# Moving to Serverless AWS

Iceberg has been adopted by all the major cloud providers in some shape or form, but AWS is certainly a big backer.

The advantage of a big backer like AWS is that many of the serverless options in AWS already support Iceberg out of the box. AWS offers Athena, based on the Trino open-source project, as well as an Iceberg catalogue in AWS Glue. We can use `pyathena` to connect to Athena from Python to execute our SQL queries.

You'll need to have completed the Terraform setup, and have set the `.env` variables to be able to follow along. There will also be some costs involved

In [1]:
import s3fs
from pyathena import connect
from pyathena.arrow.cursor import ArrowCursor
import polars as pl
from pyiceberg.catalog import load_catalog

When we connect to Athena, we need to specify where Athena should write out the results. Athena will always write out the results to a CSV file in the `s3_staging_dir` and then Pyathena will read the output csv and return the results.

In [2]:
conn = connect(s3_staging_dir="s3://pydata-copenhagen-datalake/athena", region_name="eu-north-1", cursor=ArrowCursor)

Since Athena is basically serverless Trino, so we can also use it to create an Iceberg table, using the Trino connector for Iceberg

In [15]:
sql = r""" 
CREATE TABLE steam.reviews IF NOT EXISTS WITH (table_type = 'ICEBERG', location = 's3://pydata-copenhagen-datalake/staging/reviews', is_external = false) AS 

SELECT
regexp_extract("$path", 's3://pydata-copenhagen-datalake/extract/reviews/(\w+).csv', 1) as game_id,
recommendationid, 
language, 
--compatibility with Trino timestamps
CAST(from_unixtime(timestamp_created) as timestamp(6)) as timestamp_created, 
CAST(from_unixtime(timestamp_updated) as timestamp(6)) as timestamp_updated,
CAST(voted_up as boolean) as voted_up,
votes_up,
votes_funny,
weighted_vote_score,
comment_count,
CAST(steam_purchase as boolean) as steam_purchase,
CAST(received_for_free as boolean) as received_for_free,
CAST(written_during_early_access as boolean) as written_during_early_access,
CAST(hidden_in_steam_china as boolean) as hidden_in_steam_china,
author_steamid,
author_num_games_owned,
author_num_reviews,
author_playtime_forever,
author_playtime_last_two_weeks,
author_playtime_at_review,
CAST(from_unixtime(author_last_played) as timestamp(6)) as author_last_played
FROM reviews.reviews
WHERE recommendationid is not null
"""

In [16]:
with conn.cursor() as c:
    c.execute(sql)
    print(c.fetchone())

None


We can verify the count in the new table

In [17]:
pl.read_database("SELECT COUNT(*) as num_rows FROM steam.reviews", conn)

num_rows
i64
127572734


Now the table is ready for analysis - how about calculating the most reviewed game per language?

In [18]:
sql = """
with lang_reviews as (
    SELECT language, game_id, count(*) as num_reviews 
    FROM steam.reviews group by language, game_id
), max_reviews as (
    select 
    language, 
    game_id, 
    num_reviews,
    RANK() OVER (partition by language order by num_reviews desc) as ordering 
    from lang_reviews
)
select language, game_id, num_reviews from max_reviews
where ordering = 1
order by num_reviews desc
"""

In [19]:
most_reviewed_df = pl.read_database(sql, conn)
most_reviewed_df

language,game_id,num_reviews
str,str,i64
"""english""","""730""",2102886
"""russian""","""730""",2006616
"""schinese""","""578080""",1166691
"""brazilian""","""730""",435306
"""polish""","""730""",417014
…,…,…
"""vietnamese""","""730""",10445
"""bulgarian""","""730""",9670
"""japanese""","""1172470""",8976
"""greek""","""730""",6955


Because this is still Iceberg, we can use `pyiceberg` to talk to the AWS Glue catalog as well

In [20]:
catalog = load_catalog("aws_iceberg", **{"type": "glue", "glue.region": "eu-north-1"})

In [21]:
catalog.list_namespaces()

[('reviews',), ('steam',)]

In [22]:
catalog.list_tables("steam")

[('steam', 'reviews')]

In [23]:
table = catalog.load_table("steam.reviews")

In [24]:
pl.from_arrow(table.scan(selected_fields=['game_id', 'language', 'voted_up'], row_filter="game_id == '550'").to_arrow())



game_id,language,voted_up
str,str,bool
"""550""","""schinese""",true
"""550""","""spanish""",true
"""550""","""english""",true
"""550""","""russian""",true
"""550""","""schinese""",true
…,…,…
"""550""","""brazilian""",true
"""550""","""russian""",true
"""550""","""english""",true
"""550""","""schinese""",true
