# Moving to Serverless AWS

Iceberg has been adopted by all the major cloud providers in some shape or form, but AWS is certainly a big backer.

The advantage of a big backer like AWS is that many of the serverless options in AWS already support Iceberg out of the box. AWS offers Athena, based on the Trino open-source project, as well as an Iceberg catalogue in AWS Glue. We can use `pyathena` to connect to Athena from Python to execute our SQL queries.

You'll need to have completed the Terraform setup, and have set the `.env` variables to be able to follow along. There will also be some costs involved

In [None]:
import s3fs
from pyathena import connect
from pyathena.arrow.cursor import ArrowCursor
import polars as pl
from pyiceberg.catalog import load_catalog
pl.Config().set_thousands_separator(',');

When we connect to Athena, we need to specify where Athena should write out the results. Athena will always write out the results to a CSV file in the `s3_staging_dir` and then Pyathena will read the output csv and return the results.

In [None]:
conn = connect(s3_staging_dir="s3://pydata-copenhagen-datalake/athena", region_name="eu-north-1", cursor=ArrowCursor)

Since Athena is basically serverless Trino, so we can also use it to create an Iceberg table, using the Trino connector for Iceberg.

Note that part of the setup we've done is to upload all the CSVs to S3 and run a `Glue crawler` over the CSV bucket. This registers the CSV files as a table in the Glue Catalog, which is the metadata Athena needs to be able to execute its queries. Here we want to create an Iceberg table from the pile of CSV files, while also cleaning up the data a little bit.

In [None]:
sql = r""" 
CREATE TABLE IF NOT EXISTS steam.reviews WITH (table_type = 'ICEBERG', location = 's3://pydata-copenhagen-datalake/staging/reviews', is_external = false) AS 

SELECT
regexp_extract("$path", 's3://pydata-copenhagen-datalake/extract/reviews/(\w+).csv', 1) as game_id,
recommendationid, 
language, 
--compatibility with Trino timestamps
CAST(from_unixtime(timestamp_created) as timestamp(6)) as timestamp_created, 
CAST(from_unixtime(timestamp_updated) as timestamp(6)) as timestamp_updated,
CAST(voted_up as boolean) as voted_up,
votes_up,
votes_funny,
weighted_vote_score,
comment_count,
CAST(steam_purchase as boolean) as steam_purchase,
CAST(received_for_free as boolean) as received_for_free,
CAST(written_during_early_access as boolean) as written_during_early_access,
CAST(hidden_in_steam_china as boolean) as hidden_in_steam_china,
author_steamid,
author_num_games_owned,
author_num_reviews,
author_playtime_forever,
author_playtime_last_two_weeks,
author_playtime_at_review,
CAST(from_unixtime(author_last_played) as timestamp(6)) as author_last_played
FROM reviews.reviews
WHERE recommendationid is not null
"""

In [None]:
with conn.cursor() as c:
    c.execute(sql)
    print(c.fetchone())

We can verify the count in the new table

In [None]:
pl.read_database("SELECT COUNT(*) as num_rows FROM steam.reviews", conn)

Now the table is ready for analysis - how about calculating the most reviewed game per language?

In [None]:
sql = """
with lang_reviews as (
    SELECT language, game_id, count(*) as num_reviews 
    FROM steam.reviews group by language, game_id
), max_reviews as (
    select 
    language, 
    game_id, 
    num_reviews,
    RANK() OVER (partition by language order by num_reviews desc) as ordering 
    from lang_reviews
)
select language, game_id, num_reviews from max_reviews
where ordering = 1
order by num_reviews desc
"""

In [None]:
most_reviewed_df = pl.read_database(sql, conn)
most_reviewed_df

Because this is still Iceberg, we can use `pyiceberg` to talk to the AWS Glue catalog as well

In [None]:
catalog = load_catalog("aws_iceberg", **{"type": "glue", "glue.region": "eu-north-1"})

In [None]:
catalog.list_namespaces()

In [None]:
catalog.list_tables("steam")

In [None]:
table = catalog.load_table("steam.reviews")

In [None]:
pl.from_arrow(table.scan(selected_fields=['game_id', 'language', 'voted_up'], row_filter="game_id == '550'").to_arrow())

We can even use polars directly to query 

In [None]:
pl.scan_iceberg(table).select("game_id", "language").filter(pl.col("game_id") == '550').collect()

This interopability is one of the key benefits of moving towards Iceberg as a storage layer. Athena costs 5\$ per TB scanned - using `polars` costs me 0\$.