[![AWS Data Wrangler](_static/logo.png "AWS Data Wrangler")](https://github.com/awslabs/aws-data-wrangler)

# 6 - Amazon Athena

[Wrangler](https://github.com/awslabs/aws-data-wrangler) has two ways to run queries on Athena and fetch the result as a DataFrame:

- **ctas_approach=True** (Default)

    Wraps the query with a CTAS and then reads the table data as parquet directly from s3.
    
    * `PROS`:
        - Faster for mid and big result sizes.
        - Can handle some level of nested types.
    * `CONS`:
         - Requires create/delete table permissions on Glue.
         - Does not support timestamp with time zone
         - Does not support columns with repeated names.
         - Does not support columns with undefined data types.
         - A temporary table will be created and then deleted immediately.
         - Does not support custom data_source/catalog_id.


- **ctas_approach=False**

    Does a regular query on Athena and parse the regular CSV result on s3.
    
    * `PROS`:
        - Faster for small result sizes (less latency).
        - Does not require create/delete table permissions on Glue
        - Supports timestamp with time zone.
        - Support custom data_source/catalog_id.
    * `CONS`:
        - Slower (But stills faster than other libraries that uses the regular Athena API)
        - Does not handle nested types at all.

In [1]:
import awswrangler as wr

## Enter your bucket name:

In [2]:
import getpass
bucket = getpass.getpass()
path = f"s3://{bucket}/data/"

 ···········································


## Checking/Creating Glue Catalog Databases

In [3]:
if "awswrangler_test" not in wr.catalog.databases().values:
    wr.catalog.create_database("awswrangler_test")

### Creating a Parquet Table from the NOAA's CSV files

[Reference](https://registry.opendata.aws/noaa-ghcn/)

In [4]:
cols = ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"]

df = wr.s3.read_csv(
    path="s3://noaa-ghcn-pds/csv/189",
    names=cols,
    parse_dates=["dt", "obs_time"])  # Read 10 files from the 1890 decade (~1GB)

df

Unnamed: 0,id,dt,element,value,m_flag,q_flag,s_flag,obs_time
0,AGE00135039,1890-01-01,TMAX,160,,,E,
1,AGE00135039,1890-01-01,TMIN,30,,,E,
2,AGE00135039,1890-01-01,PRCP,45,,,E,
3,AGE00147705,1890-01-01,TMAX,140,,,E,
4,AGE00147705,1890-01-01,TMIN,74,,,E,
...,...,...,...,...,...,...,...,...
29240014,UZM00038457,1899-12-31,PRCP,16,,,r,
29240015,UZM00038457,1899-12-31,TAVG,-73,,,r,
29240016,UZM00038618,1899-12-31,TMIN,-76,,,r,
29240017,UZM00038618,1899-12-31,PRCP,0,,,r,


In [5]:
wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="overwrite",
    database="awswrangler_test",
    table="noaa"
);

In [6]:
wr.catalog.table(database="awswrangler_test", table="noaa")

Unnamed: 0,Column Name,Type,Partition,Comment
0,id,string,False,
1,dt,timestamp,False,
2,element,string,False,
3,value,bigint,False,
4,m_flag,string,False,
5,q_flag,string,False,
6,s_flag,string,False,
7,obs_time,string,False,


## Reading with ctas_approach=False

In [7]:
%%time

wr.athena.read_sql_query("SELECT * FROM noaa", database="awswrangler_test", ctas_approach=False)

CPU times: user 8min 45s, sys: 6.52 s, total: 8min 51s
Wall time: 11min 3s


Unnamed: 0,id,dt,element,value,m_flag,q_flag,s_flag,obs_time
0,AGE00135039,1890-01-01,TMAX,160,,,E,
1,AGE00135039,1890-01-01,TMIN,30,,,E,
2,AGE00135039,1890-01-01,PRCP,45,,,E,
3,AGE00147705,1890-01-01,TMAX,140,,,E,
4,AGE00147705,1890-01-01,TMIN,74,,,E,
...,...,...,...,...,...,...,...,...
29240014,UZM00038457,1899-12-31,PRCP,16,,,r,
29240015,UZM00038457,1899-12-31,TAVG,-73,,,r,
29240016,UZM00038618,1899-12-31,TMIN,-76,,,r,
29240017,UZM00038618,1899-12-31,PRCP,0,,,r,


## Default with ctas_approach=True - 13x faster (default)

In [8]:
%%time

wr.athena.read_sql_query("SELECT * FROM noaa", database="awswrangler_test")

CPU times: user 28 s, sys: 6.07 s, total: 34.1 s
Wall time: 50.5 s


Unnamed: 0,id,dt,element,value,m_flag,q_flag,s_flag,obs_time
0,ASN00017088,1890-06-11,PRCP,0,,,a,
1,ASN00017087,1890-06-11,PRCP,0,,,a,
2,ASN00017089,1890-06-11,PRCP,71,,,a,
3,ASN00017095,1890-06-11,PRCP,0,,,a,
4,ASN00017094,1890-06-11,PRCP,0,,,a,
...,...,...,...,...,...,...,...,...
29240014,USC00461260,1899-12-31,SNOW,0,,,6,
29240015,USC00461515,1899-12-31,TMAX,-89,,,6,
29240016,USC00461515,1899-12-31,TMIN,-189,,,6,
29240017,USC00461515,1899-12-31,PRCP,0,,,6,


## Using categories to speed up and save memory - 24x faster

In [9]:
%%time

wr.athena.read_sql_query("SELECT * FROM noaa", database="awswrangler_test", categories=["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"])

CPU times: user 6.89 s, sys: 2.27 s, total: 9.16 s
Wall time: 27.3 s


Unnamed: 0,id,dt,element,value,m_flag,q_flag,s_flag,obs_time
0,GME00102348,1890-08-03,TMAX,172,,,E,
1,GME00102348,1890-08-03,TMIN,117,,,E,
2,GME00102348,1890-08-03,PRCP,63,,,E,
3,GME00102348,1890-08-03,SNWD,0,,,E,
4,GME00121126,1890-08-03,PRCP,32,,,E,
...,...,...,...,...,...,...,...,...
29240014,USC00461260,1899-12-31,SNOW,0,,,6,
29240015,USC00461515,1899-12-31,TMAX,-89,,,6,
29240016,USC00461515,1899-12-31,TMIN,-189,,,6,
29240017,USC00461515,1899-12-31,PRCP,0,,,6,


## Batching (Good for restricted memory environments)

In [10]:
%%time

dfs = wr.athena.read_sql_query(
    "SELECT * FROM noaa",
    database="awswrangler_test",
    chunksize=True  # Chunksize calculated automatically for ctas_approach.
)

for df in dfs:  # Batching
    print(len(df.index))

1024
8086528
1024
1024
1024
1024
1024
15360
1024
10090496
2153472
8886995
CPU times: user 22.7 s, sys: 5.41 s, total: 28.1 s
Wall time: 48 s


In [11]:
%%time

dfs = wr.athena.read_sql_query(
    "SELECT * FROM noaa",
    database="awswrangler_test",
    chunksize=100_000_000
)

for df in dfs:  # Batching
    print(len(df.index))

29240019
CPU times: user 34.8 s, sys: 8.54 s, total: 43.4 s
Wall time: 1min 1s


## Cleaning Up S3

In [12]:
wr.s3.delete_objects(path)

## Delete table

In [13]:
wr.catalog.delete_table_if_exists(database="awswrangler_test", table="noaa");

## Delete Database

In [14]:
wr.catalog.delete_database('awswrangler_test')