[![AWS Data Wrangler](_static/logo.png "AWS Data Wrangler")](https://github.com/awslabs/aws-data-wrangler)

# 6 - Amazon Athena Cache

[Wrangler](https://github.com/awslabs/aws-data-wrangler) has a cache strategy that is disabled by default and can be enabled passing `max_cache_seconds` beggier than 0. This cache strategy for Amazon Athena can help you to **decrease query times and costs**.

When calling `read_sql_query`, instead of just running the query, we now can verify if the query has been run before. If so, and this last run was within `max_cache_seconds` (a new parameter to `read_sql_query`), we return the same results as last time if they are still available in S3. We have seen this increase performance more than 100x, but the potential is pretty much infinite.

The detailed approach is:
- When `read_sql_query` is called with `max_cache_seconds > 0` (it defaults to 0), we check for the last 50 queries run by the same workgroup (the most we can get without pagination).
- We then sort those queries based on CompletionDateTime, descending
- For each of those queries, we check if their CompletionDateTime is still within the `max_cache_seconds` window. If so, we check if the query string is the same as now (with some smart heuristics to guarantee coverage over both `ctas_approach`es). If they are the same, we check if the last one's results are still on S3, and then return them instead of re-running the query.
- During the whole cache resolution phase, if there is anything wrong, the logic falls back to the usual `read_sql_query` path.

*P.S. The `cache scope is bounded for the current workgroup`, so you will be able to reuse queries results from others colleagues running in the same environment.*

In [None]:
import awswrangler as wr

## Enter your bucket name:

In [3]:
import getpass
bucket = getpass.getpass()
path = f"s3://{bucket}/data/"

 ············


## Checking/Creating Glue Catalog Databases

In [5]:
if "awswrangler_test" not in wr.catalog.databases().values:
    wr.catalog.create_database("awswrangler_test")

### Creating a Parquet Table from the NOAA's CSV files

[Reference](https://registry.opendata.aws/noaa-ghcn/)

In [6]:
cols = ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"]

df = wr.s3.read_csv(
    path="s3://noaa-ghcn-pds/csv/189",
    names=cols,
    parse_dates=["dt", "obs_time"])  # Read 10 files from the 1890 decade (~1GB)

df

Unnamed: 0,id,dt,element,value,m_flag,q_flag,s_flag,obs_time
0,ASN00070200,1890-01-01,PRCP,0,,,a,
1,SF000782720,1890-01-01,PRCP,0,,,I,
2,CA005022790,1890-01-01,TMAX,-222,,,C,
3,CA005022790,1890-01-01,TMIN,-261,,,C,
4,CA005022790,1890-01-01,PRCP,0,,,C,
...,...,...,...,...,...,...,...,...
29240014,USC00181790,1899-12-31,PRCP,0,P,,6,1830
29240015,ASN00061000,1899-12-31,PRCP,0,,,a,
29240016,ASN00040284,1899-12-31,PRCP,0,,,a,
29240017,ASN00048117,1899-12-31,PRCP,0,,,a,


In [7]:
wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="overwrite",
    database="awswrangler_test",
    table="noaa"
);

In [8]:
wr.catalog.table(database="awswrangler_test", table="noaa")

Unnamed: 0,Column Name,Type,Partition,Comment
0,id,string,False,
1,dt,timestamp,False,
2,element,string,False,
3,value,bigint,False,
4,m_flag,string,False,
5,q_flag,string,False,
6,s_flag,string,False,
7,obs_time,string,False,


## The test query

The more computational resources the query needs, the more the cache will help you. That's why we're doing it using this long running quey.

In [2]:
query = """
SELECT
    n1.element,
    count(1) as cnt
FROM
    noaa n1
JOIN
    noaa n2
ON
    n1.id = n2.id
GROUP BY
    n1.element
"""

## First execution...

In [12]:
%%time

wr.athena.read_sql_query(query, database="awswrangler_test")

CPU times: user 2.95 s, sys: 259 ms, total: 3.21 s
Wall time: 6min 14s


Unnamed: 0,element,cnt
0,MDPR,114320989
1,SNOW,21950890838
2,WT07,4486872
3,TMAX,39876132467
4,WT09,584412
5,SNWD,5089486328
6,WT11,22212890
7,WT08,33933005
8,WT05,8211491
9,DATX,11210687


## Second execution with **CACHE** (100x faster)

In [3]:
%%time

wr.athena.read_sql_query(query, database="awswrangler_test", max_cache_seconds=900000)

2020-06-23 20:28:09.718831+00:00
CPU times: user 444 ms, sys: 94.8 ms, total: 539 ms
Wall time: 9.46 s


Unnamed: 0,element,cnt
0,MDPR,114320989
1,SNOW,21950890838
2,WT07,4486872
3,TMAX,39876132467
4,WT09,584412
5,SNWD,5089486328
6,WT11,22212890
7,WT08,33933005
8,WT05,8211491
9,DATX,11210687


## Cleaning Up S3

In [13]:
wr.s3.delete_objects(path)

## Delete table

In [14]:
wr.catalog.delete_table_if_exists(database="awswrangler_test", table="noaa")

## Delete Database

In [15]:
wr.catalog.delete_database('awswrangler_test')