# EDA with awswrangler
- references
    - [aws data warngler](https://velog.io/@hsh/AWSPythonAthena-%ED%8C%8C%EC%9D%B4%EC%8D%AC%EC%9C%BC%EB%A1%9C-%EC%95%84%ED%85%8C%EB%82%98%EC%97%90-%EC%BF%BC%EB%A6%AC%ED%95%98%EA%B8%B0-boto3-vs-pyathena-vs-awswrangler)
    - [Data — AWS Wrangler Query Athena](https://dorian599.medium.com/data-aws-wrangler-query-athena-8be83bc8b091)
    - [sample codes](https://github.com/aws/aws-sdk-pandas/blob/main/tutorials/006%20-%20Amazon%20Athena.ipynb)

### Amazon Athena

[awswrangler](https://github.com/aws/aws-sdk-pandas) has three ways to run queries on Athena and fetch the result as a DataFrame:

- **ctas_approach=True** (Default)

    Wraps the query with a CTAS and then reads the table data as parquet directly from s3.
    
    * `PROS`:
        - Faster for mid and big result sizes.
        - Can handle some level of nested types.
    * `CONS`:
         - Requires create/delete table permissions on Glue.
         - Does not support timestamp with time zone
         - Does not support columns with repeated names.
         - Does not support columns with undefined data types.
         - A temporary table will be created and then deleted immediately.
         - Does not support custom data_source/catalog_id.

- **unload_approach=True and ctas_approach=False**

    Does an UNLOAD query on Athena and parse the Parquet result on s3.

    * `PROS`:
        - Faster for mid and big result sizes.
        - Can handle some level of nested types.
        - Does not modify Glue Data Catalog.
    * `CONS`:
        - Output S3 path must be empty.
        - Does not support timestamp with time zone
        - Does not support columns with repeated names.
        - Does not support columns with undefined data types.

- **ctas_approach=False**

    Does a regular query on Athena and parse the regular CSV result on s3.
    
    * `PROS`:
        - Faster for small result sizes (less latency).
        - Does not require create/delete table permissions on Glue
        - Supports timestamp with time zone.
        - Support custom data_source/catalog_id.
    * `CONS`:
        - Slower (But stills faster than other libraries that uses the regular Athena API)
        - Does not handle nested types at all.

In [168]:
import awswrangler as wr

## Enter your bucket name:

In [169]:
bucket_name = "sm-anomaly-detection"#<your bucket name>
data_path = f"s3://{bucket_name}/data"

## Checking/Creating Glue Catalog Databases

In [170]:
if "awswrangler_test" not in wr.catalog.databases().values:
    wr.catalog.create_database("awswrangler_test")

  if "awswrangler_test" not in wr.catalog.databases().values:


## Creating a Parquet Table from the CSV or parquet files
- **S3에 데이터가 CSV 파일로 존재하는 경우**

In [180]:
import os
import numpy as np

In [172]:
#wr.s3.read_csv?

In [176]:
dfs = wr.s3.read_csv(
    path=os.path.join(data_path, "csv"), # folder name
    chunksize=10000, # interable
    dtype_backend="pyarrow"
) 

- **S3에 데이터가 parqeut으로 존재하는 경우**

In [138]:
#wr.s3.read_parquet?

[0;31mSignature:[0m
[0mwr[0m[0;34m.[0m[0ms3[0m[0;34m.[0m[0mread_parquet[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mpath[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpath_root[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdataset[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpath_suffix[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpath_ignore_suffix[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0

In [185]:
dfs = wr.s3.read_parquet(
    path=os.path.join(data_path, "parquet"), # folder name
    chunked=10000, # interable
)
#dfs.sort_values(by="index")

* creating database
    - dtype optimization을 통해 쿼리 속도 향상

In [186]:
for idx, df in enumerate(dfs):
    df = df.astype(
        {
            "age":np.int16,
            "recommended_ind": np.int8
        }
    )
    wr.s3.to_parquet(
        df=df,
        path=os.path.join(data_path, "parquet_from_parqeut"),
        dataset=True,
        mode="append",
        database="awswrangler_test",
        table="reviews"
    )

In [187]:
wr.catalog.table(database="awswrangler_test", table="reviews")

Unnamed: 0,Column Name,Type,Partition,Comment
0,index,bigint,False,
1,clothing_id,bigint,False,
2,age,smallint,False,
3,title,string,False,
4,review_text,string,False,
5,rating,bigint,False,
6,recommended_ind,tinyint,False,
7,positive_feedback_count,bigint,False,
8,division_name,string,False,
9,department_name,string,False,


## Athena query

In [190]:
%%time
quety = """
SELECT division_name, SUM(recommended_ind) AS SUM_CNT
FROM reviews
GROUP BY division_name
"""

wr.athena.read_sql_query(quety, database="awswrangler_test", ctas_approach=False)

CPU times: user 1.01 s, sys: 68.6 ms, total: 1.08 s
Wall time: 3.52 s


Unnamed: 0,division_name,SUM_CNT
0,General,11313
1,General Petite,6707
2,,14
3,Initmates,1280


### Reading with ctas_approach=False

In [189]:
%%time
wr.athena.read_sql_query("SELECT * FROM reviews ORDER BY index", database="awswrangler_test", ctas_approach=False)

CPU times: user 1.2 s, sys: 90.9 ms, total: 1.29 s
Wall time: 3.82 s


Unnamed: 0,index,clothing_id,age,title,review_text,rating,recommended_ind,positive_feedback_count,division_name,department_name,class_name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
...,...,...,...,...,...,...,...,...,...,...,...
23481,23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses
23482,23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0,General Petite,Tops,Knits
23483,23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses


### Default with ctas_approach=True - 13x faster (default)

In [191]:
%%time
wr.athena.read_sql_query("SELECT * FROM reviews ORDER BY index", database="awswrangler_test")

CPU times: user 2.37 s, sys: 153 ms, total: 2.52 s
Wall time: 5.77 s


Unnamed: 0,index,clothing_id,age,title,review_text,rating,recommended_ind,positive_feedback_count,division_name,department_name,class_name
0,19624,879,31,"Adorable shirt, questionable quality",I saw this shirt online first and immediately ...,3,0,21,General,Tops,Knits
1,19625,993,25,Cute print but cut misses the mark,I had been waiting for this skirt to go on sal...,2,0,0,General Petite,Bottoms,Skirts
2,19626,879,48,Great tee!,This is adorable! lots of compliments on colo...,5,1,0,General,Tops,Knits
3,19627,1094,51,Love!,"Besides the beautiful print, i found the fit t...",5,1,4,General Petite,Dresses,Dresses
4,19628,1094,58,Put-together and comfortable,I feel both confident and comfortable in this ...,5,1,3,General Petite,Dresses,Dresses
...,...,...,...,...,...,...,...,...,...,...,...
23481,11816,829,35,Cute!,"Super cute, casual top. i'm usually a medium a...",5,1,0,General,Tops,Blouses
23482,11817,1079,50,"Beautiful, but poor fit","I loved this dress the moment i saw it, pre-or...",4,1,5,General,Dresses,Dresses
23483,11818,829,63,Beautiful!,I saw this top online and loved it. i wasn't s...,5,1,0,General,Tops,Blouses
23484,11819,1022,56,Super cute,"I am a brand loyal ag fan. best washes, best p...",5,1,2,General Petite,Bottoms,Jeans


### Using categories to speed up and save memory - 24x faster

In [192]:
%%time
wr.athena.read_sql_query("SELECT * FROM reviews ORDER BY index", database="awswrangler_test", categories=["division_name", "department_name", "class_name"])

CPU times: user 2.29 s, sys: 136 ms, total: 2.43 s
Wall time: 5.87 s


Unnamed: 0,index,clothing_id,age,title,review_text,rating,recommended_ind,positive_feedback_count,division_name,department_name,class_name
0,3904,1081,68,Lovelovelove!,This is the perfect little throw-on dress. fla...,5,1,0,General Petite,Dresses,Dresses
1,3905,1099,38,Ok curvy peeps! ;),So i notice most ladies on retailer love to sa...,5,1,0,General,Dresses,Dresses
2,3906,1100,39,Just as pictured.,This dress is so pretty in person. the colors ...,5,1,14,General,Dresses,Dresses
3,3907,1081,39,Great dress,I have this in black and red. it's a great and...,4,1,0,General Petite,Dresses,Dresses
4,3908,860,52,Boho vibe,Flowy boho tank. great details. good fabric. w...,5,1,1,General,Tops,Knits
...,...,...,...,...,...,...,...,...,...,...,...
23481,2590,975,37,,"This jacket fits just a tad big, but it's quit...",4,1,0,General,Jackets,Jackets
23482,2591,1110,51,Love this dress,I saw this dress online and loved it. i ordere...,5,1,0,General Petite,Dresses,Dresses
23483,2592,863,41,Not as it seems..,I'm always looking for pieces that would good ...,2,0,3,General,Tops,Knits
23484,2593,1089,53,Three strikes,I am now the third person to try this dress an...,2,0,13,General Petite,Dresses,Dresses


### Reading with unload_approach=True

In [193]:
%%time
wr.athena.read_sql_query("SELECT * FROM reviews ORDER BY index", database="awswrangler_test", ctas_approach=False, unload_approach=True, s3_output=f"s3://{bucket_name}/data/unload/")

CPU times: user 2.15 s, sys: 148 ms, total: 2.3 s
Wall time: 5.4 s


Unnamed: 0,index,clothing_id,age,title,review_text,rating,recommended_ind,positive_feedback_count,division_name,department_name,class_name
0,18328,1077,36,Cute cute cute,"This dress is so cute, and i received so many ...",4,1,0,General Petite,Dresses,Dresses
1,18329,830,35,Sad top.,"I wanted to love this top. but, it looks like ...",2,0,4,General,Tops,Blouses
2,18330,1092,27,Great summer dress,I love this dress. it's a lightweight soft fab...,5,1,0,General Petite,Dresses,Dresses
3,18331,1092,47,,I tried this dress on in hawaii before it even...,4,1,19,General Petite,Dresses,Dresses
4,18332,1112,37,Nice,"I did like this coat, however the material was...",4,1,3,General,Jackets,Outerwear
...,...,...,...,...,...,...,...,...,...,...,...
23481,10505,1080,42,Feminine dress,I love this dress! the colors are so pretty an...,5,1,0,General Petite,Dresses,Dresses
23482,10506,929,47,"On the right person, maybe?",This wasn't for me. i have a short neck and am...,2,0,8,General,Tops,Sweaters
23483,10507,1080,39,Cute,I finally ordered this after looking at it man...,5,1,1,General Petite,Dresses,Dresses
23484,10508,833,39,Beautiful tank to wear for years,I was urged to try this on by my usual stylist...,5,1,11,General Petite,Tops,Blouses


### Cleaning Up S3

In [194]:
wr.s3.delete_objects(data_path)

### Delete table

In [195]:
wr.catalog.delete_table_if_exists(database="awswrangler_test", table="reviews")

True

### Delete Database

In [196]:
wr.catalog.delete_database('awswrangler_test')