[![AWS Data Wrangler](_static/logo.png "AWS Data Wrangler")](https://github.com/awslabs/aws-data-wrangler)

# 9 - Parquet Crawler

[Wrangler](https://github.com/awslabs/aws-data-wrangler) can extract only the metadata from Parquet files and Partitions and then add it to the Glue Catalog.

In [1]:
import awswrangler as wr

## Enter your bucket name:

In [2]:
import getpass
bucket = getpass.getpass()
path = f"s3://{bucket}/data/"

 ··········································


### Creating a Parquet Table from the NOAA's CSV files

[Reference](https://registry.opendata.aws/noaa-ghcn/)

In [3]:
cols = ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"]

df = wr.s3.read_csv(
    path="s3://noaa-ghcn-pds/csv/189",
    names=cols,
    parse_dates=["dt", "obs_time"])  # Read 10 files from the 1890 decade (~1GB)

df

Unnamed: 0,id,dt,element,value,m_flag,q_flag,s_flag,obs_time
0,ASN00070200,1890-01-01,PRCP,0,,,a,
1,SF000782720,1890-01-01,PRCP,0,,,I,
2,CA005022790,1890-01-01,TMAX,-222,,,C,
3,CA005022790,1890-01-01,TMIN,-261,,,C,
4,CA005022790,1890-01-01,PRCP,0,,,C,
...,...,...,...,...,...,...,...,...
29240012,USC00181790,1899-12-31,PRCP,0,P,,6,1830
29240013,ASN00061000,1899-12-31,PRCP,0,,,a,
29240014,ASN00040284,1899-12-31,PRCP,0,,,a,
29240015,ASN00048117,1899-12-31,PRCP,0,,,a,


In [7]:
df["year"] = df["dt"].dt.year

df.head(3)

Unnamed: 0,id,dt,element,value,m_flag,q_flag,s_flag,obs_time,year
0,ASN00070200,1890-01-01,PRCP,0,,,a,,1890
1,SF000782720,1890-01-01,PRCP,0,,,I,,1890
2,CA005022790,1890-01-01,TMAX,-222,,,C,,1890


In [11]:
res = wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="overwrite",
    partition_cols=["year"]
)

In [13]:
[ x.split("data/", 1)[1] for x in wr.s3.list_objects(path)]

['year=1890/70dc88e20ada4f68babb85899f31dd90.snappy.parquet',
 'year=1891/324ce11cde2a445d812c7b88b316a275.snappy.parquet',
 'year=1892/357ffc59368440ad8d1d4eec8aea8fe3.snappy.parquet',
 'year=1893/ea383a0ad64444aaae3b40a513c1e4c8.snappy.parquet',
 'year=1894/ac4d78b302e3498e8156e5157a3b5218.snappy.parquet',
 'year=1895/e8d76306c3954176bcd9604bf5d83127.snappy.parquet',
 'year=1896/d12e4e33a1d14b80b6e5b04d81a74ddb.snappy.parquet',
 'year=1897/993d3cee2b574d8e90c66e70d2a599f7.snappy.parquet',
 'year=1898/f6a3d9502e534d7fac2045043b55f32f.snappy.parquet',
 'year=1899/55cb44db93cd4a78864133f06f859725.snappy.parquet']

## Crawling!

In [15]:
%%time

res = wr.s3.store_parquet_metadata(
    path=path,
    database="awswrangler_test",
    table="crawler",
    dataset=True
)

CPU times: user 270 ms, sys: 24.8 ms, total: 295 ms
Wall time: 1.91 s


## Checking

In [16]:
wr.catalog.table(database="awswrangler_test", table="crawler")

Unnamed: 0,Column Name,Type,Partition,Comment
0,id,string,False,
1,dt,timestamp,False,
2,element,string,False,
3,value,bigint,False,
4,m_flag,string,False,
5,q_flag,string,False,
6,s_flag,string,False,
7,obs_time,string,False,
8,year,bigint,True,


In [17]:
%%time

wr.athena.read_sql_query("SELECT * FROM crawler WHERE year=1890", database="awswrangler_test")

CPU times: user 1.29 s, sys: 198 ms, total: 1.48 s
Wall time: 14.3 s


Unnamed: 0,id,dt,element,value,m_flag,q_flag,s_flag,obs_time,year
0,ASN00070200,1890-01-01,PRCP,0,,,a,,1890
1,SF000782720,1890-01-01,PRCP,0,,,I,,1890
2,CA005022790,1890-01-01,TMAX,-222,,,C,,1890
3,CA005022790,1890-01-01,TMIN,-261,,,C,,1890
4,CA005022790,1890-01-01,PRCP,0,,,C,,1890
...,...,...,...,...,...,...,...,...,...
1276241,ASN00016005,1890-12-17,PRCP,0,,,a,,1890
1276242,CA006127887,1890-12-17,TMAX,0,,,C,,1890
1276243,CA006127887,1890-12-17,TMIN,-22,,,C,,1890
1276244,CA006127887,1890-12-17,PRCP,3,,,C,,1890


## Cleaning Up S3

In [1]:
wr.s3.delete_objects(path)

NameError: name 'wr' is not defined

## Cleaning Up the Database

In [None]:
for table in wr.catalog.get_tables(database="awswrangler_test"):
    wr.catalog.delete_table_if_exists(database="awswrangler_test", table=table["Name"])