[![AWS Data Wrangler](_static/logo.png "AWS Data Wrangler")](https://github.com/awslabs/aws-data-wrangler)

# 14 - Schema Evolution

Wrangler support new **columns** on Parquet Dataset through:

- [wr.s3.to_parquet()](https://aws-data-wrangler.readthedocs.io/en/stable/stubs/awswrangler.s3.to_parquet.html#awswrangler.s3.to_parquet)
- [wr.s3.store_parquet_metadata()](https://aws-data-wrangler.readthedocs.io/en/stable/stubs/awswrangler.s3.store_parquet_metadata.html#awswrangler.s3.store_parquet_metadata) i.e. "Crawler"

In [1]:
from datetime import date
import awswrangler as wr
import pandas as pd

## Enter your bucket name:

In [2]:
import getpass
bucket = getpass.getpass()
path = f"s3://{bucket}/dataset/"

 ···········································


## Creating the Dataset

In [3]:
df = pd.DataFrame({
    "id": [1, 2],
    "value": ["foo", "boo"],
})

wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="overwrite",
    database="aws_data_wrangler",
    table="my_table"
)

wr.s3.read_parquet(path, dataset=True)

Unnamed: 0,id,value
0,1,foo
1,2,boo


### Schema Version 0 on Glue Catalog (AWS Console)

![Glue Console](_static/glue_catalog_version_0.png "Glue Console")

## Appending with NEW COLUMNS

In [4]:
df = pd.DataFrame({
    "id": [3, 4],
    "value": ["bar", None],
    "date": [date(2020, 1, 3), date(2020, 1, 4)],
    "flag": [True, False]
})

wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="append",
    database="aws_data_wrangler",
    table="my_table",
    catalog_versioning=True  # Optional
)

wr.s3.read_parquet(path, dataset=True, validate_schema=False)

Unnamed: 0,id,value,date,flag
0,3,bar,2020-01-03,True
1,4,,2020-01-04,False
2,1,foo,,
3,2,boo,,


### Schema Version 1 on Glue Catalog (AWS Console)

![Glue Console](_static/glue_catalog_version_1.png "Glue Console")

## Reading from Athena

In [5]:
wr.athena.read_sql_table(table="my_table", database="aws_data_wrangler")

Unnamed: 0,id,value,date,flag
0,3,bar,2020-01-03,True
1,4,,2020-01-04,False
2,1,foo,,
3,2,boo,,


## Cleaning Up

In [6]:
wr.s3.delete_objects(path)
wr.catalog.delete_table_if_exists(table="my_table", database="aws_data_wrangler")

True