[![AWS Data Wrangler](_static/logo.png "AWS Data Wrangler")](https://github.com/awslabs/aws-data-wrangler)

# 4 - Parquet Datasets

Wrangler has 3 different write modes to store Parquet Datasets on Amazon S3.

- **append** (Default)

    Only adds new files without any delete.
    
- **overwrite**

    Deletes everything in the target directory and then add new files.
    
- **overwrite_partitions** (Partition Upsert)

    Only deletes the paths of partitions that should be updated and then writes the new partitions files. It's like a "partition Upsert".

In [1]:
from datetime import date
import awswrangler as wr
import pandas as pd

## Enter your bucket name:

In [2]:
import getpass
bucket = getpass.getpass()
path = f"s3://{bucket}/dataset/"

 ············


## Creating the Dataset

In [3]:
df = pd.DataFrame({
    "id": [1, 2],
    "value": ["foo", "boo"],
    "date": [date(2020, 1, 1), date(2020, 1, 2)]
})

wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="overwrite"
)

wr.s3.read_parquet(path, dataset=True)

Unnamed: 0,id,value,date
0,1,foo,2020-01-01
1,2,boo,2020-01-02


## Appending

In [4]:
df = pd.DataFrame({
    "id": [3],
    "value": ["bar"],
    "date": [date(2020, 1, 3)]
})

wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="append"
)

wr.s3.read_parquet(path, dataset=True)

Unnamed: 0,id,value,date
0,1,foo,2020-01-01
1,2,boo,2020-01-02
2,3,bar,2020-01-03


## Overwriting

In [5]:
wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="overwrite"
)

wr.s3.read_parquet(path, dataset=True)

Unnamed: 0,id,value,date
0,3,bar,2020-01-03


## Creating a **Partitoned** Dataset

In [6]:
df = pd.DataFrame({
    "id": [1, 2],
    "value": ["foo", "boo"],
    "date": [date(2020, 1, 1), date(2020, 1, 2)]
})

wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="overwrite",
    partition_cols=["date"]
)

wr.s3.read_parquet(path, dataset=True)

Unnamed: 0,id,value,date
0,1,foo,2020-01-01
1,2,boo,2020-01-02


## Upserting partitions (overwrite_partitions)

In [7]:

df = pd.DataFrame({
    "id": [2, 3],
    "value": ["xoo", "bar"],
    "date": [date(2020, 1, 2), date(2020, 1, 3)]
})

wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="overwrite_partitions",
    partition_cols=["date"]
)

wr.s3.read_parquet(path, dataset=True)

Unnamed: 0,id,value,date
0,1,foo,2020-01-01
1,2,xoo,2020-01-02
2,3,bar,2020-01-03
