Skip to content

Bad schema with s3.to_parquet in conjunction with non sane index column name (for Glue) #343

@ericct

Description

@ericct

When a dataframe requires sanitize_column_name (dataset with Athena table), this column is duplicated in the schema. It seems that sanitization occurs at the wrong moment.

I've tried to minimize the bug case, but there may be still unneeded conditions. I've derived this case from a the one in the [tutorial](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/004 - Parquet Datasets.ipynb) (and after the problem I've faced!)

First, create a new table with:

df = pd.DataFrame({
    "id": [1, 2],
    "value": ["foo", "boo"],
    "DATE": [date(2020, 1, 1), date(2020, 1, 2)]
})

df.set_index("DATE", inplace=True, verify_integrity=True)

wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    index=True,
    database="awswrangler",
    table="test",
    mode="overwrite"
)

wr.s3.read_parquet(path, dataset=True)

One can notice that the uppercase column for the index, TIME is kept through Parquet serialization/deserialization. Looking at the Glue catalog will show lowercase name for the column:

Now, add new data with:

df = pd.DataFrame({
   "id": [3],
   "value": ["bar"],
   "DATE": [date(2020, 1, 3)]
})

df.set_index("DATE", inplace=True, verify_integrity=True)

wr.s3.to_parquet(
   df=df,
   path=path,
   dataset=True,
   index=True,
   database="awswrangler",
   table="test",
   mode="append"
)

wr.s3.read_parquet(path, dataset=True)

Again, the name is preserved for Parquet, but now there are 2 columns in the schema with the same date name. And the table is broken.

This behavior shows only (with my tests) when the considered column is an index.

Thanks!

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingminor releaseWill be addressed in the next minor releaseready to release

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions