-
Notifications
You must be signed in to change notification settings - Fork 718
Description
When a dataframe requires sanitize_column_name
(dataset with Athena table), this column is duplicated in the schema. It seems that sanitization occurs at the wrong moment.
I've tried to minimize the bug case, but there may be still unneeded conditions. I've derived this case from a the one in the [tutorial](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/004 - Parquet Datasets.ipynb) (and after the problem I've faced!)
First, create a new table with:
df = pd.DataFrame({
"id": [1, 2],
"value": ["foo", "boo"],
"DATE": [date(2020, 1, 1), date(2020, 1, 2)]
})
df.set_index("DATE", inplace=True, verify_integrity=True)
wr.s3.to_parquet(
df=df,
path=path,
dataset=True,
index=True,
database="awswrangler",
table="test",
mode="overwrite"
)
wr.s3.read_parquet(path, dataset=True)
One can notice that the uppercase column for the index, TIME
is kept through Parquet serialization/deserialization. Looking at the Glue catalog will show lowercase name for the column:
Now, add new data with:
df = pd.DataFrame({
"id": [3],
"value": ["bar"],
"DATE": [date(2020, 1, 3)]
})
df.set_index("DATE", inplace=True, verify_integrity=True)
wr.s3.to_parquet(
df=df,
path=path,
dataset=True,
index=True,
database="awswrangler",
table="test",
mode="append"
)
wr.s3.read_parquet(path, dataset=True)
Again, the name is preserved for Parquet, but now there are 2 columns in the schema with the same date
name. And the table is broken.
This behavior shows only (with my tests) when the considered column is an index.
Thanks!