Bad schema with s3.to_parquet in conjunction with non sane index column name (for Glue)

When a dataframe requires ``sanitize_column_name`` (dataset with Athena table), this column is duplicated in the schema. It seems that sanitization occurs at the wrong moment.

I've tried to minimize the bug case, but there may be still unneeded conditions. I've derived this case from a the one in the [tutorial](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/004 - Parquet Datasets.ipynb) (and after the problem I've faced!)

First, create a new table with:
```
df = pd.DataFrame({
    "id": [1, 2],
    "value": ["foo", "boo"],
    "DATE": [date(2020, 1, 1), date(2020, 1, 2)]
})

df.set_index("DATE", inplace=True, verify_integrity=True)

wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    index=True,
    database="awswrangler",
    table="test",
    mode="overwrite"
)

wr.s3.read_parquet(path, dataset=True)
```

One can notice that the uppercase column for the index, ``TIME`` is kept through Parquet serialization/deserialization. Looking at the Glue catalog will show lowercase name for the column:


Now, add new data with:
 ```
df = pd.DataFrame({
    "id": [3],
    "value": ["bar"],
    "DATE": [date(2020, 1, 3)]
})

df.set_index("DATE", inplace=True, verify_integrity=True)

wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    index=True,
    database="awswrangler",
    table="test",
    mode="append"
)

wr.s3.read_parquet(path, dataset=True)
```
Again, the name is preserved for Parquet, but now there are 2 columns in the schema with the same ``date`` name. And the table is broken.

This behavior shows only (with my tests) when the considered column is an index.

Thanks!





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bad schema with s3.to_parquet in conjunction with non sane index column name (for Glue) #343

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bad schema with s3.to_parquet in conjunction with non sane index column name (for Glue) #343

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions