Skip to content

Dates get registered as string in Glue but correct in Parquet #24

@mlavaert

Description

@mlavaert

Consider the following Pandas Dataframe:

import datetime
df['datetime'] = datetime.datetime.today()
df['normalized_date'] = df['datetime'].dt.normalize()
df['date'] = datetime.date.today()
df = df[['datetime', 'normalized_date', 'date']]

This results in a schema like this:

datetime                 datetime64[ns]
normalized_date    datetime64[ns]
date                        object

Resulting in this schema in Glue when saved:

Picture 5

While the schema in Parquet looks like this when reading the file with Spark:

root
 |-- datetime: timestamp (nullable = true)
 |-- normalized_date: timestamp (nullable = true)
 |-- date: date (nullable = true)

This issue might be caused in the Glue.type_pandas2athena() function that converts
all pandas 'object'-types to string. Maybe instead of using the Pandas schema, you need to use the pyarrow.Table schema.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions