Skip to content

to_parquet() - dtype parameter #365

@jarretg

Description

@jarretg

Hello,

I have a process where I am reading AWS DMS created parquet files sourced from SQL Server. In one of my source tables, I have a date column with a value of 3015-06-29 in it. When I read the metadata from the parquet file, it does show my date column as a 'date' type ('startdate': 'date'). Also, when I display the data from the dataframe, I do see the 3015-06-29 value in the 'startdate' column. Without any transformations, I attempt to write the file out with supplying the dtype parameter in the to_parquet() call, but I get an error (OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3015-06-29 00:00:00). However, if I remove the dtype parameter and write to S3, it completes successfully.

key = 's3://bucket/file.parquet'
column_metadata = wr.s3.read_parquet_metadata(key)[0]
df = wr.s3.read_parquet(key)

output_key = 's3://bucket/output/file.parquet'

#Write to S3 with dtype parameter.
wr.s3.to_parquet(df, path=output_key, dtype=column_metadata, compression='gzip') #This errors with OutOfBoundsDatetime error)

#Write to S3 without dtype parameter.
wr.s3.to_parquet(new_df, path=output_key, compression='gzip') #This completes successfully.

wr.s3.read_parquet_metadata(output_key)[0] #Shows 'startdate': 'date'

wr.s3.read_parquet(output_key) #This shows the date 3015-06-29 as from source.

I hope this is all understandable! I need the ability to supply the column types on the write because the inferred schema is causing other issues elsewhere. Thanks for your time.

Edited to add: I am using aws wrangler version 1.8.0.

Jarret

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions