to_parquet() - dtype parameter

Hello,

I have a process where I am reading AWS DMS created parquet files sourced from SQL Server.  In one of my source tables, I have a date column with a value of 3015-06-29 in it.  When I read the metadata from the parquet file, it does show my date column as a 'date' type ('startdate': 'date').  Also, when I display the data from the dataframe, I do see the 3015-06-29 value in the 'startdate' column.  Without any transformations, I attempt to write the file out with supplying the dtype parameter in the to_parquet() call, but I get an error (OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3015-06-29 00:00:00).  However, if I remove the dtype parameter and write to S3, it completes successfully.


```
key = 's3://bucket/file.parquet'
column_metadata = wr.s3.read_parquet_metadata(key)[0]
df = wr.s3.read_parquet(key)

output_key = 's3://bucket/output/file.parquet'

#Write to S3 with dtype parameter.
wr.s3.to_parquet(df, path=output_key, dtype=column_metadata, compression='gzip') #This errors with OutOfBoundsDatetime error)

#Write to S3 without dtype parameter.
wr.s3.to_parquet(new_df, path=output_key, compression='gzip') #This completes successfully.

wr.s3.read_parquet_metadata(output_key)[0] #Shows 'startdate': 'date'

wr.s3.read_parquet(output_key) #This shows the date 3015-06-29 as from source.
```


I hope this is all understandable!  I need the ability to supply the column types on the write because the inferred schema is causing other issues elsewhere.  Thanks for your time.

Edited to add: I am using aws wrangler version 1.8.0.

Jarret

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

to_parquet() - dtype parameter #365

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

to_parquet() - dtype parameter #365

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions