-
Notifications
You must be signed in to change notification settings - Fork 722
Description
Describe the bug
I am using s3 to_parquet method to create a table in Glue catalog and store the data in the S3 bucket. I want to maintain partitions in my s3, and for them to be dynamic I need projected partitions.
Requirements/Expectations:
1/ partition format should be yyyy/MM/dd
2/ object prefix --> s3-bucket-name/folder1/folder2/yyyy/MM/dd
3/ My table should have a partition column called 'datepath' and it contains the 'yyyy/MM/dd' value i.e partition folders information.
4/ storage.location.template table parameter should be added to my table and its value should be s3-bucket-name/folder1/folder2/${datepath}
Observations:
1/ to_parquet method is not accepting the partition column format as an input. So I have executed a ALTER command on table to add that table property and format is yyyy/MM/dd
2/ object prefix -> s3-bucket-name/folder1/folder2/datepath=yyyy/MM/dd. Here I didn't expect the folder name to contains the partition column name.
3/ Table is having a partition column "datepath" --> as expected
4/ storage.location.template table property is not added to the table.
Here #1 can be improvement, accept the partition format from user. Please let me know what else I need to do to meet my Requirements.
How to Reproduce
I have used the below code
wr.s3.to_parquet(
df = pd.DataFrame(data_dict),
compression = 'snappy',
dataset = 'True',
path = f'{s3_path}/{current_date}',
partition_cols = ['datepath'],
mode = 'append',
projection_enabled = True,
projection_types = {'datepath':'date'},
projection_ranges = {'datepath':f'{current_date},NOW+1DAYS'},
projection_intervals = {'datepath':'1'},
schema_evolution = 'True',
database = db_name,
table = table,
table_type = 'EXTERNAL_TABLE',
dtype = cols_dict,
sanitize_columns = False,
parameters = {
'classification':'parquet',
'compressionType':'none',
'typeOfData':'file'
}
)
Expected behavior
No response
Your project
No response
Screenshots
No response
OS
Executing the code in AWS glue job
Python version
3.6
AWS SDK for pandas version
1.1.5
Additional context
No response