Skip to content

projection partitions in s3 to_parquet method #1613

@alekya1024

Description

@alekya1024

Describe the bug

I am using s3 to_parquet method to create a table in Glue catalog and store the data in the S3 bucket. I want to maintain partitions in my s3, and for them to be dynamic I need projected partitions.

Requirements/Expectations:
1/ partition format should be yyyy/MM/dd
2/ object prefix --> s3-bucket-name/folder1/folder2/yyyy/MM/dd
3/ My table should have a partition column called 'datepath' and it contains the 'yyyy/MM/dd' value i.e partition folders information.
4/ storage.location.template table parameter should be added to my table and its value should be s3-bucket-name/folder1/folder2/${datepath}

Observations:
1/ to_parquet method is not accepting the partition column format as an input. So I have executed a ALTER command on table to add that table property and format is yyyy/MM/dd
2/ object prefix -> s3-bucket-name/folder1/folder2/datepath=yyyy/MM/dd. Here I didn't expect the folder name to contains the partition column name.
3/ Table is having a partition column "datepath" --> as expected
4/ storage.location.template table property is not added to the table.

Here #1 can be improvement, accept the partition format from user. Please let me know what else I need to do to meet my Requirements.

How to Reproduce

I have used the below code
wr.s3.to_parquet(

df = pd.DataFrame(data_dict),
      compression = 'snappy',
      dataset = 'True',
      path = f'{s3_path}/{current_date}',
      partition_cols = ['datepath'],
      mode = 'append',
      projection_enabled = True,
      projection_types = {'datepath':'date'},
      projection_ranges = {'datepath':f'{current_date},NOW+1DAYS'},
      projection_intervals = {'datepath':'1'},
      schema_evolution = 'True',
      database = db_name,
      table = table,
      table_type = 'EXTERNAL_TABLE',
      dtype = cols_dict,
      sanitize_columns = False,
      parameters = {
          'classification':'parquet',
          'compressionType':'none',
          'typeOfData':'file'
      }
)

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Executing the code in AWS glue job

Python version

3.6

AWS SDK for pandas version

1.1.5

Additional context

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions