Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Parquet format in S3 source #5102

Closed
Phlair opened this issue Jul 30, 2021 · 1 comment · Fixed by #5305
Closed

Support Parquet format in S3 source #5102

Phlair opened this issue Jul 30, 2021 · 1 comment · Fixed by #5305

Comments

@Phlair
Copy link
Contributor

Phlair commented Jul 30, 2021

Tell us about the problem you're trying to solve

Currently S3 (via abstract files source) doesn't support syncing parquet files.

Describe the solution you’d like

Build a ParquetParser here and add appropriate changes to spec.py, unit tests & documentation.

If those links are broken we may have refactored where that code sits and I forgot to update this, comment @ me and I'll fix them!

Additional context

Adding this will support syncing parquet files in any new files storage source that builds from abstract-files-source.

@Phlair Phlair added the type/enhancement New feature or request label Jul 30, 2021
@sherifnada sherifnada added area/connectors Connector related issues lang/python labels Jul 30, 2021
@sherifnada sherifnada modified the milestone: Connectors August 6th Aug 3, 2021
@Phlair
Copy link
Contributor Author

Phlair commented Aug 3, 2021

Implementing a new file format will require creating an appropriate parser class in .../source-s3/source_s3/source_files_abstract/fileformatparser.py (source_files_abstract will be abstracted out soon but for now lives in source s3).

The least-effort approach here would be to utilise PyArrow (this might also be useful) since the type conversions are already handled, but if there's good reason to use something else then do that and build appropriate type/schema converter into the FileFormatParser ABC class. A reason this might be the case is if PyArrow doesn't support streaming Parquet in batches.

In .../source-s3/source_s3/source_files_abstract/spec.py, make an appropriate {Filetype}Format class (see CsvFormat for example). Make sure it's referenced by the format property in SourceFilesAbstractSpec class in same file. This adds the format into the spec so user can select and customise settings for it in the UI.

You'll also need to add the format into this map with the key matching the filetype enum from spec (e.g. csv, parquet) and value as class name of your parser class to allow the connector to use it.

Unit tests in .../source-s3/unit_tests/test_fileformatparser.py will require creating some files to test against and creating a class like TestCsvParser with details. The abstract tests can then run against these test inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants