Support Parquet format in S3 source #5102

Phlair · 2021-07-30T14:16:04Z

Tell us about the problem you're trying to solve

Currently S3 (via abstract files source) doesn't support syncing parquet files.

Describe the solution you’d like

Build a ParquetParser here and add appropriate changes to spec.py, unit tests & documentation.

If those links are broken we may have refactored where that code sits and I forgot to update this, comment @ me and I'll fix them!

Additional context

Adding this will support syncing parquet files in any new files storage source that builds from abstract-files-source.

Phlair · 2021-08-03T16:24:01Z

Implementing a new file format will require creating an appropriate parser class in .../source-s3/source_s3/source_files_abstract/fileformatparser.py (source_files_abstract will be abstracted out soon but for now lives in source s3).

The least-effort approach here would be to utilise PyArrow (this might also be useful) since the type conversions are already handled, but if there's good reason to use something else then do that and build appropriate type/schema converter into the FileFormatParser ABC class. A reason this might be the case is if PyArrow doesn't support streaming Parquet in batches.

In .../source-s3/source_s3/source_files_abstract/spec.py, make an appropriate {Filetype}Format class (see CsvFormat for example). Make sure it's referenced by the format property in SourceFilesAbstractSpec class in same file. This adds the format into the spec so user can select and customise settings for it in the UI.

You'll also need to add the format into this map with the key matching the filetype enum from spec (e.g. csv, parquet) and value as class name of your parser class to allow the connector to use it.

Unit tests in .../source-s3/unit_tests/test_fileformatparser.py will require creating some files to test against and creating a class like TestCsvParser with details. The abstract tests can then run against these test inputs.

Phlair added the type/enhancement New feature or request label Jul 30, 2021

sherifnada added area/connectors Connector related issues lang/python labels Jul 30, 2021

sherifnada modified the milestone: Connectors August 6th Aug 3, 2021

antixar self-assigned this Aug 6, 2021

antixar linked a pull request Aug 10, 2021 that will close this issue

🎉 Source S3: support of Parquet format #5305

Merged

14 tasks

Phlair mentioned this issue Aug 23, 2021

Ingest files from a directory #2622

Closed

antixar closed this as completed in #5305 Sep 4, 2021

igrankova added connectors/sources-files connectors/source/s3 labels Jan 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Parquet format in S3 source #5102

Support Parquet format in S3 source #5102

Phlair commented Jul 30, 2021

Phlair commented Aug 3, 2021 •

edited

Support Parquet format in S3 source #5102

Support Parquet format in S3 source #5102

Comments

Phlair commented Jul 30, 2021

Tell us about the problem you're trying to solve

Describe the solution you’d like

Additional context

Phlair commented Aug 3, 2021 • edited

Phlair commented Aug 3, 2021 •

edited