Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source S3: infer schema of the first file only #23189

Merged

Conversation

davydov-d
Copy link
Collaborator

What

https://github.com/airbytehq/oncall/issues/1470

How

If user enforced schema is not provided - infer file schema based on the first file only, do not iterate over all the available files

@octavia-squidington-iii octavia-squidington-iii added area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/s3 labels Feb 17, 2023
@davydov-d
Copy link
Collaborator Author

davydov-d commented Feb 17, 2023

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/4202316205
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/4202316205
🐛 https://gradle.com/s/oena33diakanw

Build Failed

Test summary info:

=========================== short test summary info ============================
FAILED test_core.py::TestBasicRead::test_read[inputs2] - Failed: Please check...
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:98: The previous and actual specifications are identical.
SKIPPED [5] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:507: The previous and actual discovered catalogs are identical.
============= 1 failed, 94 passed, 6 skipped in 261.28s (0:04:21) ==============

@davydov-d
Copy link
Collaborator Author

☝️ the failing test is not related to this PR

@davydov-d
Copy link
Collaborator Author

point_up the failing test is not related to this PR

this PR fixes the error above

Copy link
Collaborator

@bazarnov bazarnov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT for the precise fix you did, but failing tests are not going to allow you to merge it the way it should be, right, maybe the order of operations should be different here and you might want to merge the avro fix first.

Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for the effort. The overall logic looks good to me but I have a couple of questions/suggestions.

@davydov-d
Copy link
Collaborator Author

Hey @alafanechere can you please take a look at the comments I've left?

@davydov-d
Copy link
Collaborator Author

davydov-d commented Feb 23, 2023

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/4253170292
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/4253170292
🐛 https://gradle.com/s/i54x6h6drhdby

Build Failed

Test summary info:

=========================== short test summary info ============================
FAILED test_core.py::TestBasicRead::test_read[inputs2] - Failed: Please check...
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:98: The previous and actual specifications are identical.
SKIPPED [5] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:507: The previous and actual discovered catalogs are identical.
============= 1 failed, 94 passed, 6 skipped in 270.79s (0:04:30) ==============

@davydov-d
Copy link
Collaborator Author

davydov-d commented Feb 24, 2023

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/4261712063
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/4261712063
Python tests coverage:

Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
source_s3/source_files_abstract/storagefile.py                       23      0   100%
source_s3/source_files_abstract/spec.py                              55      0   100%
source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
source_s3/source_files_abstract/formats/jsonl_spec.py                13      0   100%
source_s3/source_files_abstract/formats/csv_spec.py                  16      0   100%
source_s3/source_files_abstract/formats/avro_spec.py                  5      0   100%
source_s3/source.py                                                  27      0   100%
source_s3/exceptions.py                                              10      0   100%
source_s3/__init__.py                                                 2      0   100%
source_s3/source_files_abstract/formats/parquet_parser.py            64      2    97%
source_s3/source_files_abstract/formats/abstract_file_parser.py      41      2    95%
source_s3/stream.py                                                  43      3    93%
source_s3/s3file.py                                                  41      3    93%
source_s3/source_files_abstract/formats/avro_parser.py               39      3    92%
source_s3/source_files_abstract/formats/jsonl_parser.py              53      5    91%
source_s3/source_files_abstract/file_info.py                         26      3    88%
source_s3/source_files_abstract/source.py                            41      7    83%
source_s3/source_files_abstract/formats/csv_parser.py               127     22    83%
source_s3/source_files_abstract/stream.py                           196     39    80%
source_s3/s3_utils.py                                                20      4    80%
source_s3/utils.py                                                   31     10    68%
-------------------------------------------------------------------------------------
TOTAL                                                               882    103    88%
Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
source_s3/source_files_abstract/formats/jsonl_spec.py                13      0   100%
source_s3/source_files_abstract/formats/csv_spec.py                  16      0   100%
source_s3/source_files_abstract/formats/avro_spec.py                  5      0   100%
source_s3/s3_utils.py                                                20      0   100%
source_s3/__init__.py                                                 2      0   100%
source_s3/source_files_abstract/storagefile.py                       23      1    96%
source_s3/source_files_abstract/stream.py                           196     11    94%
source_s3/stream.py                                                  43      3    93%
source_s3/s3file.py                                                  41      3    93%
source_s3/source_files_abstract/formats/abstract_file_parser.py      41      4    90%
source_s3/source.py                                                  27      4    85%
source_s3/source_files_abstract/file_info.py                         26      8    69%
source_s3/utils.py                                                   31     10    68%
source_s3/source_files_abstract/formats/csv_parser.py               127     46    64%
source_s3/exceptions.py                                              10      4    60%
source_s3/source_files_abstract/source.py                            41     18    56%
source_s3/source_files_abstract/spec.py                              55     31    44%
source_s3/source_files_abstract/formats/jsonl_parser.py              53     32    40%
source_s3/source_files_abstract/formats/avro_parser.py               39     25    36%
source_s3/source_files_abstract/formats/parquet_parser.py            64     44    31%
-------------------------------------------------------------------------------------
TOTAL                                                               882    244    72%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:98: The previous and actual specifications are identical.
SKIPPED [5] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:507: The previous and actual discovered catalogs are identical.
================== 95 passed, 6 skipped in 261.94s (0:04:21) ===================

Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for the changes! I just have minor suggestions remaining.
Please sync with @airbytehq/cloud-support and @YowanR to decide if it's safe to deploy this potentially backward version to all customers now.
If not I would suggest pinning the version to 1.0.0 on Cloud and wait for our "feature flag" functionality to test this version on a subset of users.

docs/integrations/sources/s3.md Outdated Show resolved Hide resolved
docs/integrations/sources/s3.md Outdated Show resolved Hide resolved
@davydov-d
Copy link
Collaborator Author

@alafanechere thanks! What is the "feature flag" functionality? Do you have any links?

@alafanechere
Copy link
Contributor

@davydov-d It's a project @pedroslopez and @erohmensing will tackle pretty soon. It will allow us to deploy a preview version of a connector to a subset of workspaces. Here's the tech spec if you're curious.

@davydov-d
Copy link
Collaborator Author

davydov-d commented Mar 14, 2023

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/4417909857
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/4417909857
Python tests coverage:

Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
source_s3/source_files_abstract/storagefile.py                       23      0   100%
source_s3/source_files_abstract/spec.py                              55      0   100%
source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
source_s3/source_files_abstract/formats/jsonl_spec.py                13      0   100%
source_s3/source_files_abstract/formats/csv_spec.py                  16      0   100%
source_s3/source_files_abstract/formats/avro_spec.py                  5      0   100%
source_s3/source.py                                                  27      0   100%
source_s3/exceptions.py                                              10      0   100%
source_s3/__init__.py                                                 2      0   100%
source_s3/source_files_abstract/formats/parquet_parser.py            64      2    97%
source_s3/source_files_abstract/formats/abstract_file_parser.py      41      2    95%
source_s3/stream.py                                                  43      3    93%
source_s3/s3file.py                                                  41      3    93%
source_s3/source_files_abstract/formats/avro_parser.py               39      3    92%
source_s3/source_files_abstract/formats/jsonl_parser.py              53      5    91%
source_s3/source_files_abstract/file_info.py                         26      3    88%
source_s3/source_files_abstract/source.py                            41      7    83%
source_s3/source_files_abstract/formats/csv_parser.py               127     22    83%
source_s3/source_files_abstract/stream.py                           196     39    80%
source_s3/s3_utils.py                                                20      4    80%
source_s3/utils.py                                                   31     10    68%
-------------------------------------------------------------------------------------
TOTAL                                                               882    103    88%
Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
source_s3/source_files_abstract/formats/jsonl_spec.py                13      0   100%
source_s3/source_files_abstract/formats/csv_spec.py                  16      0   100%
source_s3/source_files_abstract/formats/avro_spec.py                  5      0   100%
source_s3/s3_utils.py                                                20      0   100%
source_s3/__init__.py                                                 2      0   100%
source_s3/source_files_abstract/storagefile.py                       23      1    96%
source_s3/source_files_abstract/stream.py                           196     11    94%
source_s3/stream.py                                                  43      3    93%
source_s3/s3file.py                                                  41      3    93%
source_s3/source_files_abstract/formats/abstract_file_parser.py      41      4    90%
source_s3/source.py                                                  27      4    85%
source_s3/source_files_abstract/file_info.py                         26      8    69%
source_s3/utils.py                                                   31     10    68%
source_s3/source_files_abstract/formats/csv_parser.py               127     46    64%
source_s3/exceptions.py                                              10      4    60%
source_s3/source_files_abstract/source.py                            41     18    56%
source_s3/source_files_abstract/spec.py                              55     31    44%
source_s3/source_files_abstract/formats/jsonl_parser.py              53     32    40%
source_s3/source_files_abstract/formats/avro_parser.py               39     25    36%
source_s3/source_files_abstract/formats/parquet_parser.py            64     44    31%
-------------------------------------------------------------------------------------
TOTAL                                                               882    244    72%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:100: The previous and actual specifications are identical.
SKIPPED [5] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:509: The previous and actual discovered catalogs are identical.
================== 95 passed, 6 skipped in 266.27s (0:04:26) ===================

@davydov-d
Copy link
Collaborator Author

davydov-d commented Mar 14, 2023

/publish connector=connectors/source-s3

🕑 Publishing the following connectors:
connectors/source-s3
https://github.com/airbytehq/airbyte/actions/runs/4418337871


Connector Did it publish? Were definitions generated?
connectors/source-s3

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@davydov-d davydov-d merged commit 3eecf54 into master Mar 14, 2023
@davydov-d davydov-d deleted the ddavydov/#1470-source-s3-do-not-infer-schema-of-all-files branch March 14, 2023 18:09
adriennevermorel pushed a commit to adriennevermorel/airbyte that referenced this pull request Mar 17, 2023
* airbytehq#1470 Source S3: infer schema of the first file

* airbytehq#1470 source s3: upd changelog

* airbytehq#1470 source s3: review fixes

* airbytehq#1470 source s3: review fixes

* airbytehq#1470 source s3: bump version

* airbytehq#1470 source s3: review fixes

* auto-bump connector version

---------

Co-authored-by: Serhii Lazebnyi <53845333+lazebnyi@users.noreply.github.com>
Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
erohmensing pushed a commit that referenced this pull request Mar 22, 2023
* #1470 Source S3: infer schema of the first file

* #1470 source s3: upd changelog

* #1470 source s3: review fixes

* #1470 source s3: review fixes

* #1470 source s3: bump version

* #1470 source s3: review fixes

* auto-bump connector version

---------

Co-authored-by: Serhii Lazebnyi <53845333+lazebnyi@users.noreply.github.com>
Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
erohmensing pushed a commit that referenced this pull request Mar 22, 2023
* #1470 Source S3: infer schema of the first file

* #1470 source s3: upd changelog

* #1470 source s3: review fixes

* #1470 source s3: review fixes

* #1470 source s3: bump version

* #1470 source s3: review fixes

* auto-bump connector version

---------

Co-authored-by: Serhii Lazebnyi <53845333+lazebnyi@users.noreply.github.com>
Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/s3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants