[file-based cdk]: add config option to limit number of files for schema discover #39317

askarpets · 2024-06-06T14:00:22Z

What

Update for https://github.com/airbytehq/oncall/issues/4948

How

Add optional field files_to_read_for_schema_discover to stream's config to specify how many files should be used for schema inference

Review guide

User Impact

No

Can this PR be safely reverted and rolled back?

YES 💚
NO ❌

… discover

vercel · 2024-06-06T14:00:27Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 11, 2024 10:37am

natikgadzhi

Granted I'm not the most knowledgable expert on this, but looks good enough ;)

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py

brianjlai

I think the implementation itself looks good and just a few things around organization and nits. But see the longer comment about why this current change poses some potential risks that we need to do some investigation on

airbyte-cdk/python/airbyte_cdk/sources/file_based/config/file_based_stream_config.py

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py

brianjlai · 2024-06-07T19:34:06Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

+        max_n_files_for_schema_inference = self._discovery_policy.get_max_n_files_for_schema_inference(self.get_parser())
+        if total_n_files > max_n_files_for_schema_inference:
+            # Use the most recent files for schema inference, so we pick up schema changes during discovery.
+            files = sorted(files, key=lambda x: x.last_modified, reverse=True)[:max_n_files_for_schema_inference]


I'm a bit concerned about this aspect. We currently get all the matching files according to the pattern, and then sort them so we use the latest ones for schema discovery. If we allow customers to configure the number of files then we might lose the ability to detect schema changes on later files.

My suspicion is that file based sources like S3 most likely return files in a consistent order so adding the config value when set has the potential to break schema inference if later files change their schema.

My ask here is:

Can you look into the S3 code/API docs to see if there is a way to query for files and specify the ordering they are returned in. If there is a mechanism to do so, then we should do that to ensure we use the last modified files

If there is no way to query S3 in that manor we should add in the description of this config to very explicitly say that this can affect schema discovery

Add a comment in our code so that we know files_to_read_for_schema_discover can cause less accurate or outdated schema detection

Let me know if this all makes sense, but I think we need to do some more investigation around the implications for an accurate schema detection

@brianjlai when the file-based CDK was created there was no way to request the files in a specific order. But it definitely makes sense to check to see if this has changed. If it has we'll want to update the way we do read too so we don't have to page through everything.

Assuming the API hasn't changed, instead of just giving users a warning that schema discovery could be less accurate, I'm wondering if we should surface the option to still list & sort all files even if the user is setting a limit for schema discovery, in addition to your suggestion that we limit the pages of files read. It seems possible to me that for most customers the bulk of the time spent on discover is spent on reading e.g. hundreds of very large files rather than paging through the list of millions of files (obviously this will vary from customer to customer though).

I'm loathe to add too many options, but since the consequence of a bad schema is pretty bad (dropped records) it feels like we want to make sure we use the most recent records as much as possible.

I'm wondering if we should surface the option to still list & sort all files even if the user is setting a limit for schema discovery, in addition to your suggestion that we limit the pages of files read. It seems possible to me that for most customers the bulk of the time spent on discover is spent on reading

So if I'm understanding correctly, we still fetch every single file, but still sort them. And then once sorted, we only read X most recent files according to the config?

I think that would make sense, and in this case we don't need a new config option right? The presence of files_to_read_for_schema_discovery should be enough to indicate that we only do partial schema discovery of the most recent. It feels like we shouldn't even give users an option and the default behavior if they specify this new field is to first list & sort all files. Then do schema discovery.

I guess the only gap is if there way too many files to fetch before timing out since fetching files is a prerequisite to the sort, although maybe we don't need to solve this problem perfectly

So if I'm understanding correctly, we still fetch every single file, but still sort them. And then once sorted, we only read X most recent files according to the config?

Yep that's what I had in mind.

Agree that if we do this, we don't necessarily need a new config option outside of files_to_read_for_schema_discovery, unless we want to let users limit the number of pages read. But perhaps that can come in a separate phase if it's still needed after this change is deployed.

…iscover

CLAassistant · 2024-06-19T18:01:29Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ lazebnyi
❌ askarpets

askarpets seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

lazebnyi · 2024-06-25T15:14:57Z

update to list all files and parse only most resent to speed up discovery process
update rate limiting section for all file based sources

…very process

brianjlai

We have a typo in the config name for resent_n_files_to_read_for_schema_discovery otherwise looks good. Thanks for looking into this issue

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py

airbyte-cdk/python/airbyte_cdk/sources/file_based/config/file_based_stream_config.py

…iscover

…ma discover (#39317) Co-authored-by: askarpets <anton.karpets@globallogic.com> Co-authored-by: Serhii Lazebnyi <serhii.lazebnyi@globallogic.com> Co-authored-by: Serhii Lazebnyi <53845333+lazebnyi@users.noreply.github.com>

File based CDK: add config option to limit number of files for schema…

eef1b76

… discover

askarpets self-assigned this Jun 6, 2024

askarpets requested a review from a team as a code owner June 6, 2024 14:00

octavia-squidington-iii added the CDK Connector Development Kit label Jun 6, 2024

Update tests

3cec92d

askarpets requested review from artem1205, lazebnyi, girarda and clnoll June 6, 2024 14:48

natikgadzhi approved these changes Jun 7, 2024

View reviewed changes

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py Outdated Show resolved Hide resolved

brianjlai reviewed Jun 7, 2024

View reviewed changes

lazebnyi assigned lazebnyi and unassigned askarpets Jun 12, 2024

lazebnyi and others added 3 commits June 12, 2024 23:44

Extended get_files instead of list n files

7b6ee5a

Updated docs

3ab5de0

Merge branch 'master' into file-based-cdk-limit-number-of-files-for-d…

88c4816

…iscover

vercel bot deployed to Preview June 12, 2024 21:54 View deployment

Fix unit tests

8c4c488

lazebnyi requested a review from brianjlai June 12, 2024 22:51

lazebnyi added the team/connectors-python label Jun 13, 2024

lazebnyi changed the title ~~File based CDK: add config option to limit number of files for schema discover~~ [file-based cdk]: add config option to limit number of files for schema discover Jun 25, 2024

lazebnyi added 5 commits July 5, 2024 13:30

update to list all files and parse only most resent to speed up disco…

04cbba6

…very process

Fix typo

4e50502

Rollback doc strings

671f354

Fix foramtting

7d0f276

Update test scenatios

f4147ac

brianjlai reviewed Jul 10, 2024

View reviewed changes

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py Show resolved Hide resolved

airbyte-cdk/python/airbyte_cdk/sources/file_based/config/file_based_stream_config.py Outdated Show resolved Hide resolved

lazebnyi and others added 2 commits July 11, 2024 12:30

Fix typo

676255d

Merge branch 'master' into file-based-cdk-limit-number-of-files-for-d…

58ccc36

…iscover

vercel bot deployed to Preview July 11, 2024 10:37 View deployment

lazebnyi merged commit 6c439a8 into master Jul 11, 2024
35 of 36 checks passed

lazebnyi deleted the file-based-cdk-limit-number-of-files-for-discover branch July 11, 2024 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[file-based cdk]: add config option to limit number of files for schema discover #39317

[file-based cdk]: add config option to limit number of files for schema discover #39317

askarpets commented Jun 6, 2024 •

edited by lazebnyi

Loading

vercel bot commented Jun 6, 2024 •

edited

Loading

natikgadzhi left a comment

brianjlai left a comment

brianjlai Jun 7, 2024

clnoll Jun 17, 2024 •

edited

Loading

brianjlai Jun 21, 2024

clnoll Jun 21, 2024

CLAassistant commented Jun 19, 2024

lazebnyi commented Jun 25, 2024 •

edited

Loading

brianjlai left a comment

[file-based cdk]: add config option to limit number of files for schema discover #39317

[file-based cdk]: add config option to limit number of files for schema discover #39317

Conversation

askarpets commented Jun 6, 2024 • edited by lazebnyi Loading

What

How

Review guide

User Impact

Can this PR be safely reverted and rolled back?

vercel bot commented Jun 6, 2024 • edited Loading

natikgadzhi left a comment

Choose a reason for hiding this comment

brianjlai left a comment

Choose a reason for hiding this comment

brianjlai Jun 7, 2024

Choose a reason for hiding this comment

clnoll Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

brianjlai Jun 21, 2024

Choose a reason for hiding this comment

clnoll Jun 21, 2024

Choose a reason for hiding this comment

CLAassistant commented Jun 19, 2024

lazebnyi commented Jun 25, 2024 • edited Loading

brianjlai left a comment

Choose a reason for hiding this comment

askarpets commented Jun 6, 2024 •

edited by lazebnyi

Loading

vercel bot commented Jun 6, 2024 •

edited

Loading

clnoll Jun 17, 2024 •

edited

Loading

lazebnyi commented Jun 25, 2024 •

edited

Loading