-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[file-based cdk]: add config option to limit number of files for schema discover #39317
[file-based cdk]: add config option to limit number of files for schema discover #39317
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Granted I'm not the most knowledgable expert on this, but looks good enough ;)
airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the implementation itself looks good and just a few things around organization and nits. But see the longer comment about why this current change poses some potential risks that we need to do some investigation on
airbyte-cdk/python/airbyte_cdk/sources/file_based/config/file_based_stream_config.py
Outdated
Show resolved
Hide resolved
airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py
Outdated
Show resolved
Hide resolved
max_n_files_for_schema_inference = self._discovery_policy.get_max_n_files_for_schema_inference(self.get_parser()) | ||
if total_n_files > max_n_files_for_schema_inference: | ||
# Use the most recent files for schema inference, so we pick up schema changes during discovery. | ||
files = sorted(files, key=lambda x: x.last_modified, reverse=True)[:max_n_files_for_schema_inference] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about this aspect. We currently get all the matching files according to the pattern, and then sort them so we use the latest ones for schema discovery. If we allow customers to configure the number of files then we might lose the ability to detect schema changes on later files.
My suspicion is that file based sources like S3 most likely return files in a consistent order so adding the config value when set has the potential to break schema inference if later files change their schema.
My ask here is:
- Can you look into the S3 code/API docs to see if there is a way to query for files and specify the ordering they are returned in. If there is a mechanism to do so, then we should do that to ensure we use the last modified files
- If there is no way to query S3 in that manor we should add in the description of this config to very explicitly say that this can affect schema discovery
- Add a comment in our code so that we know
files_to_read_for_schema_discover
can cause less accurate or outdated schema detection
Let me know if this all makes sense, but I think we need to do some more investigation around the implications for an accurate schema detection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brianjlai when the file-based CDK was created there was no way to request the files in a specific order. But it definitely makes sense to check to see if this has changed. If it has we'll want to update the way we do read
too so we don't have to page through everything.
Assuming the API hasn't changed, instead of just giving users a warning that schema discovery could be less accurate, I'm wondering if we should surface the option to still list & sort all files even if the user is setting a limit for schema discovery, in addition to your suggestion that we limit the pages of files read. It seems possible to me that for most customers the bulk of the time spent on discover
is spent on reading e.g. hundreds of very large files rather than paging through the list of millions of files (obviously this will vary from customer to customer though).
I'm loathe to add too many options, but since the consequence of a bad schema is pretty bad (dropped records) it feels like we want to make sure we use the most recent records as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should surface the option to still list & sort all files even if the user is setting a limit for schema discovery, in addition to your suggestion that we limit the pages of files read. It seems possible to me that for most customers the bulk of the time spent on discover is spent on reading
So if I'm understanding correctly, we still fetch every single file, but still sort them. And then once sorted, we only read X most recent files according to the config?
I think that would make sense, and in this case we don't need a new config option right? The presence of files_to_read_for_schema_discovery
should be enough to indicate that we only do partial schema discovery of the most recent. It feels like we shouldn't even give users an option and the default behavior if they specify this new field is to first list & sort all files. Then do schema discovery.
I guess the only gap is if there way too many files to fetch before timing out since fetching files is a prerequisite to the sort, although maybe we don't need to solve this problem perfectly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I'm understanding correctly, we still fetch every single file, but still sort them. And then once sorted, we only read X most recent files according to the config?
Yep that's what I had in mind.
Agree that if we do this, we don't necessarily need a new config option outside of files_to_read_for_schema_discovery
, unless we want to let users limit the number of pages read. But perhaps that can come in a separate phase if it's still needed after this change is deployed.
askarpets seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a typo in the config name for resent_n_files_to_read_for_schema_discovery
otherwise looks good. Thanks for looking into this issue
airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py
Show resolved
Hide resolved
airbyte-cdk/python/airbyte_cdk/sources/file_based/config/file_based_stream_config.py
Outdated
Show resolved
Hide resolved
…ma discover (#39317) Co-authored-by: askarpets <anton.karpets@globallogic.com> Co-authored-by: Serhii Lazebnyi <serhii.lazebnyi@globallogic.com> Co-authored-by: Serhii Lazebnyi <53845333+lazebnyi@users.noreply.github.com>
What
Update for https://github.com/airbytehq/oncall/issues/4948
How
Add optional field
files_to_read_for_schema_discover
to stream's config to specify how many files should be used for schema inferenceReview guide
User Impact
No
Can this PR be safely reverted and rolled back?