Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[file-based cdk]: add config option to limit number of files for schema discover #39317

Merged
merged 13 commits into from
Jul 11, 2024

Conversation

askarpets
Copy link
Contributor

@askarpets askarpets commented Jun 6, 2024

What

Update for https://github.com/airbytehq/oncall/issues/4948

How

Add optional field files_to_read_for_schema_discover to stream's config to specify how many files should be used for schema inference

Review guide

User Impact

No

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

@askarpets askarpets self-assigned this Jun 6, 2024
@askarpets askarpets requested a review from a team as a code owner June 6, 2024 14:00
Copy link

vercel bot commented Jun 6, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 11, 2024 10:37am

@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Jun 6, 2024
Copy link
Contributor

@natikgadzhi natikgadzhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Granted I'm not the most knowledgable expert on this, but looks good enough ;)

Copy link
Contributor

@brianjlai brianjlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation itself looks good and just a few things around organization and nits. But see the longer comment about why this current change poses some potential risks that we need to do some investigation on

max_n_files_for_schema_inference = self._discovery_policy.get_max_n_files_for_schema_inference(self.get_parser())
if total_n_files > max_n_files_for_schema_inference:
# Use the most recent files for schema inference, so we pick up schema changes during discovery.
files = sorted(files, key=lambda x: x.last_modified, reverse=True)[:max_n_files_for_schema_inference]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about this aspect. We currently get all the matching files according to the pattern, and then sort them so we use the latest ones for schema discovery. If we allow customers to configure the number of files then we might lose the ability to detect schema changes on later files.

My suspicion is that file based sources like S3 most likely return files in a consistent order so adding the config value when set has the potential to break schema inference if later files change their schema.

My ask here is:

  1. Can you look into the S3 code/API docs to see if there is a way to query for files and specify the ordering they are returned in. If there is a mechanism to do so, then we should do that to ensure we use the last modified files
  2. If there is no way to query S3 in that manor we should add in the description of this config to very explicitly say that this can affect schema discovery
  3. Add a comment in our code so that we know files_to_read_for_schema_discover can cause less accurate or outdated schema detection

Let me know if this all makes sense, but I think we need to do some more investigation around the implications for an accurate schema detection

Copy link
Contributor

@clnoll clnoll Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brianjlai when the file-based CDK was created there was no way to request the files in a specific order. But it definitely makes sense to check to see if this has changed. If it has we'll want to update the way we do read too so we don't have to page through everything.

Assuming the API hasn't changed, instead of just giving users a warning that schema discovery could be less accurate, I'm wondering if we should surface the option to still list & sort all files even if the user is setting a limit for schema discovery, in addition to your suggestion that we limit the pages of files read. It seems possible to me that for most customers the bulk of the time spent on discover is spent on reading e.g. hundreds of very large files rather than paging through the list of millions of files (obviously this will vary from customer to customer though).

I'm loathe to add too many options, but since the consequence of a bad schema is pretty bad (dropped records) it feels like we want to make sure we use the most recent records as much as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should surface the option to still list & sort all files even if the user is setting a limit for schema discovery, in addition to your suggestion that we limit the pages of files read. It seems possible to me that for most customers the bulk of the time spent on discover is spent on reading

So if I'm understanding correctly, we still fetch every single file, but still sort them. And then once sorted, we only read X most recent files according to the config?

I think that would make sense, and in this case we don't need a new config option right? The presence of files_to_read_for_schema_discovery should be enough to indicate that we only do partial schema discovery of the most recent. It feels like we shouldn't even give users an option and the default behavior if they specify this new field is to first list & sort all files. Then do schema discovery.

I guess the only gap is if there way too many files to fetch before timing out since fetching files is a prerequisite to the sort, although maybe we don't need to solve this problem perfectly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I'm understanding correctly, we still fetch every single file, but still sort them. And then once sorted, we only read X most recent files according to the config?

Yep that's what I had in mind.

Agree that if we do this, we don't necessarily need a new config option outside of files_to_read_for_schema_discovery, unless we want to let users limit the number of pages read. But perhaps that can come in a separate phase if it's still needed after this change is deployed.

@lazebnyi lazebnyi assigned lazebnyi and unassigned askarpets Jun 12, 2024
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ lazebnyi
❌ askarpets


askarpets seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@lazebnyi lazebnyi changed the title File based CDK: add config option to limit number of files for schema discover [file-based cdk]: add config option to limit number of files for schema discover Jun 25, 2024
@lazebnyi
Copy link
Collaborator

lazebnyi commented Jun 25, 2024

  1. update to list all files and parse only most resent to speed up discovery process
  2. update rate limiting section for all file based sources

Copy link
Contributor

@brianjlai brianjlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a typo in the config name for resent_n_files_to_read_for_schema_discovery otherwise looks good. Thanks for looking into this issue

@lazebnyi lazebnyi merged commit 6c439a8 into master Jul 11, 2024
35 of 36 checks passed
@lazebnyi lazebnyi deleted the file-based-cdk-limit-number-of-files-for-discover branch July 11, 2024 13:16
xiaohansong pushed a commit that referenced this pull request Jul 12, 2024
…ma discover (#39317)

Co-authored-by: askarpets <anton.karpets@globallogic.com>
Co-authored-by: Serhii Lazebnyi <serhii.lazebnyi@globallogic.com>
Co-authored-by: Serhii Lazebnyi <53845333+lazebnyi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit team/connectors-python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants