Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest/mongodb): support disabling schemaSamplingSize #9295

Merged
merged 1 commit into from Dec 28, 2023

Conversation

diegoreico
Copy link
Contributor

@diegoreico diegoreico commented Nov 23, 2023

fix: allow mongodb source connector to have a 0 input for field schemaSamplingSize

Fixes #9287

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@@ -100,7 +100,7 @@ class MongoDBConfig(
enableSchemaInference: bool = Field(
default=True, description="Whether to infer schemas. "
)
schemaSamplingSize: Optional[PositiveInt] = Field(
schemaSamplingSize: Optional[NonNegativeInt] = Field(
default=1000,
description="Number of documents to use when inferring schema size. If set to `0`, all documents will be scanned.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the current behavior will already do the right thing if schemaSamplingSize is set to null. I might be missing something, but it seems like it'd make more sense to simply fix the docs here and leave the actual code unchanged

Suggested change
description="Number of documents to use when inferring schema size. If set to `0`, all documents will be scanned.",
description="Number of documents to use when inferring schema size. If set to `null`, all documents will be scanned.",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a small update on the linked issue on my first message

As far as I have reached, I think that it's imposible to force a full mongodb collection scan, because the yaml parser + validator forces to provide a value greater than 0, but internal code checks for null

I'm gonna try to explain my thoughts on detail:

  1. schemaSamplingSize: Optional[PositiveInt] does allow null value, but not 0 value so I changed it to NonNegativeInt
  2. schemaSamplingSize is also asigned to a Pydantic Field with the following default -> default=1000,. So, if no value is provided, this will have that default value
  3. You can't explicit provide a null or none value trough the yaml parser (as i also reflect now on the issue)

So, if I need to provide a value to the parser and it can't be null, providing a 0 value could be a good option.

@maggiehays maggiehays added the community-contribution PR or Issue raised by member(s) of DataHub Community label Nov 29, 2023
@hsheth2 hsheth2 changed the title fix: allow mongodb source connector to have a 0 input for field schem… fix(ingest/mongodb): support disabling schemaSamplingSize Dec 28, 2023
@hsheth2
Copy link
Collaborator

hsheth2 commented Dec 28, 2023

@diegoreico I've made a small tweak to your code - let me know what you think

Setting schemaSamplingSize: 0 will be invalid, but you should now be able to explicitly set schemaSamplingSize: null and have that work properly.

We actually had both in our documentation, so my hope is this makes it consistent and avoids having a "magic value" of 0.

@hsheth2 hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Dec 28, 2023
@anshbansal anshbansal merged commit 60347d6 into datahub-project:master Dec 28, 2023
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

acryl-datahub[mongodb] > field schemaSamplingSize does not work as expected
4 participants