Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2042] Enable seeds to be handled from stored manifest data #6875

Closed
gshank opened this issue Feb 6, 2023 · 3 comments · Fixed by #8762
Closed

[CT-2042] Enable seeds to be handled from stored manifest data #6875

gshank opened this issue Feb 6, 2023 · 3 comments · Fixed by #8762
Assignees
Labels
backport 1.0.latest Tag for PR to be backported to the 1.0.latest branch backport 1.1.latest backport 1.2.latest This PR will be backported to the 1.2.latest branch backport 1.3.latest backport 1.4.latest backport 1.5.latest backport 1.6.latest file_system How dbt-core interoperates with file systems to read/write data Impact: Orch python_api Issues related to dbtRunner Python entry point seeds Issues related to dbt's seed functionality

Comments

@gshank
Copy link
Contributor

gshank commented Feb 6, 2023

Right now we don't store the content of a seed file, we just store a checksum of the content or just the path if the size of the seed exceeds 1 megabyte. The seed is retrieved later using the absolute path of the file. This is not functional for a world in which files can come from a file diff and we have portable manifests.

We need to decide how to do this for file diffs and portable manifests. How important is the existence of large seeds to our users? It is documented as an anti-pattern, but I'm sure some people are doing it. Do we want to load the content below a certain size? Do some kind of special handling?

@github-actions github-actions bot changed the title Enable seeds to be handled from stored manifest data [CT-2042] Enable seeds to be handled from stored manifest data Feb 6, 2023
@jtcohen6 jtcohen6 added python_api Issues related to dbtRunner Python entry point Team:Language labels Feb 6, 2023
@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 7, 2023

My instinct here is that we should start storing raw seed contents up to 1 MB. For any seeds that are larger, we should continue storing only the file path pointing to that seed.

That would be consistent with the behavior we've established for detecting seed modifications in state:modified: comparing a checksum of contents for seeds that are <1 MB; only comparing the file path for seeds that are >1 MB.

The risk is that a project has 100 seeds, each 0.9 MB in size, and we've suddenly bloated the manifest by 90 MB. We could also look into compression... or that could be a problem for another application to solve.

@jtcohen6 jtcohen6 added seeds Issues related to dbt's seed functionality file_system How dbt-core interoperates with file systems to read/write data labels Feb 7, 2023
@jtcohen6
Copy link
Contributor

Let's revisit & prioritize as part of broader scoping for file_system work

@ChenyuLInx
Copy link
Contributor

In order to do this we might want to break out the current manifest into several individual pieces first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.0.latest Tag for PR to be backported to the 1.0.latest branch backport 1.1.latest backport 1.2.latest This PR will be backported to the 1.2.latest branch backport 1.3.latest backport 1.4.latest backport 1.5.latest backport 1.6.latest file_system How dbt-core interoperates with file systems to read/write data Impact: Orch python_api Issues related to dbtRunner Python entry point seeds Issues related to dbt's seed functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants