[CT-2042] Enable seeds to be handled from stored manifest data #6875

gshank · 2023-02-06T14:17:49Z

Right now we don't store the content of a seed file, we just store a checksum of the content or just the path if the size of the seed exceeds 1 megabyte. The seed is retrieved later using the absolute path of the file. This is not functional for a world in which files can come from a file diff and we have portable manifests.

We need to decide how to do this for file diffs and portable manifests. How important is the existence of large seeds to our users? It is documented as an anti-pattern, but I'm sure some people are doing it. Do we want to load the content below a certain size? Do some kind of special handling?

jtcohen6 · 2023-02-07T10:21:37Z

My instinct here is that we should start storing raw seed contents up to 1 MB. For any seeds that are larger, we should continue storing only the file path pointing to that seed.

That would be consistent with the behavior we've established for detecting seed modifications in state:modified: comparing a checksum of contents for seeds that are <1 MB; only comparing the file path for seeds that are >1 MB.

The risk is that a project has 100 seeds, each 0.9 MB in size, and we've suddenly bloated the manifest by 90 MB. We could also look into compression... or that could be a problem for another application to solve.

jtcohen6 · 2023-02-10T17:28:15Z

Let's revisit & prioritize as part of broader scoping for file_system work

ChenyuLInx · 2023-02-13T21:49:53Z

In order to do this we might want to break out the current manifest into several individual pieces first.

github-actions bot changed the title ~~Enable seeds to be handled from stored manifest data~~ [CT-2042] Enable seeds to be handled from stored manifest data Feb 6, 2023

jtcohen6 added python_api Issues related to dbtRunner Python entry point Team:Language labels Feb 6, 2023

jtcohen6 mentioned this issue Feb 6, 2023

[CT-1945] [Spike] Edge cases for partial manifest updates #6777

Closed

jtcohen6 added seeds Issues related to dbt's seed functionality file_system How dbt-core interoperates with file systems to read/write data labels Feb 7, 2023

This was referenced Mar 3, 2023

[CT-2266] [Feature] Make MAXIMUM_SEED_SIZE configurable #7117

Open

[CT-2271] [Feature] Compute seed file hashes incrementally #7124

Open

frannydelaney mentioned this issue May 3, 2023

[Bug]: Windows file escape for checksum brooklyn-data/dbt_artifacts#303

Closed

jtcohen6 removed the Team:Language label Jul 19, 2023

MichelleArk mentioned this issue Oct 3, 2023

[Fix] respect project root when loading seeds #8762

Merged

5 tasks

martynydbt assigned martynydbt and MichelleArk and unassigned martynydbt Oct 9, 2023

martynydbt added the Impact: Orch label Oct 9, 2023

MichelleArk closed this as completed in #8762 Oct 10, 2023

MichelleArk added the backport 1.6.latest label Oct 10, 2023

martynydbt added backport 1.0.latest Tag for PR to be backported to the 1.0.latest branch backport 1.1.latest backport 1.2.latest This PR will be backported to the 1.2.latest branch backport 1.3.latest backport 1.4.latest backport 1.5.latest labels Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-2042] Enable seeds to be handled from stored manifest data #6875

[CT-2042] Enable seeds to be handled from stored manifest data #6875

gshank commented Feb 6, 2023

jtcohen6 commented Feb 7, 2023

jtcohen6 commented Feb 10, 2023

ChenyuLInx commented Feb 13, 2023

[CT-2042] Enable seeds to be handled from stored manifest data #6875

[CT-2042] Enable seeds to be handled from stored manifest data #6875

Comments

gshank commented Feb 6, 2023

jtcohen6 commented Feb 7, 2023

jtcohen6 commented Feb 10, 2023

ChenyuLInx commented Feb 13, 2023