-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loader parallelism strategies #1457
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
@rudolfix which one is the destination that requires sequentials jobs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few small issues
tests are amazing! good work!
@@ -30,6 +30,8 @@ | |||
# insert_values - insert SQL statements | |||
# sql - any sql statement | |||
TLoaderFileFormat = Literal["jsonl", "typed-jsonl", "insert_values", "parquet", "csv"] | |||
TLoaderParallelismStrategy = Literal["parallel", "table_sequential", "sequential"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I think we are using dashes not underscores in other places? surely the above both are present :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
dlt/load/utils.py
Outdated
return file_names | ||
|
||
# destination can overwrite ps | ||
parallelism_strategy = capabilities.loader_parallelism_strategy or config.parallelism_strategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config should always has last say. by default config should be None - no strategy should be forced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
dlt/load/utils.py
Outdated
|
||
# we must ensure there only is one job per table | ||
if parallelism_strategy == "table_sequential": | ||
filtered_jobs: List[str] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good. but you could also use group_by
by table name, maybe the code will be simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i made a one-liner out of it, not sure if it is easier to read now, but it is more pythonic i guess :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out you need to sort by the key you group by for the groupby to work as expected..
@rudolfix I have added an explanation of the new config values to the custom destination page. I briefly thought about adding the parallelization strategy setting to the performance docs page, but I am actually not sure it is useful there, for now I believe we only need this in the custom destination. |
* remove underscore from settingsvalue * simplify job selection function * let config always override destination settings
0ce64ba
to
6942393
Compare
Also I don't know what is going on with this failing duckdb test, this does not happen locally for me and I also don't see any way my changes could influence this.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! no idea where duckdb problem is coming from. I bet some previous test is changing the default database location that's why file is not there
Description