Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loader parallelism strategies #1457

Merged
merged 8 commits into from
Jun 18, 2024
Merged

Loader parallelism strategies #1457

merged 8 commits into from
Jun 18, 2024

Conversation

sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Jun 12, 2024

Description

  • Add parallelism strategies to loader
  • Add max_load_jobs and parallelism strategies override to destination capabilities
  • Add settings for both on the custom destination

@sh-rp sh-rp changed the title Load parallelism strategies Loader parallelism strategies Jun 12, 2024
Copy link

netlify bot commented Jun 12, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 6942393
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/667136ee9126ce0008021e4e

@sh-rp sh-rp linked an issue Jun 12, 2024 that may be closed by this pull request
@sh-rp sh-rp marked this pull request as ready for review June 12, 2024 14:29
@sh-rp
Copy link
Collaborator Author

sh-rp commented Jun 12, 2024

@rudolfix which one is the destination that requires sequentials jobs?

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few small issues

tests are amazing! good work!

@@ -30,6 +30,8 @@
# insert_values - insert SQL statements
# sql - any sql statement
TLoaderFileFormat = Literal["jsonl", "typed-jsonl", "insert_values", "parquet", "csv"]
TLoaderParallelismStrategy = Literal["parallel", "table_sequential", "sequential"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I think we are using dashes not underscores in other places? surely the above both are present :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

dlt/destinations/impl/destination/__init__.py Show resolved Hide resolved
dlt/load/configuration.py Show resolved Hide resolved
dlt/load/utils.py Show resolved Hide resolved
return file_names

# destination can overwrite ps
parallelism_strategy = capabilities.loader_parallelism_strategy or config.parallelism_strategy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config should always has last say. by default config should be None - no strategy should be forced

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


# we must ensure there only is one job per table
if parallelism_strategy == "table_sequential":
filtered_jobs: List[str] = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good. but you could also use group_by by table name, maybe the code will be simpler?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i made a one-liner out of it, not sure if it is easier to read now, but it is more pythonic i guess :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out you need to sort by the key you group by for the groupby to work as expected..

@sh-rp
Copy link
Collaborator Author

sh-rp commented Jun 17, 2024

@rudolfix I have added an explanation of the new config values to the custom destination page. I briefly thought about adding the parallelization strategy setting to the performance docs page, but I am actually not sure it is useful there, for now I believe we only need this in the custom destination.

@sh-rp sh-rp force-pushed the feat/1456-loader-parallelism branch from 0ce64ba to 6942393 Compare June 18, 2024 07:27
@sh-rp
Copy link
Collaborator Author

sh-rp commented Jun 18, 2024

Also I don't know what is going on with this failing duckdb test, this does not happen locally for me and I also don't see any way my changes could influence this..

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! no idea where duckdb problem is coming from. I bet some previous test is changing the default database location that's why file is not there

@rudolfix rudolfix merged commit 1959942 into devel Jun 18, 2024
49 of 50 checks passed
@rudolfix rudolfix deleted the feat/1456-loader-parallelism branch June 18, 2024 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make loader parallelism configurable
2 participants