allows to run parallel pipelines in separate threads #813

rudolfix · 2023-12-09T19:02:58Z

Description

Allows to run pipelines in parallel, providing that they run in separate threads.
Makes injection containers thread-affine.
See test_parallel_threads_pipeline for an example.
Also updates docs in Performance to show how to use asyncio to run pipelines in parallel

Contains content of #807

Adds metrics for closed files in data writers and collects them when creating StepInfo [WIP]
Fixes edge cases with multiple load storages in normalize

netlify · 2023-12-09T19:03:02Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`910bfa0`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/65759b10d7b7fa00087673c4
😎 Deploy Preview	https://deploy-preview-813--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

sh-rp · 2023-12-13T08:11:39Z

dlt/common/configuration/container.py

+        context = self._thread_context(spec)
+        return spec in context
+
+    def _thread_context(


can't you use pythons thread local context to do all this? https://docs.python.org/3/library/threading.html#thread-local-data

yeah I know it but when you look at the code, there are exceptions to that behavior.

some type of context are available globally (I use main thread id)

there's a special treatment of the executor thread pool. I use a context of a thread that started a pool, not the current thread

so yeah I could use local() but there are exceptions so I'd need to keep more dictionaries. or you can force the thread id for local()?

sh-rp · 2023-12-13T08:17:46Z

dlt/common/data_writers/writers.py

+class DataWriterMetrics(NamedTuple):
+    file_path: str
+    items_count: int
+    file_size: int


maybe column count? but that is not really important tbh.

I plan to add elapsed time (start stop). Column count is not known at this moment

not during extract. but it is known during normalize. you can however get the column count from the relevant schema...

yes elapsed would be cool too!

sh-rp · 2023-12-13T08:23:46Z

docs/website/docs/reference/performance.md

@@ -249,6 +250,17 @@ The default is to not parallelize normalization and to perform it in the main pr
 Normalization is CPU bound and can easily saturate all your cores. Never allow `dlt` to use all cores on your local machine.
 :::

+:::caution
+The default method of spawning a process pool on Linux is **fork**. If you are using threads in your code (or libraries that use threads),


should we add a link to some further explanation of this in the python docs maybe?

good point! could you propose a link? then I'll add it

sh-rp · 2023-12-13T08:25:03Z

pyproject.toml

@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "dlt"
-version = "0.4.1a0"
+version = "0.4.1a1"


should we add a test somewhere to check that the version number is in sync everywhere?

what do you mean. there's just one source of truth for the version and this is the toml file. or you want to have a tests that compares the toml file with the installed package? or maybe that should be a lint step where we force people to make dev when versions are not in sync?

sh-rp

I think my main question is why not use the python built in thread locals? or do you need to be able to access another threads locals?

rudolfix added 6 commits December 6, 2023 22:59

keeps metrics of closed data writer files

4eae3bd

fixes handling of multiple load storages in normalize

0e7703c

separates immutable job id from actual job file name

1ce3098

allows to import files into data item storage

5e882e3

import parquet files into data items storage in normalize

148e93c

adds buffered writer tests

aba0db6

rudolfix added 6 commits December 9, 2023 23:02

bumps to alpha 0.4.1a1

1d21147

makes all injection contexts thread affine, except config providers

795baaa

tests running parallel pipelines in thread pool

5ab7a58

allows to set start method for process executor

edf98c9

adds thread id to dlt log

9055af4

improves parallel run test

7ebbe00

rudolfix force-pushed the rfix/container-thread-affine branch from 2847c5b to 7ebbe00 Compare December 10, 2023 00:35

rudolfix added 3 commits December 10, 2023 10:59

fixes None in toml config writer

37ba925

adds parallel asyncio test

a7c9543

updates performance docs

910bfa0

rudolfix requested a review from sh-rp December 10, 2023 17:18

rudolfix self-assigned this Dec 10, 2023

rudolfix added the devel label Dec 10, 2023

rudolfix changed the title ~~[WIP] allows to run parallel pipelines in separate threads~~ allows to run parallel pipelines in separate threads Dec 10, 2023

rudolfix marked this pull request as ready for review December 10, 2023 17:21

This was referenced Dec 10, 2023

[WIP] adds data writer metrics #807

Closed

0.4.1 release notes and master merge #763

Closed

sh-rp reviewed Dec 13, 2023

View reviewed changes

sh-rp requested changes Dec 13, 2023

View reviewed changes

sh-rp approved these changes Dec 14, 2023

View reviewed changes

rudolfix merged commit 00c2725 into devel Dec 14, 2023
44 checks passed

rudolfix deleted the rfix/container-thread-affine branch December 14, 2023 11:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allows to run parallel pipelines in separate threads #813

allows to run parallel pipelines in separate threads #813

rudolfix commented Dec 9, 2023 •

edited

Loading

netlify bot commented Dec 9, 2023 •

edited

Loading

sh-rp Dec 13, 2023

rudolfix Dec 13, 2023

sh-rp Dec 13, 2023

rudolfix Dec 13, 2023

rudolfix Dec 13, 2023

sh-rp Dec 14, 2023

sh-rp Dec 13, 2023

rudolfix Dec 13, 2023

sh-rp Dec 13, 2023

rudolfix Dec 13, 2023

sh-rp left a comment

allows to run parallel pipelines in separate threads #813

allows to run parallel pipelines in separate threads #813

Conversation

rudolfix commented Dec 9, 2023 • edited Loading

Description

netlify bot commented Dec 9, 2023 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp left a comment

Choose a reason for hiding this comment

rudolfix commented Dec 9, 2023 •

edited

Loading

netlify bot commented Dec 9, 2023 •

edited

Loading