Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faker gets 250% faster #20741

Merged
merged 14 commits into from
Jan 2, 2023
Merged

Conversation

evantahler
Copy link
Contributor

@evantahler evantahler commented Dec 20, 2022

After the platform performance research and the failure to use Faker within the Airbyte platform to generate 1TB of data, I spent some time looking into the bottlenecks of our Python CDK sources. Just like our Java sources, the limiting factors are:

  • Parallelization of CPU-intensive tasks
  • JSON parsing is slow

This PR explores solutions to these problems:

  1. Use a WorkerPool to generate the fake records in parallel.
    • The code had to be rearchitected to become thread-safe
    • The number of worker "threads" is now available within the connector's configuration (default 4)
  2. In those parallel workers, also generate the AirbyteRecord JSON in parallel, ofloading that work from the main thread.
    • A stub AirbyteMessageWithCachedJSON message was created that renders its own JSON string upon intilization

How much faster did we get? The speedups in this PR add up to a 250% speedup for a 100,000 user sync (20s to 7s)!

Notes:

  • Testing (SAT) now needs to run with only one thread or else you can't predict the order of the records returned

@octavia-squidington-iv octavia-squidington-iv added area/connectors Connector related issues CDK Connector Development Kit connectors/source/faker labels Dec 20, 2022
Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time looking into the bottlenecks of our Python CDK sources

Do you have some graphs or numbers you can share about CPU hotspots?

do you have a good understanding of why this speed up happened?
My understanding of Python is that due to the GIL, multi-threading doesn’t help on CPU-bound operations like Faker (so threads are mostly useful to speed up blocking I/O bound operations). This is just my mental model which is subject to change when brought into contact with real world experience.

@evantahler
Copy link
Contributor Author

@sherifnada this code uses multi-process code, and your mental model is right! I initially started with mutli-threading, and indeed, there's wasn't a speed up. WorkerPool actually makes a few different python process and shares data with IPC, and that's how we win.

@octavia-squidington-iv octavia-squidington-iv removed the CDK Connector Development Kit label Dec 21, 2022
@evantahler evantahler changed the title Faker is 250% faster Faker gets 250% faster Dec 21, 2022
@evantahler
Copy link
Contributor Author

evantahler commented Dec 21, 2022

/test connector=connectors/source-faker

🕑 connectors/source-faker https://github.com/airbytehq/airbyte/actions/runs/3753636010
✅ connectors/source-faker https://github.com/airbytehq/airbyte/actions/runs/3753636010
Python tests coverage:

Name                                               Stmts   Miss  Cover
----------------------------------------------------------------------
source_faker/__init__.py                               2      0   100%
source_faker/source.py                                17      3    82%
source_faker/streams.py                              191     53    72%
source_faker/utils.py                                 18      6    67%
source_faker/airbyte_message_with_cached_json.py       8      4    50%
----------------------------------------------------------------------
TOTAL                                                236     66    72%
	 Name                                                 Stmts   Miss  Cover   Missing
	 ----------------------------------------------------------------------------------
	 source_acceptance_test/base.py                          12      4    67%   16-19
	 source_acceptance_test/config.py                       140      5    96%   87, 93, 238, 242-243
	 source_acceptance_test/conftest.py                     208     92    56%   36, 42-44, 49, 54, 77, 83, 89-91, 110, 115-117, 123-125, 131-132, 137-138, 143, 149, 158-167, 173-178, 193, 217, 248, 254, 262-267, 275-280, 288-301, 306-312, 319-330, 337-353
	 source_acceptance_test/plugin.py                        69     25    64%   22-23, 31, 36, 120-140, 144-148
	 source_acceptance_test/tests/test_core.py              402    115    71%   53, 58, 93-104, 109-116, 120-121, 125-126, 308, 346-363, 376-387, 391-396, 402, 435-440, 478-485, 528-530, 533, 598-606, 618-621, 626, 682-683, 689, 692, 728-738, 751-776
	 source_acceptance_test/tests/test_incremental.py       158     14    91%   52-59, 64-77, 240
	 source_acceptance_test/utils/asserts.py                 39      2    95%   62-63
	 source_acceptance_test/utils/common.py                  94     10    89%   16-17, 32-38, 72, 75
	 source_acceptance_test/utils/compare.py                 62     23    63%   21-51, 68, 97-99
	 source_acceptance_test/utils/connector_runner.py       133     33    75%   24-27, 46-47, 50-54, 57-58, 73-75, 78-80, 83-85, 88-90, 93-95, 124-125, 159-161, 208
	 source_acceptance_test/utils/json_schema_helper.py     107     13    88%   30-31, 38, 41, 65-68, 96, 120, 192-194
	 ----------------------------------------------------------------------------------
	 TOTAL                                                 1603    336    79%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/source_acceptance_test/tests/test_core.py:386: Backward compatibility tests are disabled for version 1.0.0.
======================== 30 passed, 1 skipped in 37.87s ========================

@evantahler evantahler marked this pull request as ready for review December 21, 2022 23:52
@evantahler evantahler requested a review from a team December 21, 2022 23:52
@evantahler
Copy link
Contributor Author

Docs for how to allocate more resources to a connector when deployed: https://docs.airbyte.com/operator-guides/configuring-connector-resources#configuring-connector-specific-requirements

Comment on lines +21 to +34
def prepare(self):
"""
Note: the instances of the mimesis generators need to be global.
Yes, they *should* be able to be instance variables on this class, which should only instantiated once-per-worker, but that's not quite the case:
* relying only on prepare as a pool initializer fails because we are calling the parent process's method, not the fork
* Calling prepare() as part of generate() (perhaps checking if self.person is set) and then `print(self, current_process()._identity, current_process().pid)` reveals multiple object IDs in the same process, resetting the internal random counters
"""

seed_with_offset = self.seed
if self.seed is not None and len(current_process()._identity) > 0:
seed_with_offset = self.seed + current_process()._identity[0]

global dt
global numeric
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sherifnada I was able to address about 1/2 of your comments. For the rest, I added docs :D

@evantahler
Copy link
Contributor Author

evantahler commented Dec 22, 2022

/test connector=connectors/source-faker

🕑 connectors/source-faker https://github.com/airbytehq/airbyte/actions/runs/3761891204
✅ connectors/source-faker https://github.com/airbytehq/airbyte/actions/runs/3761891204
Python tests coverage:

Name                                               Stmts   Miss  Cover
----------------------------------------------------------------------
source_faker/__init__.py                               2      0   100%
source_faker/streams.py                              126      1    99%
source_faker/source.py                                17      3    82%
source_faker/utils.py                                 18      6    67%
source_faker/airbyte_message_with_cached_json.py       8      4    50%
source_faker/user_generator.py                        28     16    43%
source_faker/purchase_generator.py                    55     41    25%
----------------------------------------------------------------------
TOTAL                                                254     71    72%
	 Name                                                 Stmts   Miss  Cover   Missing
	 ----------------------------------------------------------------------------------
	 source_acceptance_test/base.py                          12      4    67%   16-19
	 source_acceptance_test/config.py                       140      5    96%   87, 93, 238, 242-243
	 source_acceptance_test/conftest.py                     208     92    56%   36, 42-44, 49, 54, 77, 83, 89-91, 110, 115-117, 123-125, 131-132, 137-138, 143, 149, 158-167, 173-178, 193, 217, 248, 254, 262-267, 275-280, 288-301, 306-312, 319-330, 337-353
	 source_acceptance_test/plugin.py                        69     25    64%   22-23, 31, 36, 120-140, 144-148
	 source_acceptance_test/tests/test_core.py              402    115    71%   53, 58, 93-104, 109-116, 120-121, 125-126, 308, 346-363, 376-387, 391-396, 402, 435-440, 478-485, 528-530, 533, 598-606, 618-621, 626, 682-683, 689, 692, 728-738, 751-776
	 source_acceptance_test/tests/test_incremental.py       158     14    91%   52-59, 64-77, 240
	 source_acceptance_test/utils/asserts.py                 39      2    95%   62-63
	 source_acceptance_test/utils/common.py                  94     10    89%   16-17, 32-38, 72, 75
	 source_acceptance_test/utils/compare.py                 62     23    63%   21-51, 68, 97-99
	 source_acceptance_test/utils/connector_runner.py       133     33    75%   24-27, 46-47, 50-54, 57-58, 73-75, 78-80, 83-85, 88-90, 93-95, 124-125, 159-161, 208
	 source_acceptance_test/utils/json_schema_helper.py     107     13    88%   30-31, 38, 41, 65-68, 96, 120, 192-194
	 ----------------------------------------------------------------------------------
	 TOTAL                                                 1603    336    79%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/source_acceptance_test/tests/test_core.py:386: Backward compatibility tests are disabled for version 1.0.0.
======================== 30 passed, 1 skipped in 37.60s ========================

@evantahler evantahler merged commit fe328c3 into evan/faker-hydra Jan 2, 2023
@evantahler evantahler deleted the evan/faker-hydra-multiprocess branch January 2, 2023 22:33
octavia-approvington pushed a commit that referenced this pull request Jan 3, 2023
* [faker] decouple stream state

* add PR #

* commit Stream instantiate changes

* fixup expected record

* skip backward test for this version too

* Apply suggestions from code review

Co-authored-by: Augustin <augustin@airbyte.io>

* lint

* Create realistic datasets of 10GB, 100GB, and 1TB in size (#20558)

* Faker CSV Streaming utilities

* readme

* don't do a final pipe to jq or you will run out or ram

* doc

* Faker gets 250% faster (#20741)

* Faker is 250% faster

* threads in spec + lint

* pass tests

* revert changes to record helper

* cleanup

* update expected_records

* bump default records-per-slice to 1k

* enforce unique email addresses

* cleanup

* more comments

* `parallelism` and pass tests

* update expected records

* cleanup notes

* update readme

* update expected records

* auto-bump connector version

Co-authored-by: Augustin <augustin@airbyte.io>
Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants