Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-5995] Add hot key to Python Synthetic Sources and use it in Load Tests #8664

Merged
merged 2 commits into from Jun 3, 2019

Conversation

kkucharc
Copy link
Contributor

@kkucharc kkucharc commented May 23, 2019

Added two new parameters to Synthetic Sources in Python: hotkey number and hotkey fraction. They are used to produce hot keys in key-value production.

Added Python GBK load tests in Jenkins to use hot key in practice.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Python Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@kkucharc kkucharc force-pushed the add-hot-key branch 2 times, most recently from 1e3b437 to a7a8c49 Compare May 23, 2019 14:59
@asfgit
Copy link

asfgit commented May 23, 2019

SUCCESS

--none--

@kkucharc kkucharc force-pushed the add-hot-key branch 6 times, most recently from ec0aee2 to d18e77f Compare May 27, 2019 14:25
@kkucharc
Copy link
Contributor Author

Run seed job

@kkucharc
Copy link
Contributor Author

Run Load Tests Python GBK reiterate Dataflow Batch

@kkucharc
Copy link
Contributor Author

R: @pabloem would you find time to take a look at it?

@@ -238,13 +241,25 @@ def get_range_tracker(self, start_position, stop_position):
tracker = range_trackers.UnsplittableRangeTracker(tracker)
return tracker

def _gen_kv_pair(self, index):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pabloem In Java SDK there is position (somewhere else called offset). I assumed I can use index here in the same way, WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds good to me.

Copy link
Member

@pabloem pabloem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kasia. I just have a question / suggestion, but looks generally fine.

@@ -238,13 +241,25 @@ def get_range_tracker(self, start_position, stop_position):
tracker = range_trackers.UnsplittableRangeTracker(tracker)
return tracker

def _gen_kv_pair(self, index):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds good to me.

# Generate hot key.
# An integer is randomly selected from the range [0, numHotKeys-1]
# with equal probability.
r_hot = np.random.RandomState(self._num_hot_keys)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this generate the key from the same seed always? So it's always a single key? Maybe using something like index % num_hot_keys? Or something like that..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Or maybe just index?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use index, then we will have many different seeds, so we'll jave many different 'hot keys' instead of restricting to a total of num_hot_keys - I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That totally makes sense, thanks a lot! And it also answers my second comment here, I guess :) I'll change it this way.

# An integer is randomly selected from the range [0, numHotKeys-1]
# with equal probability.
r_hot = np.random.RandomState(self._num_hot_keys)
return r_hot.bytes(self._key_size), r.bytes(self._value_size)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pabloem From your previous comment, I started thinking there is also bug here. In Java the upper bound is a hot_key_number and I'm not sure if it's ensured also here.

@pabloem
Copy link
Member

pabloem commented Jun 3, 2019

Ok,, I think this is ready to merge then. Thanks Kasia!

@pabloem pabloem merged commit 63e0170 into apache:master Jun 3, 2019
pl04351820 pushed a commit to pl04351820/beam that referenced this pull request Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants