Skip to content

Conversation

@kamilwu
Copy link
Contributor

@kamilwu kamilwu commented Jun 13, 2019

Without some pipeline parameters in ParDo Load Test in Python, it is impossible to create all required test cases (see proposal: https://s.apache.org/load-test-basic-operations).


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Python Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@kamilwu
Copy link
Contributor Author

kamilwu commented Jun 13, 2019

R: @kkucharc

Copy link
Contributor

@kkucharc kkucharc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Kamil for this PR, I added my comments

"""Prevents metrics from namespace other than specified in pipeline
options from being published."""
if self.metrics_monitor is not None:
self.metrics_monitor.filters = MetricsFilter().with_namespace(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little a bit afraid of using this mainly because of this PR. We aren't sure if we won't cut off some useful metrics with this.

An idea that comes to my mind is to create global variable METRICS_NAMESPACES as it is in Java SDK. Then specify namespaces to be saved here and use with_namespaces. Also it will require to change variable saved as pipeline option. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the _apply_filters() method to the base class. It takes a list of namespaces as a parameter. The author of a derived class will decide whether to call it or not.


def _get_option_or_default(self, opt_name, default=1):
option = self.pipeline.get_option(opt_name)
return int(option) if option is not None else default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to have this in load_test_metrics_utils, WDYT? There may be also exception thrown in casting to int.

self.nb_of_operations = nb_of_operations
self.counters = []
for i in range(nb_of_counters):
self.counters.append(Metrics.counter('do-not-publish-me',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do-not-publish sounds better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright.

super(ParDoTest, self).setUp()
self._apply_filter()
self.iterations = self._get_option_or_default('iterations')
self.nb_of_counter = self._get_option_or_default('number_of_counters')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go with full name of number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nb is a quite popular abbreviation of number. What's more, shorter variable names are better, so I'd rather keep nb in this case.

Copy link
Contributor

@kkucharc kkucharc Jun 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to be popular abbreviation across Beam (sometimes new_block was called nb as well). But ok until it's just local variable for this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Kasia - number_of_counters is easier to digest. As a non python dev and a person that didn't know this code, I had to think for a little while what does nb mean. numbersays it explicitly and costs nothing

num_runs = 1
else:
num_runs = int(self.iterations)
class CounterOperation(beam.DoFn):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to have CountBytes metrics in load_test_metrics_utils. WDYT about making this class parametrised also by name and namespace so it can be reused in future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this class can be reusable. Take a note that we don't count bytes here, we just increment counters by 1 in order to simulate stressful conditions (pipeline with a lot of metric counters)

self.metrics_monitor.filters = MetricsFilter().with_namespace(
self.metrics_namespace)

def _get_option_or_default(self, opt_name, default=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the 1 actually a default value? Maybe it would be better to have 0 and then omit loops as default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All options will be overridden in a Jenkins job anyway. But I think it makes sense to skip loops as default.

@kamilwu
Copy link
Contributor Author

kamilwu commented Jun 25, 2019

Thanks @kkucharc for review! Fixes are ready

Copy link
Contributor

@kkucharc kkucharc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @lgajowy can you double check if it's similar to Java test case?

Copy link
Contributor

@lgajowy lgajowy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. Thanks!

Side note: could you squash commits and provide a more descriptive name and description, eg:

Title: [BEAM-7550] Reimplement Python ParDo load test according to the proposal`
Description: The proposal can be found here: https://s.apache.org/load-test-basic-operations

This is ParDo load test with Synthetic Source. Besides of the standard
input options there are additional options:
* number_of_counter_operations - number of pardo operations
* iterations - number of ParDo operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be more descriptive, eg. Number of subsequent ParDo operations to be performed

input options there are additional options:
* number_of_counter_operations - number of pardo operations
* iterations - number of ParDo operations
* number_of_counters - number of counter metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. I suggest: Number of counter metrics to be created for one ParDo operation

* number_of_counter_operations - number of pardo operations
* iterations - number of ParDo operations
* number_of_counters - number of counter metrics
* number_of_counter_operations - number of times all counters are incremented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of operations on counters to be performed in one ParDo - wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

super(ParDoTest, self).setUp()
self._apply_filter()
self.iterations = self._get_option_or_default('iterations')
self.nb_of_counter = self._get_option_or_default('number_of_counters')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Kasia - number_of_counters is easier to digest. As a non python dev and a person that didn't know this code, I had to think for a little while what does nb mean. numbersays it explicitly and costs nothing

num_runs = 1
else:
num_runs = int(self.iterations)
class CounterOperation(beam.DoFn):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall algorithm looks good to me

@kamilwu
Copy link
Contributor Author

kamilwu commented Jul 4, 2019

Thanks @lgajowy. I squahed commits as you suggested.
Moreover, I extracted get_option_or_default method to the LoadTest class, because I'm going to reuse it in other load tests.

@kamilwu kamilwu force-pushed the python-pardo-refactor branch from 4449767 to 245d59f Compare July 4, 2019 07:40
@kamilwu kamilwu force-pushed the python-pardo-refactor branch from 245d59f to 92eab27 Compare July 4, 2019 08:35
@kamilwu
Copy link
Contributor Author

kamilwu commented Jul 4, 2019

Run Python PreCommit

Copy link
Contributor

@lgajowy lgajowy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@lgajowy lgajowy merged commit 2d5e493 into apache:master Jul 4, 2019
@kamilwu kamilwu deleted the python-pardo-refactor branch July 4, 2019 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants