Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-10675] Add Python GBK Load Tests for streaming on Dataflow #12612

Merged
merged 5 commits into from Aug 26, 2020

Conversation

kamilwu
Copy link
Contributor

@kamilwu kamilwu commented Aug 18, 2020

Add Python GBK Load Tests for streaming on Dataflow. Cases 1, 2, 4 and 5 are temporarily excluded because their execution time is too long.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status --- Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels

See CI.md for more information about GitHub Actions CI.

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 18, 2020

Run Seed Job

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 18, 2020

Run Load Tests Python GBK Dataflow Batch

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 18, 2020

Run Load Tests Python GBK Dataflow Streaming

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 18, 2020

Run Load Tests Python GBK reiterate Dataflow Batch

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 18, 2020

Run Load Tests Python GBK reiterate Dataflow Streaming

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 20, 2020

Run Seed Job

@kamilwu kamilwu marked this pull request as ready for review August 20, 2020 10:13
@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 20, 2020

Run Load Tests Python GBK Dataflow Batch

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 20, 2020

Run Load Tests Python GBK Dataflow Streaming

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 20, 2020

Run Load Tests Python GBK reiterate Dataflow Batch

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 20, 2020

Run Load Tests Python GBK reiterate Dataflow Streaming

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 20, 2020

Run Python PreCommit

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 20, 2020

R: @tysonjh Could you take a look?

Copy link
Contributor

@tysonjh tysonjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some minor comments.

]
],
[
title : 'GroupByKey Python Load test: 2GB of 100B records',
test : 'apache_beam.testing.load_tests.group_by_key_test',
runner : CommonTestProperties.Runner.DATAFLOW,
pipelineOptions: [
job_name : 'load-tests-python-dataflow-batch-gbk-2-' + now,
job_name : 'load-tests-python-dataflow-${mode}-gbk-2-' + now,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
job_name : 'load-tests-python-dataflow-${mode}-gbk-2-' + now,
job_name : "load-tests-python-dataflow-${mode}-gbk-2-${now}",

Comment on lines 148 to 161
def addStreamingOptions(test) {
test.pipelineOptions << [streaming: null, experiments: 'use_runner_v2',
enable_streaming_engine: null ]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to explain what these settings are? It's unexpected to see that 'streaming: null' or 'enable_streaming_engine: null' somehow enables streaming, or why 'use_runner_v2' is required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed --enable_streaming_engine, since it is now being added automatically when using use_runner_v2: #12585

def batchLoadTestJob = { scope, triggeringContext ->
scope.description('Runs Python GBK reiterate load tests on Dataflow runner in batch mode')
commonJobProperties.setTopLevelMainJobProperties(scope, 'master', 240)
def addStreamingOptions(test) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, please add a comment.

batchLoadTestJob(delegate, CommonTestProperties.TriggeringContext.POST_COMMIT)
}
CronJobBuilder.cronJob('beam_LoadTests_Python_GBK_reiterate_Dataflow_Batch',
'H 14 * * *', this) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the methodology for picking the time to trigger these? Is it documented anywhere?

Copy link
Contributor Author

@kamilwu kamilwu Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, each test suite (GBK, ParDo, IO tests, etc.) should has its own, unique time in order no to flood Jenkins with many tests that are triggered at the same time. When adding a new test suite, a contributor has to take a look at what time slots are already occupied and avoid using them.

I think this is not documented. I'll add some information here: https://cwiki.apache.org/confluence/display/BEAM/Contribution+Testing+Guide#ContributionTestingGuide

@@ -22,119 +22,138 @@ import InfluxDBCredentialsHelper

def now = new Date().format("MMddHHmmss", TimeZone.getTimeZone('UTC'))

def loadTestConfigurations = { datasetName ->
// TODO(BEAM-10774): Skipping some cases because they are too slow.
def STREAMING_TESTS_TO_SKIP = [1, 2, 4, 5]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkucharc made a good point here in PR#12435 about using indices for ignoring tests. I'm more inclined towards the approach @kkucharc is taking by excluding using the job_name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, an argument that @kkucharc made in PR#12435 is convincing. I'll exclude those cases by using the job_name.

@codecov
Copy link

codecov bot commented Aug 24, 2020

Codecov Report

Merging #12612 into master will decrease coverage by 0.13%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #12612      +/-   ##
==========================================
- Coverage   34.47%   34.33%   -0.14%     
==========================================
  Files         684      692       +8     
  Lines       81483    81958     +475     
  Branches     9180     9264      +84     
==========================================
+ Hits        28090    28144      +54     
- Misses      52965    53391     +426     
+ Partials      428      423       -5     
Impacted Files Coverage Δ
typehints/typecheck_test_py3.py 31.54% <0.00%> (-16.00%) ⬇️
typehints/typecheck.py 29.44% <0.00%> (-6.18%) ⬇️
utils/interactive_utils.py 30.95% <0.00%> (-2.39%) ⬇️
testing/load_tests/load_test_metrics_utils.py 34.98% <0.00%> (-1.39%) ⬇️
runners/worker/opcounters.py 33.81% <0.00%> (-0.87%) ⬇️
pipeline.py 22.04% <0.00%> (-0.28%) ⬇️
options/pipeline_options.py 52.99% <0.00%> (-0.16%) ⬇️
transforms/ptransform_test.py 18.37% <0.00%> (-0.09%) ⬇️
transforms/core.py 36.79% <0.00%> (-0.05%) ⬇️
runners/common.py 27.50% <0.00%> (ø)
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 094d06c...11a1a3b. Read the comment docs.

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 24, 2020

Run Seed Job

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 24, 2020

Run Load Tests Python GBK Dataflow Streaming

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 24, 2020

Run Load Tests Python GBK Dataflow Batch

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 24, 2020

Run Load Tests Python GBK reiterate Dataflow Streaming

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 24, 2020

Run Load Tests Python GBK reiterate Dataflow Batch

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 24, 2020

@tysonjh Thanks! Could you verify my fixes before I merge?

enable_streaming_engine is now being added automatically when running with 'use_runner_v2'
Copy link
Contributor

@tysonjh tysonjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@kamilwu
Copy link
Contributor Author

kamilwu commented Aug 26, 2020

Thanks!

@kamilwu kamilwu merged commit f675112 into apache:master Aug 26, 2020
@kamilwu kamilwu deleted the python-gbk-streaming branch August 26, 2020 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants