Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-11904] Remove PrettyPrint from job definition #14122

Merged
merged 1 commit into from Mar 10, 2021

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Mar 1, 2021

Please add a meaningful description for your change here. I would like to remove the PrettyPrint from the output. This decreases the size of the job definition:

With the PR:

-rw-r--r--  1 fdriesprong  wheel    45M Mar  1 22:27 /tmp/pubsub-to-bigquery.json

With 2.28.0:

-rw-r--r--  1 fdriesprong  wheel    47M Mar  1 22:30 /tmp/pubsub-to-bigquery.json

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment.
  • Format the pull request title like [BEAM-11904] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status Build Status Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status --- --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@dpcollins-google
Copy link
Contributor

R: @pabloem

I don't know if this would affect anything in the dataflow backend? But AFAICT this would result in more compressed JSON and fewer user jobs over the limit.

@Fokko, I don't know that this solves your issue. Even after this change, your job graph size is still over the limit (20 MiB IIUC?) https://cloud.google.com/dataflow/docs/guides/common-errors#job-graph-too-large

It might be good to look at what is in the serialized JSON payload for your job, see if there's any multi-MiB blobs of data that you shouldn't be serializing?

@pabloem
Copy link
Member

pabloem commented Mar 1, 2021

right - often closures with larger-than-intended context can explode the job graph size

@Fokko
Copy link
Contributor Author

Fokko commented Mar 1, 2021

We're writing data from PubSub to BigQuery, and there are Avro messages that we know upfront, and that we decode from PubSub. I don't see anything huge, but the serialized_source and the serialized_fn are rather large (not big). I think the schema is represented multiple times in the graph, each time it is being passed to the next stage ((de)serializing).

@dpcollins-google Thanks for the quick reply. I'm iterating towards a <20MB job, which is indeed the limit. I noticed that it was pretty printed, so that was a quick win of around 5%.

@pabloem
Copy link
Member

pabloem commented Mar 1, 2021

Can you try using --experiments=upload_graph? This should upload your graph to GCS instead of an HTTP request

@pabloem
Copy link
Member

pabloem commented Mar 1, 2021

let me know if that works, and if so, I can get it on stackoverflow / documentation for others

@pabloem
Copy link
Member

pabloem commented Mar 1, 2021

Run Java PostCommit

@Fokko
Copy link
Contributor Author

Fokko commented Mar 1, 2021

@pabloem Thanks for the pointer. It looks good at first glance. The UI seems to become unresponsive: Graph monitoring data exceeded maximum allowable size. Some monitoring data including counters may be incomplete. But if it works, that would be okay for now. :)

@pabloem
Copy link
Member

pabloem commented Mar 1, 2021

hm right, because the UI also passes the job graph in an API call...
I guess a good next-step is to change how we include schemas in the graph to have a dictionary of some sort...

@pabloem pabloem self-requested a review March 1, 2021 22:51
@Fokko
Copy link
Contributor Author

Fokko commented Mar 2, 2021

@pabloem Seems like I'm unable to see the stacktrace of the failed test, can you confirm?

@pabloem
Copy link
Member

pabloem commented Mar 2, 2021

that's correct - though the failed test is broken on the main branch: https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/testReport/

@pabloem
Copy link
Member

pabloem commented Mar 3, 2021

Run Java PostCommit

@Fokko
Copy link
Contributor Author

Fokko commented Mar 4, 2021

@pabloem Do you think this is something that we can move forward?

@pabloem
Copy link
Member

pabloem commented Mar 9, 2021

thanks for rebasing! : )

@pabloem
Copy link
Member

pabloem commented Mar 9, 2021

Run Java PostCommit

@pabloem pabloem merged commit 4479ff6 into apache:master Mar 10, 2021
@Fokko
Copy link
Contributor Author

Fokko commented Mar 10, 2021

Thanks @pabloem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants