Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-53] Java-only Pubsub sink for streaming. #171

Closed
wants to merge 6 commits into from
Closed

[BEAM-53] Java-only Pubsub sink for streaming. #171

wants to merge 6 commits into from

Conversation

mshields822
Copy link
Contributor

No description provided.

@mshields822
Copy link
Contributor Author

R: @kennknowles

@mshields822
Copy link
Contributor Author

Note the progression were on is:
pubsub-grpc DONE
pubsub-sink (this)
pubsub (for Source)
pubsub-runner (rework PubsubIO and google runners)
nexmark (target)

// ================================================================================

/**
* Number of cores available for publishing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a core in this context? My initial instinct is across machines, but it seems to be being used as the sharding factor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now 'numShards' and needs to be chosen to balance num records batch with pubsub latency. Note that a random long indeed shards but results in most bundles having a single element which kills apiary/grpc quota.

I currently (ie in a pending branch) hard code this in PubsubIO. It is comparable to the heuristics we have baked in for calculating the initial splits for an UnboundedReader.

@kennknowles
Copy link
Member

Better to have the owner of I/O review this. Also just want to call out here that the follow-ups suggested on #120 should probably go first, but I'll leave that up to Dan.

R: @dhalperi -@kennknowles

@mshields822
Copy link
Contributor Author

ACK working on a pubsub-apiary follow up. Thanks!

On Wed, Apr 13, 2016 at 7:16 PM, Kenn Knowles notifications@github.com
wrote:

Better to have the owner of I/O review this. Also just want to call out
here that the follow-ups suggested on #120
#120 should probably go
first, but I'll leave that up to Dan.

R: @dhalperi https://github.com/dhalperi -@kennknowles
https://github.com/kennknowles


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#171 (comment)

elementCounter.addValue(1L);
byte[] elementBytes = CoderUtils.encodeToByteArray(elementCoder, c.element());
long timestampMsSinceEpoch = c.timestamp().getMillis();
c.output(KV.of(ThreadLocalRandom.current().nextInt(numCores * SCALE_OUT),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a random int will allow most systems to choose the sharding to an even more arbitrary degree, and then you can remove numCores

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

swegner pushed a commit to swegner/beam that referenced this pull request Apr 22, 2016
@dhalperi
Copy link
Contributor

Looking into this -- looks like you'll have to rebase. I see this code is pre-name-change.

@mshields822
Copy link
Contributor Author

PTAL
Note this is not wired into anything yet - I plan to hook this and the upcoming PubsubUnboundedSource into PubsubIO in a stand-alone PR.

/**
* Coder for conveying outgoing messages between internal stages.
*/
private static final Coder<PubsubClient.OutgoingMessage> CODER = new
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that this is a case we support quite poorly today: A library author offers a sink, requiring a user to convert to a particular datatype Foo. The library author writes a coder for Foo and would like the user to get this benefit automatically.

Not necessarily directly applicable here, but putting it out there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is entirely internal to the Sink / PubsubClient interface. I could put the coder in PubsubClient but since it has no other PCollection/Coder/etc dependencies it felt better leaving it outside.

@kennknowles
Copy link
Member

Added a couple initial comments to share the review load.

elementCounter.addValue(1L);
byte[] elementBytes = CoderUtils.encodeToByteArray(elementCoder, c.element());
long timestampMsSinceEpoch = c.timestamp().getMillis();
c.output(KV.of(ThreadLocalRandom.current().nextInt(numShards),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is preferable (in terms of serialization overhead, which I understand is important for streaming) to make these static inner classes and then pass in constants in constructors. But up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good catch.

@mshields822
Copy link
Contributor Author

PTAL.
Went back and made the factories in PubsubXClient static classes too just to be consistent.

* BLOCKING
* Send {@code messages} as a batch to Pubsub.
*/
private void publishBatch(List<PubsubClient.OutgoingMessage> messages, int bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't much care, but you write this as both OutgoingMessage and PubsubClient.OutgoingMessage in this file. Probably could pick one and stick with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@dhalperi
Copy link
Contributor

Generally looks good to me. Let's sync tomorrow over any remaining comments.

@mshields822
Copy link
Contributor Author

PTAL
Beefed up unit testing for batching, though it's fragile due to direct runner's own choices on how to bundle.
Added display metadata.

throws IOException {
long nowMsSinceEpoch = System.currentTimeMillis();
int n = pubsubClient.publish(topic, messages);
Preconditions.checkState(n == messages.size());
checkState(n == messages.size(), "Attempted to publish %d messaged but %d were successful",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: messages

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@mshields822
Copy link
Contributor Author

PTAL

@asfgit asfgit closed this in 2fbc0ea May 12, 2016
iemejia pushed a commit to iemejia/beam that referenced this pull request Jan 12, 2018
mareksimunek pushed a commit to mareksimunek/beam that referenced this pull request May 9, 2018
…th-value-sorting

158 extend reduce by key with value sorting
damccorm added a commit that referenced this pull request Dec 9, 2022
Co-authored-by: Elias Segundo <elias.segundo@luisrazo.local>

Co-authored-by: Elias Segundo Antonio <eliassegundo.segundo@gmail.com>
Co-authored-by: Elias Segundo <elias.segundo@luisrazo.local>
Co-authored-by: Danny McCormick <dannymccormick@google.com>
lostluck pushed a commit to lostluck/beam that referenced this pull request Dec 22, 2022
Co-authored-by: Elias Segundo <elias.segundo@luisrazo.local>

Co-authored-by: Elias Segundo Antonio <eliassegundo.segundo@gmail.com>
Co-authored-by: Elias Segundo <elias.segundo@luisrazo.local>
Co-authored-by: Danny McCormick <dannymccormick@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants