New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STORM-583: Initiali check-in for storm-eventhubs. #336
Conversation
Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
@shanyu This is a ton of code. Do you have a JIRA for this? Do you have a design document of some sort describing how this works? |
@revans2 Yes, this is the JIRA: |
@shanyu it might be a good idea to prefix this PR title with STORM-583: this way jira's github bot and people could relate issue and PR to each other. |
@@ -0,0 +1,107 @@ | |||
<?xml version="1.0" encoding="UTF-8"?> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need apache license header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ptgoetz !This is fixed.
I would support pulling this in, but we need additional committers to approve it, and one or more to volunteer as sponsor(s). As @revans2 mentioned, this is a sizable amount of code, and effectively amounts to a code donation from Microsoft. As such it needs to go through the IP clearance process before it can be merged. If there are enough +1 votes from committers (an no vetoes), I will initiate the IP clearance process. |
If downstream throughput of a Storm topology is low, the EventHubSpout keeps sending tuples downstream but not get acked. This can result in OutOfMemoryError in the worker process that runs EventHubSpout because the spout needs to cache all pending messages. We need to introduce a new configuration: Eventhubspout.max.pending.messages.per.partition And the default value of this configuration is 1024. With this configuration, if the pending messages reach this value, then EH spout won’t fetch more messages from eventhubs entity. Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
Currently when a Storm topology first creates an EventHubSpout (no offset saved in Zookeeper), it always fetches events from the oldest messages in the EventHubs. One way to solve this problem is to provide an additional configuration "eventhub.receiver.enqueuetime.after", which if set AND there is no offset saved in Zookeeper (e.g. first time running the topology) then we’ll use it as the enqueue-time-based filter to create EventHubs receiver. With this configuration, Storm topologies can quickly “bootstrap” and starts from recent messages (ignoring old messages in EventHubs). Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
Added copywrite headers, and also include curator framework jar in the jars with dependencies (because storm relocates the jar). Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
@ptgoetz Is storm-eventhubs available to use? |
@sasincj its in the process of getting into storm/external. But you can use by going into the shanyu's branch directly |
@sasincj I am a bit nervous about pulling this in as we don't have a very good way to test it beyond the unit tests. EventHubs is a Microsoft Azure specific system. I cannot spin up my own instance of it to validate that the code is doing what I expect prior to a release. Although it may be a bit painful, I can do that for Kafka, Redis, HDFS, and HBase integration that we have accepted in so far. That doesn't mean we cannot/should not pull it in. To me it mostly means that we need more assurances and support prior to doing so. |
…anugage Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
+1 for accepting this as a code donation from Microsoft and moving forward with the IP clearance process. COMMTTERS please note that this must not be merged until the IP clearance process is completed successfully. I am working with @shanyu to complete those requirements. |
@shanyu Also, once the source code tarball for IP clearance has been created, please don't add any commits to this pull request. Any necessary modifications can take place later after IP clearance is complete. |
@ptgoetz Got it. I will keep the pull request intact during the IP clearance phase. Thanks a lot! |
@ptgoetz It has been a while, is the IP clearance done? Thx |
Thanks @shanyu. I merged this into master. |
…#336) Conflicts: CHANGELOG.md pom.xml storm-dist/binary/src/main/assembly/binary.xml
Signed-off-by: Shanyu Zhao shzhao@microsoft.com