-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8378][Streaming]Add the Python API for Flume #6830
Conversation
<dependency> | ||
<groupId>org.apache.avro</groupId> | ||
<artifactId>avro</artifactId> | ||
<version>${avro.version}</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dependencies of avro
and avro-ipc
is necessary. If not adding them, the assembly plugin will use avro 1.7.3 and avro-ipc 1.7.4. They are incompatible and will throw
java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.apache.flume.source.avro.AvroFlumeEvent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an avro bug. Can you file a jira for Avro? Avro should be compatible within the same minor version - 1.7.x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know where the other version of avro is coming up ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an avro bug. Can you file a jira for Avro? Avro should be compatible within the same minor version - 1.7.x
I think avro
and avro-ipc
should have the same version?
Do you know where the other version of avro is coming up ?
Actually "mvn dependency:tree" shows both avro
and avro-ipc
are 1.7.7. But, I don't know why the assembly plugin picks up a different version.
Test build #34942 has finished for PR 6830 at commit
|
Why does this require yet another flume module? you can make an assembly in the existing one. |
I just followed the kafka-assembly module. Is it easy to make maven publish the assembly jar? |
Sure, add a usage of the assembly plugin to the existing module? we should not be proliferating these little modules unless they really represent logically distinct artifacts. |
@@ -129,6 +138,12 @@ configuring Flume agents. | |||
JavaReceiverInputDStream<SparkFlumeEvent>flumeStream = | |||
FlumeUtils.createPollingStream(streamingContext, [sink machine hostname], [sink port]); | |||
</div> | |||
<div data-lang="python" markdown="1"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the decoding logic the same here too? UTF-8 encoded string, or custom decoding function? If yes, we should move that snippet explaining this outside of both approaches and specify that it is applicable to both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the doc. Because there is no a Python example for FlumeUtils.createPollingStream
(it requires the user installing the Spark sink jar to Flume, so we cannot write an out-of-the-box example), I just copied the description of the encoding function to here.
+1 on adding the assembly jar to the current module build itself, if possible. |
@harishreedharan Take a look at this PR. |
@zsxwing This is pretty good to me, but its not obvious that this will work. Can you add Flume python tests? See how the KafkaTestUtils (present in src not test) is used to setup and run Kafka python tests. |
@jerryshao Since you built the Kafka python API recently, could you take a look at this PR as well. :) |
This LGTM, but I am not an expert in Python, but it looks like the Flume side should work fine. Adding tests would be great! |
Sure. I will try to add some tests. |
Added the Python unit tests. I refactored the flume unit tests and extracted the common codes for Scala and Python unit tests to FlumeTestUtils and PollingFlumeTestUtils. |
Test build #35271 has finished for PR 6830 at commit
|
Test build #35272 has finished for PR 6830 at commit
|
Test build #35277 has finished for PR 6830 at commit
|
Test build #35349 has finished for PR 6830 at commit
|
Test build #35350 has finished for PR 6830 at commit
|
retest this please |
retest this please |
Test build #36137 has finished for PR 6830 at commit
|
retest this please |
Test build #36141 has finished for PR 6830 at commit
|
test this please. |
Jenkins, retest this please. |
test this please. |
1 similar comment
test this please. |
Test build #36188 has finished for PR 6830 at commit
|
Jenkins, retest this please |
Test build #36184 has finished for PR 6830 at commit
|
Test build #36186 has finished for PR 6830 at commit
|
Test build #36189 has finished for PR 6830 at commit
|
Jenkins, test this please. |
1 similar comment
Jenkins, test this please. |
Test build #36215 has finished for PR 6830 at commit
|
Jenkins, test this please. |
Test build #36217 has finished for PR 6830 at commit
|
Test build #36224 has finished for PR 6830 at commit
|
Jenkins, test this please. |
LGTM. I will merge tomorrow morning after the current run passes. |
Test build #36241 has finished for PR 6830 at commit
|
retest this please |
Test build #36249 has finished for PR 6830 at commit
|
I am merging this to master. Thanks @zsxwing ! |
</parent> | ||
|
||
<groupId>org.apache.spark</groupId> | ||
<artifactId>spark-streaming-flume-assembly_2.10</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi guys,
Sorry I'm late to the party, but why is this new assembly necessary?
It creates an 80MB jar file that repackages a bunch of things already present in the Spark assembly (e.g. scala.*
, org.hadoop.*
, and a whole lot of other things). If python/pyspark/streaming/flume.py
is meant to be used inside a Spark application, aren't those dependencies already provided by the Spark assembly? In which case all that is needed is the existing spark-streaming-flume
artifact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, none of the fume stuff is present in the spark-assembly. That is
precisely why this assembly JAR with spark-streaming-flume and flume+its
dependencies were generated.
On Mon, Jul 6, 2015 at 2:02 PM, Marcelo Vanzin notifications@github.com
wrote:
In external/flume-assembly/pom.xml
#6830 (comment):
- ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- ~ See the License for the specific language governing permissions and
- ~ limitations under the License.
- -->
+
- 4.0.0
- org.apache.spark
- spark-parent_2.10
- 1.5.0-SNAPSHOT
- ../../pom.xml
- org.apache.spark
- spark-streaming-flume-assembly_2.10
Hi guys,
Sorry I'm late to the party, but why is this new assembly necessary?
It creates an 80MB jar file that repackages a bunch of things already
present in the Spark assembly (e.g. scala., org.hadoop., and a whole
lot of other things). If python/pyspark/streaming/flume.py is meant to be
used inside a Spark application, aren't those dependencies already provided
by the Spark assembly? In which case all that is needed is the existing
spark-streaming-flume artifact?—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/6830/files#r33981901.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, two things:
- spark-streaming-flume, the existing artifact, has transitive dependencies on flume. So if you add it using the ivy support in spark-submit, you'd get those.
- Even if you want to add this assembly, it currently packages way more than just flume. It includes all of Scala and Hadoop libraries and a bunch of other things, as I mentioned above.
So any way you look at it, there is still something to be fixed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Good point, we can probably exclude scala. Why are the Hadoop libraries
included? Definitely not through spark as spark-streaming is marked as
provided dependency? - The whole point of making the assembly JAR is to make it easy to run
spark streaming + flume applications, especially in python where the users
will not be creating mvn/sbt projects to include the dependencies in an
uber jar. The most convenient for python users who want to use flume stream
is to add --jar .jar. Hence flume and its all
its dependencies need to be included.
On Mon, Jul 6, 2015 at 2:19 PM, Marcelo Vanzin notifications@github.com
wrote:
In external/flume-assembly/pom.xml
#6830 (comment):
- ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- ~ See the License for the specific language governing permissions and
- ~ limitations under the License.
- -->
+
- 4.0.0
- org.apache.spark
- spark-parent_2.10
- 1.5.0-SNAPSHOT
- ../../pom.xml
- org.apache.spark
- spark-streaming-flume-assembly_2.10
Well, two things:
- spark-streaming-flume, the existing artifact, has transitive
dependencies on flume. So if you add it using the ivy support in
spark-submit, you'd get those.- Even if you want to add this assembly, it currently packages way
more than just flume. It includes all of Scala and Hadoop libraries and a
bunch of other things, as I mentioned above.So any way you look at it, there is still something to be fixed here.
—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/6830/files#r33983627.
No description provided.