[SPARK-8378][Streaming]Add the Python API for Flume #6830

zsxwing · 2015-06-15T14:54:08Z

No description provided.

zsxwing · 2015-06-15T14:55:51Z

external/flume-assembly/pom.xml

+    <dependency>
+      <groupId>org.apache.avro</groupId>
+      <artifactId>avro</artifactId>
+      <version>${avro.version}</version>


The dependencies of avro and avro-ipc is necessary. If not adding them, the assembly plugin will use avro 1.7.3 and avro-ipc 1.7.4. They are incompatible and will throw

java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.apache.flume.source.avro.AvroFlumeEvent

This seems like an avro bug. Can you file a jira for Avro? Avro should be compatible within the same minor version - 1.7.x

Do you know where the other version of avro is coming up ?

This seems like an avro bug. Can you file a jira for Avro? Avro should be compatible within the same minor version - 1.7.x

I think avro and avro-ipc should have the same version?

Do you know where the other version of avro is coming up ?

Actually "mvn dependency:tree" shows both avro and avro-ipc are 1.7.7. But, I don't know why the assembly plugin picks up a different version.

SparkQA · 2015-06-15T16:44:20Z

Test build #34942 has finished for PR 6830 at commit 9f33873.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlumeUtils(object):

srowen · 2015-06-18T08:51:33Z

Why does this require yet another flume module? you can make an assembly in the existing one.

zsxwing · 2015-06-18T09:01:43Z

I just followed the kafka-assembly module. Is it easy to make maven publish the assembly jar?

srowen · 2015-06-18T09:03:43Z

Sure, add a usage of the assembly plugin to the existing module? we should not be proliferating these little modules unless they really represent logically distinct artifacts.

harishreedharan · 2015-06-18T18:29:44Z

docs/streaming-flume-integration.md

@@ -129,6 +138,12 @@ configuring Flume agents.
 		JavaReceiverInputDStream<SparkFlumeEvent>flumeStream =
 			FlumeUtils.createPollingStream(streamingContext, [sink machine hostname], [sink port]);
 	</div>
+	<div data-lang="python" markdown="1">


Is the decoding logic the same here too? UTF-8 encoded string, or custom decoding function? If yes, we should move that snippet explaining this outside of both approaches and specify that it is applicable to both.

I updated the doc. Because there is no a Python example for FlumeUtils.createPollingStream (it requires the user installing the Spark sink jar to Flume, so we cannot write an out-of-the-box example), I just copied the description of the encoding function to here.

harishreedharan · 2015-06-18T19:58:57Z

+1 on adding the assembly jar to the current module build itself, if possible.

tdas · 2015-06-18T20:59:58Z

@harishreedharan Take a look at this PR.

tdas · 2015-06-18T21:06:07Z

@zsxwing This is pretty good to me, but its not obvious that this will work. Can you add Flume python tests? See how the KafkaTestUtils (present in src not test) is used to setup and run Kafka python tests.

tdas · 2015-06-18T21:06:37Z

@jerryshao Since you built the Kafka python API recently, could you take a look at this PR as well. :)

harishreedharan · 2015-06-18T21:38:01Z

This LGTM, but I am not an expert in Python, but it looks like the Flume side should work fine. Adding tests would be great!

zsxwing · 2015-06-19T00:12:21Z

@zsxwing This is pretty good to me, but its not obvious that this will work. Can you add Flume python tests? See how the KafkaTestUtils (present in src not test) is used to setup and run Kafka python tests.

Sure. I will try to add some tests.

zsxwing · 2015-06-19T14:31:46Z

Added the Python unit tests. I refactored the flume unit tests and extracted the common codes for Scala and Python unit tests to FlumeTestUtils and PollingFlumeTestUtils.

SparkQA · 2015-06-19T15:47:45Z

Test build #35271 has finished for PR 6830 at commit 0336579.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlumeUtils(object):

SparkQA · 2015-06-19T15:57:02Z

Test build #35272 has finished for PR 6830 at commit 4762c34.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlumeUtils(object):

SparkQA · 2015-06-19T17:34:42Z

Test build #35277 has finished for PR 6830 at commit 14ba0ff.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlumeUtils(object):

SparkQA · 2015-06-20T04:23:50Z

Test build #35349 has finished for PR 6830 at commit 152364c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlumeUtils(object):

SparkQA · 2015-06-20T04:36:35Z

Test build #35350 has finished for PR 6830 at commit 01cbb3d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlumeUtils(object):

zsxwing · 2015-06-20T15:56:05Z

retest this please

zsxwing · 2015-06-30T13:34:34Z

retest this please

SparkQA · 2015-06-30T13:48:58Z

Test build #36137 has finished for PR 6830 at commit 78dfdac.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlumeUtils(object):

zsxwing · 2015-06-30T14:25:26Z

retest this please

SparkQA · 2015-06-30T16:39:26Z

Test build #36141 has finished for PR 6830 at commit 78dfdac.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlumeUtils(object):

tdas · 2015-06-30T19:29:27Z

test this please.

JoshRosen · 2015-06-30T19:48:02Z

Jenkins, retest this please.

tdas · 2015-06-30T20:02:29Z

test this please.

tdas · 2015-06-30T20:20:00Z

test this please.

SparkQA · 2015-06-30T20:23:09Z

Test build #36188 has finished for PR 6830 at commit 78dfdac.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2015-06-30T20:29:03Z

Jenkins, retest this please

SparkQA · 2015-06-30T22:13:42Z

Test build #36184 has finished for PR 6830 at commit 78dfdac.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlumeUtils(object):

SparkQA · 2015-06-30T22:16:32Z

Test build #36186 has finished for PR 6830 at commit 78dfdac.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlumeUtils(object):

SparkQA · 2015-06-30T22:40:47Z

Test build #36189 has finished for PR 6830 at commit 78dfdac.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlumeUtils(object):

tdas · 2015-07-01T00:24:36Z

Jenkins, test this please.

tdas · 2015-07-01T01:20:35Z

Jenkins, test this please.

SparkQA · 2015-07-01T02:46:42Z

Test build #36215 has finished for PR 6830 at commit 78dfdac.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlumeUtils(object):

tdas · 2015-07-01T02:50:16Z

Jenkins, test this please.

SparkQA · 2015-07-01T03:38:49Z

Test build #36217 has finished for PR 6830 at commit 78dfdac.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlumeUtils(object):

SparkQA · 2015-07-01T05:02:47Z

Test build #36224 has finished for PR 6830 at commit 78dfdac.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlumeUtils(object):

tdas · 2015-07-01T08:43:33Z

Jenkins, test this please.

tdas · 2015-07-01T08:48:23Z

LGTM. I will merge tomorrow morning after the current run passes.

SparkQA · 2015-07-01T10:47:16Z

Test build #36241 has finished for PR 6830 at commit 78dfdac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlumeUtils(object):

zsxwing · 2015-07-01T11:25:16Z

retest this please

SparkQA · 2015-07-01T14:39:08Z

Test build #36249 has finished for PR 6830 at commit 78dfdac.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlumeUtils(object):

tdas · 2015-07-01T18:55:20Z

I am merging this to master. Thanks @zsxwing !

vanzin · 2015-07-06T21:02:16Z

external/flume-assembly/pom.xml

+  </parent>
+
+  <groupId>org.apache.spark</groupId>
+  <artifactId>spark-streaming-flume-assembly_2.10</artifactId>


Hi guys,

Sorry I'm late to the party, but why is this new assembly necessary?

It creates an 80MB jar file that repackages a bunch of things already present in the Spark assembly (e.g. scala.*, org.hadoop.*, and a whole lot of other things). If python/pyspark/streaming/flume.py is meant to be used inside a Spark application, aren't those dependencies already provided by the Spark assembly? In which case all that is needed is the existing spark-streaming-flume artifact?

Nope, none of the fume stuff is present in the spark-assembly. That is
precisely why this assembly JAR with spark-streaming-flume and flume+its
dependencies were generated.

On Mon, Jul 6, 2015 at 2:02 PM, Marcelo Vanzin notifications@github.com
wrote:

In external/flume-assembly/pom.xml
#6830 (comment):

~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

~ See the License for the specific language governing permissions and

~ limitations under the License.

-->

+

4.0.0

org.apache.spark

spark-parent_2.10

1.5.0-SNAPSHOT

../../pom.xml

org.apache.spark

spark-streaming-flume-assembly_2.10

Hi guys,

Sorry I'm late to the party, but why is this new assembly necessary?

It creates an 80MB jar file that repackages a bunch of things already
present in the Spark assembly (e.g. scala., org.hadoop., and a whole
lot of other things). If python/pyspark/streaming/flume.py is meant to be
used inside a Spark application, aren't those dependencies already provided
by the Spark assembly? In which case all that is needed is the existing
spark-streaming-flume artifact?

—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/6830/files#r33981901.

Well, two things:

spark-streaming-flume, the existing artifact, has transitive dependencies on flume. So if you add it using the ivy support in spark-submit, you'd get those.

Even if you want to add this assembly, it currently packages way more than just flume. It includes all of Scala and Hadoop libraries and a bunch of other things, as I mentioned above.

So any way you look at it, there is still something to be fixed here.

Good point, we can probably exclude scala. Why are the Hadoop libraries
included? Definitely not through spark as spark-streaming is marked as
provided dependency?

The whole point of making the assembly JAR is to make it easy to run
spark streaming + flume applications, especially in python where the users
will not be creating mvn/sbt projects to include the dependencies in an
uber jar. The most convenient for python users who want to use flume stream
is to add --jar .jar. Hence flume and its all
its dependencies need to be included.

On Mon, Jul 6, 2015 at 2:19 PM, Marcelo Vanzin notifications@github.com
wrote:

In external/flume-assembly/pom.xml
#6830 (comment):

~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

~ See the License for the specific language governing permissions and

~ limitations under the License.

-->

+

4.0.0

org.apache.spark

spark-parent_2.10

1.5.0-SNAPSHOT

../../pom.xml

org.apache.spark

spark-streaming-flume-assembly_2.10

Well, two things:

spark-streaming-flume, the existing artifact, has transitive
dependencies on flume. So if you add it using the ivy support in
spark-submit, you'd get those.

Even if you want to add this assembly, it currently packages way
more than just flume. It includes all of Scala and Hadoop libraries and a
bunch of other things, as I mentioned above.

So any way you look at it, there is still something to be fixed here.

—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/6830/files#r33983627.

Add the Python API for Flume

9f33873

zsxwing reviewed Jun 15, 2015
View reviewed changes

jerryshao mentioned this pull request Jun 17, 2015

[SPARK-7050][build] Fix Python Kafka test assembly jar not found issue under Maven build #5632

Closed

harishreedharan reviewed Jun 18, 2015
View reviewed changes

zsxwing added 2 commits June 19, 2015 22:23

Refactor Flume unit tests and also add tests for Python API

0336579

Fix the doc

4762c34

zsxwing added 2 commits June 19, 2015 23:51

Merge branch 'master' into flume-python

b8d5551

Add flume-assembly for sbt building

14ba0ff

tdas mentioned this pull request Jun 20, 2015

[SPARK-5155] [PySpark] [Streaming] Mqtt streaming support in Python #4229

Closed

zsxwing added 2 commits June 20, 2015 10:56

Fix the issue that StringIO doesn't work in Python 3

152364c

Add import sys

01cbb3d

tdas mentioned this pull request Jul 1, 2015

[SPARK-8533][Streaming] Upgrade Flume to 1.6.0 #6939

Closed

JoshRosen mentioned this pull request Jul 1, 2015

[SPARK-746][CORE] Added Avro Serialization to Kryo #7004

Closed

asfgit closed this in 75b9fe4 Jul 1, 2015

zsxwing deleted the flume-python branch July 2, 2015 01:20

vanzin reviewed Jul 6, 2015
View reviewed changes

[SPARK-8378][Streaming]Add the Python API for Flume #6830

[SPARK-8378][Streaming]Add the Python API for Flume #6830

Conversation

zsxwing commented Jun 15, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 15, 2015

srowen commented Jun 18, 2015

zsxwing commented Jun 18, 2015

srowen commented Jun 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harishreedharan commented Jun 18, 2015

tdas commented Jun 18, 2015

tdas commented Jun 18, 2015

tdas commented Jun 18, 2015

harishreedharan commented Jun 18, 2015

zsxwing commented Jun 19, 2015

zsxwing commented Jun 19, 2015

SparkQA commented Jun 19, 2015

SparkQA commented Jun 19, 2015

SparkQA commented Jun 19, 2015

SparkQA commented Jun 20, 2015

SparkQA commented Jun 20, 2015

zsxwing commented Jun 20, 2015

zsxwing commented Jun 30, 2015

SparkQA commented Jun 30, 2015

zsxwing commented Jun 30, 2015

SparkQA commented Jun 30, 2015

tdas commented Jun 30, 2015

JoshRosen commented Jun 30, 2015

tdas commented Jun 30, 2015

tdas commented Jun 30, 2015

SparkQA commented Jun 30, 2015

tdas commented Jun 30, 2015

SparkQA commented Jun 30, 2015

SparkQA commented Jun 30, 2015

SparkQA commented Jun 30, 2015

tdas commented Jul 1, 2015

tdas commented Jul 1, 2015

SparkQA commented Jul 1, 2015

tdas commented Jul 1, 2015

SparkQA commented Jul 1, 2015

SparkQA commented Jul 1, 2015

tdas commented Jul 1, 2015

tdas commented Jul 1, 2015

SparkQA commented Jul 1, 2015

zsxwing commented Jul 1, 2015

SparkQA commented Jul 1, 2015

tdas commented Jul 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment