Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8378][Streaming]Add the Python API for Flume #6830

Closed
wants to merge 16 commits into from
Closed

[SPARK-8378][Streaming]Add the Python API for Flume #6830

wants to merge 16 commits into from

Conversation

zsxwing
Copy link
Member

@zsxwing zsxwing commented Jun 15, 2015

No description provided.

<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>${avro.version}</version>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependencies of avro and avro-ipc is necessary. If not adding them, the assembly plugin will use avro 1.7.3 and avro-ipc 1.7.4. They are incompatible and will throw

java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.apache.flume.source.avro.AvroFlumeEvent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an avro bug. Can you file a jira for Avro? Avro should be compatible within the same minor version - 1.7.x

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know where the other version of avro is coming up ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an avro bug. Can you file a jira for Avro? Avro should be compatible within the same minor version - 1.7.x

I think avro and avro-ipc should have the same version?

Do you know where the other version of avro is coming up ?

Actually "mvn dependency:tree" shows both avro and avro-ipc are 1.7.7. But, I don't know why the assembly plugin picks up a different version.

@SparkQA
Copy link

SparkQA commented Jun 15, 2015

Test build #34942 has finished for PR 6830 at commit 9f33873.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FlumeUtils(object):

@srowen
Copy link
Member

srowen commented Jun 18, 2015

Why does this require yet another flume module? you can make an assembly in the existing one.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 18, 2015

I just followed the kafka-assembly module. Is it easy to make maven publish the assembly jar?

@srowen
Copy link
Member

srowen commented Jun 18, 2015

Sure, add a usage of the assembly plugin to the existing module? we should not be proliferating these little modules unless they really represent logically distinct artifacts.

@@ -129,6 +138,12 @@ configuring Flume agents.
JavaReceiverInputDStream<SparkFlumeEvent>flumeStream =
FlumeUtils.createPollingStream(streamingContext, [sink machine hostname], [sink port]);
</div>
<div data-lang="python" markdown="1">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the decoding logic the same here too? UTF-8 encoded string, or custom decoding function? If yes, we should move that snippet explaining this outside of both approaches and specify that it is applicable to both.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the doc. Because there is no a Python example for FlumeUtils.createPollingStream (it requires the user installing the Spark sink jar to Flume, so we cannot write an out-of-the-box example), I just copied the description of the encoding function to here.

@harishreedharan
Copy link
Contributor

+1 on adding the assembly jar to the current module build itself, if possible.

@tdas
Copy link
Contributor

tdas commented Jun 18, 2015

@harishreedharan Take a look at this PR.

@tdas
Copy link
Contributor

tdas commented Jun 18, 2015

@zsxwing This is pretty good to me, but its not obvious that this will work. Can you add Flume python tests? See how the KafkaTestUtils (present in src not test) is used to setup and run Kafka python tests.

@tdas
Copy link
Contributor

tdas commented Jun 18, 2015

@jerryshao Since you built the Kafka python API recently, could you take a look at this PR as well. :)

@harishreedharan
Copy link
Contributor

This LGTM, but I am not an expert in Python, but it looks like the Flume side should work fine. Adding tests would be great!

@zsxwing
Copy link
Member Author

zsxwing commented Jun 19, 2015

@zsxwing This is pretty good to me, but its not obvious that this will work. Can you add Flume python tests? See how the KafkaTestUtils (present in src not test) is used to setup and run Kafka python tests.

Sure. I will try to add some tests.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 19, 2015

Added the Python unit tests. I refactored the flume unit tests and extracted the common codes for Scala and Python unit tests to FlumeTestUtils and PollingFlumeTestUtils.

@SparkQA
Copy link

SparkQA commented Jun 19, 2015

Test build #35271 has finished for PR 6830 at commit 0336579.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FlumeUtils(object):

@SparkQA
Copy link

SparkQA commented Jun 19, 2015

Test build #35272 has finished for PR 6830 at commit 4762c34.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FlumeUtils(object):

@SparkQA
Copy link

SparkQA commented Jun 19, 2015

Test build #35277 has finished for PR 6830 at commit 14ba0ff.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FlumeUtils(object):

@SparkQA
Copy link

SparkQA commented Jun 20, 2015

Test build #35349 has finished for PR 6830 at commit 152364c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FlumeUtils(object):

@SparkQA
Copy link

SparkQA commented Jun 20, 2015

Test build #35350 has finished for PR 6830 at commit 01cbb3d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FlumeUtils(object):

@zsxwing
Copy link
Member Author

zsxwing commented Jun 20, 2015

retest this please

@zsxwing
Copy link
Member Author

zsxwing commented Jun 30, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36137 has finished for PR 6830 at commit 78dfdac.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FlumeUtils(object):

@zsxwing
Copy link
Member Author

zsxwing commented Jun 30, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36141 has finished for PR 6830 at commit 78dfdac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FlumeUtils(object):

@tdas
Copy link
Contributor

tdas commented Jun 30, 2015

test this please.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@tdas
Copy link
Contributor

tdas commented Jun 30, 2015

test this please.

1 similar comment
@tdas
Copy link
Contributor

tdas commented Jun 30, 2015

test this please.

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36188 has finished for PR 6830 at commit 78dfdac.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Jun 30, 2015

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36184 has finished for PR 6830 at commit 78dfdac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FlumeUtils(object):

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36186 has finished for PR 6830 at commit 78dfdac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FlumeUtils(object):

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36189 has finished for PR 6830 at commit 78dfdac.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FlumeUtils(object):

@tdas
Copy link
Contributor

tdas commented Jul 1, 2015

Jenkins, test this please.

1 similar comment
@tdas
Copy link
Contributor

tdas commented Jul 1, 2015

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Jul 1, 2015

Test build #36215 has finished for PR 6830 at commit 78dfdac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FlumeUtils(object):

@tdas
Copy link
Contributor

tdas commented Jul 1, 2015

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Jul 1, 2015

Test build #36217 has finished for PR 6830 at commit 78dfdac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FlumeUtils(object):

@SparkQA
Copy link

SparkQA commented Jul 1, 2015

Test build #36224 has finished for PR 6830 at commit 78dfdac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FlumeUtils(object):

@tdas
Copy link
Contributor

tdas commented Jul 1, 2015

Jenkins, test this please.

@tdas
Copy link
Contributor

tdas commented Jul 1, 2015

LGTM. I will merge tomorrow morning after the current run passes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2015

Test build #36241 has finished for PR 6830 at commit 78dfdac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FlumeUtils(object):

@zsxwing
Copy link
Member Author

zsxwing commented Jul 1, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Jul 1, 2015

Test build #36249 has finished for PR 6830 at commit 78dfdac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FlumeUtils(object):

@tdas
Copy link
Contributor

tdas commented Jul 1, 2015

I am merging this to master. Thanks @zsxwing !

@asfgit asfgit closed this in 75b9fe4 Jul 1, 2015
@zsxwing zsxwing deleted the flume-python branch July 2, 2015 01:20
</parent>

<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume-assembly_2.10</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi guys,

Sorry I'm late to the party, but why is this new assembly necessary?

It creates an 80MB jar file that repackages a bunch of things already present in the Spark assembly (e.g. scala.*, org.hadoop.*, and a whole lot of other things). If python/pyspark/streaming/flume.py is meant to be used inside a Spark application, aren't those dependencies already provided by the Spark assembly? In which case all that is needed is the existing spark-streaming-flume artifact?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, none of the fume stuff is present in the spark-assembly. That is
precisely why this assembly JAR with spark-streaming-flume and flume+its
dependencies were generated.

On Mon, Jul 6, 2015 at 2:02 PM, Marcelo Vanzin notifications@github.com
wrote:

In external/flume-assembly/pom.xml
#6830 (comment):

  • ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  • ~ See the License for the specific language governing permissions and
  • ~ limitations under the License.
  • -->

+

  • 4.0.0
  • org.apache.spark
  • spark-parent_2.10
  • 1.5.0-SNAPSHOT
  • ../../pom.xml
  • org.apache.spark
  • spark-streaming-flume-assembly_2.10

Hi guys,

Sorry I'm late to the party, but why is this new assembly necessary?

It creates an 80MB jar file that repackages a bunch of things already
present in the Spark assembly (e.g. scala., org.hadoop., and a whole
lot of other things). If python/pyspark/streaming/flume.py is meant to be
used inside a Spark application, aren't those dependencies already provided
by the Spark assembly? In which case all that is needed is the existing
spark-streaming-flume artifact?


Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/6830/files#r33981901.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, two things:

  • spark-streaming-flume, the existing artifact, has transitive dependencies on flume. So if you add it using the ivy support in spark-submit, you'd get those.
  • Even if you want to add this assembly, it currently packages way more than just flume. It includes all of Scala and Hadoop libraries and a bunch of other things, as I mentioned above.

So any way you look at it, there is still something to be fixed here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Good point, we can probably exclude scala. Why are the Hadoop libraries
    included? Definitely not through spark as spark-streaming is marked as
    provided dependency?
  2. The whole point of making the assembly JAR is to make it easy to run
    spark streaming + flume applications, especially in python where the users
    will not be creating mvn/sbt projects to include the dependencies in an
    uber jar. The most convenient for python users who want to use flume stream
    is to add --jar .jar. Hence flume and its all
    its dependencies need to be included.

On Mon, Jul 6, 2015 at 2:19 PM, Marcelo Vanzin notifications@github.com
wrote:

In external/flume-assembly/pom.xml
#6830 (comment):

  • ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  • ~ See the License for the specific language governing permissions and
  • ~ limitations under the License.
  • -->

+

  • 4.0.0
  • org.apache.spark
  • spark-parent_2.10
  • 1.5.0-SNAPSHOT
  • ../../pom.xml
  • org.apache.spark
  • spark-streaming-flume-assembly_2.10

Well, two things:

  • spark-streaming-flume, the existing artifact, has transitive
    dependencies on flume. So if you add it using the ivy support in
    spark-submit, you'd get those.
  • Even if you want to add this assembly, it currently packages way
    more than just flume. It includes all of Scala and Hadoop libraries and a
    bunch of other things, as I mentioned above.

So any way you look at it, there is still something to be fixed here.


Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/6830/files#r33983627.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants