Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ZEPPELIN-1883] Can't import spark submitted packages in PySpark #1831

Conversation

1ambda
Copy link
Member

@1ambda 1ambda commented Jan 2, 2017

What is this PR for?

Fixed importing packages in pyspack requested by SPARK_SUBMIT_OPTION

What type of PR is it?

[Bug Fix]

Todos

Nothing

What is the Jira issue?

ZEPPELIN-1883

How should this be tested?

  1. Download Apache Spark 1.6.2 (since it's the most recent for pyspark-cassandra)

  2. Set SPARK_HOME and SPARK_SUBMIT_OPTION in conf/zeppelin-env.sh like

export SPARK_HOME="~/github/apache-spark/1.6.2-bin-hadoop2.6"
export SPARK_SUBMIT_OPTIONS="--packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.2,TargetHolding:pyspark-cassandra:0.3.5 --exclude-packages org.slf4j:slf4j-api"
  1. Check before that you can run spark-submit or not
./bin/spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.2,TargetHolding:pyspark-cassandra:0.3.5 --exclude-packages org.slf4j:slf4j-api --class org.apache.spark.examples.SparkPi lib/spark-examples-1.6.2-hadoop2.6.0.jar
  1. Test whether submitted packages can be import or not
%pyspark

import pyspark_cassandra

Screenshots (if appropriate)

import pyspark_cassandra


Traceback (most recent call last):
  File "/var/folders/lr/8g9y625n5j39rz6qhkg8s6640000gn/T/zeppelin_pyspark-5266742863961917074.py", line 267, in <module>
    raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
  File "/var/folders/lr/8g9y625n5j39rz6qhkg8s6640000gn/T/zeppelin_pyspark-5266742863961917074.py", line 265, in <module>
    exec(code)
  File "<stdin>", line 1, in <module>
ImportError: No module named pyspark_cassandra

Questions:

  • Does the licenses files need update? - NO
  • Is there breaking changes for older versions? - NO
  • Does this needs documentation? - NO

@astroshim
Copy link
Contributor

In my test, I got a INFO [2017-01-02 09:08:12,358] ({Exec Default Executor} RemoteInterpreterManagedProcess.java[onProcessComplete]:164) - Interpreter process exited 0 error when i try to run the paragraph.
Maybe this error occurs when couldn't download libraries of SPARK_SUBMIT_OPTIONS option.
Is this normal behavior?

@1ambda
Copy link
Member Author

1ambda commented Jan 2, 2017

@astroshim Thanks for review!

It's the expected behavior. If spark-submit doens't be properly loaded, spark interpreter will die without errors.
I'v just updated the How should this be tested? section so that you can check the problem is occurred from spark-submit or not.

@zjffdu
Copy link
Contributor

zjffdu commented Jan 3, 2017

@1ambda Spark doesn't support specifying python packages throught --packages, the correct usage is to use --py-files. Although this PR could resolve your issue, but the issue here is not due to zeppelin bug, it is because of wrong usage of --packages.

@felixcheung
Copy link
Member

right, I'm a bit concern if this would be the right fix for the issue?

@1ambda
Copy link
Member Author

1ambda commented Jan 3, 2017

@zjffdu @felixcheung Thanks for review :)

  1. Then, How can I load pyspark-cassandra for pyspark in zeppelin?
// from README

spark-submit \
    --packages TargetHolding/pyspark-cassandra:<version> \
    --conf spark.cassandra.connection.host=your,cassandra,node,names
  1. According to SparkSubmit.scala it seems like adding resolved maven package path to --py-files. Could you check it? (pyspark-cassandra jar includes some py files.)

  2. If you run the code below, it download packages and add it to py-files so that you can import pyspark-cassandra in python shell.

// used spark 1.6.2

./bin/pyspark --packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.2,TargetHolding:pyspark-cassandra:0.3.5 --exclude-packages org.slf4j:slf4j-api

>>> import pyspark_cassandra
>>>

Considering (2) and (3), we can conclude that --packages option download some jars and add them to py-files even in pyspark for some jar files including python files. So why do you think?

Spark doesn't support specifying python packages throught --packages, the correct usage is to use --py-files.

  1. Why zeppelin cannot use --package option to download pyspark-cassandra even spark can do that. It's not a bug of zeppelin?

but the issue here is not due to zeppelin bug,

@zjffdu
Copy link
Contributor

zjffdu commented Jan 3, 2017

@1ambda Actually, pyspark-cassandra doesn't work for me in pyspark shell. I guess it works because you have installed it locally.

>>> import pyspark_cassandra
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named pyspark_cassandra

@1ambda
Copy link
Member Author

1ambda commented Jan 3, 2017

@zjffdu

I'v just created gist to show --packages option download pyspark-cassandra. https://gist.github.com/1ambda/5caf92753ea2f95ada11b1c13945d261

$ pip uninstall pyspark-cassandra
Cannot uninstall requirement pyspark-cassandra, not installed

$ ./bin/pyspark --packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.2,TargetHolding:pyspark-cassandra:0.3.5 --exclude-packages org.slf4j:slf4j-api

...

downloading https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.10/1.6.2/spark-cassandra-connector_2.10-1.6.2.jar ...
	[SUCCESSFUL ] com.datastax.spark#spark-cassandra-connector_2.10;1.6.2!spark-cassandra-connector_2.10.jar (450ms)
downloading http://dl.bintray.com/spark-packages/maven/TargetHolding/pyspark-cassandra/0.3.5/pyspark-cassandra-0.3.5.jar ...
	[SUCCESSFUL ] TargetHolding#pyspark-cassandra;0.3.5!pyspark-cassandra.jar (310ms)
downloading https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector-java_2.10/1.6.0-M1/spark-cassandra-connector-java_2.10-1.6.0-M1.jar ...
	[SUCCESSFUL ] com.datastax.spark#spark-cassandra-connector-java_2.10;1.6.0-M1!spark-cassandra-connector-java_2.10.jar (23ms)
downloading https://repo1.maven.org/maven2/com/datastax/cassandra/cassandra-driver-core/3.0.0/cassandra-driver-core-3.0.0.jar ...
	[SUCCESSFUL ] com.datastax.cassandra#cassandra-driver-core;3.0.0!cassandra-driver-core.jar(bundle) (78ms)
:: resolution report :: resolve 2819ms :: artifacts dl 870ms

@zjffdu
Copy link
Contributor

zjffdu commented Jan 3, 2017

hmm, it works in local mode but doesn't work in yarn-client mode. Could you try yarn-client mode ?

@1ambda
Copy link
Member Author

1ambda commented Jan 3, 2017

I tested on yarn-client, mesos-client and found that

  • mesos-client mode copies pyspark-cassandra submitted by --packages as you can see here (the error is due to invalid python version, not because of spark, pyspark-cassandra)
Using Python version 2.6.6 (r266:84292, Aug 18 2016 15:13:37)
SparkContext available as sc, HiveContext available as sqlContext.
>>> import pyspark_cassandra
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/spark-df7bc8fa-233f-4124-855b-4a39fa948c1a/userFiles-ab70ffa3-212b-47ee-9611-9c240d3ce899/TargetHolding_pyspark-cassandra-0.3.5.jar/pyspark_cassandra/__init__.py", line 24, in <module>
  File "/tmp/spark-df7bc8fa-233f-4124-855b-4a39fa948c1a/userFiles-ab70ffa3-212b-47ee-9611-9c240d3ce899/TargetHolding_pyspark-cassandra-0.3.5.jar/pyspark_cassandra/context.py", line 16, in <module>
  File "/tmp/spark-df7bc8fa-233f-4124-855b-4a39fa948c1a/userFiles-ab70ffa3-212b-47ee-9611-9c240d3ce899/TargetHolding_pyspark-cassandra-0.3.5.jar/pyspark_cassandra/rdd.py", line 291
    k = Row(**{c: row.__getattr__(c) for c in columns})
                                       ^
SyntaxError: invalid syntax
>>>
  • yarn-client mode doens't copy pyFiles as you can see here
    // If we're running a python app, set the main class to our specific python runner
    if (args.isPython && deployMode == CLIENT) {

...

      if (clusterManager != YARN) {
        // The YARN backend handles python files differently, so don't merge the lists.
        args.files = mergeFileLists(args.files, args.pyFiles)
      }

Summary

@zjffdu @felixcheung

  1. I am not sure why they decided not to copy py-files in yarn-client mode. But it's problem of spark, not zeppelin. Of course, we can implement the exact same behavior in yarn-client mode. But then, users cannot benefit from --packages. They need to download, find location of all transitive deps and provide the paths to --py-files
  2. As you saw, this is the expected behavior at least in local, mesos-client.

@1ambda
Copy link
Member Author

1ambda commented Jan 5, 2017

Any update? This is one of blockers for 0.7

@zjffdu
Copy link
Contributor

zjffdu commented Jan 5, 2017

I still think this is not a correct fix since it doesn't resolve the yarn-client mode which I believe most of users use this mode.

@1ambda
Copy link
Member Author

1ambda commented Jan 5, 2017

@zjffdu

since it doesn't resolve the yarn-client mode

  1. PySpark also doens't support extending PYTHONPATH in yarn-client.
  2. You are saying this is not right fix repeatedly without providing any other idea. So let me ask
  • How you can load pyspark-cassandra using --packages as described in their README.md in local, mesos-client mode also even in yarn-client on Zeppelin?

@zjffdu
Copy link
Contributor

zjffdu commented Jan 5, 2017

As I said before why not using --py-files, I check the repository of pyspark-cassandra.
https://github.com/TargetHolding/pyspark-cassandra

README shows that user can use --py-files

spark-submit \
    --jars /path/to/pyspark-cassandra-assembly-<version>.jar \
    --driver-class-path /path/to/pyspark-cassandra-assembly-<version>.jar \
    --py-files /path/to/pyspark-cassandra-assembly-<version>.jar \
    --conf spark.cassandra.connection.host=your,cassandra,node,names \
    --master spark://spark-master:7077 \
    yourscript.py

@1ambda
Copy link
Member Author

1ambda commented Jan 5, 2017

  1. I read and replied before.

Q. README shows that user can use --py-files
A. Users cannot benefit from --packages. They need to download, find location of all transitive deps and provide the paths to --py-files

And even in spark, we can use --packages in local, mesos-client. Why do you think zeppelin should't do that?

  1. I tested this PR in yarn-client and it works. How did you test this PR in yarn-client?

since it doesn't resolve the yarn-client mode

Could you tell me your env?

  • how did you build (command, env)
  • zeppelin, yarn, spark versions.

@zjffdu
Copy link
Contributor

zjffdu commented Jan 5, 2017

Sorry, I miss your last reply.
I don't mean yarn-client mode in zeppelin, I mean in spark. Do you mean yarn-client mode works for you in spark ? If it doesn't work in yarn-client mode, then either this is a bug of spark. If not, then it might not be proper to add new features in zeppelin that spark doesn't support.

I use the following command to launch pyspark and get the error as following:

Launch pyspark ( I am using spark 2.1.0)

bin/pyspark --packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.2,TargetHolding:pyspark-cassandra:0.3.5 --exclude-packages org.slf4j:slf4j-api --master yarn-client

Fail to import pyspark_cassandra

>>> import pyspark_cassandra
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named pyspark_cassandra

@1ambda
Copy link
Member Author

1ambda commented Jan 10, 2017

@zjffdu I'v just fixed not to extend PYTHONPATH using submitted packages only in yarn-client

See f76d2c8

@zjffdu
Copy link
Contributor

zjffdu commented Jan 11, 2017

Thanks, @1ambda Do you mind to create a spark ticket as well ? The behavior inconsistency between different modes seems an issue of spark, we need to clarify it with spark community.

@1ambda
Copy link
Member Author

1ambda commented Jan 11, 2017

For reviewers

Fixed to use spark.jars instead of classpath.

  • classpath doesn't include submitted jars at this moment (i could get 7 days ago, but not now)
  • it help to simplify logic since we can set PYTHONPATH directly in setupPySparkEnv function.
  • also, spark.jars includes only required jars. (jars in classpath are more verbose)
  • I tested in spark 1.6.2, spark 2.0.0

@zjffdu

The spark code we talked came from apache/spark#6360. It seems like intended so it's ok not to raise an issue.

@1ambda 1ambda force-pushed the ZEPPELIN-1883/cant-import-submitted-packages-in-pyspark branch 4 times, most recently from c8930cb to 5efacb4 Compare January 11, 2017 19:35
@1ambda 1ambda force-pushed the ZEPPELIN-1883/cant-import-submitted-packages-in-pyspark branch from 5efacb4 to 585d48a Compare January 11, 2017 21:48
@Leemoonsoo
Copy link
Member

Thanks @1ambda for the fix. Thanks @zjffdu for reviewing and verifying.

It looks good to me, while it provides more consistent behavior between PysparkInterpreter and SPARK_HOME/bin/pyspark. Merge to master and branch-0.7 if no more comments.

@asfgit asfgit closed this in cb8e418 Jan 14, 2017
asfgit pushed a commit that referenced this pull request Jan 14, 2017
### What is this PR for?

Fixed importing packages in pyspack requested by `SPARK_SUBMIT_OPTION`

### What type of PR is it?
[Bug Fix]

### Todos

Nothing

### What is the Jira issue?

[ZEPPELIN-1883](https://issues.apache.org/jira/browse/ZEPPELIN-1883)

### How should this be tested?

0. Download Apache Spark 1.6.2 (since it's the most recent for pyspark-cassandra)

1. Set `SPARK_HOME` and `SPARK_SUBMIT_OPTION` in `conf/zeppelin-env.sh` like

```sh
export SPARK_HOME="~/github/apache-spark/1.6.2-bin-hadoop2.6"
export SPARK_SUBMIT_OPTIONS="--packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.2,TargetHolding:pyspark-cassandra:0.3.5 --exclude-packages org.slf4j:slf4j-api"
```

2. Check before that you can run `spark-submit` or not

```
./bin/spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.2,TargetHolding:pyspark-cassandra:0.3.5 --exclude-packages org.slf4j:slf4j-api --class org.apache.spark.examples.SparkPi lib/spark-examples-1.6.2-hadoop2.6.0.jar
```

3. Test whether submitted packages can be import or not

```
%pyspark

import pyspark_cassandra
```

### Screenshots (if appropriate)

```
import pyspark_cassandra

Traceback (most recent call last):
  File "/var/folders/lr/8g9y625n5j39rz6qhkg8s6640000gn/T/zeppelin_pyspark-5266742863961917074.py", line 267, in <module>
    raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
  File "/var/folders/lr/8g9y625n5j39rz6qhkg8s6640000gn/T/zeppelin_pyspark-5266742863961917074.py", line 265, in <module>
    exec(code)
  File "<stdin>", line 1, in <module>
ImportError: No module named pyspark_cassandra
```

### Questions:
* Does the licenses files need update? - NO
* Is there breaking changes for older versions? - NO
* Does this needs documentation? - NO

Author: 1ambda <1amb4a@gmail.com>

Closes #1831 from 1ambda/ZEPPELIN-1883/cant-import-submitted-packages-in-pyspark and squashes the following commits:

585d48a [1ambda] Use spark.jars instead of classpath
f76d2c8 [1ambda] fix: Do not extend PYTHONPATH in yarn-client
c735bd5 [1ambda] fix: Import spark submit packages in pyspark

(cherry picked from commit cb8e418)
Signed-off-by: Lee moon soo <moon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants