New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ZEPPELIN-1883] Can't import spark submitted packages in PySpark #1831
[ZEPPELIN-1883] Can't import spark submitted packages in PySpark #1831
Conversation
In my test, I got a |
@astroshim Thanks for review! It's the expected behavior. If |
@1ambda Spark doesn't support specifying python packages throught |
right, I'm a bit concern if this would be the right fix for the issue? |
@zjffdu @felixcheung Thanks for review :)
// used spark 1.6.2
./bin/pyspark --packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.2,TargetHolding:pyspark-cassandra:0.3.5 --exclude-packages org.slf4j:slf4j-api
>>> import pyspark_cassandra
>>> Considering
|
@1ambda Actually, pyspark-cassandra doesn't work for me in pyspark shell. I guess it works because you have installed it locally.
|
I'v just created gist to show
|
hmm, it works in local mode but doesn't work in yarn-client mode. Could you try yarn-client mode ? |
I tested on yarn-client, mesos-client and found that
Using Python version 2.6.6 (r266:84292, Aug 18 2016 15:13:37)
SparkContext available as sc, HiveContext available as sqlContext.
>>> import pyspark_cassandra
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tmp/spark-df7bc8fa-233f-4124-855b-4a39fa948c1a/userFiles-ab70ffa3-212b-47ee-9611-9c240d3ce899/TargetHolding_pyspark-cassandra-0.3.5.jar/pyspark_cassandra/__init__.py", line 24, in <module>
File "/tmp/spark-df7bc8fa-233f-4124-855b-4a39fa948c1a/userFiles-ab70ffa3-212b-47ee-9611-9c240d3ce899/TargetHolding_pyspark-cassandra-0.3.5.jar/pyspark_cassandra/context.py", line 16, in <module>
File "/tmp/spark-df7bc8fa-233f-4124-855b-4a39fa948c1a/userFiles-ab70ffa3-212b-47ee-9611-9c240d3ce899/TargetHolding_pyspark-cassandra-0.3.5.jar/pyspark_cassandra/rdd.py", line 291
k = Row(**{c: row.__getattr__(c) for c in columns})
^
SyntaxError: invalid syntax
>>>
// If we're running a python app, set the main class to our specific python runner
if (args.isPython && deployMode == CLIENT) {
...
if (clusterManager != YARN) {
// The YARN backend handles python files differently, so don't merge the lists.
args.files = mergeFileLists(args.files, args.pyFiles)
} Summary
|
Any update? This is one of blockers for 0.7 |
I still think this is not a correct fix since it doesn't resolve the yarn-client mode which I believe most of users use this mode. |
|
As I said before why not using README shows that user can use
|
And even in spark, we can use
Could you tell me your env?
|
Sorry, I miss your last reply. I use the following command to launch pyspark and get the error as following: Launch pyspark ( I am using spark 2.1.0)
Fail to import pyspark_cassandra
|
Thanks, @1ambda Do you mind to create a spark ticket as well ? The behavior inconsistency between different modes seems an issue of spark, we need to clarify it with spark community. |
For reviewers Fixed to use
The spark code we talked came from apache/spark#6360. It seems like intended so it's ok not to raise an issue. |
c8930cb
to
5efacb4
Compare
5efacb4
to
585d48a
Compare
### What is this PR for? Fixed importing packages in pyspack requested by `SPARK_SUBMIT_OPTION` ### What type of PR is it? [Bug Fix] ### Todos Nothing ### What is the Jira issue? [ZEPPELIN-1883](https://issues.apache.org/jira/browse/ZEPPELIN-1883) ### How should this be tested? 0. Download Apache Spark 1.6.2 (since it's the most recent for pyspark-cassandra) 1. Set `SPARK_HOME` and `SPARK_SUBMIT_OPTION` in `conf/zeppelin-env.sh` like ```sh export SPARK_HOME="~/github/apache-spark/1.6.2-bin-hadoop2.6" export SPARK_SUBMIT_OPTIONS="--packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.2,TargetHolding:pyspark-cassandra:0.3.5 --exclude-packages org.slf4j:slf4j-api" ``` 2. Check before that you can run `spark-submit` or not ``` ./bin/spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.2,TargetHolding:pyspark-cassandra:0.3.5 --exclude-packages org.slf4j:slf4j-api --class org.apache.spark.examples.SparkPi lib/spark-examples-1.6.2-hadoop2.6.0.jar ``` 3. Test whether submitted packages can be import or not ``` %pyspark import pyspark_cassandra ``` ### Screenshots (if appropriate) ``` import pyspark_cassandra Traceback (most recent call last): File "/var/folders/lr/8g9y625n5j39rz6qhkg8s6640000gn/T/zeppelin_pyspark-5266742863961917074.py", line 267, in <module> raise Exception(traceback.format_exc()) Exception: Traceback (most recent call last): File "/var/folders/lr/8g9y625n5j39rz6qhkg8s6640000gn/T/zeppelin_pyspark-5266742863961917074.py", line 265, in <module> exec(code) File "<stdin>", line 1, in <module> ImportError: No module named pyspark_cassandra ``` ### Questions: * Does the licenses files need update? - NO * Is there breaking changes for older versions? - NO * Does this needs documentation? - NO Author: 1ambda <1amb4a@gmail.com> Closes #1831 from 1ambda/ZEPPELIN-1883/cant-import-submitted-packages-in-pyspark and squashes the following commits: 585d48a [1ambda] Use spark.jars instead of classpath f76d2c8 [1ambda] fix: Do not extend PYTHONPATH in yarn-client c735bd5 [1ambda] fix: Import spark submit packages in pyspark (cherry picked from commit cb8e418) Signed-off-by: Lee moon soo <moon@apache.org>
What is this PR for?
Fixed importing packages in pyspack requested by
SPARK_SUBMIT_OPTION
What type of PR is it?
[Bug Fix]
Todos
Nothing
What is the Jira issue?
ZEPPELIN-1883
How should this be tested?
Download Apache Spark 1.6.2 (since it's the most recent for pyspark-cassandra)
Set
SPARK_HOME
andSPARK_SUBMIT_OPTION
inconf/zeppelin-env.sh
likespark-submit
or notScreenshots (if appropriate)
Questions: