Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node #118

Closed
wants to merge 22 commits into from

Conversation

jongyoul
Copy link
Member

export PYTHONPATH="${ZEPPELIN_HOME}/python/lib/pyspark.zip:${ZEPPELIN_HOME}/python/lib/py4j-0.8.2.1-src.zip"
else
export PYTHONPATH="$PYTHONPATH${ZEPPELIN_HOME}/lib/pyspark.zip:${ZEPPELIN_HOME}/python/lib/py4j-0.8.2.1-src.zip"
fi
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Leemoonsoo In case of PYTHONPATH on PysparkInterpreter, It's not affected by python driver. In the past, Zeppelin loads pyspark from SPARK_HOME/python

@bzz
Copy link
Member

bzz commented Jun 24, 2015

Rising a question on necessity of downloading spark on every dev machine because recently heard a joke from @anthonycorbacho (somebody on the Strata conference),

"Want do download whole internet? Build Zeppelin!"

I.e .m2 dir on my machine is about 4 Gb already.

Is there a reason we can't have it under some some kind of profile?
So users who need it will do like -Pyarn-python

@jongyoul
Copy link
Member Author

Do you think it's good to make a profile? There is no problem to make it. I'll handle it to make that joke as just joke.

@jongyoul
Copy link
Member Author

Can anyone help me know the reason why last build fails?

@Leemoonsoo
Copy link
Member

You can download the log file and search for "BUILD FAILURE"

@jongyoul
Copy link
Member Author

@Leemoonsoo thanks. I will check it tomorrow. And what do you think of making a profile for yarn-pyspark?

@Leemoonsoo
Copy link
Member

Making profile for pyspark is not a bad idea. However, pyspark can work with not only yarn but also with mesos and standalone cluster. So i think it would be better if profile looks like -Ppyspark than -Pyarn-pyspark

@jongyoul
Copy link
Member Author

I've rebased

@jongyoul
Copy link
Member Author

@bzz @Leemoonsoo Review this again, please.

@Leemoonsoo
Copy link
Member

@jongyoul
Thanks for great contribution. It really helps not only yarn but also local mode and standalone cluster.
LGTM!

@bzz
Copy link
Member

bzz commented Jun 26, 2015

Thank you very much for contributing this!

It would be great to have a high level summary of the changes, so please correct me in case I miss-understand something:

This PR allows users of pyspark skip setting pythonpath env var, copy Python modules on every node of he cluster and have spark installed(in case of pyspark in local mode on 1 machine) by adding new artefact to the Zeppelin build, a python, hidden behind optional build profile, that brings py4j as well as Python code of pyspark by downloading (and caching) actual spark distribution and re-packing those to a zip file, available in Z class path at runtime.

Is that correct?
If it is - that sounds great to me, pspark not working in Z with local interpreter without having spark installed was very frustrating.

One question: a Python dir is not a maven submodule now but may be it should be, one day, what do you guys think?

@jongyoul
Copy link
Member Author

@bzz It's correct for your understanding. I'm worried about mesos cluster mode because I don't test my PR on these cluster yet, but standalone cluster will be ok because that cluster already has python libraries. I've only tested it from local mode and yarn cluster mode. Concerning a directory name of python, I think python interpreter will be added somedays, can you recommend the directory name?

if [[ x"" == x${PYTHONPATH} ]]; then
export PYTHONPATH="${ZEPPELIN_HOME}/python/lib/pyspark.zip:${ZEPPELIN_HOME}/python/lib/py4j-0.8.2.1-src.zip"
else
export PYTHONPATH="$PYTHONPATH${ZEPPELIN_HOME}/lib/pyspark.zip:${ZEPPELIN_HOME}/python/lib/py4j-0.8.2.1-src.zip"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about add colon (:)?
from ....="$PYTHONPATH${ZEPPELIN_HOME}/lib/py.... to ....="$PYTHONPATH:${ZEPPELIN_HOME}/lib/py....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I'm a bit confused by this. is it not suppose to have a colon?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Leemoonsoo @felixcheung Yes, you guys are right. I can't find this error because I don't set extra PYTHONPATH at all and don't test it. I'll fix it.

@jongyoul
Copy link
Member Author

I'm trying to remove some unused settings. I hope not to set SPARK_HOME for using pyspark eventually.

@jongyoul jongyoul closed this Jul 3, 2015
@jongyoul jongyoul reopened this Jul 3, 2015
@jongyoul
Copy link
Member Author

jongyoul commented Jul 3, 2015

I've heard close and reopen issue is the easiest way to trigger a travis, but In my case, I have a merge conflict.

- Removed redundant dependency setting
- Excludes python/** from apache-rat
- Changed the location of pyspark's directory into interpreter/spark
@jongyoul
Copy link
Member Author

jongyoul commented Jul 4, 2015

@Leemoonsoo I've rebased it again because all of travis tests passed.

@Leemoonsoo
Copy link
Member

Thanks @jongyoul. Great work!
Looks good to me.

@asfgit asfgit closed this in 3bd2b21 Jul 5, 2015
asfgit pushed a commit that referenced this pull request Jul 5, 2015
…very yarn node

- Spark supports pyspark on yarn cluster without deploying python libraries from Spark 1.4
 - https://issues.apache.org/jira/browse/SPARK-6869
 - apache/spark#5580, apache/spark#5478

Author: Jongyoul Lee <jongyoul@gmail.com>

Closes #118 from jongyoul/ZEPPELIN-18 and squashes the following commits:

a47e27c [Jongyoul Lee] - Fixed test script for spark 1.4.0
72a65fd [Jongyoul Lee] - Fixed test script for spark 1.4.0
ee6d100 [Jongyoul Lee] - Cleanup codes
47fd9c9 [Jongyoul Lee] - Cleanup codes
248e330 [Jongyoul Lee] - Cleanup codes
4cd10b5 [Jongyoul Lee] - Removed meaningless codes comments
c9cda29 [Jongyoul Lee] - Removed setting SPARK_HOME - Changed the location of pyspark's directory into interpreter/spark
ef240f5 [Jongyoul Lee] - Fixed typo
06002fd [Jongyoul Lee] - Fixed typo
4b35c8d [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - Dummy for trigger
682986e [Jongyoul Lee] rebased
8a7bf47 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - rebasing
ad610fb [Jongyoul Lee] rebased
94bdf30 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - Fixed checkstyle
929333d [Jongyoul Lee] rebased
64b8195 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - rebasing
0a2d90e [Jongyoul Lee] rebased
b05ae6e [Jongyoul Lee] [ZEPPELIN-18] Remove setting SPARK_HOME for PySpark - Excludes python/** from apache-rat
71e2a92 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - Removed verbose setting
0ddb436 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - Followed spark's way to support pyspark - https://issues.apache.org/jira/browse/SPARK-6869 - apache/spark#5580 - https://github.com/apache/spark/pull/5478/files
1b192f6 [Jongyoul Lee] [ZEPPELIN-18] Remove setting SPARK_HOME for PySpark - Removed redundant dependency setting
32fd9e1 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - rebasing

(cherry picked from commit 3bd2b21)
Signed-off-by: Lee moon soo <moon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants