-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node #118
Changes from all commits
32fd9e1
1b192f6
0ddb436
71e2a92
b05ae6e
0a2d90e
64b8195
929333d
94bdf30
ad610fb
8a7bf47
682986e
4b35c8d
06002fd
ef240f5
c9cda29
4cd10b5
248e330
47fd9c9
ee6d100
72a65fd
a47e27c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -77,3 +77,4 @@ auto-save-list | |
tramp | ||
.\#* | ||
*.swp | ||
**/dependency-reduced-pom.xml |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -73,6 +73,19 @@ if [[ ! -d "${ZEPPELIN_LOG_DIR}" ]]; then | |
$(mkdir -p "${ZEPPELIN_LOG_DIR}") | ||
fi | ||
|
||
if [[ ! -z "${SPARK_HOME}" ]]; then | ||
PYSPARKPATH="${SPARK_HOME}/python/lib/pyspark.zip:${SPARK_HOME}/python/lib/py4j-0.8.2.1-src.zip" | ||
else | ||
PYSPARKPATH="${ZEPPELIN_HOME}/interpreter/spark/pyspark/pyspark.zip:${ZEPPELIN_HOME}/interpreter/spark/pyspark/py4j-0.8.2.1-src.zip" | ||
fi | ||
|
||
if [[ x"" == x"${PYTHONPATH}" ]]; then | ||
export PYTHONPATH="${PYSPARKPATH}" | ||
else | ||
export PYTHONPATH="${PYTHONPATH}:${PYSPARKPATH}" | ||
fi | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about move this conditional block to above the ie.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SPARK_HOME affects a lot of thing though, should this be setting it to ZEPPELIN_HOME? |
||
unset PYSPARKPATH | ||
|
||
${ZEPPELIN_RUNNER} ${JAVA_INTP_OPTS} -cp ${CLASSPATH} ${ZEPPELIN_SERVER} ${PORT} & | ||
pid=$! | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -48,6 +48,8 @@ | |
|
||
<akka.group>org.spark-project.akka</akka.group> | ||
<akka.version>2.3.4-spark</akka.version> | ||
|
||
<spark.download.url>http://www.apache.org/dist/spark/spark-${spark.version}/spark-${spark.version}.tgz</spark.download.url> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need python files. I think it's good idea to download these files from vanilla spark package. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just a question: does it mean that every clean build of Zeppelin on CI and elsewhere will dowload full Spark distribution? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @bzz Nope. download-maven-plugin uses cache under you ~/.m2/repository/.cache. If you use same file, maven checks this location first. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. clean build from my previous message means CI on Travis virtual machine as well as just installed development environment. So, are you you sure it will not download it in those cases? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @bzz I've found the line below in my failed test.
You have a question about this situation, don't you? |
||
</properties> | ||
|
||
<repositories> | ||
|
@@ -473,13 +475,6 @@ | |
</exclusions> | ||
</dependency> | ||
|
||
<!-- pyspark --> | ||
<dependency> | ||
<groupId>net.sf.py4j</groupId> | ||
<artifactId>py4j</artifactId> | ||
<version>0.8.2.1</version> | ||
</dependency> | ||
|
||
<dependency> | ||
<groupId>org.apache.commons</groupId> | ||
<artifactId>commons-exec</artifactId> | ||
|
@@ -731,6 +726,74 @@ | |
</dependencies> | ||
</profile> | ||
|
||
<profile> | ||
<id>pyspark</id> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd suggest calling this pyspark_local or something - for those who have pyspark already in their cluster, this should not be necessary? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @felixcheung Yes, if you already installed Spark and used SPARK_HOME, you don't need to build Z with this profile. |
||
<properties> | ||
<spark.download.url>http://www.apache.org/dist/spark/spark-${spark.version}/spark-${spark.version}.tgz | ||
</spark.download.url> | ||
</properties> | ||
<build> | ||
<plugins> | ||
<plugin> | ||
<groupId>com.googlecode.maven-download-plugin</groupId> | ||
<artifactId>download-maven-plugin</artifactId> | ||
<version>1.2.1</version> | ||
<executions> | ||
<execution> | ||
<id>download-pyspark-files</id> | ||
<phase>validate</phase> | ||
<goals> | ||
<goal>wget</goal> | ||
</goals> | ||
<configuration> | ||
<url>${spark.download.url}</url> | ||
<unpack>true</unpack> | ||
<outputDirectory>${project.build.directory}/spark-dist</outputDirectory> | ||
</configuration> | ||
</execution> | ||
</executions> | ||
</plugin> | ||
<plugin> | ||
<artifactId>maven-clean-plugin</artifactId> | ||
<configuration> | ||
<filesets> | ||
<fileset> | ||
<directory>${basedir}/../python/build</directory> | ||
</fileset> | ||
<fileset> | ||
<directory>${project.build.direcoty}/spark-dist</directory> | ||
</fileset> | ||
</filesets> | ||
</configuration> | ||
</plugin> | ||
<plugin> | ||
<groupId>org.apache.maven.plugins</groupId> | ||
<artifactId>maven-antrun-plugin</artifactId> | ||
<version>1.7</version> | ||
<executions> | ||
<execution> | ||
<id>download-and-zip-pyspark-files</id> | ||
<phase>generate-resources</phase> | ||
<goals> | ||
<goal>run</goal> | ||
</goals> | ||
<configuration> | ||
<target> | ||
<delete dir="../interpreter/spark/pyspark"/> | ||
<copy todir="../interpreter/spark/pyspark" | ||
file="${project.build.directory}/spark-dist/spark-${spark.version}/python/lib/py4j-0.8.2.1-src.zip"/> | ||
<zip destfile="${project.build.directory}/../../interpreter/spark/pyspark/pyspark.zip" | ||
basedir="${project.build.directory}/spark-dist/spark-${spark.version}/python" | ||
includes="pyspark/*.py,pyspark/**/*.py"/> | ||
</target> | ||
</configuration> | ||
</execution> | ||
</executions> | ||
</plugin> | ||
</plugins> | ||
</build> | ||
</profile> | ||
|
||
<!-- Build without Hadoop dependencies that are included in some runtime environments. --> | ||
<profile> | ||
<id>hadoop-provided</id> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Leemoonsoo In case of PYTHONPATH on PysparkInterpreter, It's not affected by python driver. In the past, Zeppelin loads pyspark from SPARK_HOME/python