Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node #118

Closed
wants to merge 22 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
32fd9e1
[ZEPPELIN-18] Running pyspark without deploying python libraries to e…
jongyoul Jun 25, 2015
1b192f6
[ZEPPELIN-18] Remove setting SPARK_HOME for PySpark
jongyoul Jun 21, 2015
0ddb436
[ZEPPELIN-18] Running pyspark without deploying python libraries to e…
jongyoul Jun 24, 2015
71e2a92
[ZEPPELIN-18] Running pyspark without deploying python libraries to e…
jongyoul Jun 24, 2015
b05ae6e
[ZEPPELIN-18] Remove setting SPARK_HOME for PySpark
jongyoul Jun 24, 2015
0a2d90e
rebased
jongyoul Jul 3, 2015
64b8195
[ZEPPELIN-18] Running pyspark without deploying python libraries to e…
jongyoul Jun 25, 2015
929333d
rebased
jongyoul Jul 3, 2015
94bdf30
[ZEPPELIN-18] Running pyspark without deploying python libraries to e…
jongyoul Jun 24, 2015
ad610fb
rebased
jongyoul Jul 3, 2015
8a7bf47
[ZEPPELIN-18] Running pyspark without deploying python libraries to e…
jongyoul Jun 25, 2015
682986e
rebased
jongyoul Jul 3, 2015
4b35c8d
[ZEPPELIN-18] Running pyspark without deploying python libraries to e…
jongyoul Jun 25, 2015
06002fd
- Fixed typo
jongyoul Jun 29, 2015
ef240f5
- Fixed typo
jongyoul Jul 2, 2015
c9cda29
- Removed setting SPARK_HOME
jongyoul Jul 3, 2015
4cd10b5
- Removed meaningless codes comments
jongyoul Jul 3, 2015
248e330
- Cleanup codes
jongyoul Jul 3, 2015
47fd9c9
- Cleanup codes
jongyoul Jul 3, 2015
ee6d100
- Cleanup codes
jongyoul Jul 3, 2015
72a65fd
- Fixed test script for spark 1.4.0
jongyoul Jul 3, 2015
a47e27c
- Fixed test script for spark 1.4.0
jongyoul Jul 4, 2015
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -77,3 +77,4 @@ auto-save-list
tramp
.\#*
*.swp
**/dependency-reduced-pom.xml
6 changes: 3 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,16 +22,16 @@ before_install:
- "sh -e /etc/init.d/xvfb start"

install:
- mvn package -DskipTests -Phadoop-2.3 -B
- mvn package -DskipTests -Phadoop-2.3 -Ppyspark -B

before_script:
-

script:
# spark 1.4
- mvn package -Pbuild-distr -Phadoop-2.3 -B
- mvn package -Pbuild-distr -Phadoop-2.3 -Ppyspark -B
- ./testing/startSparkCluster.sh 1.4.0 2.3
- SPARK_HOME=./spark-1.4.1-bin-hadoop2.3 mvn verify -Pusing-packaged-distr -Phadoop-2.3 -B
- SPARK_HOME=`pwd`/spark-1.4.0-bin-hadoop2.3 mvn verify -Pusing-packaged-distr -Phadoop-2.3 -Ppyspark -B
- ./testing/stopSparkCluster.sh 1.4.0 2.3
# spark 1.3
- mvn clean package -DskipTests -Pspark-1.3 -Phadoop-2.3 -B -pl 'zeppelin-interpreter,spark'
Expand Down
13 changes: 13 additions & 0 deletions bin/interpreter.sh
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,19 @@ if [[ ! -d "${ZEPPELIN_LOG_DIR}" ]]; then
$(mkdir -p "${ZEPPELIN_LOG_DIR}")
fi

if [[ ! -z "${SPARK_HOME}" ]]; then
PYSPARKPATH="${SPARK_HOME}/python/lib/pyspark.zip:${SPARK_HOME}/python/lib/py4j-0.8.2.1-src.zip"
else
PYSPARKPATH="${ZEPPELIN_HOME}/interpreter/spark/pyspark/pyspark.zip:${ZEPPELIN_HOME}/interpreter/spark/pyspark/py4j-0.8.2.1-src.zip"
fi
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Leemoonsoo In case of PYTHONPATH on PysparkInterpreter, It's not affected by python driver. In the past, Zeppelin loads pyspark from SPARK_HOME/python


if [[ x"" == x"${PYTHONPATH}" ]]; then
export PYTHONPATH="${PYSPARKPATH}"
else
export PYTHONPATH="${PYTHONPATH}:${PYSPARKPATH}"
fi

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about move this conditional block to above the if [[ x"" == x${PYTHONPATH} ]]; then ... conditional block and use SPARK_HOME instead of ZEPPELIN_HOME for defining PYTHONPATH?
Because of when SPARK_HOME is provided by user, then PYTHONPATH and SPARK_HOME will point two different directory. which is little bit confusing.

ie.

if [[ x"" == x${SPARK_HOME} ]]; then
  export SPARK_HOME=${ZEPPELIN_HOME}
fi

if [[ x"" == x${PYTHONPATH} ]]; then
  export PYTHONPATH="${SPARK_HOME}/python/lib/pyspark.zip:${SPARK_HOME}/python/lib/py4j-0.8.2.1-src.zip"
else
  export PYTHONPATH="$PYTHONPATH${SPARK_HOME}/lib/pyspark.zip:${SPARK_HOME}/python/lib/py4j-0.8.2.1-src.zip"
fi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPARK_HOME affects a lot of thing though, should this be setting it to ZEPPELIN_HOME?

unset PYSPARKPATH

${ZEPPELIN_RUNNER} ${JAVA_INTP_OPTS} -cp ${CLASSPATH} ${ZEPPELIN_SERVER} ${PORT} &
pid=$!
Expand Down
77 changes: 70 additions & 7 deletions spark/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,8 @@

<akka.group>org.spark-project.akka</akka.group>
<akka.version>2.3.4-spark</akka.version>

<spark.download.url>http://www.apache.org/dist/spark/spark-${spark.version}/spark-${spark.version}.tgz</spark.download.url>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need python files. I think it's good idea to download these files from vanilla spark package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: does it mean that every clean build of Zeppelin on CI and elsewhere will dowload full Spark distribution?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bzz Nope. download-maven-plugin uses cache under you ~/.m2/repository/.cache. If you use same file, maven checks this location first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean build from my previous message means CI on Travis virtual machine as well as just installed development environment.

So, are you you sure it will not download it in those cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bzz I've found the line below in my failed test.

[INFO] --- download-maven-plugin:1.2.1:wget (download-pyspark-files) @ zeppelin-spark ---
[INFO] Got from cache: /home/travis/.m2/repository/.cache/maven-download-plugin/spark-1.3.1.tgz
[INFO] Expanding: /home/travis/build/apache/incubator-zeppelin/spark/target/spark-dist/spark-1.3.1.tgz into /home/travis/build/apache/incubator-zeppelin/spark/target/spark-dist

You have a question about this situation, don't you?

</properties>

<repositories>
Expand Down Expand Up @@ -473,13 +475,6 @@
</exclusions>
</dependency>

<!-- pyspark -->
<dependency>
<groupId>net.sf.py4j</groupId>
<artifactId>py4j</artifactId>
<version>0.8.2.1</version>
</dependency>

<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-exec</artifactId>
Expand Down Expand Up @@ -731,6 +726,74 @@
</dependencies>
</profile>

<profile>
<id>pyspark</id>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest calling this pyspark_local or something - for those who have pyspark already in their cluster, this should not be necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung Yes, if you already installed Spark and used SPARK_HOME, you don't need to build Z with this profile.

<properties>
<spark.download.url>http://www.apache.org/dist/spark/spark-${spark.version}/spark-${spark.version}.tgz
</spark.download.url>
</properties>
<build>
<plugins>
<plugin>
<groupId>com.googlecode.maven-download-plugin</groupId>
<artifactId>download-maven-plugin</artifactId>
<version>1.2.1</version>
<executions>
<execution>
<id>download-pyspark-files</id>
<phase>validate</phase>
<goals>
<goal>wget</goal>
</goals>
<configuration>
<url>${spark.download.url}</url>
<unpack>true</unpack>
<outputDirectory>${project.build.directory}/spark-dist</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-clean-plugin</artifactId>
<configuration>
<filesets>
<fileset>
<directory>${basedir}/../python/build</directory>
</fileset>
<fileset>
<directory>${project.build.direcoty}/spark-dist</directory>
</fileset>
</filesets>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-antrun-plugin</artifactId>
<version>1.7</version>
<executions>
<execution>
<id>download-and-zip-pyspark-files</id>
<phase>generate-resources</phase>
<goals>
<goal>run</goal>
</goals>
<configuration>
<target>
<delete dir="../interpreter/spark/pyspark"/>
<copy todir="../interpreter/spark/pyspark"
file="${project.build.directory}/spark-dist/spark-${spark.version}/python/lib/py4j-0.8.2.1-src.zip"/>
<zip destfile="${project.build.directory}/../../interpreter/spark/pyspark/pyspark.zip"
basedir="${project.build.directory}/spark-dist/spark-${spark.version}/python"
includes="pyspark/*.py,pyspark/**/*.py"/>
</target>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>

<!-- Build without Hadoop dependencies that are included in some runtime environments. -->
<profile>
<id>hadoop-provided</id>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -159,18 +159,6 @@ public void open() {
try {
Map env = EnvironmentUtils.getProcEnvironment();

String pythonPath = (String) env.get("PYTHONPATH");
if (pythonPath == null) {
pythonPath = "";
} else {
pythonPath += ":";
}

pythonPath += getSparkHome() + "/python/lib/py4j-0.8.2.1-src.zip:"
+ getSparkHome() + "/python";

env.put("PYTHONPATH", pythonPath);

executor.execute(cmd, env, this);
pythonscriptRunning = true;
} catch (IOException e) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,9 @@
import java.lang.reflect.Method;
import java.net.URL;
import java.net.URLClassLoader;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.Set;
import java.util.*;

import com.google.common.base.Joiner;
import org.apache.spark.HttpServer;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
Expand Down Expand Up @@ -273,6 +270,34 @@ public SparkContext createSparkContext() {
}
}

//TODO(jongyoul): Move these codes into PySparkInterpreter.java

String pysparkBasePath = getSystemDefault("SPARK_HOME", "spark.home", null);
File pysparkPath;
if (null == pysparkBasePath) {
pysparkBasePath = getSystemDefault("ZEPPELIN_HOME", "zeppelin.home", "../");
pysparkPath = new File(pysparkBasePath,
"interpreter" + File.separator + "spark" + File.separator + "pyspark");
} else {
pysparkPath = new File(pysparkBasePath,
"python" + File.separator + "lib");
}

String[] pythonLibs = new String[]{"pyspark.zip", "py4j-0.8.2.1-src.zip"};
ArrayList<String> pythonLibUris = new ArrayList<>();
for (String lib : pythonLibs) {
File libFile = new File(pysparkPath, lib);
if (libFile.exists()) {
pythonLibUris.add(libFile.toURI().toString());
}
}
pythonLibUris.trimToSize();
if (pythonLibs.length == pythonLibUris.size()) {
conf.set("spark.yarn.dist.files", Joiner.on(",").join(pythonLibUris));
conf.set("spark.files", conf.get("spark.yarn.dist.files"));
conf.set("spark.submit.pyArchives", Joiner.on(":").join(pythonLibs));
}

SparkContext sparkContext = new SparkContext(conf);
return sparkContext;
}
Expand Down