Skip to content

Setting up a PySpark 2.0 notebook with MLeap an Toree

Mikhail Semeniuk edited this page Nov 7, 2016 · 1 revision

Setting up a Spark 2.0 notebook with MLeap an Toree

Install Jupyter and Toree

We are going to assume you already have the following installed:

  1. Python 2.x
  2. Docker (required to install Toree)

Install Jupyter

virtualenv venv

source ./venv/bin/activate

pip install jupyter

Build and install Toree

Clone master into your working directory from Toree's github repo.

For this next step, you'll need to make sure that docker is running.

$ cd incubator-toree

$ make release

$ cd dist/toree-pip

$ pip install toree-0.2.0.dev1.tar.gz

SPARK_HOME=<path to spark> jupyter toree install --interpreters=PySpark

Launch Notebook and Include MLeap

The most error-proof way to add mleap to your project is to modify the kernel directly (or create a new one for Toree and Spark 2.0).

Kernel config files are typically located in /usr/local/share/jupyter/kernels/apache_toree_pyspark/kernel.json

Go ahead and add or modify __TOREE_SPARK_OPTS_ and PYTHONPATH like so:

"__TOREE_SPARK_OPTS__": "--packages com.databricks:spark-avro_2.11:3.0.1,ml.combust.mleap:mleap-spark_2.11:0.4.0", 
"PYTHONPATH": "/usr/local/spark-2.0.0-bin-hadoop2.7/python:/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip:<path to mleap>/python"

An alternative way is to use AddDeps Magics, but we've run into dependency collisions, so do so at your own risk:

%AddDeps ml.combust.mleap mleap-spark_2.11 0.4.0 --transitive