# How to install PySpark on Jupyter notebook with Anaconda

### There are two steps as below:

    1- Installing PySpark
    2- Activating PySpark in Jupyter
        - Configuring PySpark driver to use Jupyter notebook by running 'pyspark' command 
        - Loading a regular Jupyter notebook and importing PySpark using 'findspark' library

### Setp 1: installing PySpark 

You can download Spark using this [link](http://spark.apache.org/downloads.html). Make sure to download *pre-built for Apache Hadoop 2.x or later*. Before start installing it, make sure that you have Java 8 or higher version installed. 
Also, remember that you should have already installed Jupyter notebook with Python 3 using Anaconda. 

After downloading Spark, you should unzip it and move it to your */opt* folder. Open your terminal and go to the download folder and run these commands:

$ tar -xzf spark-1.2.0-bin-hadoop2.4.tgz/n

$ mv spark-1.2.0-bin-hadoop2.4 /opt/spark-1.2.0

Then create a symbolic link to the spark:

$ sudo ln -s /opt/spark-1.2.0 /opt/spark

The last step is to configure the $PATH variable by adding it to the *~/.bashrc*

export SPARK_HOME=/opt/spark 

export PATH=\$SPARK_HOME/bin:$PATH

#### Testing if PySpark is working properly in shell

During this step, you can check if PySpark is properly installed using it with shell. Open your terminal and write down the following command:

$ pyspark

This command should welcome you to Spark. 

Now, let's run a test script to make sure everything is working properly. 
**Note:** I am getting error here because I am running this script on Jupyter Notebook. You should run it in shell using terminal. 

In [1]:
import random
num_samples = 100000000
def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)

sc.stop()

NameError: name 'sc' is not defined

The result of this script should give you pi number. 

## Activating PySpark in Jupyter Notebook

#### Method 1- Configuring Pyspark driver to use Jupyter Notebook

In order to configure PySpark driver to use Jupyter Notbook, you need to open *~/.bashrc* and then added these two lines. 

export PYSPARK_DRIVER_PYTHON=jupyter

export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

After adding these two lines and saving the *~/.bashrc* file, you need to restart the terminal and then write the flowwing command:

$ pyspark

This should open Jupter Notebook in your web browser. Creat a new notebook by clicking 'New'> 'Python3' or 'Python 2'

Now, copy-paste the pi calcuation script and run it. 
**Note:** I am getting error here becasue I configure PySpark on my Jupter Notebook using the second method of activating PySpark. These methods are discussed late in this document. 

In [2]:
import random
num_samples = 100000000
def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)

sc.stop()

NameError: name 'sc' is not defined

#### Method 2- Loading a regular Jupyter notebook and importing PySpark using 'findspark' library

This method is more common and personaly, I perfer this mothod over the first one. First, you need to install *findspark* library using Anaconda by running the following script in the command line: 

$ conda install -c conda-forge findspark

Then launch Jupyter Notebook:

$ jupyter notebook

Create a new notbook and import *findspark* library which you recently installed. 

In [6]:
import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)

3.14150352


In [4]:
sc.version

'2.2.0'

In [None]:
sc.stop()

Congratulation, now you can run PySpark on Jupyter Notebook using Anaconda. 