# Setting Spark together with Jupyter

In this notebook I will outline how I managed to set up Spark/PySpark in Jupyter/IPython (using Python 3.x). I used as some reference [this post](https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python).

## System initial setting

On my OS X I installed Python using [Anaconda](https://www.continuum.io/downloads). The default version of Python I have currently installed is 3.4.4 (Anaconda 2.4.0). Note, that I also have installed also 2.x version of Python using `conda create -n python2 python=2.7 anaconda` (see [SO answer](http://stackoverflow.com/a/24415581/671013)).

## Installing Spark

This is actually the simplest step; download the latest binaries from [here](http://spark.apache.org/downloads.html) into `~/Applications` or some other directory of your choice. Next, untar the archive `tar -xzf spark-X.Y.Z-bin-hadoopX.Y.tgz`.

For easy access to Spark create a symbolic link to the Spark: 

```bash
ln -s ~/Applications/spark-X.Y.Z-bin-hadoopX.Y ~/Applications/spark
```

Lastly, add the Spark symbolic link to the `PATH`:

```bash
export SPARK_HOME=~/Applications/spark
export PATH=$SPARK_HOME/bin:$PATH
```

You can now run Spark/PySpark locally: simply invoke `spark-shell` or `pyspark`.

### Verbosity of Spark's output

Just execute this command in the spark directory:

```bash
cp conf/log4j.properties.template conf/log4j.properties
```

Edit `log4j.properties`:

```
# Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=WARN
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=WARN
```

## Setting Jupyter

In order to use Spark from within a Jupyter notebook, prepand the following to `PYTHONPATH`:

```bash
export PYTHONPATH=$SPARKHOME/python/lib/py4j-0.8.2.1-src.zip:$SPARKHOME/python/:$PYTHONPATH
```

## Testing Spark in Jupyter

Start a new Jupyter notebook instance:
```bash
jupyter notebook
```
inside some directory, say `~/scratch`

In [None]:
from pyspark import SparkContext
sc = SparkContext('local', 'pyspark')

### Primes count

In [None]:
def isprime(n):
    """
    check if integer n is a prime
    """
    # make sure n is a positive integer
    n = abs(int(n))
    # 0 and 1 are not primes
    if n < 2:
        return False
    # 2 is the only even prime number
    if n == 2:
        return True
    # all other even numbers are not primes
    if not n & 1:
        return False
    # range starts with 3 and only needs to go up the square root of n
    # for all odd numbers
    for x in range(3, int(n**0.5)+1, 2):
        if n % x == 0:
            return False
    return True

In [None]:
# Create an RDD of numbers from 0 to 1,000,000
nums = sc.parallelize(range(10000))

In [None]:
# Compute the number of primes in the RDD
print(nums.filter(isprime).count())

### Word count

In [None]:
from operator import add

In [None]:
# Taken from: http://langs.eserver.org/the-awful-german-language.txt
lines = sc.textFile('./the-awful-german-language.txt',1)

In [None]:
counts = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add)

In [None]:
# Top 10. http://stackoverflow.com/a/30779026/671013
counts.takeOrdered(10, key=lambda x: -x[1])

## Databricks CSV (WIP)

In [None]:
%AddDeps com.databricks spark-csv_2.10 1.2.0 --transitive

In [None]:
from pyspark.sql import SQLContext

In [None]:
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('file.csv')
df.show()