# Getting Started with PySpark

Apache Spark is an open-source cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. [Wikipedia](https://en.wikipedia.org/wiki/Apache_Spark)

While Spark is writen in Scala, a language that compiles down to bytecode for the JVM, the open source community has developed a wonderful toolkit called PySpark that allows you to interface with RDD's in Python. Thanks to a library called Py4J, Python can interface with JVM objects, in our case RDD's, and this library one of the tools that makes PySpark work. [Read more about PySpark here](http://www.kdnuggets.com/2015/11/introduction-spark-python.html)

Let's start by installing and running PySpark on our local machine.  

1. Open a command line separate from the one on which you're running Jupyter.   
2. Install on command line using "pip install pyspark"    

You may need to restart your terminal to be able to run PySpark. (wasn't required running on a mac) 

3. To check that pyspark is working, you can start pyspark directly on your command line, by typing "pyspark"  You should see:  

Welcome to  
       ____              __  
      / __/__  ___ _____/ /__  
     _\ \/ _ \/ _ `/ __/  '_/  
    /__ / .__/\_,_/_/ /_/\_\   version 2.2.0  
       /_/  
  
4. You can quit the command line PySpark by typing "quit()".  

5. Next, add the pyspark-shell to the shell environment variable   PYSPARK_SUBMIT_ARGS, by typing:  
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell  

In [1]:
import pyspark

In [2]:
from pyspark import SparkContext
sc =SparkContext()

Quick example

In [3]:
# Calculate a very special number using PySpark!
import random
num_samples = 100000000
def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()

3.14198288


## PySpark in Jupyter  
If the above procedure didn't work, there are two other ways to get PySpark available in a Jupyter Notebook:

1. Configure PySpark driver to use Jupyter Notebook: running pyspark will automatically open a Jupyter Notebook  

2. Load a regular Jupyter Notebook and load PySpark using findSpark package  

First option is quicker but specific to Jupyter Notebook, second option is a broader approach to get PySpark available in your favorite IDE.  

[Source](https://blog.sicara.com/get-started-pyspark-jupyter-guide-tutorial-ae2fe84f594f)

### Method 1 — Configure PySpark driver  

Update PySpark driver environment variables: add these lines to your ~/.bashrc (or ~/.zshrc) file. 



In the command line, enter

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'  

Restart your terminal and launch PySpark again:  

$ pyspark  

Now, this command should start a Jupyter Notebook in your web browser. Create a new notebook by clicking on ‘New’ > ‘Notebooks Python [default]’.
Copy and paste our Pi calculation script and run it by pressing Shift + Enter.

## Method 2 — FindSpark package
There is another and more generalized way to use PySpark in a Jupyter Notebook: use findSpark package to make a Spark Context available in your code.  

findSpark package is not specific to Jupyter Notebook, you can use this trick in your favorite IDE too.  

To install findspark:  

$ pip install findspark

Launch a regular Jupyter Notebook:  

$ jupyter notebook  

Create a new Python [default] notebook and write the following script:

In [None]:
# The only difference from above are the first two lines initializing findspark.

import findspark

In [None]:
findspark.init()

In [None]:
import pyspark
import random

sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()

### Let's do a bit more with pyspark

Here, we count characters
[Source](https://github.com/KristianHolsheimer/pyspark-setup-guide/blob/master/spark_word_count.ipynb)

In [18]:
# This is just to add a sound that can alert us when a cell finishes executing.
from IPython.display import Audio
sound_file = './../beep.wav'

In [4]:
# You can't run two instances at the same time, so you'll need to quit the previous instance first.

from pyspark import SparkContext
sc = SparkContext()

In [10]:
!wget 'http://www.gutenberg.org/cache/epub/100/pg100.txt'
Audio(url=sound_file, autoplay=True)

--2017-07-26 19:21:30--  http://www.gutenberg.org/cache/epub/100/pg100.txt
Resolving www.gutenberg.org... 152.19.134.47, 2610:28:3090:3000::bad:cafe:47
Connecting to www.gutenberg.org|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5589889 (5.3M) [text/plain]
Saving to: ‘pg100.txt.3’


2017-07-26 19:21:32 (3.35 MB/s) - ‘pg100.txt.3’ saved [5589889/5589889]



In [11]:
raw_text = sc.textFile('pg100.txt', 4)

In [17]:
# check whether the data was loaded properly:
print( u'first line of raw_text:\t "{}"'.format(raw_text.first()))
print( u'total number of lines:\t {}'.format(raw_text.count()))
Audio(url=sound_file, autoplay=True)

first line of raw_text:	 "The Project Gutenberg EBook of The Complete Works of William Shakespeare, by"
total number of lines:	 124787
