## Installation 

Download pyspark from anaconda.

```shell
conda install -y -c anaconda-cluster spark
```

Download the latest version of apache spark from this [link](https://spark.apache.org/downloads.html). For the package type choose the package `pre-built for Hadoop 2.6` and leave the rest untouched. After downloading it, unzip it and place the folder in your home directory and change the folder name to just `spark`.

Then you need to define these environment variables before starting the notebook.

```shell
export SPARK_HOME=~/spark
export PYSPARK_PYTHON=python3
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PACKAGES="com.databricks:spark-csv_2.11:1.4.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
```

In Unix/Mac, this can be done in .bashrc or .bash_profile.

- [link](http://people.duke.edu/~ccc14/sta-663-2016/21A_Introduction_To_Spark.html)

# Spark

When you run a Spark program, it actually consists of two programs. There's a driver program and there's a workers program. The driver program runs on one machine. And the worker program runs either on cluster nodes or in local threads on the same machine.

A Spark program first creates a `SparkContext` object. And the `SparkContext` tells Spark how and where
to access a cluster. If we're writing a Spark program, separated from the pySpark shell or the Databricks Community Edition environment, then you have to create a new `SparkContext` ourselves. Next the program creates a SQL context object that's used to create data frames.

In [2]:
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext()
sqlContext = SQLContext(sc)

When we run a Spark program, it actually consists of two programs. There's a driver program and there's a workers program. The driver program runs on one machine. And the worker program runs either on cluster nodes or in local threads on the same machine. The data frames that we create are automatically distributed across all of the workers and pySpark provides an easy to use programming abstraction and parallel runtime.

In [4]:
df = sqlContext.createDataFrame( [("John", "Smith"), ("Ravi", "Singh"), ("Julia", "Jones")], 
                                 ("first_name", "last_name") )
df.show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|      John|    Smith|
|      Ravi|    Singh|
|     Julia|    Jones|
+----------+---------+



In [6]:
df.columns

['first_name', 'last_name']

In [7]:
# each DataFrame is a list of Row object 
print( df.select( '*' ).collect() )

# index [0] will access the first Row (data point) and another [0]
# will access the first data point's first feature
print( df.select( '*' ).collect()[0][0] )
print( df.select( 'first_name' ).collect()[0][0] )

[Row(first_name='John', last_name='Smith'), Row(first_name='Ravi', last_name='Singh'), Row(first_name='Julia', last_name='Jones')]
John
John


In [8]:
# or we can use first, if we only wish to access
# the first element
df.first()[0]

'John'

In [None]:
# one-hot encode 
sample_ohe_dict_auto = create_one_hot_dict(sample_data_df)

In [2]:
from pyspark.ml.classification import LogisticRegression
?LogisticRegression(  )

In [9]:
from pyspark.ml.regression import LinearRegression
model = LinearRegression()
model

In [55]:
from pyspark.ml.feature import PolynomialExpansion
?PolynomialExpansion

In [56]:
from pyspark.ml import Pipeline
?Pipeline

In [2]:
wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordsRDD = sc.parallelize(wordsList)
# Print out the type of wordsRDD
print( type(wordsRDD) )

# simple word count
wordPairs = wordsRDD.map( lambda x: ( x, 1 ) )
print( wordPairs.collect() )

wordCounts = wordPairs.reduceByKey( lambda x, y: x + y )
print( wordCounts.collect() )

<class 'pyspark.rdd.RDD'>
[('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)]
[('elephant', 1), ('rat', 2), ('cat', 2)]


In [13]:
# http://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate
import string

# Create a dictionary using a comprehension - this maps every character from
# string.punctuation to None. Initialize a translation object from it.
translator = str.maketrans({ key: None for key in string.punctuation })

s = 'string with "punctuation" inside of it! Does this work? I hope so.'

# pass the translator to the string's translate method.
print(s.translate(translator))

string with punctuation inside of it Does this work I hope so
