# Basic Configuration

## Required installations
1. Java
    - download from http://www.oracle.com/technetwork/java/javase/downloads/index.html
    - install using the downloaded executable file
    - install Java to a different path name instead of *Program Files* due to issues relating to the additional whitespace
2. Anaconda
    - download from https://www.anaconda.com/download/
    - install using the downloaded executable file
3. Apache Spark 
    - download from https://spark.apache.org/downloads.html
    - unzip to a new folder e.g. *C:\spark*
4. Apache Hadoop 
    - only *winutils.exe* is required
    - download from https://github.com/steveloughran/winutils
    - create a new folder e.g. *C:\hadoop\bin* where this file should be placed in
    - Run command prompt as Adminstrator and execute
    ```bash
    winutils.exe chmod 777 \tmp\hive
    ```
5. Findspark
    - a utility to locate and initialize pyspark
    - install using
    ```bash
    conda install -c conda-forge findspark
    ```

## Environment variables
- HADOOP_HOME = *path\to\hadoop*
- SPARK_HOME = *path\to\spark*
- JAVA_HOME = *path\to\JavaJDK*

# Import base packages

In [1]:
import sys, os, shutil
import findspark
# use findspark to locate and initialize pyspark before importing pyspark
findspark.init()
import pyspark

# Check environment

In [2]:
print("Python Version:", sys.version)
print("Spark Version:", pyspark.__version__)

Python Version: 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)]
Spark Version: 2.1.1+hadoop2.7


# Example 1: Calculate value of Pi

Adapted from https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py

In [3]:
from random import random
from operator import add
from pyspark.sql import SparkSession

def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 <= 1 else 0
    
spark = SparkSession \
        .builder \
        .appName("PythonPi") \
        .getOrCreate()

partitions = 10
num_samples = 10000

count = spark.sparkContext.parallelize(range(1, num_samples + 1), partitions).map(f).reduce(add)

print("Pi is roughly %f" % (4.0 * count / num_samples))

spark.stop()

Pi is roughly 3.131600


# Example 2: Perform Binary Classification using Decision Tree

Adapted from https://github.com/apache/spark/blob/master/examples/src/main/python/ml/decision_tree_classification_example.py

In [4]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from pyspark.sql import SparkSession

spark = SparkSession\
        .builder\
        .appName("DecisionTreeClassificationExample")\
        .getOrCreate()

data = spark.read.format("libsvm").load("data/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)

# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
                labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))

treeModel = model.stages[2]
# summary only
print(treeModel)

spark.stop()

+----------+------------+--------------------+
|prediction|indexedLabel|            features|
+----------+------------+--------------------+
|       1.0|         1.0|(692,[95,96,97,12...|
|       1.0|         1.0|(692,[121,122,123...|
|       1.0|         1.0|(692,[123,124,125...|
|       1.0|         1.0|(692,[124,125,126...|
|       1.0|         1.0|(692,[125,126,127...|
+----------+------------+--------------------+
only showing top 5 rows

Test Error = 0 
DecisionTreeClassificationModel (uid=DecisionTreeClassifier_47efbcff89c6e4ec0117) of depth 2 with 5 nodes


# Example 3: Perform Multiclass Classification using Multilayer Perceptron
Adapted from https://github.com/apache/spark/blob/master/examples/src/main/python/ml/multilayer_perceptron_classification.py

In [5]:
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from pyspark.sql import SparkSession

spark = SparkSession\
        .builder\
        .appName("multilayer_perceptron_classification_example")\
        .getOrCreate()
        
 # Load training data
data = spark.read.format("libsvm").load("data/sample_multiclass_classification_data.txt")

# Split the data into train and test
splits = data.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]

# specify layers for the neural network:
# input layer of size 4 (features), two intermediate of size 5 and 4
# and output of size 3 (classes)
layers = [4, 5, 4, 3]

# create the trainer and set its parameters
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)

# train the model
model = trainer.fit(train)

# compute accuracy on the test set
result = model.transform(test)
predictionAndLabels = result.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))

spark.stop()

Test set accuracy = 0.9019607843137255
