# Basic Configuration

## Required installations
1. Java
    - download from http://www.oracle.com/technetwork/java/javase/downloads/index.html
    - install using the downloaded executable file
    - install Java to a different path name instead of *Program Files* due to issues relating to the additional whitespace
2. Anaconda
    - download from https://www.anaconda.com/download/
    - install using the downloaded executable file
3. Apache Spark 
    - download from https://spark.apache.org/downloads.html
    - unzip to a new folder e.g. *C:\spark*
4. Apache Hadoop 
    - only *winutils.exe* is required
    - download from https://github.com/steveloughran/winutils
    - create a new folder e.g. *C:\hadoop\bin* where this file should be placed in
    - Run command prompt as Adminstrator and execute
    ```bash
    winutils.exe chmod 777 \tmp\hive
    ```
5. Findspark
    - a utility to locate and initialize pyspark
    - install using
    ```bash
    conda install -c conda-forge findspark
    ```

## Environment variables
- HADOOP_HOME = *path\to\hadoop*
- SPARK_HOME = *path\to\spark*
- JAVA_HOME = *path\to\JavaJDK*

# Import base packages

In [1]:
import sys, os, shutil
import findspark
# use findspark to locate and initialize pyspark before importing pyspark
findspark.init()
import pyspark

# Check environment

In [2]:
print("Python Version:", sys.version)
print("Spark Version:", pyspark.__version__)

Python Version: 3.6.3 |Anaconda, Inc.| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]
Spark Version: 2.1.1+hadoop2.7


# Example 1: Calculate value of Pi

Adapted from https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py

In [3]:
from random import random
from operator import add
from pyspark.sql import SparkSession

def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 <= 1 else 0
    
spark = SparkSession \
        .builder \
        .appName("PythonPi") \
        .getOrCreate()

partitions = 10
num_samples = 1000000

count = spark.sparkContext.parallelize(range(1, num_samples + 1), partitions).map(f).reduce(add)

print("Pi is roughly %f" % (4.0 * count / num_samples))

spark.stop()

Pi is roughly 3.139252


# Example 2: Perform Binary Classification using Decision Tree

Adapted from https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/decision_tree_classification_example.py

In [4]:
from pyspark import SparkContext

from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils

sc = SparkContext(appName="PythonDecisionTreeClassificationExample")

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/sample_libsvm_data.txt')

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
    
# Train a DecisionTree model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                    impurity='gini', maxDepth=5, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(
        lambda lp: lp[0] != lp[1]).count() / float(testData.count())

print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())

# Save model (delete previous model if exists)
if os.path.exists("model/myDecisionTreeClassificationModel"):
    shutil.rmtree("model/myDecisionTreeClassificationModel")
model.save(sc, "model/myDecisionTreeClassificationModel")

# load model
sameModel = DecisionTreeModel.load(sc, "model/myDecisionTreeClassificationModel")
    
sc.stop()

Test Error = 0.041666666666666664
Learned classification tree model:
DecisionTreeModel classifier of depth 2 with 5 nodes
  If (feature 406 <= 72.0)
   If (feature 100 <= 165.0)
    Predict: 0.0
   Else (feature 100 > 165.0)
    Predict: 1.0
  Else (feature 406 > 72.0)
   Predict: 1.0

