# Lab : Feedforward Neural Network on the Iris Dataset

### Overview
Classify IRIS dataset using Feed Forward Network

### Runtime
30 mins

## Step 1 - About IRIS Dataset

This is [Fisher's Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris)

This dataset contains 150 samples, with 4 dimensions, as follows:

1. Petal Length  (c1)
2. Petal Width   (c2)
3. Sepal Length  (c3)
4. Sepal Width   (c4)

There are 3 output classes: Setosa, Versicolor, and Virginica.
In our output datset, we have simplified this data by making classes simply 1, 2, 3.

Here's an example of what the dataset looks like

| c1  | c2  | c3  | c4  | label | 
|-----|-----|-----|-----|-------| 
| 6.4 | 2.8 | 5.6 | 2.2 | 3     | 
| 5.0 | 2.3 | 3.3 | 1.0 | 2     | 
| 4.9 | 2.5 | 4.5 | 1.7 | 3     | 
| 4.9 | 3.1 | 1.5 | 0.1 | 1     | 
| 5.7 | 3.8 | 1.7 | 0.3 | 1     | 
| 4.4 | 3.2 | 1.3 | 0.2 | 1     | 
| 5.4 | 3.4 | 1.5 | 0.4 | 1     | 
| 6.9 | 3.1 | 5.1 | 2.3 | 3     | 
| 6.7 | 3.1 | 4.4 | 1.4 | 2     | 

## Step 2 - Init

In [None]:
from zoo.common.nncontext import init_nncontext
import zoo.version

## TODO : use 'init_nncontext ("your app name")' to initialize the app
sc = ???("???")
print("zoo version : ", zoo.version.__version__)

## Spark UI
print('Spark UI running on http://localhost:' + sc.uiWebUrl.split(':')[2])
sc

## Step 3 - Explore Dataset

Let's do some basic exploration of dataset

### 3.1 - Load data

In [None]:
data = spark.read.csv("../../data/iris/iris_full.csv", \
                      header=True, inferSchema="true", mode="DROPMALFORMED")
print ("data count ", data.count())
data = data.na.drop()
print ("clean data count ", data.count())
data.show()

### 3.2 - Basic Analysis

In [None]:
## Spark's describe function is pretty powerful
data.describe().show()

### 3.3 -  See how data is distributed

In [None]:
## TODO : see data distributed
## Hint : groupBy('label')

data.groupBy("???").count().show()

# we see the data is pretty evenly distributed

### 3.4 - basic graph

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

class_count = data.groupBy("label").count().orderBy('label').toPandas()
class_count = class_count.set_index('label')
class_count.plot(kind='bar', rot=0)
plt.xlabel("Label")
plt.ylabel("Frequency");


## Step 4 - Create Feature Vectors

### 4.1 - Convert double
BigDL needs attributes as double

In [None]:
from pyspark.sql.functions import col, udf
## TODO : convert everything to double
## Hint : cast all columns to 'double'
data = data.select([col(c).cast("???") for c in data.columns])
data.printSchema()
data.show()

### 4.2 - Assembler

In [None]:
from pyspark.ml.feature import VectorAssembler

## TODO : assemble a feature vector
## Hint : inputCols = ['c1', 'c2', 'c3', 'c4']
assembler = VectorAssembler (inputCols=['c1','c2','???', '???'], outputCol='assembled')
fv = assembler.transform(data)
fv.show()

### 4.3 - Scale Features
It is important to scale features, so their values are normalized.

In [None]:
from pyspark.ml.feature import StandardScaler

## TODO : Use 'StandardScaler' to scale the features
## Hint : inputCol='assembled',  outputCol='scaled'
scaler = ???(inputCol="???", outputCol="???")
fv = scaler.fit(fv).transform(fv)
fv.show()

### 4.4 - Convert vectors to array

In [None]:
# Add Utils dir to load path

import os
import sys
cwd = os.getcwd()
# print ("cwd : ", cwd)
utils_dir = os.path.abspath(os.path.join(cwd, "../utils"))
# print("utils dir : ", utils_dir)
if utils_dir not in sys.path:
    sys.path.append(utils_dir)
print ("sys.path: " , sys.path)

my_utils_pyfile = os.path.abspath(os.path.join(utils_dir, 'my_utils.py'))
print ("my_utils file : ", my_utils_pyfile)

from my_utils import dense_to_array_udf, sparse_to_array_udf

# add file to spark
sc.addPyFile(my_utils_pyfile)

In [None]:
## convert scaled(vector) --> features(array)
fv = fv.withColumn('features', dense_to_array_udf('???'))

fv.printSchema()
fv.show()

## Step 5 - Split Training / Validation Set

In [None]:
## TODO : split 70% training, 30% validation
## Hint : 70% = 0.7 ,  30% = 0.3
(training, validation) = fv.randomSplit([???, ???])

## TODO : print out the record count in training and validation sets
## Hint : 'count'
print("training set count ", training.???())
print("validation set count ", validation.???())

## Step 6 - Setup Neural Network


### 6.1 - Designing the network
Here's a picture of a simple neural network, like what we have in this example:

<img src="../../media/feed-forward-1-skitch.png">


As you can see, we have a total of 3 layers:

1. Input layer (sized as number of features -- in this case 4)
2. Hidden Layer (size we have to specify as part of the model).
3. Output Layer (Number of output classes we are trying to classify -- in this case 3)

### 6.2 - Sizing hidden layers

Sizing hidden layers can be a challenge. The best way to figure this out is to do it empirically. However, we may need a "rule of thumb" to start. Here is a good rule of thumb:

First Hidden Layer:

```
n_hidden_1 = np.sqrt(np.sqrt((n_classes + 2) * n_input) + 2 * np.sqrt(n_input /(n_classes+2.)))
```

Second Hidden Layer: (if needed)

```
n_hidden_2 = n_classes * np.sqrt(n_input / (n_classes + 2.))
```

In this case, we have a VERY simple dataset. We may not need two hidden layers. Let's start with one.

In [None]:
# Number of hidden layers
import numpy as np

n_input = 4  # c1-4
n_classes = 3  # outcome 1/2/3

n_hidden_guess = np.sqrt(np.sqrt((n_classes + 2) * n_input) + 2 * np.sqrt(n_input /(n_classes+2.)))
print("Hidden layer 1 (Guess) : " + str(n_hidden_guess))

n_hidden_guess_2 = n_classes * np.sqrt(n_input / (n_classes + 2.))
print("Hidden layer 2 (Guess) : " + str(n_hidden_guess_2))

## Step 7 - Setup BigDL Network

### 7.1 - Network Parameters

In [None]:
learning_rate = 0.01
training_epochs = 100
# batch size should be multiple of number of cores.
# So powers of two is a good bet
batch_size = 32

# Network Parameters
## TODO : define input / output numbers
## Hint : how many input features are we feeding?
## Hint : How many output classes?
n_input = ???  # c1-3
n_classes = ???  # outcome 1/2/3
n_hidden_1 = ???  # from the above guess


### 7.2 - setup BigDL network

In [None]:
from bigdl.nn.layer import Sequential, Linear, LogSoftMax
from bigdl.nn.criterion import ClassNLLCriterion
from zoo.pipeline.nnframes import  NNClassifier
from bigdl.optim.optimizer import Adam, SGD, Adagrad

## TODO : setup network
nn = Sequential()\
     .add(Linear(???, ???))\  # hint : input --> hidden1
     .add(Linear(???, ???))\  # hint : hidden1 --> output
     .add(???())     #  hint : LogSoftMax

## TODO : use a 'ClassNLLCriterion'
criterion = ???()

## TODO : create NNClassifier with parameters : network, criterion, and input_size
estimator = ???(??, ???, [???])

## TODO : set training parameters
estimator.setMaxEpoch(???)\
            .setBatchSize(???)\
            .setLearningRate(???)

## TODO : set featuresCol='features',  labelCol='label'
estimator.setLabelCol("???").setFeaturesCol("???")

# TODO : set an optimizer method 'Adam()', default is SGD
estimator.setOptimMethod(???())

print ("nn \n", nn)

## Step 8 - Train the network

### 8.1 - Train

In [None]:
%%time 

## training
print ("starting training...")
## TODO : do training on 'training' dataset
## Hint : call 'fit' function with 'training' parameter
model = estimator.???(???)
print("training finished.\n")

## TODO : note the time it took for training

### 8.2 - Predict
We use 'test' dataset for prediction

In [None]:
%%time

## TODO : do predictions
## Hint : call 'transform' function, pass in 'validation' dataset
predictions = model.???(???)

In [None]:
predictions.groupBy("prediction").count().show()
predictions.show()


## Step 9 - Evaluate the model

## 9.1 - Basic stats

In [None]:
print ("matching predictions ", predictions.filter("prediction == label").count())

## TODO  : print missed prediction count
## Hint : adjust the condition for 'filter' function
print ("missed predictions ", predictions.filter("prediction ??? label").count())

### 9.2 - Accuracy, Precision, AUC

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label", metricName="areaUnderROC")
auPRC = evaluator.evaluate(predictions)
print("Area under precision-recall curve = " , auPRC)
    
recall = MulticlassClassificationEvaluator(metricName="weightedRecall").evaluate(predictions)
print("recall = " , recall)

precision = MulticlassClassificationEvaluator(metricName="weightedPrecision").evaluate(predictions)
print("Precision = ", precision)

accuracy = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy").\
            evaluate(predictions)
print("accuracy = ",  accuracy)

### 9.3 - Confusion Matrix

In [None]:
# Confusion matrix
# we use Spark to calculate confusion matrix as the prediction set can be rather large
cm = predictions.groupBy('label').pivot('prediction', [1,2,3]).count().na.fill(0).orderBy('label')
cm.show()

In [None]:
## Plot

import seaborn as sn

cm_pd = cm.toPandas()
# print(cm_pd)
cm_pd = cm_pd.set_index('label')  # make 'label' as index
# print(cm_pd)

plt.figure(figsize = (10,8))
sn.heatmap(cm_pd, annot=True,fmt='d');

## Step 10 - Experiment
Do a few runs (`Cell --> Run All`) and try the following
- change hidden layer sizing (3,4,5)
- change learning rate (0.0001 --> 0.01)

And observe the accuracy and confusion matrix