# Lab : Visualizing Training With Tensorboard

### Overview
Introducing visual tools

### Runtime
30 mins

## Step 1 - About IRIS Dataset

This is [Fisher's Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris)

This dataset contains 150 samples, with 4 dimensions, as follows:

1. Petal Length  (c1)
2. Petal Width   (c2)
3. Sepal Length  (c3)
4. Sepal Width   (c4)

There are 3 output classes: Setosa, Versicolor, and Virginica.
In our output datset, we have simplified this data by making classes simply 1, 2, 3.

Here's an example of what the dataset looks like

| c1  | c2  | c3  | c4  | label | 
|-----|-----|-----|-----|-------| 
| 6.4 | 2.8 | 5.6 | 2.2 | 3     | 
| 5.0 | 2.3 | 3.3 | 1.0 | 2     | 
| 4.9 | 2.5 | 4.5 | 1.7 | 3     | 
| 4.9 | 3.1 | 1.5 | 0.1 | 1     | 
| 5.7 | 3.8 | 1.7 | 0.3 | 1     | 
| 4.4 | 3.2 | 1.3 | 0.2 | 1     | 
| 5.4 | 3.4 | 1.5 | 0.4 | 1     | 
| 6.9 | 3.1 | 5.1 | 2.3 | 3     | 
| 6.7 | 3.1 | 4.4 | 1.4 | 2     | 

## Step 2 - Init

In [None]:
from zoo.common.nncontext import init_nncontext
import zoo.version

sc = init_nncontext("single layer IRIS")
print("zoo version : ", zoo.version.__version__)

## Spark UI
print('Spark UI running on http://localhost:' + sc.uiWebUrl.split(':')[2])
sc

## Step 3 - Explore Dataset

Let's do some basic exploration of dataset

### 3.1 - Load data

In [None]:
data = spark.read.csv("../../data/iris/iris_full.csv", \
                      header=True, inferSchema="true", mode="DROPMALFORMED")
print ("data count ", data.count())
data = data.na.drop()
print ("clean data count ", data.count())
data.show()

### 3.2 - Basic Analysis

In [None]:
data.describe().show()

In [None]:
data.groupBy('label').count().show()

## Step 4 - Create Feature Vectors

### 4.1 - Convert double
BigDL needs attributes as double

In [None]:
from pyspark.sql.functions import col, udf
# convert everything to double
data = data.select([col(c).cast("double") for c in data.columns])
data.printSchema()
data.show()

### 4.2 - Assembler

In [None]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler (inputCols=['c1','c2','c3', 'c4'], outputCol='assembled')
fv = assembler.transform(data)
fv.show()

### 4.3 - Scalar

In [None]:
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="assembled", outputCol="scaled")
fv = scaler.fit(fv).transform(fv)
fv.show()

### 4.4 - Convert vectors to array
BigDL supports Array\[\] type.  Spark ML Vector support coming soon.

In [None]:
# Add Utils dir to load path

import os
import sys
cwd = os.getcwd()
# print ("cwd : ", cwd)
utils_dir = os.path.abspath(os.path.join(cwd, "../utils"))
# print("utils dir : ", utils_dir)
if utils_dir not in sys.path:
    sys.path.append(utils_dir)
print ("sys.path: " , sys.path)

my_utils_pyfile = os.path.abspath(os.path.join(utils_dir, 'my_utils.py'))
print ("my_utils file : ", my_utils_pyfile)

from my_utils import dense_to_array_udf, sparse_to_array_udf

# add file to spark
sc.addPyFile(my_utils_pyfile)

In [None]:
## convert scaled(vector) --> features(array)
fv = fv.withColumn('features', dense_to_array_udf('scaled'))

fv.printSchema()
fv.show()

## Step 5 - Split Training / Validation Set

In [None]:
## split 70% training, 30% validation
(training, validation) = fv.randomSplit([0.7,0.3])
print("training set count ", training.count())
print("validation set count ", validation.count())

## Step 6 - Setup Neural Network


### 6.1 - Designing the network
Here's a picture of a simple neural network, like what we have in this example:

<img src="../../media/feed-forward-1-skitch.png">


As you can see, we have a total of 3 layers:

1. Input layer (sized as number of features -- in this case 4)
2. Hidden Layer (size we have to specify as part of the model).
3. Output Layer (Number of output classes we are trying to classify -- in this case 3)

### 6.2 - Sizing hidden layers

Sizing hidden layers can be a challenge. The best way to figure this out is to do it empirically. However, we may need a "rule of thumb" to start. Here is a good rule of thumb:

First Hidden Layer:

```
n_hidden_1 = np.sqrt(np.sqrt((n_classes + 2) * n_input) + 2 * np.sqrt(n_input /(n_classes+2.)))
```

Second Hidden Layer: (if needed)

```
n_hidden_2 = n_classes * np.sqrt(n_input / (n_classes + 2.))
```

In this case, we have a VERY simple dataset. We may not need two hidden layers. Let's start with one.

In [None]:
# Number of hidden layers
import numpy as np

n_input = 4  # c1-4
n_classes = 3  # outcome 1/2/3

n_hidden_guess = np.sqrt(np.sqrt((n_classes + 2) * n_input) + 2 * np.sqrt(n_input /(n_classes+2.)))
print("Hidden layer 1 (Guess) : " + str(n_hidden_guess))

n_hidden_guess_2 = n_classes * np.sqrt(n_input / (n_classes + 2.))
print("Hidden layer 2 (Guess) : " + str(n_hidden_guess_2))

## Step 7 - Setup BigDL Network

### 7.1 - Network Parameters

In [None]:
learning_rate = 0.01
training_epochs = 100
# batch size should be multiple of number of cores.
# So powers of two is a good bet
batch_size = 32

# Network Parameters
n_input = 4  # c1-3
n_classes = 3  # outcome 1/2/3
n_hidden_1 = 3  # from the above guess
# n_hidden_2 = 3  # 2nd layer number of neurons (from guess above)

### 7.2 - setup BigDL network

In [None]:
from bigdl.nn.layer import Sequential, Linear, LogSoftMax
from bigdl.nn.criterion import ClassNLLCriterion
from zoo.pipeline.nnframes import  NNClassifier
from bigdl.optim.optimizer import Adam, SGD, Adagrad

nn = Sequential()\
     .add(Linear(n_input, n_hidden_1))\
     .add(Linear(n_hidden_1, n_classes))\
     .add(LogSoftMax())

estimator = NNClassifier(nn, ClassNLLCriterion(), [n_input])
estimator.setMaxEpoch(training_epochs)\
            .setBatchSize(batch_size)\
            .setLearningRate(learning_rate)
estimator.setLabelCol("label").setFeaturesCol("features")

# optimizer method, default is SGD
estimator.setOptimMethod(Adam())

print ("nn \n", nn)


## Step 8 - Setup tensorboard

### 8.1 - Cleanup tensor logs directory

In [None]:
import os
import shutil
# import datetime as dt


tensorboard_dir=os.environ.get('TENSORBOARD_DIR', '/tmp/tensorboard-logs')
print("TENSORBOARD_DIR : ", tensorboard_dir)

## TODO : give an app name
app_name='???' #+dt.datetime.now().strftime("%Y%m%d-%H%M%S")
base_path = os.path.abspath(os.path.join(tensorboard_dir, app_name))

# clean old logs
try:
    print ("Cleaning : ", base_path)
    shutil.rmtree(base_path)
#     shutil.rmtree('/private' + base_path)  # On Mac
except OSError:
    pass



### 8.2 - Setup validation parameters

In [None]:
from bigdl.optim.optimizer import EveryEpoch, Top1Accuracy, TrainSummary, SeveralIteration, ValidationSummary

estimator.setValidation(EveryEpoch(), \
                        validation, \
                        [Top1Accuracy()], \
                        batch_size)

## TODO : create a trining summary, 
##   hint : log_dir=tensorboard_dir
##          app_name=app_name
train_summary = TrainSummary(log_dir=???, app_name=???)
train_summary.set_summary_trigger("Parameters", SeveralIteration(50))

## TODO : create a validation summary
##   hint : log_dir=tensorboard_dir
##          app_name=app_name
val_summary = ValidationSummary(log_dir=???, app_name=???)

log_path = os.path.abspath(os.path.join(tensorboard_dir, app_name))
print("saving logs to ",log_path)

## TODO : set training summary (train_summary)
estimator.setTrainSummary(???)

## TODO : set validation summary (val_summary)
estimator.setValidationSummary(???)

## Step 9 - Train the network

### 9.1 - Train

In [None]:
%%time 

## training
print ("starting training...")
model = estimator.fit(training)
print("training finished.\n")

### Step 9.2 - Visualize Learning

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

loss = np.array(train_summary.read_scalar("Loss"))
top1 = np.array(val_summary.read_scalar("Top1Accuracy"))

plt.figure(figsize = (12,12))
plt.subplot(2,1,1)
plt.plot(loss[:,0],loss[:,1],label='loss')
plt.xlim(0,loss.shape[0]+10)
plt.grid(True)
plt.title("loss")
plt.subplot(2,1,2)
plt.plot(top1[:,0],top1[:,1],label='top1')
plt.xlim(0,loss.shape[0]+10)
plt.title("top1 accuracy")
plt.grid(True)

### 9.3 - Predict
We use 'test' dataset for prediction

In [None]:
%%time

predictions = model.transform(validation)

In [None]:
predictions.groupBy("prediction").count().show()
predictions.show()


## Step 10 - Evaluate the model

## 10.1 - Basic stats

In [None]:
print ("matching predictions ", predictions.filter("prediction == label").count())
print ("missed predictions ", predictions.filter("prediction != label").count())

### 10.2 - Accuracy, Precision, AUC

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction")
auPRC = evaluator.evaluate(predictions)
print("Area under precision-recall curve = " , auPRC)
    
recall = MulticlassClassificationEvaluator(metricName="weightedRecall").evaluate(predictions)
print("recall = " , recall)

precision = MulticlassClassificationEvaluator(metricName="weightedPrecision").evaluate(predictions)
print("Precision = ", precision)

accuracy = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy").\
            evaluate(predictions)
print("accuracy = ",  accuracy)

### 10.3 - Confusion Matrix

In [None]:
# Confusion matrix
# we use Spark to calculate confusion matrix as the prediction set can be rather large
cm = predictions.groupBy('label').pivot('prediction', [1,2,3]).count().na.fill(0).orderBy('label')
cm.show()

In [None]:
## Plot
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn

cm_pd = cm.toPandas()
# print(cm_pd)
cm_pd = cm_pd.set_index('label')  # make 'label' as index
# print(cm_pd)

plt.figure(figsize = (10,8))
sn.heatmap(cm_pd, annot=True,fmt='d');