# Lab : Diabetes

### Overview
Analyze some diabetes data

### Runtime
30 mins

## Step 1 -About Data

[About data](https://www.kaggle.com/uciml/pima-indians-diabetes-database)

This is a classification dataset, based on inputs (`a,b,c,d,e,f,g,h`) we predict the `outcome`

Sample Data:

```
a,b,c,d,e,f,g,h,outcome
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
```

## Step 2 - Init

In [None]:
from zoo.common.nncontext import init_nncontext
import zoo.version

## TODO : use 'init_nncontext ("your app name")' to initialize the app
sc = ???("???")
print("zoo version : ", zoo.version.__version__)

## Spark UI
print('Spark UI running on http://localhost:' + sc.uiWebUrl.split(':')[2])
sc

## Step 3 - Explore Dataset

### 3.1 - Load Data

In [None]:
data = spark.read.csv("../../data/diabetes/pima-indians-diabetes-data.csv", \
                      header=True, inferSchema=True)
print("record count ", data.count())
data = data.na.drop()
print ("clean data count ", data.count())
data.printSchema()
data.show()

### 3.2 - Basic Exploration

In [None]:
data.describe().toPandas().T

In [None]:
## TODO : see data distributed
## Hint : groupBy('outcome')

data.groupBy("???").count().show()


### 3.3 - Graph

In [None]:
## basic frequency graph

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

a = data.groupBy("outcome").count().toPandas()
print(a)
a = a.set_index('outcome')
a.plot(kind='bar', rot=0)
plt.show()

## Step 4 - Create Feature Vectors

### 4.1 - No zeroes in Target Label column

In [None]:
# BigdL doesn't like 0 (zero) in label column
# so I am going to add +1 to label

##TODO : create another column 'outcome2'
## We add 1 to 'outcome' column
data = data.withColumn("???", data['outcome']+1)

## TODO : group by 'outcome2'
data.groupBy("???").count().show()


data.show(10)

### 4.2 - Convert to Double
BigDL likes all numbers in Double

In [None]:
from pyspark.sql.functions import col, udf

## TODO : convert everything to double
## Hint : cast all columns to 'double'
data = data.select([col(c).cast("???") for c in data.columns])
data.printSchema()
data.show(5)


### 4.3 - Feature Vector

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import DoubleType

## TODO : create a feature vector from columns : 'a','b','c','d','e','f','g', 'h'
assembler = VectorAssembler (inputCols=['a', '?', '?'], outputCol='assembled')
fv = assembler.transform(data)

fv = fv.withColumn ('label', fv['outcome2'])
fv.show(5)

### 4.4 - Scaling

In [None]:
from pyspark.ml.feature import StandardScaler

## TODO : scale 'assembled' --> 'features'
scaler = StandardScaler (inputCol="???", outputCol="???")

fv = scaler.fit(fv).transform(fv)
fv.show(5)

### 4.5 Convert label & feature to arrays
BigDL supports Array\[\] type.  Spark ML Vector support coming soon.

In [None]:
# Add Utils dir to load path

import os
import sys
cwd = os.getcwd()
# print ("cwd : ", cwd)
utils_dir = os.path.abspath(os.path.join(cwd, "../utils"))
# print("utils dir : ", utils_dir)
if utils_dir not in sys.path:
    sys.path.append(utils_dir)
print ("sys.path: " , sys.path)

my_utils_pyfile = os.path.abspath(os.path.join(utils_dir, 'my_utils.py'))
print ("my_utils file : ", my_utils_pyfile)

from my_utils import dense_to_array_udf, sparse_to_array_udf

# add file to spark
sc.addPyFile(my_utils_pyfile)

In [None]:
## convert scaled(vector) --> features(array)
fv = fv.withColumn('features', dense_to_array_udf('scaled'))

fv.printSchema()
fv.limit(5).toPandas()

## Step 5 - Split training / validation

In [None]:
## TODO : split the data 70%, 30 %

(training, validation) = fv.randomSplit([???, ???])
print("training set count ", training.count())
print("validation set count ", validation.count())

## Step 6 - Design Network

### 6.1 - Designing the network
Here's a picture of a simple neural network, like what we have in this example:

<img src="../../media/diabetes-hidden-layer-skitch.png">


As you can see, we have a total of 3 layers:

1. Input layer (sized as number of features -- in this case 8 : 'a' -- 'h')
2. Hidden Layer (size we have to specify as part of the model).
3. Output Layer (Number of output classes we are trying to classify -- in this case 2)

### 6.2 - Sizing hidden layers

Sizing hidden layers can be a challenge. The best way to figure this out is to do it empirically. However, we may need a "rule of thumb" to start. Here is a good rule of thumb:

First Hidden Layer:

```
n_hidden_1 = np.sqrt(np.sqrt((n_classes + 2) * n_input) + 2 * np.sqrt(n_input /(n_classes+2.)))
```

Second Hidden Layer: (if needed)

```
n_hidden_2 = n_classes * np.sqrt(n_input / (n_classes + 2.))
```

In this case, we have a VERY simple dataset. We may not need two hidden layers. Let's start with one.

In [None]:
# Number of hidden layers
import numpy as np

## TODO : define input / output numbers
## Hint : how many input features are we feeding?
## Hint : How many output classes?
n_input = ???  # # a -h 
n_classes = ???  # outcome 1/2

n_hidden_guess = np.sqrt(np.sqrt((n_classes + 2) * n_input) + 2 * np.sqrt(n_input /(n_classes+2.)))
print("Hidden layer 1 (Guess) : " + str(n_hidden_guess))

n_hidden_guess_2 = n_classes * np.sqrt(n_input / (n_classes + 2.))
print("Hidden layer 2 (Guess) : " + str(n_hidden_guess_2))

## Step  7 -  Create the Network

### 7.1 - Network Parameters

In [None]:
learning_rate = 0.001
training_epochs = 100

# batch size should be multiple of number of cores.
# So powers of two is a good bet, start with 32
batch_size = ???


n_hidden_1 = 5 # 1st layer number of neurons
n_hidden_2 = 3  # 2nd layer number of neurons

### 7.2 - setup network

In [None]:
from bigdl.nn.layer import Sequential, Linear, LogSoftMax
from bigdl.nn.criterion import ClassNLLCriterion
from zoo.pipeline.nnframes import  NNClassifier
from bigdl.optim.optimizer import Adam, SGD, Adagrad

## two layers =  input [8] + output [2]
# nn = Sequential().add(Linear(n_input, n_classes)).add(LogSoftMax())

## TODO : setup a network
## 3 layers = input [8] +  hidden1  + output [2]
nn = Sequential().\
       add(Linear(???, ???)).\   # Hint : n_input --> n_hidden_1
       add(Linear(???, ???)).\   # Hint : n_hidden_1 --> n_output
       add(???())                # Hint : LogSoftMax

## 4 layers = input [8] +  hidden1   +  hidden2 + output [2]
# nn = Sequential().add(Linear(???, ???)).\   # n_input --> n_hidden1
#                   add(Linear(???, ???)).\   # n_hidden1 --> n_hidden2
#                   add(Linear(???, ???)).\   # n_hidden2 --> n_output 
#                   add(???())              # LogSoftMax


criterion = ClassNLLCriterion()

## TODO : Create 'NNClassifier' with 'network', 'criterion' and 'n_input'
estimator = ???(???, ???, [???])

## TODO : set other network parameters
estimator.setMaxEpoch(???)\
            .setBatchSize(???)\
            .setLearningRate(???)

estimator.setLabelCol("label").setFeaturesCol("features")

## TODO :  optimizer method to 'Adam()', default is SGD
estimator.setOptimMethod(???())

print ("nn \n", nn)

## Step 8 - Train / Predict

### 8.1 - Train

In [None]:
%%time 

## training
print ("starting training...")
## TODO : train using 'fit' method , pass in 'training' data
model = estimator.???(???)
print("initial model training finished.")

# TODO : notice the time it took for training

### 8.2 -  Prediction

In [None]:
%%time

## TODO : predict using 'validation'
predictions = model.transform(???)

In [None]:
predictions.groupBy("prediction").count().show()
predictions.sample(False, 0.1).limit(5).toPandas()

## Step 9 - Evalauating

### 9.1 - Basic Eval

In [None]:
print ("matching predictions ", predictions.filter("prediction == label").count())
print ("missed predictions ", predictions.filter("prediction != label").count())

### 9.2 - Accuracy, Precision, AUC

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction")
auPRC = evaluator.evaluate(predictions)
print("Area under precision-recall curve = " , auPRC)
    
recall = MulticlassClassificationEvaluator(metricName="weightedRecall").evaluate(predictions)
print("recall = " , recall)

precision = MulticlassClassificationEvaluator(metricName="weightedPrecision").evaluate(predictions)
print("Precision = ", precision)

accuracy = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy").\
            evaluate(predictions)
print("accuracy = ",  accuracy)

### 9.3 - Confusion Matrix

In [None]:
# Confusion matrix
# we use Spark to calculate confusion matrix as the prediction set can be rather large
cm = predictions.groupBy('label').pivot('prediction', [1,2]).count().na.fill(0).orderBy('label')
cm.show()

In [None]:
# basic imports

import matplotlib.pyplot as plt
import seaborn as sn

cm_pd = cm.toPandas()
# print(cm_pd)
cm_pd = cm_pd.set_index('label')  # make 'label' as index
# print(cm_pd)

plt.figure(figsize = (10,8))
sn.heatmap(cm_pd, annot=True,fmt='d');

## Step 10 - Experiment
Try the following :
- increase number of hidden layers (3 --> 4 --> 5)
- you can also adjust the number of neurons on each 

See if you can improve the accuracy and confusion matrix.