# Machine Learning (and Deep Learning) in just over 200 minutes with R
Created by [Ajay Hemanth](https://www.linkedin.com/in/ajayhemanth/?originalSubdomain=au) with [Yoni Nazarathy](https://yoninazarathy.com/). 

See more material at the [workshop's GitHub page](https://github.com/ajayhemanth/Machine-Learning-Workshop).

--- 
# Activity C: Dense Neural Networks
---

# Dense Neural Networks (aka as Multi-Layer Perceptrons)

## Classification

### Dataset
We will be using the [Covertype dataset](https://archive.ics.uci.edu/ml/datasets/covertype) available in the dataset folder to work with MLP.
<br>With this dataset we are supposed to predict `Forest Cover type` using the following features.

<img src="img/Dataset.png" />


In [1]:
# Make + as string concat operator as well.. Dont worry about this cell...
"+" = function(x,y) { if(is.character(x) || is.character(y)) return(paste(x , y, sep="")) else .Primitive("+")(x,y) }

In [None]:
library("keras")

In [None]:
d1 = read.csv("dataset/covtype.data", header=FALSE, stringsAsFactors=FALSE)
colnames(d1) = c('Elevation', 'Aspect', 'Slope', 'HorHydro', 'VertHydro', 'HorRoad', 'Shade9', 'Shade12', 'Shade15', 'HorFire', 'WildernessArea'+(1:4), 'SoilType'+(1:40), 'CoverType')

In [None]:
train_data  =  d1[  , -ncol(d1) ]
train_labels = d1[  ,  ncol(d1) ]

#### Input Data Format
Keras requires the input data to be a matrix with features as columns and observations as rows.
<br>
The response variable is required to be a binary matrix with each class in a seraparate column, which is achieved using the `to_categorical` function of Keras. `to_categorical` requires the labels to start from `0`. But since in our dataset the labels start with `1` we, subtract labels values by 1.

In [None]:
train_data = as.matrix(train_data)
train_labels = to_categorical(train_labels-1)

<img src="img/Network.png" width="500" />

#### Linear Stack of Layers
A Keras model is composed of layers of compute units, the most common of which is a Linear Stack of Layers.
<br> We define a Linear stack using "keras_model_sequential" functions.
<br> We then add layer by layer to the stack

In [None]:
model = keras_model_sequential() 

#### Input Layer
The first layer of MLP is the input layer which receives the input data.
<br> The total number of nodes in this layer is equal to the total number of input features.

#### Hidden Layer
A hidden layer is where the computation takes place. There are many parameters in hidden layers which changes the performance of the model, few of which are `activation function`, `number of hidden layers in a nerwork`, `number of hidden units in a layer`. The code below implements an example with 1 hidden layer having 200 units, with `relu` activation function for each layer.

In [None]:
model = model  %>% layer_dense(
    input_shape = ncol(train_data), 
    units = 200, 
    activation = 'relu',
    name="layer1"
)

In [None]:
model = model  %>% layer_dense(
    units = 200, 
    activation = 'relu',
    name="layer2"
)

#### Output Layer

The output layer is the last layer of the network, which gives the prediction.
<br> The nature of the `activation function` tells if the prediction is classification or regression. For classification there are a couple of options, with `softmax` being one of them for multi-class classification. 
<br> Total number of units in output class should be equal to the total number of classes

In [None]:
model = model  %>% layer_dense(units = ncol(train_labels), activation = 'softmax', name="layer3")
model

#### Calculating total number of model parameters

Printing `model` shows the following. Below is the calculation for total number of parameters for each layer.<br>
- Since the 1st layer (input layer) is connected to input, total number of parameter = number of input columns(54) X number of units in 1st layer (200) + 200 bias (1 for each node) = 11,000 parameters <br>
- Since the 2nd layer (hidden layer) is connected to input layer, total number of parameter = number of units in 1st layer (200) X number of units in 2nd layer (200) + 200 bias (1 for each node) = 40,200 parameters <br>
- Since the 3rd layer (output layer) is connected to input layer, total number of parameter = number of units in 2nd layer (200) X number of units in 3rd layer (7) + 7 bias (1 for each node) = 1,407 parameters <br>
- The total number of parameters is the sum of all the parameters in each layer -- provided if a layer weights are set to non-trainable

<br>
<div>
<img src="img/Model.PNG" width="500"/>
</div>

#### Configure Loss Function, Optimizer Algorithm, Error Metric using `compile` function

- Loss Functions - https://keras.io/api/losses/
- Optimizer Algorithms - https://keras.io/api/optimizers/
- Error Metrics - https://keras.io/api/metrics/ 

compile R function - https://keras.rstudio.com/reference/compile.html

In [None]:
model %>% compile(
    loss = 'categorical_crossentropy',
    optimizer = optimizer_adam(),
    metrics = 'accuracy'
)

#### Fit Model to Training Data

When fitting the model you get to choose 
- batch size
- epochs
- validation split
- class weight
- ... 

The arguments of the `fit` function are in
https://keras.rstudio.com/reference/fit.html

In [None]:
history <- fit(
    object = model,
    x = train_data,
    y = train_labels,
    epochs = 25,
    validation_split = 0.2
)

plot(history)

#### Output of De-Normalized Data

Below is the output of the denormalized data. Ideally MLP requires the data to be normalized. If the data is not normalized then it will take many epochs for the model to converge to reasonably optimum weights, which will require more compute capacity. Also if the scale between the features are vastly different, chances are that it might get held up in a local minima far from the global minima.

<img src="img/denorm.png" width="400"/>


#### Predicting New Data using the Built Model
The built model can be used to predict for new data points using the `predict` function. The test data should have the same structure of the train data. For demo purposes, the test data is taken to be the first two rows of the train data.

In [None]:
test_data = train_data[1:2,]
predict(model, test_data)

The prediction output is the probability of the new data points belonging to each class, and we classify the data into the class with maximum probability. <br>
<img src="img/PredictKeras.PNG" />

#### Performance of MLP with Normalized Input Data

On check the summary of the training data, it can be observed that the 1st ten columns' scale is very large.

In [None]:
summary(train_data)

One way of normalizing is by z-score -> ($x-\mu) / \sigma$ <br>

In [None]:
d2 = d1 

means1 = sapply(d1[,1:10], mean)
sd1 = sapply(d1[,1:10], sd)

d2[ , 1:10 ] = t((t(d1[ , 1:10 ]) - means1)  / sd1)

Rest of the steps follow as before

In [None]:
train_data  =  d2[  , -ncol(d2) ]
train_labels = d2[  ,  ncol(d2) ]

train_data = as.matrix(train_data)
train_labels = to_categorical(train_labels-1)

In [None]:
model <- keras_model_sequential()  %>% 

layer_dense(
    input_shape = ncol(train_data), 
    units = 200, 
    activation = 'relu'
) %>% 

layer_dense(
    units = 200, 
    activation = 'relu'
) %>% 

layer_dense(units = ncol(train_labels), activation = 'softmax')

In [None]:
model %>% compile(
    loss = 'categorical_crossentropy',
    optimizer = optimizer_adam(),
    metrics = 'accuracy'
)

history <- fit(
    object = model,
    x = train_data,
    y = train_labels,
    epochs = 25,
    validation_split = 0.2
)

plot(history)

#### Output of Normalized Data

It can be observed that the model converges faster to optimum than de-normalized data.
There are many methods of normalization. [This article](https://developers.google.com/machine-learning/data-prep/transform/normalization) provides a good explanation of the available methods.

<div>
<img src="img/norm.png" width="400" />
</div>


#### More Hidden Layers

Deeper neural networks can help predict complex patterns. The code below works with the same models parameters of the previous except for an additional hidden layer of 200 units.

In [None]:
model <- keras_model_sequential()  %>% 

layer_dense(
    input_shape = ncol(train_data), 
    units = 200, 
    activation = 'relu'
) %>% 

############ added layer ############

layer_dense(
    units = 200, 
    activation = 'relu'
) %>% 

############ added layer ############

layer_dense(
    units = 200, 
    activation = 'relu'
) %>% 

layer_dense(units = ncol(train_labels), activation = 'softmax')

model %>% compile(
    loss = 'categorical_crossentropy',
    optimizer = optimizer_adam(),
    metrics = 'accuracy'
)



history <- fit(
    object = model,
    x = train_data,
    y = train_labels,
    epochs = 25,
    validation_split = 0.2
)

plot(history)

The accuracy of test and validation set has improved compared to model with 1 hidden layer.
Note that the performance of the model is not proportional to the number of hidden layers / units in each hidden unit. 
<img src="img/TwoHidden.png" width="400"/>

#### Batch Size
Batch size defines the number of instances after which the weights are updated.
<br>If the batch size of 1 then after each instance, the weight is updated. If the batch size is equal to size of dataset then the weights are updated once for each epoch.
<br>The default batch size is 32.
<br> Higher the batch size, faster is the execution with GPUs.

In [None]:
############ same as before ############
model <- keras_model_sequential()  %>% 

layer_dense(
    input_shape = ncol(train_data), 
    units = 200, 
    activation = 'relu'
) %>% 

layer_dense(
    units = 200, 
    activation = 'relu'
) %>% 

layer_dense(units = ncol(train_labels), activation = 'softmax')

############ same as before ############

model %>% compile(
    loss = 'categorical_crossentropy',
    optimizer = optimizer_adam(),
    metrics = 'accuracy'
)


############ same as before ############

history <- fit(
    object = model,
    x = train_data,
    y = train_labels,
    epochs = 25,
    validation_split = 0.2,
    batch_size=nrow(train_data) # set the batch size to size of dataset
)

plot(history)

Result shows that large batch size takes more epoch to converge. Ideally with smaller batch size there would be greater variation in error across each epoch, however it would reach higher accuracy/performance with fewer epochs, on the other hand if the batch size is very large then the variation of model performance across epochs is small after convergence, but it takes longer to converge.

<div>
<img src="img/BatchSizeMax.png" width="400"/>
</div>

[This link](https://machinelearningmastery.com/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size/) has good information on the pros and cons of varying batch size, from where the below plot has been taken from showing the performance of model on mnist data for varying batch size.

<div>
<img src="img/BatchSizeWebSite.png" width="400"/>
</div>

#### Learning Rate
Batch Size and learning rate are usually tuned together. Smaller learning rate takes more epochs to converge, whereas bigger learning rate can cause model to jump around the optimal point. Since large batch size takes more number of epochs to reach close to convergence, it is a usual practice that learning rate is increased with batch size. However, this can depend on specific requirement. <br>
Below is shown a model with learning rate 100 times less than the default.

In [None]:
############ same as before ############
model <- keras_model_sequential()  %>% 

layer_dense(
    input_shape = ncol(train_data), 
    units = 200, 
    activation = 'relu'
) %>% 

layer_dense(
    units = 200, 
    activation = 'relu'
) %>% 

layer_dense(units = ncol(train_labels), activation = 'softmax')


############ same as before ############


model %>% compile(
    loss = 'categorical_crossentropy',
    optimizer = optimizer_adam(lr=0.00001),
    metrics = 'accuracy'
)



history <- fit(
    object = model,
    x = train_data,
    y = train_labels,
    epochs = 25,
    validation_split = 0.2,
    batch_size=nrow(train_data)
)

plot(history)

With learning rate 100 times less than defauly it can be observed that even after 25 epochs, the model doesn't seems to be anywhere close to the optimum because the steps taken by weights across each epoch is very small.

<img src="img/LearningRate.png" width="500"/>

## Regression

For regression, the main changes in parameters include
- activation function of the output layer changes (Eg. linear)
- loss function (Eg. mean square error)
- error metric (Eg. mean absolute error)

For demo, lets predict Elevation using all the other features.

In [None]:

d2 = d1 

means1 = sapply(d1[,1:10], mean)
sd1 = sapply(d1[,1:10], sd)

d2[ , 1:10 ] = t((t(d1[ , 1:10 ]) - means1)  / sd1)

# Converting Cover Type into Binary Matrix
d2 = cbind(
                    d2[  , -ncol(d2) ],  
    to_categorical( d2[  ,  ncol(d2) ] - 1 )
)


train_data  =  d2[  , -1 ]
train_labels = d2[  ,  1 ]

train_data = as.matrix(train_data)
train_labels = as.matrix(train_labels)



# Changing Output Activation to Linear

model <- keras_model_sequential()  %>% 

layer_dense(
    input_shape = ncol(train_data), 
    units = 200, 
    activation = 'relu',
    name="layer1"
) %>% 

layer_dense(
    units = 200, 
    activation = 'relu',
    name="layer2"
) %>% 

layer_dense(units = ncol(train_labels), activation = 'linear', name="layer3")





# Changing Loss Function to Mean Square Error, and error metric to Mean Absolute Error
model %>% compile(
    loss = 'mse',
    optimizer = optimizer_adam(),
    metrics = 'mean_absolute_error'
)

# Setting Batch Size very high for faster computation
history <- fit(
    object = model,
    x = train_data,
    y = train_labels,
    epochs = 100,
    validation_split = 0.2,
    batch_size=nrow(train_data)
)

plot(history)



<img src="img/Regression.png" width="500"/>

More Example of MLP and other Deep Neural Netowork are available [here](https://keras.rstudio.com/articles/examples/index.html).

## Task
The MLP network of MNIST dataset available [here](https://tensorflow.rstudio.com/guide/keras/examples/mnist_mlp/) has a test set accuracy of 98.40%. Modify the model to tune its hyper-parameters to improve the model performance against test set.