# Classification using MLP

### Dataset
We will be using Covertype dataset available in dataset folder to test MLP.
<br>We are supposed to predict `Forest Cover type` using following features

<div>
<img src="img/Dataset.png" width="800"/>
</div>


In [2]:
# Make + as string concat operator as well.. Dont worry about this cell...

"+" = function(x,y) {
    if(is.character(x) || is.character(y)) return(paste(x , y, sep=""))
    else .Primitive("+")(x,y)
}

In [None]:
library("keras")

In [None]:
d1 = read.csv("dataset/covtype.data", header=FALSE, stringsAsFactors=FALSE)
colnames(d1) = c('Elevation', 'Aspect', 'Slope', 'HorHydro', 'VertHydro', 'HorRoad', 'Shade9', 'Shade12', 'Shade15', 'HorFire', 'WildernessArea'+(1:4), 'SoilType'+(1:40), 'CoverType')

In [None]:
train_data  =  d1[  , -ncol(d1) ]
train_labels = d1[  ,  ncol(d1) ]

#### Input Data Format
Keras requires the input data to be matrix with features as columns and instance as rows.
<br>The response variable is required to be a binary matrix with each class in a seraparate column, which is achieved using `to_categorical` function of keras. `to_categorical` requires the labels to start from `0`. But since in our dataset it start with `1` we, subtract its value by 1.

In [None]:
train_data = as.matrix(train_data)
train_labels = to_categorical(train_labels-1)

<div>
<img src="img/Network.png" width="500"/>
</div>


#### Linear Stack of Layers
Keras Model composed of Layers of compute units, most common of which is a Linear Stack of Layer.
<br> We define a Linear stack using "keras_model_sequential" functions.
<br> We then add layer by layer to the stack

In [None]:
model = keras_model_sequential() 

#### Input Layer
The first layer of MLP is the input layer which receives the input data.
<br> The total number of nodes in this layer is equal to the total number of input features

#### Hidden Layer
Hidden layer is where the computation takes places. There are many parameters in hidden layer which changes the performance of the model, few of which are `activation function`, `number of hidden layers in a nerwork`, `number of hidden units in a layer`. The below shows an example with 2 hidden layers and 200 units in each hidden layer, with `relu` activation function for each layer.

In [None]:
model = model  %>% layer_dense(
    input_shape = ncol(train_data), 
    units = 200, 
    activation = 'relu'
)

In [None]:
model = model  %>% layer_dense(
    units = 200, 
    activation = 'relu'
)


#### Output Layer
Output layer is the last layer of the network, which gives the prediction.
<br> The nature of the `activation function` tells if the prediction is classification or regression. For classification there are couple of options, with `softmax` being one of them for multi-class classification. 
<br> Total number of units in output class should be equal to the total number of classes

In [None]:
model = model  %>% layer_dense(units = ncol(train_labels), activation = 'softmax')

#### Configure Loss Function, Optimizer Algorithm, Error Metric using `compile` function

- Loss Functions - https://keras.io/api/losses/
- Optimizer Algorithms - https://keras.io/api/optimizers/
- Error Metrics - https://keras.io/api/metrics/ 

compile R function - https://keras.rstudio.com/reference/compile.html

In [None]:
model %>% compile(
    loss = 'categorical_crossentropy',
    optimizer = optimizer_adam(),
    metrics = 'accuracy'
)

#### Fit Model to Training Data

When fitting the model you get to choose 
- batch size
- epochs
- validation split
- class weight
- ... 

The arguments of the `fit` function are elaborated in the below link
https://keras.rstudio.com/reference/fit.html

In [None]:
history <- fit(
    object = model,
    x = train_data,
    y = train_labels,
    epochs = 25,
    validation_split = 0.2
)

plot(history)

#### Output of De-Normalized Data

Below is the output of the denormalized data. Ideally MLP requires the data to be normalized between 0 and 1 for it to be efficient. If the data is not normalized then it will take many epochs for the model to converge to reasonably optimum weights, which will require more compute capacity. Also if the scale between the features are vastly different, chances are that it might get held up in a local minima far from the global minima.

<div>
<img src="img/denorm.png" width="500"/>
</div>


#### Performance of MLP with Normalized Input Data

On check the summary of the training data, it can be observed that the 1st ten columns' scale is very large.

In [None]:
summary(train_data)

One way of normalizing is by z-score -> ($x-\mu) / \sigma$ <br>

In [None]:
d2 = d1 

means1 = sapply(d1[,1:10], mean)
sd1 = sapply(d1[,1:10], sd)

d2[ , 1:10 ] = t((t(d1[ , 1:10 ]) - means1)  / sd1)

Rest of the steps follow as before

In [None]:
train_data  =  d2[  , -ncol(d2) ]
train_labels = d2[  ,  ncol(d2) ]

train_data = as.matrix(train_data)
train_labels = to_categorical(train_labels-1)

In [None]:
model <- keras_model_sequential()  %>% 

layer_dense(
    input_shape = ncol(train_data), 
    units = 200, 
    activation = 'relu'
) %>% 

layer_dense(
    units = 200, 
    activation = 'relu'
) %>% 

layer_dense(units = ncol(train_labels), activation = 'softmax')

In [None]:
model %>% compile(
    loss = 'categorical_crossentropy',
    optimizer = optimizer_adam(),
    metrics = 'accuracy'
)

history <- fit(
    object = model,
    x = train_data,
    y = train_labels,
    epochs = 25,
    validation_split = 0.2
)

plot(history)

#### Output of Normalized Data

It can be observed that the model converges faster to optimum than de-normalized data.
There are many methods of normalization. [This article](https://developers.google.com/machine-learning/data-prep/transform/normalization) provides a good explanation of the available methods.

<div>
<img src="img/denorm.png" width="500"/>
</div>
