## Practical I - Neural networks

Welcome to the second machine learning practical. In this practical we will discover how to run an artificial neural network algorithm. A neural network, as its name implies, takes its computational form from the way neurons in a biological system work. In essence, for a given list of inputs, a neural network performs a number of processing steps before returning an output. The complexity in neural networks comes in how many of the processing steps there are, and how complex each particular step might be.

This practical is fit into two sections:


### Introduction
The common tools for making neural nets in R are:

In [None]:
install.packages("neuralnet")

In [None]:
install.packages("caret")

In [None]:
install.packages("nnet")

A very simple example of how a neural network can work is through the use of logic gates. We use logical functions often in programming, but just as a refresher, an AND function is only true if both inputs are true. If one or both inputs are false, the result is false:

In [None]:
TRUE & TRUE

TRUE & FALSE

FALSE & FALSE

We can define a simple neural network as one that takes in two inputs, calculates the AND function, and gives us a result. These can be represented in graphical form where you have layers and nodes. Layers are vertical sections of the visual, and nodes are the points of computation within each layer. The mathematics of this requires the use of a bias variable, which is just a constant we add to the equation for calculation purposes and is represented as its own node, typically at the top of each layer in the neural network.

In the case of the AND function, we’ll use numeric values passed into a classification function to give a value of 1 for TRUE and 0 for FALSE. We can do this using the sigmoid function. Make sure you watch the lecture for a clear explanation of the notation.

A neural network is a set of equations that we use to calculate an outcome. They aren’t so scary if we think of them as a brain made out of computer code. Depending on the number of features we have in our data, the neural network almost becomes a “black box.” In principle, we can display the equations that make up a neural network, but at a certain level, the amount of information becomes too cumbersome to intuit easily.

Neural networks are used far and wide in industry largely due to their accuracy. Sometimes, there are trade-offs between having a highly accurate model, but slow computation speeds, however. Therefore, it’s best to try **multiple** models and use neural networks *only* if they work for your particular dataset.

### Section 1 - Single-Layer Neural Networks
In the lecture we looked at the development of an AND gate in the original artificial neuron. An AND gate follows logic like this:

In [None]:
x1 <- c(0, 0, 1, 1)
x2 <- c(0, 1, 0, 1)
logic <- data.frame(x1, x2)
logic$AND <- as.numeric(x1 & x2)
logic

If you have two 1 inputs (both TRUE), your output is 1 (TRUE). However, if either of them, or both, are 0 (FALSE), your output is also 0 (FALSE). This computation is somewhat similar to logistic regression. 

For the logic gate, all you need to do is pick and choose weights so that when x1 = 1 and x2 = 1, the result of z when you pass it through the sigmoid function g(z) is also 1. The way the neural network goes about computing those weights is very mathy process, but it follows the same sort of logic as used in logistic regression.

In [None]:
library(neuralnet)

set.seed(123)
AND <- c(rep(0, 3), 1)
binary.data <- data.frame(expand.grid(c(0, 1), c(0, 1)), AND)
net <- neuralnet(AND ~ Var1 + Var2, binary.data, hidden = 0,
    err.fct = "ce", linear.output = FALSE)
plot(net, rep = "best")

We have a number of aspects in a neural network to be aware of:

#### The input layer
This is a layer that takes in a number of features, including a bias node, which is often just an offset parameter.

#### The hidden layer, or “compute” layer
This is the layer that computes some function of each feature. The number of nodes in this hidden layer depends on the computation. Sometimes, it might be as simple as one node in this layer. Other times, the picture might be more complex with multiple hidden layers.

#### The output layer
This is a final processing node, which might be a single function.


Neural networks come in many different flavors, but the most popular ones stem from single or multilayered neural networks. So far, you’ve seen an example of a single-layer network, for which we take some input (1,0), process it through a sigmoid function, and get some output (0). You can, in fact, chain together these computational steps to form more interconnected and complicated models by taking the output and passing it into futher computational layers.

This code example uses the `iris` dataset that is also built in with R:

In [None]:
set.seed(123)
library(nnet)
iris.nn <- nnet(Species ~ ., data = iris, size = 2)

This code uses the `nnet()` function with the familiar `~` operator that we’ve been using in our previous practical. The `size=2` option tells us that we are using two hidden layers for computation, which must be explicitly specified. The output that we see are iterations of the network.

After the neural network has finally converged, we can use it for prediction:

In [None]:
table(iris$Species, predict(iris.nn, iris, type = "class"))


The result in the confusion matrix are the reference iris species of flowers across the top and the predicted iris species of flowers going up and down the table. So, we see the neural network performed perfectly for classifying the data of the setosa species, but missed one classification for the versicolor and virginica species, respectively. A perfect machine learning model would have zeroes for all the off-diagonal elements, but this is pretty good for an illustrative example.

In R, there’s really only one neural network library that has built-in functionality for plotting neural networks. In practice, most of the time plotting neural networks is more complicated than it’s worth, as we will demonstrate later. In complex modeling scenarios, neural network diagrams and mathematics become so cumbersome that the model itself more or less becomes a trained black box. If your manager were to ask you to explain the math behind a complex neural network model, you might need to block out an entire afternoon and find the largest whiteboard in the building.

The neuralnet library has built-in plotting functionality, however, and in the previous case, you are plotting the neural network that has been determined to have the lowest error in this case. The number of steps are the number of iterations that have gone on in the background to tune the particular output for its lowest error.

### Section 2 - Multiple compute outputs

As alluded to earlier, neural networks can take multiple inputs and provide multiple outputs. If, for example, you have two functions that you want to model via neural networks, you can use R’s formula operator ~ and the + operator to simply add another response to the lefthand side of the equation during modeling,

In [None]:
set.seed(123)
AND <- c(rep(0, 7), 1)
OR <- c(0, rep(1, 7))
binary.data <- data.frame(expand.grid(c(0, 1), c(0, 1), c(0,
    1)), AND, OR)
net <- neuralnet(AND + OR ~ Var1 + Var2 + Var3, binary.data,
    hidden = 0, err.fct = "ce", linear.output = FALSE)
plot(net, rep = "best")

We can model our AND and OR functions with two equations given by the outputs in the above plot

AND = 19.4 + 7.5 × Var1 + 7.6 × Var2 + 7.6 × Var3 <br>
OR = 10.3 + 22.3 × Var1 + 21.8 × Var2 + 21.9 × Var3

We can see our output in the same way as before with just one function:

In [None]:
prediction(net)

The neural networks seem to be performing quite nicely!

### Section 3 - Hidden compute layers
So far, you have been building neural networks that have no hidden layers. That is to say, the compute layer is the same as the output layer. The neural network we computed in section 2 comprised zero layers and one output layer. Here, we will look at how adding one hidden layer of computation can help increase the model’s accuracy.

Neural networks use a shorthand notation for defining their architecture, in which we note the number of input nodes, followed by a colon, the number of compute nodes in the hidden layer, another colon, and then the number of output nodes. The architecture of the neural network we built in section 2 was 3:0:1.

An easier way to illustrate this is by diagramming a new neural network that has three inputs, one hidden layer, and one output layer for a 3:1:1 neural network architecture:

In [None]:
set.seed(123)
AND <- c(rep(0, 7), 1)
binary.data <- data.frame(expand.grid(c(0, 1), c(0, 1), c(0,
    1)), AND, OR)
net <- neuralnet(AND ~ Var1 + Var2 + Var3, binary.data, hidden = 1,
    err.fct = "ce", linear.output = FALSE)
plot(net, rep = "best")

In this case, we have inserted a computation step before the output. Walking through the plot you have generated above from left to right, there are three inputs for a logic gate. These are crunched into a logistic regression function in the middle, hidden layer. The resultant equation is then pumped out to the compute layer for us to use for our AND function. The math would look something like this:

H1 = 8.57 + 3.6 × Var1 – 3.5 × Var2 – 3.6 × Var3 <br>
Which we would then pass through a logistic regression function `g(H1)`
Next, we take that output and put it through another logistic regression node using the weights calculated on the output node:

AND = 5.72 - 13.79 × g(H1)

One major advantage of using a hidden layer with some hidden compute nodes is that it makes the neural network more accurate. However, the more complex you make a neural network model, the slower it will be and the more difficult it will be to simply explain it with easy-to-intuit equations. More hidden compute layers also means that you run the risk of overfitting your model, such as you’ve seen already with traditional regression modeling systems.

Although the numbers tied to the weights of each compute node shown in your plot above are now becoming pretty illegible, the main takeaway here is the error and number of computation steps. In this case, the error has gone down a little bit from 0.033 to 0.027 from the last model, but you’ve also reduced the number of computational steps to get that accuracy from 143 to 61. So, not only have you increased the accuracy, but you’ve made the model computation quicker at the same time. The next plot also shows another hidden computation node added to the single hidden layer, just before the output layer:

In [None]:
set.seed(123)

net2 <- neuralnet(AND ~ Var1 + Var2 + Var3, binary.data, hidden = 2,
    err.fct = "ce", linear.output = FALSE)

plot(net2, rep = "best")

Mathematically, this can be represented as two logistic regression equations being fed into a final logistic regression equation for our resultant output.

The equations are becoming more and more complicated with each increase in the number of hidden compute nodes. The error with two nodes went up slightly from 0.29 to 0.33, but the number of iteration steps the model took to minimize that error was a little bit better in that it went down from 156 to 143. What happens if you turn the number of compute nodes even higher?

In [None]:
set.seed(123)

net4 <- neuralnet(AND ~ Var1 + Var2 + Var3, binary.data, hidden = 4,
    err.fct = "ce", linear.output = FALSE)
net8 <- neuralnet(AND ~ Var1 + Var2 + Var3, binary.data, hidden = 8,
    err.fct = "ce", linear.output = FALSE)

plot(net4, rep = "best")
plot(net8, rep = "best")

The code above uses the same neural network modeling scenario, but the number of hidden computation nodes are increased first to four, and then to eight. The neural network with four hidden computation nodes had a better level of error (just slightly) than the network with only a single hidden node. The error in that case went down from 0.29 to 0.28, but the number of steps went down dramatically from 156 to 58. Quite an improvement! However, a neural network with eight hidden computation layers might have crossed into overfitting territory. In that network, error went from 0.29 to 0.34, even though the number of steps went from 156 to 51.

### Section 4 - Multilayer Neural Networks

All the neural networks thus far that we’ve played around with have had an architecture that has one input layer, one or zero hidden layers (or compute layers), and one output layer.

We’ve used 1:1:1 or 1:0:1 neural networks for some classification schemes already. In those examples, we were trying to model classifications based on the AND and OR logic gate functions:

In [None]:
x1 <- c(0, 0, 1, 1)
x2 <- c(0, 1, 0, 1)
logic <- data.frame(x1, x2)
logic$AND <- as.numeric(x1 & x2)
logic$OR <- as.numeric(x1 | x2)
logic

We can represent this table as two plots, one of which shows the input values and colors those according to the type of logic gate output we use:

In [None]:
logic$AND <- as.numeric(x1 & x2) + 1
logic$OR <- as.numeric(x1 | x2) + 1

par(mfrow = c(2, 1))

plot(x = logic$x1, y = logic$x2, pch = logic$AND, cex = 2,
    main = "Simple Classification of Two Types",
    xlab = "x", ylab = "y", xlim = c(-0.5, 1.5), ylim = c(-0.5,
        1.5))

plot(x = logic$x1, y = logic$x2, pch = logic$OR, cex = 2,
    main = "Simple Classification of Two Types",
    xlab = "x", ylab = "y", xlim = c(-0.5, 1.5), ylim = c(-0.5,
        1.5))

These plots use triangles to signify when outputs are 1 (or TRUE), and circles for which the outputs are 0 (or FALSE). In our discussion on logistic regression you will recall the separating line is called a decision boundary and had always been a straight line. However, we can’t use a straight line to try to classify more complicated logic gates like an XOR or XNOR.
In the lecture we discussed how this problem gave the AI winter for neural networks. Let's see now how to get around that problem.

In tabular form, as we’ve seen with the AND and OR functions, the XOR and XNOR functions take inputs of x1, x2, and give us a numeric output in much the same way.

In [None]:
x1 <- c(0, 0, 1, 1)
x2 <- c(0, 1, 0, 1)
logic <- data.frame(x1, x2)
logic$AND <- as.numeric(x1 & x2)
logic$OR <- as.numeric(x1 | x2)
logic$XOR <- as.numeric(xor(x1, x2))
logic$XNOR <- as.numeric(x1 == x2)
logic

In [None]:
logic$XOR <- as.numeric(xor(x1, x2)) + 1
logic$XNOR <- as.numeric(x1 == x2) + 1

par(mfrow = c(2, 1))

plot(x = logic$x1, y = logic$x2, pch = logic$XOR, cex = 2, main = "Non-Linear Classification of Two Types",
    xlab = "x", ylab = "y", xlim = c(-0.5, 1.5), ylim = c(-0.5,
        1.5))

plot(x = logic$x1, y = logic$x2, pch = logic$XNOR, cex = 2, main = "Non-Linear Classification of Two Types",
    xlab = "x", ylab = "y", xlim = c(-0.5, 1.5), ylim = c(-0.5,
        1.5))

There’s no single straight line that can separate dots on the plots generated above. If you try to plot a very simple neural network with no hidden layers for an XOR classification, the results aren’t especially gratifying. Run the cells below and see for yourself.

In [None]:
logic$XOR <- as.numeric(xor(x1, x2))

set.seed(123)
net.xor <- neuralnet(XOR ~ x1 + x2, logic, hidden = 0, err.fct = "ce",
    linear.output = FALSE)
prediction(net.xor)

In [None]:
plot(net.xor, rep = "best")

Trying to use a neural network with no hidden layers will result in a huge error. Looking at the output from the `prediction()` function, you can see that the neural network thinks that for a given scenario, such as xor(0,0), the answer is 0.5 +/- 2.77. Having an error that is much higher than the level of granularity that you’re trying to find the answer for indicates that this isn’t the best method for you to use.

Instead of the traditional approach of using one or zero hidden layers, which provide a straight line decision boundary that is being used, you must rely on nonlinear decision boundaries, or curves, to separate classes of data. By adding more hidden layers to your neural networks, you add more logistic regression decision boundaries as straight lines. From these added lines, you can draw a convex decision boundary that enables nonlinearity. For this, you must rely on a class of neural networks called **multilayer perceptrons**, or MLPs.

One quick-and-dirty way of using an MLP in this case would be to use the inputs x1 and x2 to get the outputs of the AND and OR functions. You then can feed those outputs as individual inputs into a single-layer neural network,

In [None]:
#set up the AND
set.seed(123)
and.net <- neuralnet(AND ~ x1 + x2, logic, hidden = 2, err.fct = "ce",
    linear.output = FALSE)
and.result <- data.frame(prediction(and.net)$rep1)

In [None]:
# set up the OR 
or.net <- neuralnet(OR ~ x1 + x2, logic, hidden = 2, err.fct = "ce",
    linear.output = FALSE)
or.result <- data.frame(prediction(or.net)$rep1)

In [None]:
as.numeric(xor(round(and.result$AND), round(or.result$OR)))

In [None]:
xor.data <- data.frame(and.result$AND, or.result$OR, as.numeric(xor(round(and.result$AND),
    round(or.result$OR))))
names(xor.data) <- c("AND", "OR", "XOR")

xor.net <- neuralnet(XOR ~ AND + OR, data = xor.data, hidden = 0,
    err.fct = "ce", linear.output = FALSE)

prediction(xor.net)
plot(xor.net, rep = "best")

An MLP is exactly what its name implies. A perceptron is a particular type of neural network that involves a specific way of how it calculates the weights and errors, known as a feed-forward neural network. By taking that principle and adding multiple hidden layers, we make it compatible with nonlinear data like the kind we are dealing with in an XOR gate.

### Section 5 - Neural Networks for Classification
In a sense, we’ve already demonstrated the use of neural networks for classification via the AND and OR gates that we built in section 1. These functions take some kind of binary input and give us a binary result through logistic regression activation functions at each neural network computational node. You can think of that as single-class classification. Most of the time, we’re more interested in multiclass classification.

In this case, you need to split your data into training and test sets, which is straightforward enough. Training the neural network on the training data also makes sense from our past experiences with the train/test approach to machine learning. The difference here is that when you call the `predict()` function, you do so with the `type=class` option. This helps when dealing with class data instead of numeric data that you would use with regression:

In [None]:
iris.df <- iris
smp_size <- floor(0.75 * nrow(iris.df))

set.seed(123)
train_ind <- sample(seq_len(nrow(iris.df)), size = smp_size)

train <- iris.df[train_ind, ]
test <- iris.df[-train_ind, ]

iris.nnet <- nnet(Species ~ ., data = train, size = 4, decay = 0.0001,
    maxit = 500, trace = FALSE)
predictions <- predict(iris.nnet, test[, 1:4], type = "class")
table(predictions, test$Species)

You can see that the confusion matrix provides a pretty good result for classification using neural networks. Look back at the example in Section 1 in this practical and the examples in the previous practical for kmeans multiclass clustering; we have no cases here that are mislabeled compared to the two mislabeled cases that we saw previously.

### Section 6 - Classification with caret
`caret` is a great package for machine learning in R. Classification with `caret` works in a similar manner depending on the method you are using. You can use most caret methods for classification or regression, but some are specific to one versus another. The only method that is explicitly classification only for caret is `multinom`, whereas the methods `neuralnet`, `brnn`, `qrnn`, and `mlpSGD` are explicitly regression only. You can use the rest for either classification or regression:

In [None]:
library(caret)
iris.caret <- train(Species ~ ., data = train, method = "nnet",
    trace = FALSE)
predictions <- predict(iris.caret, test[, 1:4])
table(predictions, test$Species)

The end result here is the same as earlier in terms of model accuracy, but the flexibility of caret allows you again to test against other methods pretty easily:

In [None]:
iris.caret.m <- train(Species ~ ., data = train, method = "multinom",
    trace = FALSE)
predictions.m <- predict(iris.caret.m, test[, 1:4])
table(predictions.m, test$Species)

Good to know that other methods are also quite accurate!

### Summary
Neural networks can seem very complicated at first glance. Often they are thought of as a black box; data goes in, and insight comes out. In reality, neural networks are pretty easy to understand in their simplest form, but difficult to explain when they become more complex. At their core, neural networks take some input values, crunch them through an activation function, and return an output. The activation function, more often than not, is usually just a sigmoid function, so you can think of neural networks as just more complicated logistic regression models. In fact, their computation with simple neural network architecture is almost identical.

Neural networks become more complex when you begin changing their architecture. A neural network’s architecture is made up of an input layer, a number of hidden layers, and an output layer. The input layer is simply the values for what features you are passing in to our model. The hidden layers are those that handle the computation and processing. The output layers are the ones from which you get your results. In simple cases, neural networks can have the hidden computation layers be the same as the output layer, as in the case with modeling logic gate functions like AND and OR. An example neural architecture for a neural network with three inputs, one hidden layer, and one activation node could be 3:1:1, for example. Increasing the number of compute nodes to something like 3:8:1 tends to overfit the data.

Multilayered neural networks (i.e., a 3:2:2:1 neural network) can also model nonlinear behavior. Logistic regression is good at finding decision boundaries that are straight lines to separate data into several classes or types, but it fails for nonlinear behavior. By introducing multiple decision boundaries into a system via hidden layers, you can create a curve that then can separate data, which is something that a straight line cannot do.

You can use neural networks both for regression modeling and classification. However, with regression modeling, it pays to be cautious and practice data normalization. In many cases, neural networks prefer data to be in a 1 or 0 format, and trying to model data that has higher values can be problematic. For classification purposes, when you use the predict() function, you also need to pass the type='class' option in order to have the modeling behavior work appropriately.

There are a slew of neural network methods that you can use with the caret function in R, as well. While some of these are limited to only regression or classification, a good majority of them are flexible enough to be used with either. It pays to be cautious in method selection not just for selecting the one that can do the job you’re interested in, but because there can be tuning or optimization parameters that might need to be passed into the model to speed it up or make it more accurate.

### Exercises
Everything we have done in this exercise has been with random data or with the `Iris` dataset. Repeat making basic neural networks for the built in `mtcars` dataset and see how the models perform differently. You should generate both a shallow and deep neural net and compare how they perform for predicting an aspect of the dataset of your choice. Post your answer in a notebook to the Canvas discussion for week 10.