# **Lab 7 - Neural Networks**

The inspiration for neural networks was the recognition that complex learning systems
in animal brains consisted of closely interconnected sets of neurons. Although a particular neuron may be relatively simple in structure, dense networks of interconnected neurons could perform complex learning tasks such as classiﬁcation and pattern recognition. The human brain, for example, contains approximately $10^{11}$ neurons, each connected on average to $10,000$ other neurons, making a total of $1,000,000,000,000,000=10^{15}$ synaptic connections.

**Definition**

Artiﬁcial neural networks (hereafter, neural networks) represent an attempt at a very basic level to imitate the type of nonlinear learning that occurs in the networks of neurons found in nature.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **Loading the dataset**

In [2]:
from random import seed
from random import randrange
from random import random
from csv import reader
from math import exp

In [3]:
# Load a CSV file
def load_csv(filename):
    dataset = list()
    with open(filename, 'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset

In [4]:
# Test Backprop on Seeds dataset
seed(1)
#load and prepare data
filename = '/content/drive/MyDrive/Iris.csv'
dataset = load_csv(filename)

In [5]:
dataset[:10]

[['Id',
  'SepalLengthCm',
  'SepalWidthCm',
  'PetalLengthCm',
  'PetalWidthCm',
  'Species'],
 ['1', '5.1', '3.5', '1.4', '0.2', 'Iris-setosa'],
 ['2', '4.9', '3.0', '1.4', '0.2', 'Iris-setosa'],
 ['3', '4.7', '3.2', '1.3', '0.2', 'Iris-setosa'],
 ['4', '4.6', '3.1', '1.5', '0.2', 'Iris-setosa'],
 ['5', '5.0', '3.6', '1.4', '0.2', 'Iris-setosa'],
 ['6', '5.4', '3.9', '1.7', '0.4', 'Iris-setosa'],
 ['7', '4.6', '3.4', '1.4', '0.3', 'Iris-setosa'],
 ['8', '5.0', '3.4', '1.5', '0.2', 'Iris-setosa'],
 ['9', '4.4', '2.9', '1.4', '0.2', 'Iris-setosa']]

DATA PRE PROCESSING

In [6]:
del dataset[0]

In [7]:
dataset[:10]

[['1', '5.1', '3.5', '1.4', '0.2', 'Iris-setosa'],
 ['2', '4.9', '3.0', '1.4', '0.2', 'Iris-setosa'],
 ['3', '4.7', '3.2', '1.3', '0.2', 'Iris-setosa'],
 ['4', '4.6', '3.1', '1.5', '0.2', 'Iris-setosa'],
 ['5', '5.0', '3.6', '1.4', '0.2', 'Iris-setosa'],
 ['6', '5.4', '3.9', '1.7', '0.4', 'Iris-setosa'],
 ['7', '4.6', '3.4', '1.4', '0.3', 'Iris-setosa'],
 ['8', '5.0', '3.4', '1.5', '0.2', 'Iris-setosa'],
 ['9', '4.4', '2.9', '1.4', '0.2', 'Iris-setosa'],
 ['10', '4.9', '3.1', '1.5', '0.1', 'Iris-setosa']]

In [8]:
# Convert string column to float
def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip())

# Convert string column to integer
def str_column_to_int(dataset, column):
    class_values = [row[column] for row in dataset]
    unique = set(class_values)
    print(unique)
    lookup = dict()
    for i, value in enumerate(unique):
        lookup[value] = i
    for row in dataset:
        row[column] = lookup[row[column]]
    return lookup


In [9]:
for i in range(len(dataset[0])-1):
    str_column_to_float(dataset, i)

    # convert class column to integers
str_column_to_int(dataset, len(dataset[0])-1)

{'Iris-virginica', 'Iris-versicolor', 'Iris-setosa'}

{'Iris-virginica': 0, 'Iris-versicolor': 1, 'Iris-setosa': 2}

{'Iris-versicolor', 'Iris-setosa', 'Iris-virginica'}


{'Iris-virginica': 0, 'Iris-versicolor': 1, 'Iris-setosa': 2}

In [10]:
dataset[:10]

[[1.0, 5.1, 3.5, 1.4, 0.2, 1],
 [2.0, 4.9, 3.0, 1.4, 0.2, 1],
 [3.0, 4.7, 3.2, 1.3, 0.2, 1],
 [4.0, 4.6, 3.1, 1.5, 0.2, 1],
 [5.0, 5.0, 3.6, 1.4, 0.2, 1],
 [6.0, 5.4, 3.9, 1.7, 0.4, 1],
 [7.0, 4.6, 3.4, 1.4, 0.3, 1],
 [8.0, 5.0, 3.4, 1.5, 0.2, 1],
 [9.0, 4.4, 2.9, 1.4, 0.2, 1],
 [10.0, 4.9, 3.1, 1.5, 0.1, 1]]

# **Input-Output Encoding**

One possible drawback of neural networks is that all attribute values must be encoded in a standardized manner, taking values between zero and 1, even for categorical variables. Later, when we examine the details of the back-propagation algorithm, we shall understand why this is necessary.

For now, however, how does one go about standardizing all the attribute values?

## **Continuous Variables**

For continuous variables, this is not a problem, as we discussed in Lecture 2.
We may simply apply the \textit{min–max normalization}:
\begin{equation*}
    X* = \frac{X-min(X)}{range(X)} = \frac{X-min(X)}{max(X) - min(X)}
\end{equation*}
This works well as long as the minimum and maximum values are known and all
potential new data are bounded between them. Neural networks are somewhat robust to minor violations of these boundaries. If more serious violations are expected,
certain ad hoc solutions may be adopted, such as rejecting values that are outside the
boundaries, or assigning such values to either the minimum or maximum value.

In [11]:
# Find the min and max values for each column
def dataset_minmax(dataset):
    minmax = list()
    stats = [[min(column), max(column)] for column in zip(*dataset)]
    return stats

# Rescale dataset columns to the range 0-1
def normalize_dataset(dataset, minmax):
    for row in dataset:
        for i in range(len(row)-1):
            row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0])

In [12]:
# normalize input variables
minmax = dataset_minmax(dataset)
normalize_dataset(dataset, minmax)

In [13]:
dataset[:10]

[[0.0,
  0.22222222222222213,
  0.6249999999999999,
  0.06779661016949151,
  0.04166666666666667,
  1],
 [0.006711409395973154,
  0.1666666666666668,
  0.41666666666666663,
  0.06779661016949151,
  0.04166666666666667,
  1],
 [0.013422818791946308,
  0.11111111111111119,
  0.5,
  0.05084745762711865,
  0.04166666666666667,
  1],
 [0.020134228187919462,
  0.08333333333333327,
  0.4583333333333333,
  0.0847457627118644,
  0.04166666666666667,
  1],
 [0.026845637583892617,
  0.19444444444444448,
  0.6666666666666666,
  0.06779661016949151,
  0.04166666666666667,
  1],
 [0.03355704697986577,
  0.30555555555555564,
  0.7916666666666665,
  0.11864406779661016,
  0.12500000000000003,
  1],
 [0.040268456375838924,
  0.08333333333333327,
  0.5833333333333333,
  0.06779661016949151,
  0.08333333333333333,
  1],
 [0.04697986577181208,
  0.19444444444444448,
  0.5833333333333333,
  0.0847457627118644,
  0.04166666666666667,
  1],
 [0.053691275167785234,
  0.027777777777777922,
  0.3749999999999999

## **Output**

With respect to output, we shall see that neural network output nodes always
return a continuous value between zero and 1 as output.

**How can we use such continuous output for classiﬁcation?**

Many classiﬁcation problems have a dichotomous result, an up-or-down decision, with only two possible outcomes. For example, “Is this customer about to leave our company’s service?” For dichotomous classiﬁcation problems, one option is to use a single output node , with a threshold value set a priori which would separate the classes, such as “leave” or “stay.” For example, with the threshold of “leave if output $\geq$ 0.67,” an output of 0.72 from the output node would classify that record as likely to leave the company’s service.

In [14]:
# Initialize a network
def initialize_network(n_inputs, n_hidden, n_outputs):
    network = list()
    hidden_layer = [{'weights':[random() for i in range(n_inputs + 1)]} for i in range(n_hidden)]
    network.append(hidden_layer)
    output_layer = [{'weights':[random() for i in range(n_hidden + 1)]} for i in range(n_outputs)]
    network.append(output_layer)
    return network

In [15]:
# Update network weights with error
def update_weights(network, row, l_rate):
    for i in range(len(network)):
        inputs = row[:-1]
        if i != 0:
            inputs = [neuron['output'] for neuron in network[i - 1]]
        for neuron in network[i]:
            for j in range(len(inputs)):
                neuron['weights'][j] -= l_rate * neuron['delta'] * inputs[j]
            neuron['weights'][-1] -= l_rate * neuron['delta']

# **Sigmoid Activation Function**

In [16]:
# Calculate neuron activation for an input
def activate(weights, inputs):
    activation = weights[-1]
    for i in range(len(weights)-1):
        activation += weights[i] * inputs[i]
    return activation

# Transfer neuron activation
def transfer(activation):
    return 1.0 / (1.0 + exp(-activation))

# **Back propagation**

**How does a neural network learn?**


Neural networks represent a supervised learning
method, requiring a large training set of complete records, including the target
variable. As each observation from the training set is processed through the network,
an output value is produced from the output node.

This output value is then compared to the actual value
of the target variable for this training set observation, and the error (actual - output)
is calculated. This prediction error is analogous to the residuals in regression models.

In [17]:
# Train a network for a fixed number of epochs
def train_network(network, train, l_rate, n_epoch, n_outputs):
    for epoch in range(n_epoch):
        sum_error = 0
        for row in train:
            outputs = forward_propagate(network, row)
            expected = [0 for i in range(n_outputs)]
            expected[row[-1]] = 1
            sum_error += sum([(expected[i]-outputs[i])**2 for i in range(len(expected))])
            backward_propagate_error(network, expected)
            update_weights(network, row, l_rate)
        print('>epoch=%d, lrate=%.3f, error=%.3f' % (epoch, l_rate, sum_error))

# **Gradient Descent Method**

We must therefore turn to optimization methods, specifically gradient-descent methods,
to help us find the set of weights that will minimize SSE.


Suppose that we have a
set (vector) of $m$ weights $w = w_0,w_1,w_2, \dots , w_m$ in our neural network model and
we wish to find the values for each of these weights that, together, minimize SSE.
We can use the gradient descent method, which gives us the direction that we should
adjust the weights in order to decrease SSE. The gradient of SSE with respect to the
vector of weights $w$ is the vector derivative:

$
\nabla{SSE(W)} =  \left[  \frac{ \partial{SSE}} {\partial w_0}, \frac{ \partial{SSE}} {\partial w_1},\dots, \frac{ \partial{SSE}} {\partial w_m}  \right]
$

that is, the vector of partial derivatives of SSE with respect to each of the weights.

In [18]:
# Forward propagate input to a network output
def forward_propagate(network, row):
    inputs = row
    for layer in network:
        new_inputs = []
        for neuron in layer:
            activation = activate(neuron['weights'], inputs)
            neuron['output'] = transfer(activation)
            new_inputs.append(neuron['output'])
        inputs = new_inputs
    return inputs

In [19]:
# Train a network for a fixed number of epochs
def train_network(network, train, l_rate, n_epoch, n_outputs):
    for epoch in range(n_epoch):
        sum_error = 0
        for row in train:
            outputs = forward_propagate(network, row)
            expected = [0 for i in range(n_outputs)]
            expected[row[-1]] = 1
            sum_error += sum([(expected[i]-outputs[i])**2 for i in range(len(expected))])
            backward_propagate_error(network, expected)
            update_weights(network, row, l_rate)
        print('>epoch=%d, lrate=%.3f, error=%.3f' % (epoch, l_rate, sum_error))

# **Back Propagation Rules**

The back-propagation algorithm takes the prediction error (actual - output) for a
particular record and percolates the error back through the network, assigning partitioned responsibility for the error to the various connections. The weights on these
connections are then adjusted to decrease the error, using gradient descent.

Using the sigmoid activation function and gradient descent, Mitchell derives
the back-propagation rules as follows:

$ w_{ij,new} = w_{ij,current} + \Delta w_{ij} \quad \text{ where,}
\quad \Delta w_{ij} = \eta \delta_{j} x_{ij} $

In [20]:
# Backpropagate error and store in neurons
def backward_propagate_error(network, expected):
    for i in reversed(range(len(network))):
        layer = network[i]
        errors = list()
        if i != len(network)-1:
            for j in range(len(layer)):
                error = 0.0
                for neuron in network[i + 1]:
                    error += (neuron['weights'][j] * neuron['delta'])
                errors.append(error)
        else:
            for j in range(len(layer)):
                neuron = layer[j]
                errors.append(neuron['output'] - expected[j])
        for j in range(len(layer)):
            neuron = layer[j]
            neuron['delta'] = errors[j] * transfer_derivative(neuron['output'])


In [21]:
# Backpropagation Algorithm With Stochastic Gradient Descent
def back_propagation(train, test, l_rate, n_epoch, n_hidden):
    n_inputs = len(train[0]) - 1
    n_outputs = len(set([row[-1] for row in train]))
    network = initialize_network(n_inputs, n_hidden, n_outputs)
    train_network(network, train, l_rate, n_epoch, n_outputs)
    predictions = list()
    for row in test:
        prediction = predict(network, row)
        predictions.append(prediction)
    return(predictions)

In [23]:
# Calculate the derivative of an neuron output
def transfer_derivative(output):
    return output * (1.0 - output)

# **Termination Criteria**

The neural network algorithm would then proceed to work through the training data
set, record by record, adjusting the weights constantly to reduce the prediction error.

It may take many passes through the data set before the algorithm’s termination
criterion is met. What, then, serves as the termination criterion, or stopping criterion?
If training time is an issue, one may simply set the number of passes through the
data, or the amount of real-time the algorithm may consume, as termination criteria.

However, what one gains in short training time is probably bought with degradation
in model efficacy.

In [24]:
# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
    dataset_split = list()
    dataset_copy = list(dataset)
    fold_size = int(len(dataset) / n_folds)
    for i in range(n_folds):
        fold = list()
        while len(fold) < fold_size:
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
    return dataset_split

# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
    correct = 0
    for i in range(len(actual)):
        if actual[i] == predicted[i]:
            correct += 1
    return correct / float(len(actual)) * 100.0

# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
    folds = cross_validation_split(dataset, n_folds)
    scores = list()
    for fold in folds:
        train_set = list(folds)
        train_set.remove(fold)
        train_set = sum(train_set, [])
        test_set = list()
        for row in fold:
            row_copy = list(row)
            test_set.append(row_copy)
            row_copy[-1] = None
        predicted = algorithm(train_set, test_set, *args)
        actual = [row[-1] for row in fold]
        accuracy = accuracy_metric(actual, predicted)
        scores.append(accuracy)
    return scores

In [25]:
# Make a prediction with a network
def predict(network, row):
    outputs = forward_propagate(network, row)
    return outputs.index(max(outputs))

# **Learning Rate**

Recall that the learning rate $\eta$, $0 < \eta < 1$, is a constant chosen to help us move the
network weights toward a global minimum for SSE.

**However, what value should $\eta$ take? How large should the weight adjustments be?**

When the learning rate is very small, the weight adjustments tend to be very
small. Thus, if $\eta$ is small when the algorithm is initialized, the network will probably
take an unacceptably long time to converge. Is the solution therefore to use large
values for $\eta$? Not necessarily. Suppose that the algorithm is close to the optimal
solution and we have a large value for $\eta$. This large $\eta$ will tend to make the algorithm
overshoot the optimal solution.

Consider Figure 7.5, where $W^*$ is the optimum value for weight $W$, which
has current value $W_{current}$. According to the gradient descent rule, $\Delta w_{current} = -\eta(\partial SSE/\partial w_{current})$, $W_{current}$ will be adjusted in the direction of $W^*$. But if the learning
rate $\eta$, which acts as a multiplier in the formula for  $\Delta w_{current}$, is too large, the new
weight value $W_{new}$ will jump right past the optimal value $W^*$, and may in fact end up
farther away from $W^*$ than $W_{current}$.

In [26]:
n_folds = 5
l_rate = 0.1
n_epoch = 400
n_hidden = 21
scores = evaluate_algorithm(dataset, back_propagation, n_folds, l_rate, n_epoch, n_hidden)
print('Scores: %s' % scores)
print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

>epoch=0, lrate=0.100, error=239.896
>epoch=1, lrate=0.100, error=239.893
>epoch=2, lrate=0.100, error=239.889
>epoch=3, lrate=0.100, error=239.885
>epoch=4, lrate=0.100, error=239.881
>epoch=5, lrate=0.100, error=239.876
>epoch=6, lrate=0.100, error=239.871
>epoch=7, lrate=0.100, error=239.866
>epoch=8, lrate=0.100, error=239.860
>epoch=9, lrate=0.100, error=239.853
>epoch=10, lrate=0.100, error=239.845
>epoch=11, lrate=0.100, error=239.837
>epoch=12, lrate=0.100, error=239.827
>epoch=13, lrate=0.100, error=239.816
>epoch=14, lrate=0.100, error=239.802
>epoch=15, lrate=0.100, error=239.787
>epoch=16, lrate=0.100, error=239.768
>epoch=17, lrate=0.100, error=239.744
>epoch=18, lrate=0.100, error=239.714
>epoch=19, lrate=0.100, error=239.673
>epoch=20, lrate=0.100, error=239.616
>epoch=21, lrate=0.100, error=239.530
>epoch=22, lrate=0.100, error=239.380
>epoch=23, lrate=0.100, error=239.052
>epoch=24, lrate=0.100, error=237.640
>epoch=25, lrate=0.100, error=204.137
>epoch=26, lrate=0.100