# Multi-Layer Neural Network using Sigmoid Activation for Toy Problem 3

In exercise 670, we built a multi-layer classifier for Toy Problem 3 and used **the ReLU as the activation function**.

Let's see **what happens if we use a sigmoid** instead of the ReLU.

Note: The sigmoid non-linearity was the most commonly used "squashing function" or non-linearity before the advent of the ReLU.

Also Note:  We've changed nothing from the code of exercise_670 except the ReLU to the sigmoid.

We've provided a utility class 'Data' (in data_reader.py) to load the training data (it works for all the toy problems).

In [1]:
import torch
import torch.nn.functional as F
from data_reader import Data

data = Data("data/toy_problem_3_train.txt")

labels, features = data.get_sample()

print("Labels:\n"+str(labels))

print("Features:\n"+str(features))
    
target = torch.autograd.Variable(torch.LongTensor(labels))
#print("Labels Tensor:\n"+str(target))

features = torch.autograd.Variable(torch.Tensor(features))
#print("Features Tensor:\n"+str(features))

Labels:
[1, 0, 1, 1, 1, 1, 1, 0, 1, 0]
Features:
[[87, -23], [32, 34], [44, -46], [22, -69], [-75, 91], [87, -7], [-94, 18], [70, 22], [-1, 8], [20, 48]]


We initialize the weights (one set of weights per layer) randomly.

In [2]:
middle = 4

weights1 = torch.nn.Parameter(torch.rand(2, middle))
print("Weights1 => "+str(weights1))

weights2 = torch.nn.Parameter(torch.rand(middle, 2))
print("Weights2 => "+str(weights2))


Weights1 => Parameter containing:
tensor([[ 0.5590,  0.5279,  0.6394,  0.4857],
        [ 0.8009,  0.4655,  0.4313,  0.1399]])
Weights2 => Parameter containing:
tensor([[ 0.0429,  0.9055],
        [ 0.2008,  0.9115],
        [ 0.2530,  0.3775],
        [ 0.5487,  0.1264]])


We can now perform 1000 learning iterations below as many times as we want.

Notice that the code for the learning iterations is almost identical to that of exercise 630 but that we've used the Adam optimizer class in Pytorch to nudge the weights in the direction they must go.

In [3]:
optimizer = torch.optim.Adam([weights1, weights2], lr=0.01)

for i in range(1001):
    optimizer.zero_grad()   # zero the gradient buffers
    
    labels, features = data.get_sample(1000)
    
    features = torch.autograd.Variable(torch.Tensor(features))
    #print("Features: "+str(features))
    
    target = torch.autograd.Variable(torch.LongTensor(labels))
    #print("Target: "+str(target))
    
    result = features.mm(weights1)
    result1 = F.sigmoid(result)
    result2 = result1.mm(weights2)
    
    loss = F.cross_entropy(result2, target)
    #print("Cross entropy loss: "+str(loss))

    loss.backward()
    
    optimizer.step()
        
    if i % 10 == 0:
        print("The loss is now "+str(loss.data.item()))

torch.save(weights1, "models/toy_problem_3_trained_deep_model_weights1.bin")
torch.save(weights2, "models/toy_problem_3_trained_deep_model_weights2.bin")

The loss is now 0.7574540972709656
The loss is now 0.7175595164299011
The loss is now 0.6913857460021973
The loss is now 0.698091447353363
The loss is now 0.6923474073410034
The loss is now 0.6937881708145142
The loss is now 0.69038987159729
The loss is now 0.6925729513168335
The loss is now 0.6897422075271606
The loss is now 0.6919993162155151
The loss is now 0.6926519274711609
The loss is now 0.6930178999900818
The loss is now 0.6905223727226257
The loss is now 0.6888645887374878
The loss is now 0.692439079284668
The loss is now 0.6946567296981812
The loss is now 0.6904807090759277
The loss is now 0.6891172528266907
The loss is now 0.6931989789009094
The loss is now 0.6928488612174988
The loss is now 0.6929441690444946
The loss is now 0.6920353770256042
The loss is now 0.6916951537132263
The loss is now 0.6908668875694275
The loss is now 0.6910408139228821
The loss is now 0.691026508808136
The loss is now 0.6910014152526855
The loss is now 0.6886860132217407
The loss is now 0.6922541

## The Loss

Observe the loss that is printed at the end of every 10 iterations.

Now matter how many hundreds of times you run the hill-climbing code, the loss does not decrease very much.

This tells us that the machine learning algorithm is probably not learning anthing much.

## Parameters

We can now print the weights.

In [4]:
print("The first layer weights are now "+str(weights1.data))
print("and the second layer's weights are now "+str(weights2.data))

The first layer weights are now tensor([[ 0.5763,  0.9937,  0.7670,  0.2050],
        [ 1.0866,  0.3080,  0.3730,  0.4365]])
and the second layer's weights are now tensor([[ 0.0819,  0.8665],
        [ 0.3775,  0.7348],
        [ 0.5055,  0.1251],
        [ 0.6643,  0.0109]])


## Classifier Test - Toy Problem 3

We have just trained a multilayer classifier for Toy Problem 3.

It doesn't seem to be learning anything (the loss on the training data does not decrease).

But, to make sure, let us evaluate the performance of the classifier on the test data.

In [5]:
data = Data("data/toy_problem_3_test.txt")

weights1 = torch.load("models/toy_problem_3_trained_deep_model_weights1.bin")
print(weights1)
weights2 = torch.load("models/toy_problem_3_trained_deep_model_weights2.bin")
print(weights2)

labels, features = data.get_all()

features = torch.autograd.Variable(torch.Tensor(features))
#print(features)

target = torch.autograd.Variable(torch.LongTensor(labels))
#print(target)

result = torch.mm(features, weights1)
result1 = F.sigmoid(result)
result2 = torch.mm(result1, weights2)
#print(result2)

maxv, observed = torch.max(result2, 1)

total = 0
correct = 0
for i in range(len(labels)):
    total += 1
    #print(str(target.data[i]) + " " + str(observed.data[i]))
    if target.data[i] == observed.data[i]:
        correct += 1
accuracy = correct / total
print("Accuracy: "+str(accuracy))

tensor([[ 0.5763,  0.9937,  0.7670,  0.2050],
        [ 1.0866,  0.3080,  0.3730,  0.4365]])
tensor([[ 0.0819,  0.8665],
        [ 0.3775,  0.7348],
        [ 0.5055,  0.1251],
        [ 0.6643,  0.0109]])
Accuracy: 0.523


As you can see, the accuracy is around 50%.

This the classifier hasn't learnt anything at all.

It tells us that the multi-layer neural network (without a bias term) was **not able to learn the non-linear XOR function using the sigmoid activation function**, though **it was able to learn the same function using the ReLU** activation function.

Note:  This does not mean that a multi-layer neural network (using sigmoid activation) can never learn the non-linear XOR function.  It can, as we shall see in the next exercise - if the neural network uses bias parameters in each layer in addition to the weights.