# Multi-Layer Neural Network using Sigmoid Activation for Toy Problem 3

In exercise 670, we built a multi-layer classifier for Toy Problem 3 and used **the ReLU as the activation function**.

Let's see **what happens if we use a sigmoid** instead of the ReLU.

Note: The sigmoid non-linearity was the most commonly used "squashing function" or non-linearity before the advent of the ReLU.

Also Note:  We've changed nothing from the code of exercise_670 except the ReLU to the sigmoid.

We've provided a utility class 'Data' (in data_reader.py) to load the training data (it works for all the toy problems).

In [1]:
import torch
import torch.nn.functional as F
from data_reader import Data

data = Data("data/toy_problem_3_train.txt")

labels, features = data.get_sample()

print("Labels:\n"+str(labels))

print("Features:\n"+str(features))
    
target = torch.autograd.Variable(torch.LongTensor(labels))
#print("Labels Tensor:\n"+str(target))

features = torch.autograd.Variable(torch.Tensor(features))
#print("Features Tensor:\n"+str(features))

Labels:
[0, 1, 1, 0, 1, 0, 1, 0, 0, 0]
Features:
[[-19, -99], [-35, 47], [59, -70], [-85, -69], [18, -3], [-16, -42], [-23, 45], [66, 9], [-85, -45], [-57, -61]]


We initialize the weights (one set of weights per layer) randomly.

In [2]:
middle = 4

weights1 = torch.nn.Parameter(torch.rand(2, middle))
print("Weights1 => "+str(weights1))

weights2 = torch.nn.Parameter(torch.rand(middle, 2))
print("Weights2 => "+str(weights2))


Weights1 => Parameter containing:
 0.5830  0.2759  0.9354  0.8574
 0.5529  0.3738  0.4447  0.3815
[torch.FloatTensor of size 2x4]

Weights2 => Parameter containing:
 0.0206  0.9780
 0.7147  0.8743
 0.7247  0.8200
 0.7998  0.6264
[torch.FloatTensor of size 4x2]



We can now perform 1000 learning iterations below as many times as we want.

Notice that the code for the learning iterations is almost identical to that of exercise 630 but that we've used the Adam optimizer class in Pytorch to nudge the weights in the direction they must go.

In [3]:
optimizer = torch.optim.Adam([weights1, weights2], lr=0.01)

for i in range(1001):
    optimizer.zero_grad()   # zero the gradient buffers
    
    labels, features = data.get_sample(1000)
    
    features = torch.autograd.Variable(torch.Tensor(features))
    #print("Features: "+str(features))
    
    target = torch.autograd.Variable(torch.LongTensor(labels))
    #print("Target: "+str(target))
    
    result = features.mm(weights1)
    result1 = F.sigmoid(result)
    result2 = result1.mm(weights2)
    
    loss = F.cross_entropy(result2, target)
    #print("Cross entropy loss: "+str(loss))

    loss.backward()
    
    optimizer.step()
        
    if i % 10 == 0:
        print("The loss is now "+str(loss.data[0]))

torch.save(weights1, "models/toy_problem_3_trained_deep_model_weights1.bin")
torch.save(weights2, "models/toy_problem_3_trained_deep_model_weights2.bin")

The loss is now 0.7401268482208252
The loss is now 0.6937983632087708
The loss is now 0.6957014799118042
The loss is now 0.6931490302085876
The loss is now 0.6924446821212769
The loss is now 0.6924837827682495
The loss is now 0.690542459487915
The loss is now 0.6924556493759155
The loss is now 0.6921226382255554
The loss is now 0.6910936236381531
The loss is now 0.69541996717453
The loss is now 0.6904984712600708
The loss is now 0.6919944882392883
The loss is now 0.6886522173881531
The loss is now 0.6914364695549011
The loss is now 0.6887767314910889
The loss is now 0.6889198422431946
The loss is now 0.6891558170318604
The loss is now 0.6901046633720398
The loss is now 0.6916608810424805
The loss is now 0.6940774917602539
The loss is now 0.6919639110565186
The loss is now 0.6932433843612671
The loss is now 0.6892197132110596
The loss is now 0.6898062825202942
The loss is now 0.6901471018791199
The loss is now 0.6912344098091125
The loss is now 0.6914899349212646
The loss is now 0.69211

## The Loss

Observe the loss that is printed at the end of every 10 iterations.

Now matter how many hundreds of times you run the hill-climbing code, the loss does not decrease very much.

This tells us that the machine learning algorithm is probably not learning anthing much.

## Parameters

We can now print the weights.

In [4]:
print("The first layer weights are now "+str(weights1.data))
print("and the second layer's weights are now "+str(weights2.data))

The first layer weights are now 
 0.6968  0.7275  0.8728  0.8050
 0.5757  0.2892  0.7656  0.5792
[torch.FloatTensor of size 2x4]

and the second layer's weights are now 
 0.1540  0.8446
 0.7007  0.8883
 0.9043  0.6404
 0.9424  0.4838
[torch.FloatTensor of size 4x2]



## Classifier Test - Toy Problem 3

We have just trained a multilayer classifier for Toy Problem 3.

It doesn't seem to be learning anything (the loss on the training data does not decrease).

But, to make sure, let us evaluate the performance of the classifier on the test data.

In [5]:
data = Data("data/toy_problem_3_test.txt")

weights1 = torch.load("models/toy_problem_3_trained_deep_model_weights1.bin")
print(weights1)
weights2 = torch.load("models/toy_problem_3_trained_deep_model_weights2.bin")
print(weights2)

labels, features = data.get_all()

features = torch.autograd.Variable(torch.Tensor(features))
#print(features)

target = torch.autograd.Variable(torch.LongTensor(labels))
#print(target)

result = torch.mm(features, weights1)
result1 = F.sigmoid(result)
result2 = torch.mm(result1, weights2)
#print(result2)

maxv, observed = torch.max(result2, 1)

total = 0
correct = 0
for i in range(len(labels)):
    total += 1
    #print(str(target.data[i]) + " " + str(observed.data[i]))
    if target.data[i] == observed.data[i]:
        correct += 1
accuracy = correct / total
print("Accuracy: "+str(accuracy))

Parameter containing:
 0.6968  0.7275  0.8728  0.8050
 0.5757  0.2892  0.7656  0.5792
[torch.FloatTensor of size 2x4]

Parameter containing:
 0.1540  0.8446
 0.7007  0.8883
 0.9043  0.6404
 0.9424  0.4838
[torch.FloatTensor of size 4x2]

Accuracy: 0.465


As you can see, the accuracy is around 50%.

This the classifier hasn't learnt anything at all.

It tells us that the multi-layer neural network (without a bias term) was **not able to learn the non-linear XOR function using the sigmoid activation function**, though **it was able to learn the same function using the ReLU** activation function.

Note:  This does not mean that a multi-layer neural network (using sigmoid activation) can never learn the non-linear XOR function.  It can, as we shall see in the next exercise - if the neural network uses bias parameters in each layer in addition to the weights.