# Chapter 1 — Classifying Linearly Separable Data with Neural Nets
## Introduction
Your journey to understand Neural Networks starts with a humble example of building a binary classifier for linearly separable data. 'Linearly seperable data'  means that a line can shatter (or classify) the data into two classes. For two-dimensional data, the decision boundary is a line, and for three or more dimensions in the data, the decision boundary would be a plane and a milti-dimensional daat the decision boundary would be a plane.

## The Dataset
We’ll use the example below to illustrate the workings at the Neural Network. Imagine you are running a computing infrastructure where you have just two tiers of pricing, VIP and Regular. You want to classify each user into one of the two tiers (or classes) based on the two resources they consume. Let's call these resources x1 and x2 (for instance, the resources could be the number of CPU cores and the network bandwidth). Also, you are just starting out providing such a service and you have a only handful of users (i.e., a small number of data samples). Your task is to draw the decision boundary between the two classes of users. The decision boundary is a straight line and you are granted the assumption that the data is linearly separable.

![startup pic](plots/first-diag.png)




A common practice is to center the data around the origin (0, 0) by taking the mean and subtracting each data point from the mean.

![](plots/centered-data.png)

The decision boundary could be anywhere in the shaded area.
All the data points in the graph above is shown below (in the form of a Python list). Each element consists of the two resources (x1, x2) and the class the user belongs to (0 or 1).
 
## The Neural Network

In [4]:
"""
Binary Classifier with Linear layer and Back Propagation of Errors
"""
import torch
import torch.nn as nn
from torch.nn import Linear
import torch.optim as optimizer
from utils import display_loss, display_points, plot_points_line_slope_intercept

print(f'Torch version: {torch.__version__}')


Torch version: 1.7.1


Since the data is linearly separable, you must estimate the parameters, w1, w2 and b that represent the line:

w1* x1 + w2 * x2 + b 

or 

WX + b.  

X is the input vector of size two consisting of the scalars x1, and x2. 

W is the weights vector consisting of coefficients w1, w2. 

The bias term is denoted as b. 

The Neural Network, shown below has one linear layer (that's it!). It does not have any non-linearity (such as Sigmoid or ReLU) added to it. The reason it, the task of building a decision boundary on the (toy) data we chose is simple and does not require more complexity than necessary. Note, even if the network does not have any non-linear activation, we call this a neural network because we use the principles of back propagation to estimate the model parameters, w1, w2 and b.

We’re going to learn the model parameters, w1, w2, and b by the technique of backpropagation of errors. That means we need a loss function at the head of the network.

The network is represented in the Fig 1.

![network](plots/network.png)

Fig 1. The inputs x1 and x2 along with the corresponding class label is fed into a Linear layer. The arrows indicate the forward path of the single layer Neural Network.

The Python class describing the above network in PyTorch framework is as follows. 

Note, we define only the forward path thru the network and allow the PyTorch 



In [5]:
class LinearClassifier(nn.Module):
    """ One Linear layer """
    def __init__(self):
        super(LinearClassifier, self).__init__()
        self.fully_connected = Linear(2, 1)


    def forward(self, x):
        return self.fully_connected(x)  # WX + b

The reason we do not state anything about the baskward pass is that the torch.autograd package provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions.

The class, Linear applies a linear transformation with WT to the incoming data, X and adds a bias term.

XWT + b

Here, the size of each input sample is 2 and the size of the output is 1.

Setting up the Neural Network

We instantiate the model, define the loss function (mean-squared error, MSE for short), and define the optimizer as Stochastic Gradient Descent (SGD) with a learning rate (lr) of 0.01. 


In [None]:
model = LinearClassifier()
criterion = nn.MSELoss()
optimizer = optimizer.SGD(model.parameters(), lr=0.01, momentum=0.5)

train_set = [((-2, -1), 0), ((-2, 1), 1), ((-1, -1.5), 0),
             ((1, 1), 1), ((1.5, -0.5), 1), ((2, -2), 0)]
display_points([sample[0] for sample in train_set],
               [sample[1] for sample in train_set], "Data")


We also define the training set consisting of only the six point in our dataset.

## Training the Neural Network
We train the model for 50 epochs with an often-used “training” recipe. The recipe constitutes the following steps:

⎯	Zeroing out the gradients

⎯	Making a forward pass thru the model with a batch of input data

⎯	Applying the loss at the head of the network (the mean square error between the predicted value and the target value) 

⎯	Computing the gradients via back propagation (in this case, backpropagating thru just the one hidden layer)

⎯	Updating the model parameters.




In [6]:
model.train()
loss_over_epochs = []
for epoch in range(50):
    epoch_loss = 0
    for train_data in train_set:
        X = torch.tensor([train_data[0]], dtype=torch.float, requires_grad=True)
        y = torch.tensor([train_data[1]], dtype=torch.float, requires_grad=True)

        optimizer.zero_grad() # Zero out for each batch
        y_pred = model(X)     # Forward Propagation
        loss = criterion(torch.squeeze(y_pred, 1), y)  # Compute loss
        loss.backward()       # Compute gradient
        optimizer.step()      # Update model parameters
        epoch_loss += loss

    loss_over_epochs.append(epoch_loss)
    print('Epoch {}, Epoch loss:{}'.format(epoch, epoch_loss))

As shown, we iterate over all the data and all the epochs with a nested for loop.

On our case, we use a batch-size of one for simplification (but doesn’t need to be so, if we had a lot of data). Also, in our case, the computation is performed on the CPU (the default behavior). Later, we’ll show the extra steps needed to move the computations onto the GPU.  

Putting the training thru its paces generate a loss plot that looks quite typical in its convergence characteristics perhaps indicating that the choice of the hyperparameters (learning rate and momentum) is adequate.

In [None]:
display_loss(loss_over_epochs, "Loss Plot")
print('Model params:', list(model.parameters()))  # https://graphsketch.com/

The model parameters give us the values for w1, w2, and b. We use this later to display the decision boundary.


In [None]:
(w1, w2) = model.fully_connected.weight.data.numpy()[0]
b = model.fully_connected.bias.data.numpy()[0]

In [None]:
plot_points_line_slope_intercept([sample[0] for sample in train_set],
                                 [sample[1] for sample in train_set],
                                 -w1/w2, -b, 'Decision Boundary')

    
# Test on training data
for train_data in train_set:
    prob = model(torch.tensor([train_data[0]], dtype=torch.float, requires_grad=False))
    label = 0 if prob < 0.5 else 1
    verdict = 'correct' if label == train_data[1] else 'wrong'
    print('Data in:{}, Actual Class:{} Out Score:{}, Predicted Class{}: {}'.format(train_data,
                                                                                   train_data[1],
                                                                                   prob,
                                                                                   label,
                                                                                   verdict))