# Continuous XOR


In this section you will build a simple neural network and train it to classify a dataset representing a continuous XOR (continuous data are generated by introducing some gaussian noise on the binary inputs. Our desired separation of an XOR dataset could look as follows:

<center style="width: 100%"><img src="https://github.com/phlippe/uvadlc_notebooks/blob/master/docs/tutorial_notebooks/tutorial2/continuous_xor.svg?raw=1" width="350px"></center>


## Perform standard imports

In [1]:
## Standard libraries
# import os
# import math
import numpy as np 
# import time

## Imports for plotting
import matplotlib.pyplot as plt
%matplotlib inline 
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg', 'pdf') # For export
from matplotlib.colors import to_rgba
import seaborn as sns
sns.set()

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data

## Generate the data
To generate the data we will use the package `torch.utils.data`, which allows to load the training and test data efficiently.

The data package defines two classes which are the standard interface for handling data in PyTorch: `data.Dataset`, and `data.DataLoader`. The dataset class provides an uniform interface to access the training/test data, while the data loader makes sure to efficiently load and stack the data points from the dataset into batches during training.

#### The dataset class

The dataset class summarizes the basic functionality of a dataset in a natural way. To define a dataset in PyTorch, we simply specify two functions: `__getitem__`, and `__len__`. The get-item function has to return the $i$-th data point in the dataset, while the len function returns the size of the dataset. For the XOR dataset, we can define the dataset class as follows:

In [2]:
class XORDataset(data.Dataset):

    def __init__(self, size, std=0.1):
        """
        Inputs:
            size - Number of data points we want to generate
            std - Standard deviation of the noise (see generate_continuous_xor function)
        """
        super().__init__()
        self.size = size
        self.std = std
        self.generate_continuous_xor()

    def generate_continuous_xor(self):
        # Each data point in the XOR dataset has two variables, x and y, that can be either 0 or 1
        # The label is their XOR combination, i.e. 1 if only x or only y is 1 while the other is 0.
        # If x=y, the label is 0.
        data = torch.randint(low=0, high=2, size=(self.size, 2), dtype=torch.float32)
        label = (data.sum(dim=1) == 1).to(torch.long)
        # To make it slightly more challenging, we add a bit of gaussian noise to the data points.
        data += self.std * torch.randn(data.shape)

        self.data = data
        self.label = label

    def __len__(self):
        # Number of data point we have. Alternatively self.data.shape[0], or self.label.shape[0]
        return self.size

    def __getitem__(self, idx):
        # Return the idx-th data point of the dataset
        # If we have multiple things to return (data point and label), we can return them as tuple
        data_point = self.data[idx]
        data_label = self.label[idx]
        return data_point, data_label

Let's try to create such a dataset and inspect it:

In [3]:
dataset = XORDataset(size=200)
print("Size of dataset:", len(dataset))
print("Data point 0:", dataset[0])

Size of dataset: 200
Data point 0: (tensor([1.0160, 0.0804]), tensor(1))


### EXERCISE:  plot the samples below. 

In [4]:
# YOUR CODE HERE
# Remember to do data.cpu().numpy(), etc before plotting


### QUICK RECALL: The data loader class

The class `torch.utils.data.DataLoader` represents a Python iterable over a dataset with support for automatic batching, multi-process data loading and many more features. The data loader communicates with the dataset using the function `__getitem__`, and stacks its outputs as tensors over the first dimension to form a batch.
In contrast to the dataset class, we usually don't have to define our own data loader class, but can just create an object of it with the dataset as input. Additionally, we can configure our data loader with the following input arguments (only a selection, see full list [here](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)):

* `batch_size`: Number of samples to stack per batch
* `shuffle`: If True, the data is returned in a random order. This is important during training for introducing stochasticity. 
* `num_workers`: Number of subprocesses to use for data loading. The default, 0, means that the data will be loaded in the main process which can slow down training for datasets where loading a data point takes a considerable amount of time (e.g. large images). More workers are recommended for those, but can cause issues on Windows computers. For tiny datasets as ours, 0 workers are usually faster.
* `pin_memory`: If True, the data loader will copy Tensors into CUDA pinned memory before returning them. This can save some time for large data points on GPUs. Usually a good practice to use for a training set, but not necessarily for validation and test to save memory on the GPU.
* `drop_last`: If True, the last batch is dropped in case it is smaller than the specified batch size. This occurs when the dataset size is not a multiple of the batch size. Only potentially helpful during training to keep a consistent batch size.

### EXERCISE: create a data loader for the training test and one for the test set

You can create train and test sets separately or create a single dataset and split them using sklearn function. 
Do not forget to import it using `from sklearn.model_selection import train_test_split`.

The training set should be 2500 points and batch size 128. Use 500 and batch size 25 for the test


In [5]:
# YOUR CODE HERE



In [6]:
# next(iter(...)) catches the first batch of the data loader
# If shuffle is True, this will return a different batch every time we run this cell
# For iterating over the whole dataset, we can simple use "for batch in data_loader: ..."

data_loader = data.DataLoader(dataset, batch_size=8, shuffle=True)
data_inputs, data_labels = next(iter(data_loader))

# The shape of the outputs are [batch_size, d_1,...,d_N] where d_1,...,d_N are the 
# dimensions of the data point returned from the dataset class
print("Data inputs", data_inputs.shape, "\n", data_inputs)
print("Data labels", data_labels.shape, "\n", data_labels)

Data inputs torch.Size([8, 2]) 
 tensor([[ 1.1871,  0.1322],
        [ 1.0012,  0.0388],
        [ 0.1054, -0.0954],
        [-0.0770,  1.0230],
        [-0.0293, -0.0560],
        [-0.0918, -0.0035],
        [ 0.9766,  0.0374],
        [ 1.1661, -0.1935]])
Data labels torch.Size([8]) 
 tensor([1, 1, 0, 1, 0, 0, 1, 1])


## Build the model

### EXERCISE:
* Construct the class for minimal network with a input layer, one hidden layer with tanh as activation function, and an output layer. 
* Call an istance of the model using four hidden neurons.
* Print the model and its parameters

In [7]:
# YOUR CODE HERE


In [8]:
# YOUR CODE HERE


## Train the model

After defining the model and the dataset, it is time to prepare the optimization of the model. 

Recall that during training, we will perform the following steps:

1. Get a batch from the data loader
2. Obtain the predictions from the model for the batch
3. Calculate the loss based on the difference between predictions and labels
4. Backpropagation: calculate the gradients for every parameter with respect to the loss
5. Update the parameters of the model in the direction of the gradients


#### Loss modules

We can calculate the loss for a batch by simply performing a few tensor operations as those are automatically added to the computation graph. For instance, for binary classification, we can use Binary Cross Entropy (BCE) which is defined as follows:

$$\mathcal{L}_{BCE} = -\sum_i \left[ y_i \log x_i + (1 - y_i) \log (1 - x_i) \right]$$

where $y$ are our labels, and $x$ our predictions, both in the range of $[0,1]$. However, PyTorch already provides a list of predefined loss functions which we can use (see [here](https://pytorch.org/docs/stable/nn.html#loss-functions) for a full list). For instance, for BCE, PyTorch has two modules: `nn.BCELoss()`, `nn.BCEWithLogitsLoss()`. While `nn.BCELoss` expects the inputs $x$ to be in the range $[0,1]$, i.e. the output of a sigmoid, `nn.BCEWithLogitsLoss` combines a sigmoid layer and the BCE loss in a single class. This version is numerically more stable than using a plain Sigmoid followed by a BCE loss because of the logarithms applied in the loss function. Hence, it is adviced to use loss functions applied on "logits" where possible (remember to not apply a sigmoid on the output of the model in this case!). For our model defined above, we therefore use the module `nn.BCEWithLogitsLoss`. 

#### Stochastic Gradient Descent

For updating the parameters, you will use the `torch.optim.SGD` as seen in the prevous sections.

Remember that input to the optimizer are the parameters of the model: `model.parameters()`.


A good default value of the learning rate for a small network as ours is 0.1. 

### EXERCISE: Train your network and evaluate it on the test data

* Train the network for 100 epochs
* Evaluate it simultaneously (do not forget to deactivate gradients using `with torch.no_grad()`).
* Use the accuracy as metric for evaluation. You should also plot the loss function for training and test data set accross epochs

Before starting it is better to redo the following steps:
* Create a train set of 2500 points and batch size 128. Use 500 and batch size 25 for the test
* Creat a new istance of your model and send it to the device


In [9]:
# YOUR CODE HERE


### Training

Finally, we are ready to train our model. As a first step, we create a slightly larger dataset and specify a data loader with a larger batch size. 

HINT: 

1. you can write a small training function. Remember our five steps: load a batch, obtain the predictions, calculate the loss, backpropagate, and update. Here the model should be in training mode. This is done by calling `model.train()`. The training mode is needed to correcly implement BatchNorm and Dropuot.

2. Similarly you can write a evaluate function and call it in the training function. Here the model should be in evaluation mode (use `model.eval()`)



In [10]:
# YOUR CODE HERE, code for evaluation function


In [11]:
# YOUR CODE HERE, code for training function



In [12]:
# YOUR CODE HERE, here you call the training function and get the results

### EXERCISE: plot the losses and accuracy

In [13]:
# YOUR CODE HERE

If we trained our model correctly, we should see a score close to 100% accuracy. However, this is only possible because of our simple task, and unfortunately, we usually don't get such high scores on test sets of more complex tasks.

### EXERCISE: save the model and reload it

In [14]:
# SAVE THE MODEL


# A detailed tutorial on saving and loading models in PyTorch 
# can be found [here](https://pytorch.org/tutorials/beginner/saving_loading_models.html).

### EXERCISE: visualize the classification boundaries using the model just load

To visualize what our model has learned, we can perform a prediction for every data point in a range of $[-0.5, 1.5]$, and visualize the predicted class as in the sample figure at the beginning of this section. This shows where the model has created decision boundaries, and which points would be classified as $0$, and which as $1$. We therefore get a background image out of blue (class 0) and orange (class 1). 

In [16]:
# YOUR CODE HERE
# @torch.no_grad() # Decorator, same effect as "with torch.no_grad(): ..." over the whole function.
