## 3.3 Lab 1 / Case 1: Non-Linear Regression

In this lab, you will use the same [Auto MPG Dataset](https://archive.ics.uci.edu/ml/datasets/auto+mpg), but we'll bring more features to the mix, as you will also learn how to embed discrete/categorical features so they can be used to train the model.

The columns, or attributes, of this dataset, are as follows:

1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)

Remember that the last column, `car name`, is actually separated by tabs (instead of spaces), so we're considering the cars' names as comments while loading the dataset.

We're loading the dataset into a Pandas dataframe just like before:

In [None]:
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['mpg', 'cyl', 'disp', 'hp', 'weight', 'acc', 'year', 'origin']

df = pd.read_csv(url, names=column_names, na_values='?', comment='\t', sep=' ', skipinitialspace=True)
df

### 3.3.1 Train-Validation-Test Split

Split the dataset into train, validation, and test sets using Scikit-Learn's `train_test_split()` method:

In [None]:
from sklearn.model_selection import train_test_split

raw_data = {}
raw_data['train'], raw_data['test'] = ...
raw_data['train'], raw_data['val'] = ...

In [None]:
raw_data['train']

### 3.3.2 Missing Values

In this lab, we're throwing rows with missing values away, so make sure there are no NAs left in your datasets.

In [None]:
...

### 3.3.3 Continuous Attributes

We've done this already, but this time you should write a `standardize()` function that:
- takes a Pandas dataframe, a list of column names that are continuous attributes, and an optional scaler
- creates and trains a Scikit-Learn's `StandardScaler` if one isn't provided as an argument
- returns a PyTorch tensor containing the standardized features and an instance of Scikit-Learn's `StandardScaler`

In [None]:
from sklearn.preprocessing import StandardScaler

def standardize(df, cont_attr, scaler=None):
    # write your code here
    ...
    
    return cont_X, scaler

Use your `standardize` function to standardize all continuous attributes in our datasets. Don't forget you shouldn't train scalers on validation and test sets. They must use the scaler trained on the training set!

In [None]:
cont_attr = ['mpg', 'disp', 'hp', 'weight', 'acc']

cont_data = {'train': None, 'val': None, 'test': None}
# write your code here
...

### 3.3.4 Discrete and Categorical Attributes

We've already talked about these attributes, but we didn't do anything about them. It is time to change that!

Your goal here is to convert each possible value in a discrete or categorical attribute into a numerical array of a given length (that does not need to match the number of unique values). Before converting them into arrays, though, we need to encode them as sequential numbers first.

Let's see what this looks like for the `cyl` attribute of our training dataset. It has only five unique vales: 3, 4, 5, 6, and 8 cylinders.

In [None]:
cyls = sorted(raw_data['train']['cyl'].unique())
cyls

We can easily build a dictionary to map them into sequential numbers:

In [None]:
cyls_map = dict((v, i) for i, v in enumerate(cyls))
cyls_map

Now imagine there's a lookup table with as many entries as unique values, each entry being a numerical array of a given length (say, eight elements). Let's create such a lookup table filled with random values as an illustration:

In [None]:
n_dim = 8
lookup_table = torch.randn((len(cyls), n_dim))
lookup_table

There are five rows, each corresponding to a unique number of cylinders. Three cylinders, according to our mapping dictionary, corresponds to the first (index zero) row. Five cylinders, to the second (index one) row, and so on, and so forth.

Let's say we'd like to retrieve the numerical array corresponding to six cylinders. We apply the mapping to find the corresponding index (`cyls_map[6]`) and use the result to actually slice the corresponding row from the lookup table (`lookup_table[idx]`):

In [None]:
idx = cyls_map[6]
lookup_table[idx]

There we go! Now, any number of cylinders can easily be mapped to a sequence of eight numerical values. It is as if any given number of cylinders, a categorical attribute, were now represented by eight numerical features instead. We have just (re)invented embeddings!

The fact that these numbers are random is not necessarily an issue: we can simply turn the whole lookup table into parameters of the model itself, so they are also learned during training. The model will learn the best way to represent each value in categorical attribute as a sequence of numerical attributes! How cool is that?

So, let's use PyTorch's `Embedding` layer instead of our own lookup table. The arguments are the same, though: the number of unique values, and the desired number of elements - or dimensions - in the returned numerical array.

In [None]:
emb_table = nn.Embedding(len(cyls), n_dim)
emb_table.weight

The embedding layer, like any other layer in PyTorch, is also a model. Its weights are, surprise, surprise, the lookup table itself. Besides, since it's a model, it can called as such and its expected input is a batch of indices. Let's try it out and see what we get out of it:

In [None]:
idx = cyls_map[6]
emb_table(torch.as_tensor([idx]))

There we go! Six cylinders is the fourth value in our encoded list, and therefore the embedding layer returned its fourth row of weights.

A special case of embedding is the one-hot encoding (OHE) approach: instead of letting the model learn it during training, the mapping is fixed. In OHE, the numerical array has the same length as the number of unique values and it has only one nonzero element. It works as if each unique value were a dummy variable, for example: `cyl3`, `cyl4`, `cyl5`, `cyl6`, and `cyl8`, and only one of those dummy variables may have a nonzero value.

In [None]:
ohe_table = torch.eye(len(cyls))
ohe_table

In [None]:
idx = cyls_map[6]
ohe_table[idx]

In this lab, we'll be using real, learnable, embeddings. Embeddings are an important part of modern deep learning, and a fundamental piece of natural language processing, as we'll see in Chapters 2 and 4.

Now, it's your time to encode categorical attributes and create embeddings for them!

#### 3.3.4.1 Ordinal Encoder

Instead of building dictionarie to manually encode categorical values into sequential numbers, write a function that uses Scikit-Learn's `OrdinalEncoder` instead. Similary to the standardization function you already wrote, this function:
- takes a Pandas dataframe, a list of column names that are categorical attributes, and an optional encoder
- creates and trains a Scikit-Learn's `OrdinalEncoder` if one isn't provided as an argument
- returns a PyTorch tensor containing the encoded categorical features and an instance of Scikit-Learn's `OrdinalEncoder`

In [None]:
from sklearn.preprocessing import OrdinalEncoder

def encode(df, cat_attr, encoder=None):
    # write your code here
    ...
    
    return cat_X, encoder

Use your `encode` function to encode all categorical attributes in our datasets. Don't forget you shouldn't train encoders on validation and test sets. They must use the encoder trained on the training set!

In [None]:
disc_attr = ['cyl', 'origin']

cat_data = {'train': None, 'val': None, 'test': None}
# write your code here
...

The `categories_` attribute of the trained encoder should a list of lists of unique values, one list for each encoded attribute:

In [None]:
encoder.categories_

If we check the encoded attributes, their unique values should be lists of sequential numbers.

In [None]:
cat_data['train'][:, 0].unique(), cat_data['train'][:, 1].unique()

### 3.3.5 Embeddings: From Categorical to Continuous

Write code to create a list of embedding layers, each layer configured to handle one particular attribute, that is, one layer to embed `cyl` and another one to embed `origin`. You're free to choose the number of elements/dimensions that the resulting arrays will have.

In [None]:
embedding_layers = []
# write your code here
...

In [None]:
embedding_layers

Now, try out your layers by embedding the first five rows of your categorical training data. You should get a list containing two tensors with five rows and as many columns/dimensions as you choose in the previous step.

In [None]:
embeddings = []
# write your code here
...

In [None]:
embeddings

In practice, thoug, your model won't be using a list of embeddings, but their concatenation along the horizontal axis instead. You can use `torch.cat` to accomplish this.

In [None]:
torch.cat(embeddings, 1)

Now your categorical attributes are represented by many (learned) numerical features. Later on, when building your model, you will have to concatenate both the original continuous features, and those learned via embeddings.

### 3.3.6 Target and Task

Your features are already taken care of, so it's time to create column tensors for your target attribute. Make sure they are of the type `float32`.

In [None]:
target_data = {'train': None, 'val': None, 'test': None}
target_col = 'mpg'
# write your code here
...

### 3.3.7 Custom Dataset

Previously, we used a simple `TensorDataset` for our single feature and target. Now let's build our own custom dataset class instead by inheriting from the `Dataset` class. 

It needs to implement some basic methods:
- `__init__(self)`
- `__getitem__(self, index)`
- `__len__(self)`. 

The constructor (`__init__()`) method may receive any arguments you can possible need, so you can create and preprocess your tensors right away or, as it is often the case when your dataset is too large, load them on demand. In our case, the constructor will receive the following arguments:

- `raw_data`: a Pandas dataframe containing our (small) dataset
- `cont_attr`: a list of the continuous attributes we'd like to use
- `disc_attr`: a list of the discrete/categorical attributes we'd like to use
- `target`: the name of the column containing the target attribute we'd like to predict
- `scaler`: an optional instance of a `StandardScaler` to standardize the continuous attributes
- `encoder`: an optional instance of an `OrdinalEncoder` to encode the discrete attributes sequentially

You can use these arguments to preprocess and store the resulting tensors as class attributes, which you can retrieve at your convenience when other methods are called. Remember that you have already written functions to standardize continuous attributes and to encode categorical ones, feel free to use them.

In the `__getitem__()` method, which makes a dataset "sliceable" just like a Python list, you should return a tuple `(features, target)` corresponding to the requested index. Notice that the first element of your tuple, `features` does not necessarily need to be a single tensor. It may be anything, another tuple, or even a dictionary. Remember that we have two types of features, continuous and categorical, and they are going to be handled differently in our model.

In the `__len__()` method, you only need to return the total number of elements in your dataset.

In [None]:
from torch.utils.data import Dataset

class TabularDataset(Dataset):
    def __init__(self, raw_data, cont_attr, disc_attr, target_col, scaler=None, encoder=None):
        self.n = ...
        self.target = ...
        self.cont_data, self.scaler = ...
        self.cat_data, self.encoder = ...
        
    def __len__(self):
        return self.n

    def __getitem__(self, idx):
        features = ...
        target = ...
        return (features, target)

Once your custom class has been defined, use it to create training, validation, and test datasets. Don't forget that scaling and encoding should be fitted in the training set only!

In [None]:
datasets = {'train': None, 'val': None, 'test': None}
# write your code here
...

In [None]:
datasets['train'][:5]

You should see the features and targets of the first five elements from your training set.

### 3.3.8 Data Loaders

Next, you need to create data loaders, one for each set. It is recommended to shuffle the training set, but don't bother shuffling the others. Dropping the last mini-batch, in case your set isn't a perfect multiple of your mini-batch size, is also recommended.

In [None]:
from torch.utils.data import DataLoader

dataloaders = {'train': None, 'val': None, 'test': None}
# write your code here
...

### 3.3.9 Custom Model

Your next task is to build a custom model that can handle continuous and categorical features (via embeddings), and that is non-linear in nature. Before moving on, let's briefly discuss two topics: `ModuleList` and the importance of non-linearities.

#### 3.3.9.1 `ModuleList`

`ModuleList` is a special type of list, one that allows PyTorch to recursively look for learnable parameters of layers and model inside its contents. As it turns out, if the class attribute of your custom model is a regular Python list, any layers or models inside it will be ignore by PyTorch during training. By explicitly making a `ModuleList` out of a regular Python list we ensure that its parameters are also accounted for.

In our custom model, we have a list of embedding layers, one for each categorical attribute. Therefore, if we want our model to properly learn these embeeddings, we need to make it a `ModuleList`.

#### 3.3.9.2 Non-Linearities

This is the "secret sauce" of deep learning models! If it weren't for non-linear activation functions, functions that twist and turn, or outright chop off, intermediate values, we would be eternally stuck with linear regression.

If you line up two linear layers in a row with nothing in-between, these two linear layers will have an exact single-layer equivalent model. Only by adding non-linear activation functions between layers we can break this equivalence and make the model effectively more-complex.

```python
# Redundant
model = nn.Sequential([nn.Linear(1, 10),
                       nn.Linear(10, 1)])

# Good!
model = nn.Sequential([nn.Linear(1, 10),
                       nn.ReLU(),  # non-linearity FTW!
                       nn.Linear(10, 1)])
```

The use of non-linear activation functions is what allow models to learn complex decision boundaries when separating data points in different classes. For example, in a binary classification problem, a linear model can only produce a decision boundary that is, well, linear (left plot). A more-complex model that includes a non-linear activation function, on the other hand, can produce, well, non-linear boundaries (right plot).

|Linear  |Non-Linear  |
|---|---|
| ![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch3/linear_boundary.png) | ![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch3/non_linear_boundaries.png) |

***

PyTorch implements many non-linear activation functions.

Classical functions:

- [Sigmoid](https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html#torch.nn.Sigmoid): the first activation function, chosen because of its mathematical properties, it is rarely used in-between layers; it can be used to convert logits into probabilities for binary clasification tasks;
- [Tanh](https://pytorch.org/docs/stable/generated/torch.nn.Tanh.html#torch.nn.Tanh): the hyperbolic tangent activation function was developed to overcome the major issue with sigmoid functions, the fact that it wasn't centered at zero; it is also rarely used in-between layers, but it's an internal component of other layers, such as recurrent layers.

ReLU-family of functions:

- [ReLU](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU): the rectified linear unit is a simple yet powerful activation function as it simply preserves positive values while turning all negative values into zero; it addressed the problem of vanishing gradients (that is, when a model stops learning) and it spawned a whole family of activation functions;
- [ReLU6](https://pytorch.org/docs/stable/generated/torch.nn.ReLU6.html#torch.nn.ReLU6): a ReLU function that is capped at six;
- [LeakyReLU](https://pytorch.org/docs/stable/generated/torch.nn.LeakyReLU.html#torch.nn.LeakyReLU): the leaky rectified linear unit is a modified ReLU where negative values are multiplied by a tiny factor such as 0.01 instead of being turned into zero; it addressed the problem of "dead neurons" (that is, a neuron whose inputs are consistently negative and therefore does not have its weights updated);
- [PReLU](https://pytorch.org/docs/stable/generated/torch.nn.PReLU.html#torch.nn.PReLU): the parametric version of the LeakyReLU, where the multiplying factor is also learned by the model.
- [RReLU](https://pytorch.org/docs/stable/generated/torch.nn.RReLU.html#torch.nn.RReLU): randomized leaky rectified linear unit;
- [SELU](https://pytorch.org/docs/stable/generated/torch.nn.SELU.html#torch.nn.SELU): scaled exponential linear unit;
- [CELU](https://pytorch.org/docs/stable/generated/torch.nn.CELU.html#torch.nn.CELU): continuously differentiable exponential linear unit;
- [GELU](https://pytorch.org/docs/stable/generated/torch.nn.GELU.html#torch.nn.GELU): Gaussian error linear unit;
- [SiLU](https://pytorch.org/docs/stable/generated/torch.nn.SiLU.html#torch.nn.SiLU): sigmoid linear unit;
- [ELU](https://pytorch.org/docs/stable/generated/torch.nn.ELU.html#torch.nn.ELU): exponential linear unit.

And more, less-known, functions as well. For a full list of activation functions, check the [Non-linear Activations](https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity) section of PyTorch's documentation. Moreover, non-linearities are also available in [functional](https://pytorch.org/docs/stable/nn.functional.html#non-linear-activation-functions) form.

For a more detailed explanation of the inner workings of activation functions, check my blog post ["Hyper-parameters in Action - Part I: Activation Functions."](https://towardsdatascience.com/hyper-parameters-in-action-a524bf5bf1c)
***

#### 3.3.9.3 Methods

A custom model class must implement a couple of methods:
- `__init__(self)`
- `forward(self, x)`

In the constructor method, you will define the parts that make up your model, like linear layers and embeddings, as class attributes. Don't forget to include a call to `super().__init__()` at the top of the method so it executes the code from the parent class before your own. In our case, the model will receive the following arguments:

- `n_cont`: the number of continuous attributes
- `cat_list`: a list of lists of unique values of categorical attributes (as returned by the `categories_` property of the `OrdinalEncoder`)
- `emb_dim`: the number of dimensions of each embedding (we're keeping them the same for every categorical attribute for simplicity)

The `forward()` method is where the magic happens, as you know. It receives an input `x`, which can be anything (e.g. a tensor, a tuple, a dictionary), and forwards this input through your model's components, such as layers, activation functions, and embeddings. In the end, it should return a prediction. The diagram below illustrates the flow of the inputs through the model's components in the forward pass. Please refer to it for its implementation.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch3/lab1_model.png)

In [None]:
import torch.nn.functional as F

class FFN(nn.Module):
    def __init__(self, n_cont, cat_list, emb_dim):
        super().__init__()
        
        # Embedding layers
        embedding_layers = []
        # Creates one embedding layer for each categorical feature
        
        # write your code here
        ...
        
        self.emb_layers = nn.ModuleList(embedding_layers)

        # Total number of embedding dimensions
        self.n_emb = len(cat_list) * emb_dim
        self.n_cont = n_cont

        # Linear Layer(s)
        lin_layers = []
        # The input layers takes as many inputs as the number of continuous features
        # plus the total number of concatenated embeddings
        # The number of outputs is your own choice
        # Optionally, add more hidden layers, don't forget to match the dimensions if you do

        # write your code here
        ...
        
        self.lin_layers = nn.ModuleList(lin_layers)
        
        # The output layer must have as many inputs as there were outputs in the last hidden layer
        self.output_layer = ...

        # Layer initialization
        for lin_layer in self.lin_layers:
            nn.init.kaiming_normal_(lin_layer.weight.data, nonlinearity='relu')
        nn.init.kaiming_normal_(self.output_layer.weight.data, nonlinearity='relu')

    def forward(self, inputs):
        # The inputs are the features as returned in the first element of a tuple
        # coming from the dataset/dataloader
        # Make sure you split it into continuous and categorical attributes according
        # to your dataset implementation of __getitem__
        cont_data, cat_data = inputs
        
        # Retrieve embeddings for each categorical attribute and concatenate them
        embeddings = []
        
        # write your code here
        ...
        
        embeddings = torch.cat(embeddings, 1)
        
        # Concatenate all features together, continuous and embeddings
        x = ...
        
        # Run the inputs through each layer and applies an activation function to each output
        for layer in self.lin_layers:
            # write your code here
            ...
            
        # Run the output of the last linear layer through the output layer
        # write your code here
        ...
        
        # Return the prediction
        return ...

Perhaps you noticed something unexpected in the constructor method of the model above, a couple of calls to something named `kaiming_normal_()`  layer initialization. We always start with random weights whenever we instantiate a new model, but we can tweak their initial distribution a little bit so it is less likely to end up in a vanishing gradients situation, that is, to stop learning altogether. Kaiming (also known as He) is the prescribed initialization to be used with ReLU activation functions. Initialization schemes may be relevant for training models from scratch, but since we'll be mostly using pretrained models, we won't be going into further details here. For a thorough explanation of the inner workings of different schemes, please check my blog post ["Hyper-parameters in Action! Part II — Weight Initializers."](https://towardsdatascience.com/hyper-parameters-in-action-part-ii-weight-initializers-35aee1a28404)

### 3.3.10 Training

Now it is time to write your own training loop. 

First, you need to instantiate your model, create an optimizer for its parameters, and the appropriate loss function for the task. Use the data loaders to iterate through your training and validation data. If your features are a more-complex type (e.g. tuples or dictionaries), don't forget to send each one of its components to the appropriate device. Remember that model's have two modes, training and evaluation, set them accordingly. Optionally, you can also implement early stopping.

In [None]:
n_cont = scaler.n_features_in_
cat_list = encoder.categories_

n_cont, cat_list

In [None]:
torch.manual_seed(42)
model = ...
lr = 1e-2
optimizer = ...
loss_fn = ...

In [None]:
from tqdm import tqdm

device = 'cuda' if torch.cuda.is_available() else 'cpu'

n_epochs = 100

losses = torch.empty(n_epochs)
val_losses = torch.empty(n_epochs)

best_loss = torch.inf
best_epoch = -1
patience = 3

model.to(device)

progress_bar = tqdm(range(n_epochs))

for epoch in progress_bar:
    batch_losses = torch.empty(len(dataloaders['train']))
    
    ## Training
    for i, (batch_features, batch_targets) in enumerate(dataloaders['train']):
        # Set the model to training mode
        # write your code here
        ...
        
        # Send batch features and targets to the device
        # write your code here
        ...
        
        # Step 1 - forward pass
        predictions = ...

        # Step 2 - computing the loss
        loss = ...

        # Step 3 - computing the gradients
        # Tip: it requires a single method call to backpropagate gradients
        # write your code here
        ...
        
        batch_losses[i] = loss.item()

        # Step 4 - updating parameters and zeroing gradients
        # Tip: it takes two calls to optimizer's methods
        # write your code here
        ...
        
    losses[epoch] = batch_losses.mean()

    ## Validation   
    with torch.inference_mode():
        batch_losses = torch.empty(len(dataloaders['val']))    

        for i, (val_features, val_targets) in enumerate(dataloaders['val']):
            # Set the model to evaluation mode
            # write your code here
            ...

            # Send batch features and targets to the device
            # write your code here
            ...

            # Step 1 - forward pass
            predictions = ...

            # Step 2 - computing the loss
            loss = ...
            
            batch_losses[i] = loss.item()

        val_losses[epoch] = batch_losses.mean()
        
        if val_losses[epoch] < best_loss:
            best_loss = val_losses[epoch]
            best_epoch = epoch
            save_checkpoint(model, optimizer, "best_model.pth")
        elif (epoch - best_epoch) > patience:
            print(f"Early stopping at epoch #{epoch}")
            break

Let's check the evolution of the losses:

In [None]:
plt.plot(losses[:epoch], label='Training')
plt.plot(val_losses[:epoch], label='Validation')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.yscale('log')
plt.legend()

Then, let's compare predicted and actual values in the validation set. Hopefully, it will be much better than our former linear regression.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(5, 5))
split = 'val'
batch = list(datasets[split][:][0])
batch[0] = batch[0].to(device)
batch[1] = batch[1].to(device)
ax.scatter(datasets[split][:][1].tolist(), model(batch).tolist(), alpha=.5)
#ax.scatter(y_true, y_hat)
ax.plot([0, 45], [0, 45], linestyle='--', c='k', linewidth=1)
ax.set_xlabel('Actual')
ax.set_xlim([0, 45])
ax.set_ylabel('Predicted')
ax.set_ylim([0, 45])
ax.set_title('MPG')