## 3.4 Lab 2 / Case 2: Price Prediction

In this we'll use a different dataset: [100,000 UK Used Car Data set](https://www.kaggle.com/datasets/adityadesai13/used-car-dataset-ford-and-mercedes) from Kaggle. It contains scraped data of used car listings split into CSV files according to the manufacturer: Audi, BMW, Ford, Hyundai, Mercedes, Skoda, Toyota, Vauxhall, and VW. It also contains a few extra files of particular models (`cclass.csv`, `focus.csv`, `unclean_cclass.csv`, and `unclean_focus.csv`) that we won't be using.

Each file has nine columns with the car's attributes: model, year, price, transmission, mileage, fuel type, road tax, fuel consumption (mpg), and engine size. Transmission, fuel type, and year are discrete/categorical attributes, the others are continous. Our goal here is to predict the car's price based on its other attributes.

We'll start by building a datapipe that reads all the information from the CSV files, and then we'll use this datapipe as a drop-in replacement for the dataset we typically use with data loaders. The use of datapipes in its functional form, will illustrate some of the challenges when dealing with real-world data.

To download the dataset, you'll need to create a Kaggle account. In the following sections, we're assuming the dataset was downloaded and unzipped to a local folder named `car_prices`.

In [None]:
#!wget https://github.com/dvgodoy/assets/raw/main/PyTorchInPractice/data/100KUsedCar/car_prices.zip
#!unzip car_prices.zip -d car_prices

### 3.4.1 DataPipes

There are many different available classes of data pipes in PyTorch. They are highly configurable, but we're sticking to the basics to illustrate how they work. The recommendation from PyTorch's team is to use its functional form and chain several operations in a sequence. We'll be doing exactly that, but we'll also be inspecting the results at the end of each operation to more easily understand what's happening under the hood.

Our goal is to build a datapipe that produces a dictionary with three keys in it: `label` (containing the prices we want to predict), `cont_X` (an array of the continuous attributes), and `cat_X` (an array of sequentially-encoded categorical attributes).

Let's start with a `FileLister` datapipe which, as its name says, lists all files inside a given folder:

In [None]:
import torchdata.datapipes as dp

datapipe = dp.iter.FileLister('./car_prices')

Unlike datasets, datapipes do not behave like lists, they do not implement the `__getitem__()` method. So, let's create a temporary dataloader in order to retrieve a couple of elements from it:

In [None]:
from torch.utils.data import DataLoader
next(iter(DataLoader(dataset=datapipe, batch_size=16)))

As expected, it listed every file (up to 16) inside that folder. But we don't actually want to use all of these files, we need to filter out some of them. Following the functional approach, we can call the `filter()` method of the datapipe we just created using a function that returns `True` only if the filename matches our filter conditions:

In [None]:
def filter_for_data(filename):
    return ("unclean" not in filename) and ("focus" not in filename) and ("cclass" not in filename) and filename.endswith(".csv")

datapipe = datapipe.filter(filter_fn=filter_for_data)

In [None]:
next(iter(DataLoader(dataset=datapipe, batch_size=16)))

Great, now we only have nine filenames, one for each manufacturer.

#### 3.4.1.1 Loading CSV Files

Next, we need to actually open and parse these CSV files. We're skipping the first line (it contains the headers), and we set its `return_path` to `True`, so we know which file each row came from:

In [None]:
datapipe = datapipe.open_files(mode='rt')
datapipe = datapipe.parse_csv(delimiter=",", skip_lines=1, return_path=True)

In [None]:
next(iter(DataLoader(dataset=datapipe, batch_size=4)))

Let's take a moment to analyze the output above. It is a list containing two elements:
- a tuple containing four filenames
- a list of nine tuples, each tuple containing four values

Each tuple, both the first one and those inside the inner list, has four elements since we requested a mini-batch of four. The inner list has nine elements because there are nine columns in each CSV file. 

In effect, every element of our datapipe is a tuple `(filepath, features)`, but each feature is in its own tuple: `(filepath, [(f1,),(f2,), ...])`

It would be interesting to have the manufacturer as a feature instead of a filename, right? Let's do that by calling the `map()` method of our datapipe with our own `get_manufacturer()` function. This function works by extracting the name of the manufacturer out of the filename, and appending it to the existing list of features:

In [None]:
import os

def get_manufacturer(content):
    path, data = content
    manuf = os.path.splitext(os.path.basename(path))[0].upper()
    data.extend([manuf])
    return data

datapipe = datapipe.map(get_manufacturer)

In [None]:
next(iter(DataLoader(dataset=datapipe, batch_size=4)))

There are ten features now, as expected.

#### 3.4.1.2 Encoding Categorical Attributes

Now, it is time to encode categorical attributes, just like we did before using Scikit-Learn's `OrdinalEncoder`. There is an issue, though. We haven't even split our data into training, validation, and test sets yet. Moreover, it would be impractical to train an encoder inside a datapipe.

So, let's take a step back and imagine that we had already discussed the problem at length, and we're aware of all valid unique values for each categorical attribute. It makes sense: if we're building an app that estimates car's prices, we'll eventually ask the end user to provide the characteristics of their car, most likely using dropdowns so they can choose from a list of predefined values. For example, you shouldn't let the user enter some made-up fuel type, they must choose among "petrol", "diesel", "hybrid", "other", or "electric". Each item in the dropdown corresponds to an integer value, so "petrol" is zero, "diesel" is one, and so on. That's the same as sequentially-encoding the fuel type.

Of course, we won't be actually designing the frontend of an app in this lab, so let's just pretend we did it, and cheat a little bit by looking at the whole data first and building dictionaries that perform the encoding described above.

In [None]:
import pandas as pd

colnames = ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer']
df = pd.DataFrame(list(datapipe), columns=colnames)

In [None]:
N_ROWS = len(df)

In [None]:
def gen_encoder_dict(series):
    values = series.unique()
    return dict(zip(values, range(len(values))))

In [None]:
cont_attr = ['year', 'mileage', 'road_tax', 'mpg', 'engine_size']
cat_attr = ['model', 'transmission', 'fuel_type', 'manufacturer']

dropdown_encoders = {col: gen_encoder_dict(df[col]) for col in cat_attr}

In [None]:
dropdown_encoders['fuel_type']

Datapipes are less flexible than datasets in this sense. They require that you know, beforehand, what your data structure is, what your data actually looks like, and what transformations are needed.

So, let's use this knowledge to build a preprocessing function that takes a row as input - containing ten columns of data - and produces the desired dictionary as output.

#### 3.4.1.3 Row Output

In [None]:
import numpy as np

def preproc(row):
    colnames = ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer']
    
    cat_attr = ['model', 'transmission', 'fuel_type', 'manufacturer']
    cont_attr = ['year', 'mileage', 'road_tax', 'mpg', 'engine_size']
    target = 'price'
    
    vals = dict(zip(colnames, row))
    cont_X = [float(vals[name]) for name in cont_attr]
    cat_X = [dropdown_encoders[name][vals[name]] for name in cat_attr]
            
    return {'label': np.array([float(vals[target])], dtype=np.float32),
            'cont_X': np.array(cont_X, dtype=np.float32), 
            'cat_X': np.array(cat_X, dtype=int)}

We can, once again, use the datapipe's `map()` method to apply the function above to every data point.

In [None]:
datapipe = datapipe.map(preproc)

Let's take a mini-batch of four data points from our datapipe:

In [None]:
next(iter(DataLoader(dataset=datapipe, batch_size=4)))

Nice! We got the desired dictionary back, each key has a tensor with four rows (our mini-batch size), and the categorical attributes are encoded as integers.

At this point, you're probably wondering why we didn't bother at all to standardize/scale the continuous attributes. Don't worry, we'll get back to it in a couple of sections.

#### 3.4.1.4 The Full DataPipe and Splits

Let's piece all the parts together and re-build the full datapipe from top to bottom:

In [None]:
datapipe = dp.iter.FileLister('./car_prices')
datapipe = datapipe.filter(filter_fn=filter_for_data)
datapipe = datapipe.open_files(mode='rt')
datapipe = datapipe.parse_csv(delimiter=",", skip_lines=1, return_path=True)
datapipe = datapipe.map(get_manufacturer)
datapipe = datapipe.map(preproc)

The datapipe is ready, and it works like a "recipe" of all the steps required to load, clean, and preprocess our data. Now, let's call its `random_split()` method to create the training, validation, and test sets. We shouldn't forget to shuffle the training set, though.

In [None]:
datapipes = {}
datapipes['train'] = datapipe.random_split(total_length=N_ROWS, weights={"train": 0.8, "val": 0.1, "test": 0.1}, seed=11, target='train')
datapipes['val'] = datapipe.random_split(total_length=N_ROWS, weights={"train": 0.8, "val": 0.1, "test": 0.1}, seed=11, target='val')
datapipes['test'] = datapipe.random_split(total_length=N_ROWS, weights={"train": 0.8, "val": 0.1, "test": 0.1}, seed=11, target='test')

datapipes['train'] = datapipes['train'].shuffle(buffer_size=100000)

Datapipes are a drop-in replacement for datasets, so we can simply use data loaders exactly the same way we've been doing so far:

In [None]:
dataloaders = {}
dataloaders['train'] = DataLoader(dataset=datapipes['train'], batch_size=128, drop_last=True, shuffle=True)
dataloaders['val'] = DataLoader(dataset=datapipes['val'], batch_size=128)
dataloaders['test'] = DataLoader(dataset=datapipes['test'], batch_size=128)

### 3.4.2 BatchNorm for Continuous Attributes

As promised a couple of sections ago, let's discuss what to do with the continuous attributes. Just like with the ordinal encoder, it isn't practical to train a `StandardScaler` inside a datapipe. Moreover, how are we supposed to fit a `StandardScaler` on training data only, if the split is performed at the very end of the datapipe?

Luckily, we don't necessarily need to standardize the data using statistics (mean and standard deviation) computed on the whole training set. We can standardize them using running mini-batch statistics instead! That's what batch normalization does. 

Let's see how it works by, first, retrieving a mini-batch of data and computing the statistics of its continuous attributes:

In [None]:
import torch.nn as nn

batch = next(iter(dataloaders['train']))
batch['cont_X'].mean(axis=0), batch['cont_X'].std(axis=0, unbiased=False)

Now, let's create an instance of a batch norm layer, use our mini-batch as input, and compute statistics on the output:

In [None]:
bn_layer = nn.BatchNorm1d(num_features=len(cont_attr))

normalized_cont = bn_layer(batch['cont_X'])
normalized_cont.mean(axis=0), normalized_cont.std(axis=0, unbiased=False)

There we go! The continuous attributes of our mini-batch were standardized (or normalized, following technique's name) so they are zero-centered and have unit standard deviation. As it turns out, the batch normalization layer keeps track of running statistics, so after seeing this one mini-batch of data, it will have some statistics of its own already:

In [None]:
bn_layer.state_dict()

When in training mode, it keeps updating running statistics so, after one epoch, it will have collected statistics over the whole training set. At this point, it will have statistics very close to those we would get if we had computed them over the whole training set in the first place.

Once the model is switched to evaluation mode, the batch norm layer doesn't update its internal statistics anymore, but it still normalizes new data points using those it learned during training. Batch norm layers, together with dropout layers, are a classical example of having distinct behaviors depending on which mode the model was set to.

All we have to do now is to add one of these layers to normalize the inputs of our model and, optionally, after every hidden layer as well. That may raise a question: should we place the batch normalization before or after the activation function? On a theoretical level, it makes more sense to place it after the activation function, so the outputs are zero-centered. However, successful models such as Inception V3 place it before the activation function. Unfortunately, there's no straight answer to this question, the choice is yours to make.

### 3.4.3 Custom Model

You know the drill: write a custom model class that implements both `__init__()` and `forward()` methods. You can use the model you wrote in Lab 1 as a starting point.

In the constructor method, you will define the parts that make up your model, like linear layers and embeddings, as class attributes. Don't forget to include a call to `super().__init__()` at the top of the method so it executes the code from the parent class before your own. In our case, the model will receive the following arguments:

- `n_cont`: the number of continuous attributes
- `cat_list`: a list of lists of unique values of categorical attributes (as returned by the `categories_` property of the `OrdinalEncoder`)
- `emb_dim`: the number of dimensions of each embedding (we're keeping them the same for every categorical attribute for simplicity)

The `forward()` method is where the magic happens, as you know. It receives an input `x`, which can be anything (e.g. a tensor, a tuple, a dictionary), and forwards this input through your model's components, such as layers, activation functions, and embeddings. In the end, it should return a prediction.

Don't forget your data loader is returning dictionaries now, you'll need to make adjustments to how your model treats its inputs. Also, don't forget to add a batch normalization layer to preprocess the continuous attributes and, optionally, you can also add batch normalization layers after each hidden linear layer. Please refer to the diagram below for the implementation.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch3/lab2_model.png)

In [None]:
import torch.nn.functional as F

class FFN(nn.Module):
    def __init__(self, n_cont, cat_list, emb_dim):
        super().__init__()
        
        # Embedding layers
        embedding_layers = []
        # Creates one embedding layer for each categorical feature

        # write your code here
        ...
        self.emb_layers = ...

        # Total number of embedding dimensions
        self.n_emb = len(cat_list) * emb_dim
        self.n_cont = n_cont
        # Batch Normalization layer for continuous features
        self.bn_input = nn.BatchNorm1d(n_cont)

        # Linear Layer(s)
        lin_layers = []
        # The input layers takes as many inputs as the number of continuous features
        # plus the total number of concatenated embeddings
        # The number of outputs is your own choice
        # Optionally, add more hidden layers, don't forget to match the dimensions if you do
        
        # write your code here
        ...
        self.lin_layers = ...

        # Batch Normalization Layer(s)
        bn_layers = []
        # Creates batch normalization layers for each linear hidden layer

        # write your code here
        ...
        self.bn_layers = ...
        
        # The output layer must have as many inputs as there were outputs in the last hidden layer
        self.output_layer = ...

        # Layer initialization
        for lin_layer in self.lin_layers:
            nn.init.kaiming_normal_(lin_layer.weight.data, nonlinearity='relu')
        nn.init.kaiming_normal_(self.output_layer.weight.data, nonlinearity='relu')

    def forward(self, inputs):
        # The inputs are the features as returned in the first element of a tuple
        # coming from the dataset/dataloader
        # Make sure you split it into continuous and categorical attributes according
        # to your dataset implementation of __getitem__
        cont_data, cat_data = ...
        
        # Retrieve embeddings for each categorical attribute and concatenate them
        embeddings = []
        # write your code here
        ...
        
        # Normalizes continuous features using Batch Normalization layer
        normalized_cont_data = self.bn_input(cont_data)
        
        # Concatenate all features together, normalized continuous and embeddings
        x = ...
        
        # Run the inputs through each layer and applies an activation function and batch norm to each output
        for layer, bn_layer in zip(self.lin_layers, self.bn_layers):
            # write your code here
            ...
            
        # Run the output of the last linear layer through the output layer
        ...
        
        # Return the prediction
        return ...

### 3.4.4 Training

Now it is time to write your own training loop. Once again, you need to instantiate your model, create an optimizer for its parameters, and the appropriate loss function for the task. The training loop itself is pretty much the same as in the previous lab, but don't forget your data loaders return dictionaries now, so you'll need to adjust they way your data is being sent to the appropriate device.

In [None]:
n_cont = len(cont_attr)
cat_list = [np.array(list(dropdown_encoders[name].values())) for name in cat_attr]

n_cont, cat_list

In [None]:
torch.manual_seed(42)

lr = 3e-3

model = ...
optimizer = ...
loss_fn = ...

In [None]:
model.state_dict().keys()

In [None]:
from tqdm import tqdm

device = 'cuda' if torch.cuda.is_available() else 'cpu'

n_epochs = 20

losses = torch.empty(n_epochs)
val_losses = torch.empty(n_epochs)

best_loss = torch.inf
best_epoch = -1
patience = 3

model.to(device)

progress_bar = tqdm(range(n_epochs))

for epoch in progress_bar:
    batch_losses = []
    
    ## Training
    for i, batch in enumerate(dataloaders['train']):
        # Set the model to training mode
        # write your code here
        ...
        
        # Send batch features and targets to the device
        # write your code here
        ...
        
        # Step 1 - forward pass
        predictions = ...

        # Step 2 - computing the loss
        loss = ...

        # Step 3 - computing the gradients
        # Tip: it requires a single method call to backpropagate gradients
        # write your code here
        ...

        batch_losses.append(loss.item())

        # Step 4 - updating parameters and zeroing gradients
        # Tip: it takes two calls to optimizer's methods
        # write your code here
        ...
        
    losses[epoch] = torch.tensor(batch_losses).mean()

    ## Validation   
    with torch.inference_mode():
        batch_losses = []

        for i, val_batch in enumerate(dataloaders['val']):
            # Set the model to evaluation mode
            # write your code here
            ...

            # Send batch features and targets to the device
            # write your code here
            ...

            # Step 1 - forward pass
            predictions = ...

            # Step 2 - computing the loss
            loss = ...

            batch_losses.append(loss.item())

        val_losses[epoch] = torch.tensor(batch_losses).mean()
        
        if val_losses[epoch] < best_loss:
            best_loss = val_losses[epoch]
            best_epoch = epoch
            save_checkpoint(model, optimizer, "best_model.pth")
        elif (epoch - best_epoch) > patience:
            print(f"Early stopping at epoch #{epoch}")
            break

Let's check the evolution of the losses:

In [None]:
import matplotlib.pyplot as plt

plt.plot(losses[:epoch], label='Training')
plt.plot(val_losses[:epoch], label='Validation')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.yscale('log')
plt.legend()

Then, let's compare predicted and actual values in the validation set.

In [None]:
split = 'val'
y_hat = []
y_true = []
for batch in dataloaders[split]:
    model.eval()
    batch['cont_X'] = batch['cont_X'].to(device)
    batch['cat_X'] = batch['cat_X'].to(device)
    batch['label'] = batch['label'].to(device)
    y_hat.extend(model(batch).tolist())
    y_true.extend(batch['label'].tolist())

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(5, 5))
ax.scatter(y_true, y_hat, alpha=0.25)
ax.plot([0, 80000], [0, 80000], linestyle='--', c='k', linewidth=1)
ax.set_xlabel('Actual')
ax.set_xlim([0, 80000])
ax.set_ylabel('Predicted')
ax.set_ylim([0, 80000])
ax.set_title('Price')

Ideally, you'll see a cloud of points around the diagonal line. What about the R2 score?

In [None]:
from sklearn.metrics import r2_score
r2_score(y_true, y_hat)

If your cloud of points were indeed around the diagonal line, you're probably expecting a high R2 score (>0.8). If you got a surprisingly low value for it, can you guess why?