<a href="https://colab.research.google.com/github/frank-895/machine_learning_journey/blob/main/linear_to_NN_to_DL/lin_to_NN_to_DL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [52]:
import torch, numpy as np, pandas as pd

# Creating a Linear Model, Neural Net and Deep Learning Model from Scratch using Tabular Data

## Introduction

In this notebook, I will be demonstrating my learning in creating neural networks (NNs) from scratch. I will be building on my knowledge by going through 3 distinct steps:

1. Build **linear model** from scratch.
2. Build simple **NN** from scratch.
3. Build a **deep learning** (DL) model from scratch.

While a similar task was previously completed for image classification using the MNIST dataset, this notebook will focus on the Titanic dataset, aiming to build a model that can predict the chance of survival.

## Data Extraction and Cleaning

The data is contained in a csv file which we can open with Pandas.

In [53]:
df = pd.read_csv('train.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


We can see that some of the columns contain NaN values, which we will be unable to multiply by coefficients.

Let's replace the missing values with something - normally the mode is a good place to start.

In [54]:
modes = df.mode().iloc[0] # we use .iloc to take first row, as it will return more than 1 row in case of a tie.
df.fillna(modes, inplace=True)

Now we need to make sure that our data will be appropriate to feed through a model. A good place to start is with `.describe()` to see a summary, selecting numeric columns only to start with.

In [55]:
df.describe(include=(np.number))

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,28.56697,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,13.199572,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,22.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,24.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,35.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


We need to do a bit of feature engineering to make our data fit for purpose.

**Fare** has many values between 0 and 30 but some massive values too. This will skew the model, so we take the `log` to bring all values to a sensible range. This is generally a good technique for continuous variables involving *money* or *popn*.

In [56]:
df['LogFare'] = np.log(df['Fare'] + 1) # we add 1 to avoid log(0)!

Clearly, we still have some issues - notably, the strings cannot be multiplied by coefficients! Let's replace them with numbers.

Pandas allows us to create new columns containing **dummmy variables** which is a column that contains a `1` where a particular column contains a particular value or `0` otherwise. This is very easy in Pandas using the `get_dummies` function - by default, this provides a n columns for n categories (even though technically we only need n-1 columns as you can derive the final column). However, this is useful as it means we do not need to worry about adding a constant term anymore as we don't need a separate intercept term to cover rows that aren't otherwise part of a column.

We will do this for all categorical variables, even `Pclass`, because as shown below, this has 3 distinct values.

In [57]:
# only 3 distinct values in Pclass
pclasses = sorted(df.Pclass.unique())
pclasses

[1, 2, 3]

In [58]:
categorical = ['Pclass', 'Sex', 'Embarked']
df = pd.get_dummies(df, columns=categorical)
# get_dummies will automatically remove the original colummns.
df.head()

Unnamed: 0,PassengerId,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,LogFare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,1,0,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,B96 B98,2.110213,False,False,True,False,True,False,False,True
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",38.0,1,0,PC 17599,71.2833,C85,4.280593,True,False,False,True,False,True,False,False
2,3,1,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,B96 B98,2.188856,False,False,True,True,False,False,False,True
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,3.990834,True,False,False,True,False,False,False,True
4,5,0,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,B96 B98,2.202765,False,False,True,False,True,False,False,True


Now, we have engineered both the independent and dependent variables we will use in the model. We will discard some of the remaining columns for the purpose of this notebook (which is focusing on creating a NN from scratch rather than effective data transformation).

*However*, it is worth noting that a lot can be done with the remaining columns. The best Kaggle notebook actually used *only* the name column to predict chance of survival!

In [59]:
df.drop(columns=["Fare", "PassengerId", "Name", "Ticket", "Cabin"], inplace=True)

df.head()

Unnamed: 0,Survived,Age,SibSp,Parch,LogFare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,1,0,2.110213,False,False,True,False,True,False,False,True
1,1,38.0,1,0,4.280593,True,False,False,True,False,True,False,False
2,1,26.0,0,0,2.188856,False,False,True,True,False,False,False,True
3,1,35.0,1,0,3.990834,True,False,False,True,False,False,False,True
4,0,35.0,0,0,2.202765,False,False,True,False,True,False,False,True


The final step we want to do is **normalize** our data so that all the columns contain numbers from `0` to `1`. We do this by diciging each column by its maximum value. This will prevent the model by being dominated by larger values such as *age*.

It is worth noting that I am practicing using Pandas here for data manipulation, but in practice (particularly if working with more data) this would be much more efficient to perform in PyTorch, as we could use **broadcasting** to rapidly perform divisions. However, this allows me to have all the data engineering together so I can focus on creating the models.

Also note, we are converting our boolean columns to floats, as this will enable matrix multiplication using PyTorch later.

In [60]:
# PyTorch expects floats to perform matrix multiplications
indep_cols = df.columns[df.columns != "Survived"]
df[indep_cols] = df[indep_cols].astype(float)

In [61]:
for col in df.columns:
  df[col] = df[col]/df[col].max()
df.head()

Unnamed: 0,Survived,Age,SibSp,Parch,LogFare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0.0,0.275,0.125,0.0,0.338125,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
1,1.0,0.475,0.125,0.0,0.685892,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2,1.0,0.325,0.0,0.0,0.350727,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
3,1.0,0.4375,0.125,0.0,0.639463,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,0.0,0.4375,0.0,0.0,0.352955,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0


Now, we can start building models!

## Linear Model

### Complile Data

We will be using PyTorch to build our model, as PyTorch tensors can make use of the GPU to make ultra-fast calculations.

We start by turning our independent variables (predictors) and dependent variables (targets) into tensors.

In [137]:
from torch import tensor

dep = tensor(df["Survived"])
indep = tensor(df[indep_cols].values)

indep.shape, dep.shape

(torch.Size([891, 12]), torch.Size([891]))

At this point, it is also important to split the data into **training** and **validation** sets, a topic I have spoken in-depth about previously.

We can use fastai's `RandomSplitter` for this task as it will return distinct random selection of rows for both training and validation sets.

In [138]:
from fastai.data.transforms import RandomSplitter
train, val = RandomSplitter(seed=42)(df)
train, val

((#713) [788,525,821,253,374,98,215,313,281,305,701,812,76,50,387,47,516,564,434,117...],
 (#178) [303,778,531,385,134,476,691,443,386,128,579,65,869,359,202,187,456,880,705,797...])

In [139]:
trn_dep, val_dep = dep[train], dep[val]
trn_indep, val_indep = indep[train], indep[val]

trn_indep.shape, val_indep.shape, trn_dep.shape, val_dep.shape

(torch.Size([713, 12]),
 torch.Size([178, 12]),
 torch.Size([713]),
 torch.Size([178]))

Finally, we need to turn our dependent variable into a column vector (rank-2 tensor) which we can do by indexing the column dimension (which doesn't currently exist) with the special value `None` which tells PyTorch to add a new dimension here.

We are doing this as we will be using matrix multiplication and the predictions will be returned as a rank-2 column vector.

In [140]:
trn_dep = trn_dep[:,None]
val_dep = val_dep[:,None]

### Initialise coefficients

Now that we have our variables, we need to generate our (initially) random coefficients.

We need one coefficient for each independent variable and we will pick random numbers in the range (-0.5, 0.5).

When we perform matrix multiplication between the coefficients and independent variables, we need `coeffs` to be a rank-2 column vector, hence, we add an extra dimension through the second argument of `torch.rand()`.

In [151]:
def init_coeffs(n_coeff):
  return (torch.rand(n_coeff, 1, dtype=torch.float64) * 0.1).requires_grad_(True)
  # I use dtype as I need to make sure the coeffs are the same data type as the independent variables.

### Calculating Predictions

Our predictions will be calculated by multiplying each row by the coefficients and adding them up.

The function `calc_preds` will use **broadcasting** to multiply each row of independent variables by the vector of coefficients. The sum of row of independent variables will be calculated and this will represent the prediction for these predictors.

But what about the `sigmoid` function? Well, this is a cool function that basically limits our predictions to be between `0` and `1` (since `0` means died and `1` means survived). Essentially whenever we are performing binary classification, we will use the sigmoid function as it improves the accuracy of the model substantially.



In [142]:
def calc_preds(coeffs, indeps):
  return torch.sigmoid((indeps*coeffs).sum(axis=1))

This function looks great! However, we can actually improve it further!

Multiplying elements together, then adding across rows is identical to doing a matrix-vector product. We can use the Python `@` operator to perform PyTorch optimised matrix products.

In [143]:
def calc_preds(coeffs, indeps):
  return torch.sigmoid(indeps@coeffs)

### Calculating Loss

Once the predictions are made (initially on random coefficients) for each row of independent variables, we need to calculate the **loss**. The loss will allow us to update the coefficients (we will now refer to them as **parameters**).

`calc_loss` calls the `calc_preds` function and uses the **mean absolute error** for loss.

In [144]:
def calc_loss(coeffs, indeps, deps):
  return torch.abs(calc_preds(coeffs, indeps) - deps).mean()

### Update Parameters

At this point, we have made predictions and calculated the loss. Now we need to use the loss and **stochastic gradient descent** (SGD) to update the parameters.

`update_coeffs` uses the gradient (calculated by PyTorch as we used `requires_grad_` when initialising the paramters) of each coefficient to determine how to adjust it. It is adjusted by the **learning rate** an important hyperparameter when designing a model.

We zero the coefficients to prevent the gradients from accumulating (the default behaviour of PyTorch).

In [145]:
def update_coeffs(coeffs, lr):
  coeffs.sub_(coeffs.grad * lr)
  coeffs.grad.zero_()

### Training the Model

Now we have all of the important functions to:

1. Initialise Coefficients
2. Calculate Predictions
3. Calculate Loss
4. Update Parameters

We will put this all together using 2 functions.

The first called `epoch` represents a single **epoch** which a full pass through all training data and will use our previously defined functions.

The second function called `train_model` will initialise the coefficients then call `epoch` for each epoch we want to perform.  

In [146]:
def epoch(coeffs, lr):
  loss = calc_loss(coeffs, trn_indep, trn_dep)
  loss.backward()
  with torch.no_grad():
    update_coeffs(coeffs, lr)
  print(f"{loss:.3f}")

In [147]:
def train_model(epochs, lr):
  torch.manual_seed(442)
  coeffs = init_coeffs(indep.shape[1])
  for i in range(epochs):
    epoch(coeffs, lr=lr)
  return coeffs

In [148]:
coeffs = train_model(15, 100)

0.515
0.323
0.288
0.204
0.200
0.198
0.197
0.197
0.196
0.196
0.196
0.195
0.195
0.195
0.195


### Analyse Coefficients

We can write a quick function to see all of our coefficients for each independent variable. This will give us an idea of how our function works.

For example, we can see that a higher *age* is a strong predictor of death and a a higher *class* (as in 1 or 2) is a strong predictor of life by looking at the coefficients.

In [149]:
def show_coeffs():
  return dict(zip(indep_cols, coeffs.requires_grad_(False)))
show_coeffs()

{'Age': tensor([-1.1208], dtype=torch.float64),
 'SibSp': tensor([-0.8208], dtype=torch.float64),
 'Parch': tensor([-0.3355], dtype=torch.float64),
 'LogFare': tensor([0.4991], dtype=torch.float64),
 'Pclass_1': tensor([3.3172], dtype=torch.float64),
 'Pclass_2': tensor([1.2995], dtype=torch.float64),
 'Pclass_3': tensor([-6.3529], dtype=torch.float64),
 'Sex_female': tensor([8.2192], dtype=torch.float64),
 'Sex_male': tensor([-10.0180], dtype=torch.float64),
 'Embarked_C': tensor([1.2405], dtype=torch.float64),
 'Embarked_Q': tensor([1.4383], dtype=torch.float64),
 'Embarked_S': tensor([-4.3675], dtype=torch.float64)}

### Calculating Metrics

Now we just need a **metric** to determine the quality of the model. We can see that the loss is going down, but loss is not suitable for evaluating how **accurate** the model is.

We will define a prediction of *death* as any value < 0.5 and a prediction of *life* as any value >= 0.5.

We have our coeffients but we haven't used our validation set to determine how effective the model is. We will use the validation set to calculate accuracy.

In [150]:
preds = calc_preds(coeffs, val_indep)
def accuracy(coeffs):
  return (val_dep.bool()==(preds>=0.5)).float().mean()
accuracy(coeffs)

tensor(0.8258)

### Summary

We've now built a linear model that is performing very well! This model is not yet a NN but **it is the basis for creating a layer of a NN**, which we will be working on in the next section!

## Neural Network