# **DIVE 2 DEEP LEARNING textbook**

Online book : https://d2l.ai/index.html

Pytorch 60 mins tutorial: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html

Online book: https://www.deeplearningbook.org

Online book: http://neuralnetworksanddeeplearning.com

In [1]:
pip install -U d2l

Collecting d2l
  Downloading d2l-1.0.3-py3-none-any.whl (111 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.7/111.7 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jupyter==1.0.0 (from d2l)
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting matplotlib==3.7.2 (from d2l)
  Downloading matplotlib-3.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
Collecting pandas==2.0.3 (from d2l)
  Downloading pandas-2.0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m64.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy==1.10.1 (from d2l)
  Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

## Implementation of Multilayer Perceptrons from Scratch


In [2]:
import torch
from torch import nn
from d2l import torch as d2l

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ../data/FashionMNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 26421880/26421880 [00:02<00:00, 11711272.78it/s]


Extracting ../data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ../data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 29515/29515 [00:00<00:00, 210019.04it/s]


Extracting ../data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ../data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 4422102/4422102 [00:01<00:00, 3881992.53it/s]


Extracting ../data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ../data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 5148/5148 [00:00<00:00, 13453132.08it/s]


Extracting ../data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw





### Initializing Model Parameters

Recall that Fashion-MNIST contains 10 classes,
and that each image consists of a $28 \times 28 = 784$
grid of grayscale pixel values.
Again, we will disregard the spatial structure
among the pixels for now,
so we can think of this as simply a classification dataset
with 784 input features and 10 classes.
To begin, we will [**implement an MLP
with one hidden layer and 256 hidden units.**]
Note that we can regard both of these quantities
as hyperparameters.
Typically, we choose layer widths in powers of 2,
which tend to be computationally efficient because
of how memory is allocated and addressed in hardware.

Again, we will represent our parameters with several tensors.
Note that *for every layer*, we must keep track of
one weight matrix and one bias vector.
As always, we allocate memory
for the gradients of the loss with respect to these parameters.

In [3]:
num_inputs, num_outputs, num_hiddens = 784, 10, 256

W1 = nn.Parameter(torch.randn(
    num_inputs, num_hiddens, requires_grad=True) * 0.01)
b1 = nn.Parameter(torch.zeros(num_hiddens, requires_grad=True))
W2 = nn.Parameter(torch.randn(
    num_hiddens, num_outputs, requires_grad=True) * 0.01)
b2 = nn.Parameter(torch.zeros(num_outputs, requires_grad=True))

params = [W1, b1, W2, b2]

### Activation Function

To make sure we know how everything works,
we will [**implement the ReLU activation**] ourselves
using the maximum function rather than
invoking the built-in `relu` function directly.

In [4]:
def relu(X):
    a = torch.zeros_like(X)
    return torch.max(X, a)

### Model

Because we are disregarding spatial structure,
we `reshape` each two-dimensional image into
a flat vector of length  `num_inputs`.
Finally, we (**implement our model**)
with just a few lines of code.


In [5]:
def net(X):
    X = X.reshape((-1, num_inputs))
    H = relu(X@W1 + b1)  # Here '@' stands for matrix multiplication
    return (H@W2 + b2)

### Loss Function


In [6]:
loss = nn.CrossEntropyLoss(reduction='none')


### Training

Fortunately, [**the training loop for MLPs
is exactly the same as for softmax regression.**]
Leveraging the `d2l` package again,
we call the `train_ch3` function
(see `sec_softmax_scratch`),
setting the number of epochs to 10
and the learning rate to 0.1.

In [8]:
num_epochs, lr = 10, 0.1
updater = torch.optim.SGD(params, lr=lr)
#d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, updater)

In [9]:
# d2l.predict_ch3(net, test_iter)

# Concise Implementation of Multilayer Perceptrons

## Model

We add *two* fully-connected layers
(previously, we added *one*).
The first is [**our hidden layer**],
which (**contains 256 hidden units
and applies the ReLU activation function**).
The second is our output layer.

In [10]:
net = nn.Sequential(nn.Flatten(),
                    nn.Linear(784, 256),
                    nn.ReLU(),
                    nn.Linear(256, 10))

def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)

net.apply(init_weights);

In [12]:
batch_size, lr, num_epochs = 256, 0.1, 10
loss = nn.CrossEntropyLoss(reduction='none')
trainer = torch.optim.SGD(net.parameters(), lr=lr)

train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
# d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

## Exercises

1. Try adding different numbers of hidden layers (you may also modify the learning rate). What setting works best?
1. Try out different activation functions. Which one works best?
1. Try different schemes for initializing the weights. What method works best?

### Task 0:
Read
https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#sphx-glr-beginner-blitz-neural-networks-tutorial-py

### Task 1. MLP for Binary Classification

We will use the Ionosphere binary (two-class) classification dataset to demonstrate an MLP for binary classification.

This dataset involves predicting whether a structure is in the atmosphere or not given radar returns.

The dataset will be downloaded automatically using [Pandas](https://pandas.pydata.org/), but you can learn more about it here.

*  [Ionosphere Dataset (csv)](https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv).
* [Ionosphere Dataset Description (csv)](https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.names).

We will use a [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) to encode the string labels to integer values 0 and 1. The model will be fit on 67 percent of the data, and the remaining 33 percent will be used for evaluation, split using the [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function.

It is a good practice to use ‘relu‘ activation with a ‘he_normal‘ weight initialization. This combination goes a long way to overcome the problem of vanishing gradients when training deep neural network models. For more on ReLU, see the tutorial:
[A Gentle Introduction to the Rectified Linear Unit (ReLU)](https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/).

The model predicts the probability of class 1 and uses the sigmoid activation function. The model is optimized using the adam version of stochastic gradient descent and seeks to minimize the cross-entropy loss.

Implement this network in Pytorch



In [13]:
# mlp for binary classification
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# ensure all data are floating point values
X = X.astype('float32')
# encode strings to integer
y = LabelEncoder().fit_transform(y)
# split into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(235, 34) (116, 34) (235,) (116,)


In [19]:
class MLP(nn.Module):

    def __init__(self, n_inputs):
        super(MLP, self).__init__()
        self.hidden1 = nn.Linear(n_inputs, 10)
        self.act1 = nn.ReLU()
        self.hidden2 = nn.Linear(10, 8)
        self.act2 = nn.ReLU()
        self.hidden3 = nn.Linear(8, 1)
        self.act3 = nn.Sigmoid()

    def forward(self, X):
        X = self.hidden1(X)
        X = self.act1(X)
        X = self.hidden2(X)
        X = self.act2(X)
        X = self.hidden3(X)
        X = self.act3(X)
        return X

from torch.utils.data import Dataset, DataLoader
class Dataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y.reshape(-1, 1)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return [self.X[idx], self.y[idx]]

train_dl = DataLoader(Dataset(X_train, y_train), batch_size=32, shuffle=True)
test_dl = DataLoader(Dataset(X_test, y_test), batch_size=1024, shuffle=False)

net = MLP(X_train.shape[1])
print(net)
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=0.001)
for epoch in range(100):

    epoch_loss = 0
    eval_loss = 0
    eval_acc = 0
    for i, (X, y) in enumerate(train_dl):
        optimizer.zero_grad()
        yhat = net(X)
        loss = criterion(yhat, y.float())
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()


    with torch.no_grad():
        for i, (X, y) in enumerate(test_dl):
            yhat = net(X)
            loss = criterion(yhat, y.float())
            eval_loss += loss.item()
            yhat = torch.round(yhat)
            acc = (yhat == y).sum().item()
            eval_acc += acc

    print('epoch: %d, Train loss %.3f, Test loss %.3f Test acc %.3f' % (epoch, epoch_loss, eval_loss, eval_acc/len(y_test)))


MLP(
  (hidden1): Linear(in_features=34, out_features=10, bias=True)
  (act1): ReLU()
  (hidden2): Linear(in_features=10, out_features=8, bias=True)
  (act2): ReLU()
  (hidden3): Linear(in_features=8, out_features=1, bias=True)
  (act3): Sigmoid()
)
epoch: 0, Train loss 5.607, Test loss 0.703 Test acc 0.302
epoch: 1, Train loss 5.564, Test loss 0.696 Test acc 0.302
epoch: 2, Train loss 5.519, Test loss 0.690 Test acc 0.569
epoch: 3, Train loss 5.485, Test loss 0.685 Test acc 0.828
epoch: 4, Train loss 5.458, Test loss 0.680 Test acc 0.828
epoch: 5, Train loss 5.419, Test loss 0.675 Test acc 0.802
epoch: 6, Train loss 5.382, Test loss 0.670 Test acc 0.793
epoch: 7, Train loss 5.338, Test loss 0.664 Test acc 0.767
epoch: 8, Train loss 5.293, Test loss 0.657 Test acc 0.767
epoch: 9, Train loss 5.242, Test loss 0.649 Test acc 0.776
epoch: 10, Train loss 5.207, Test loss 0.641 Test acc 0.776
epoch: 11, Train loss 5.123, Test loss 0.632 Test acc 0.767
epoch: 12, Train loss 5.020, Test loss 0

### Task 2. MLP for Regression

We will use the Boston housing regression dataset to demonstrate an MLP for regression predictive modeling.

This problem involves predicting house value based on properties of the house and neighborhood.

The dataset will be downloaded automatically using Pandas, but you can learn more about it here.

* [Boston Housing Dataset (csv).](https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv)
* [Boston Housing Dataset Description (csv).](https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.names)

This is a regression problem that involves predicting a single numerical value. As such, the output layer has a single node and uses the default or linear activation function (no activation function). The mean squared error (mse) loss is minimized when fitting the model.

Recall that this is a regression, not classification; therefore, we cannot calculate classification accuracy. The complete example of fitting and evaluating an MLP on the Boston housing dataset is listed below.

In [25]:
# mlp for regression
from pandas import read_csv
from sklearn.model_selection import train_test_split
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = read_csv(path, header=None)

In [26]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [27]:
df.columns

Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], dtype='int64')

In [28]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=[0]), df[0], test_size=0.33)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(339, 13) (167, 13) (339,) (167,)


In [33]:
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

In [56]:
# mlp for regression with mse loss function and adam optimizer

# define the network
class MLP(nn.Module):

        def __init__(self, n_inputs):
            super(MLP, self).__init__()
            self.hidden1 = nn.Linear(n_inputs, 10)
            self.act1 = nn.ReLU()
            self.hidden2 = nn.Linear(10, 8)
            self.act2 = nn.ReLU()
            self.hidden3 = nn.Linear(8, 1)

        def forward(self, X):
            X = X.to(torch.float32)
            X = self.hidden1(X)
            X = self.act1(X)
            X = self.hidden2(X)
            X = self.act2(X)
            X = self.hidden3(X)
            return X

# train the model

#create dataloader

from torch.utils.data import Dataset, DataLoader
class Dataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return [self.X.loc[idx].values, self.y.loc[idx]]


train_dl = DataLoader(Dataset(X_train, y_train), batch_size=32, shuffle=True)
test_dl = DataLoader(Dataset(X_test, y_test), batch_size=1024, shuffle=False)

net = MLP(X_train.shape[1])
print(net)
# define the optimization
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=0.001)
# enumerate epochs
for epoch in range(100):

        epoch_loss = 0
        eval_loss = 0
        # enumerate mini batches
        for i, (X, y) in enumerate(train_dl):
            # clear the gradients
            optimizer.zero_grad()
            # compute the model output
            yhat = net(X)
            # calculate loss
            yhat = yhat.squeeze(1)
            loss = criterion(yhat, y.float())
            # credit assignment
            loss.backward()
            # update model weights
            optimizer.step()
            epoch_loss += loss.item()

        # evaluate the model
        with torch.no_grad():
            # use test_dl
            for i, (X, y) in enumerate(test_dl):
                # make prediction
                yhat = net(X)
                yhat = yhat.squeeze(1)
                # calculate loss
                loss = criterion(yhat, y.float())
                # credit assignment
                eval_loss += loss.item()

        print('epoch: %d, Train loss %.3f, Test loss %.3f' % (epoch, epoch_loss, eval_loss))


MLP(
  (hidden1): Linear(in_features=13, out_features=10, bias=True)
  (act1): ReLU()
  (hidden2): Linear(in_features=10, out_features=8, bias=True)
  (act2): ReLU()
  (hidden3): Linear(in_features=8, out_features=1, bias=True)
)
epoch: 0, Train loss 3801.039, Test loss 115.912
epoch: 1, Train loss 1857.728, Test loss 46.214
epoch: 2, Train loss 1136.745, Test loss 40.455
epoch: 3, Train loss 965.694, Test loss 35.257
epoch: 4, Train loss 850.126, Test loss 25.493
epoch: 5, Train loss 776.548, Test loss 20.477
epoch: 6, Train loss 740.352, Test loss 18.300
epoch: 7, Train loss 705.084, Test loss 18.102
epoch: 8, Train loss 762.705, Test loss 17.494
epoch: 9, Train loss 668.393, Test loss 16.810
epoch: 10, Train loss 657.679, Test loss 17.225
epoch: 11, Train loss 652.680, Test loss 17.351
epoch: 12, Train loss 650.931, Test loss 16.605
epoch: 13, Train loss 647.722, Test loss 16.232
epoch: 14, Train loss 665.205, Test loss 17.190
epoch: 15, Train loss 644.973, Test loss 17.483
epoch: 1