# Task.
Using the ESOL dataset with 8/1/1 for train/valid/test splits and using RMSE as the performance measure: train a neural network model on the ESOL dataset by using Morgan fingerprints as input features and RMSE as the performance measure as well as the cost function. Report performance measure for all three datasets.

In [1]:
import pandas as pd
import numpy as np

In [2]:
from rdkit import Chem
from rdkit.Chem import AllChem

In [3]:
import torch
import torch.nn as nn

from torch.utils.data import random_split
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader

### Download data

In [4]:
esol = pd.read_csv("delaney_processed.csv")
esol.head()

Unnamed: 0,Compound ID,ESOL predicted log solubility in mols per litre,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area,measured log solubility in mols per litre,smiles
0,Amigdalin,-0.974,1,457.432,7,3,7,202.32,-0.77,OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...
1,Fenfuram,-2.885,1,201.225,1,2,2,42.24,-3.3,Cc1occc1C(=O)Nc2ccccc2
2,citral,-2.579,1,152.237,0,0,4,17.07,-2.06,CC(C)=CCCC(C)=CC(=O)
3,Picene,-6.618,2,278.354,0,5,0,0.0,-7.87,c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43
4,Thiophene,-2.232,2,84.143,0,1,0,0.0,-1.33,c1ccsc1


### Process data

In [5]:
# function to generate canon SMILES
def gen_canon_smiles(smiles_list):
    
    invalid_ids = []
    canon_smiles = []

    for i in range(len(smiles_list)):   
        mol = Chem.MolFromSmiles(smiles_list[i])
        
        # do not append NoneType if invalid
        if mol is None: 
            invalid_ids.append(i)
            continue

        canon_smiles.append(Chem.MolToSmiles(mol))

    return canon_smiles, invalid_ids

In [6]:
# function to calculate morgan fingerprints from SMILES
def calc_morgan_fpts(smiles_list):
    morgan_fingerprints = []
    
    for i in smiles_list:
        mol = Chem.MolFromSmiles(i)
        
        # do not try to calculate if invalid
        if mol is None: continue
            
        fpts = AllChem.GetMorganFingerprintAsBitVect(mol,2,2048)
        mfpts = np.array(fpts)
        morgan_fingerprints.append(mfpts) 
        
    return np.array(morgan_fingerprints)

In [7]:
# generate canon smiles
canon_smiles, invalid_ids = gen_canon_smiles(esol.smiles)

# drop rows with invalid SMILES
esol = esol.drop(invalid_ids)

# replace SMILES with canon SMILES
esol.smiles = canon_smiles

# drop duplicates to prevent train/valid/test contamination
esol.drop_duplicates(subset=['smiles'], inplace=True)

In [8]:
# calculate fingerprints and create TensorDataset
X = torch.from_numpy(calc_morgan_fpts(esol.smiles)).float()
y = torch.from_numpy(esol["measured log solubility in mols per litre"].values).float()
esol_ds = TensorDataset(X, y)

In [9]:
# split data into training, validation, and test sets
train_ds, valid_ds, test_ds = random_split(esol_ds, [0.80, 0.10, 0.10])

In [10]:
# create DataLoaders
batch_size = 64

train_dl = DataLoader(train_ds, batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size, shuffle=True)
test_dl = DataLoader(test_ds, batch_size, shuffle=True)

### Model testing

#### Base model

In [11]:
hidden_units = [32, 16]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=32, bias=True)
  (1): ReLU()
  (2): Linear(in_features=32, out_features=16, bias=True)
  (3): ReLU()
  (4): Linear(in_features=16, out_features=1, bias=True)
)

In [12]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [13]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 7.7337, Valid Loss: 2.3464
Epoch 040, Train Loss: 4.6522, Valid Loss: 2.7143
Epoch 060, Train Loss: 4.2944, Valid Loss: 2.6510
Epoch 080, Train Loss: 3.8723, Valid Loss: 2.6809
Epoch 100, Train Loss: 3.8019, Valid Loss: 2.6742
Epoch 120, Train Loss: 3.7750, Valid Loss: 2.6409
Epoch 140, Train Loss: 3.5653, Valid Loss: 2.6502
Epoch 160, Train Loss: 3.6969, Valid Loss: 2.5938
Epoch 180, Train Loss: 3.5051, Valid Loss: 2.6314
Epoch 200, Train Loss: 3.6971, Valid Loss: 2.5945


#### Increase hidden units

In [14]:
hidden_units = [256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=256, bias=True)
  (1): ReLU()
  (2): Linear(in_features=256, out_features=64, bias=True)
  (3): ReLU()
  (4): Linear(in_features=64, out_features=1, bias=True)
)

In [15]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [16]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 4.6639, Valid Loss: 2.2584
Epoch 040, Train Loss: 4.0665, Valid Loss: 2.1762
Epoch 060, Train Loss: 3.9779, Valid Loss: 2.0550
Epoch 080, Train Loss: 3.7950, Valid Loss: 2.1702
Epoch 100, Train Loss: 3.9314, Valid Loss: 2.1409
Epoch 120, Train Loss: 3.7042, Valid Loss: 2.1247
Epoch 140, Train Loss: 3.5746, Valid Loss: 2.1323
Epoch 160, Train Loss: 3.6995, Valid Loss: 2.1275
Epoch 180, Train Loss: 3.7217, Valid Loss: 2.1294
Epoch 200, Train Loss: 3.6591, Valid Loss: 2.1263


**Increase hidden units again**

In [17]:
hidden_units = [512, 256]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ReLU()
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): ReLU()
  (4): Linear(in_features=256, out_features=1, bias=True)
)

In [18]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [19]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 4.5893, Valid Loss: 2.2734
Epoch 040, Train Loss: 4.3788, Valid Loss: 2.1372
Epoch 060, Train Loss: 4.0292, Valid Loss: 1.9974
Epoch 080, Train Loss: 4.0408, Valid Loss: 2.1087
Epoch 100, Train Loss: 3.9150, Valid Loss: 2.0536
Epoch 120, Train Loss: 3.7056, Valid Loss: 2.0718
Epoch 140, Train Loss: 3.7043, Valid Loss: 2.0739
Epoch 160, Train Loss: 3.7188, Valid Loss: 2.0721
Epoch 180, Train Loss: 3.6855, Valid Loss: 2.0499
Epoch 200, Train Loss: 4.1616, Valid Loss: 2.0475


```hidden_units=[512, 256]``` seems to give the best performance on the validation set.

#### Add a hidden layer

In [20]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ReLU()
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): ReLU()
  (4): Linear(in_features=256, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=1, bias=True)
)

In [21]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [22]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 5.5669, Valid Loss: 2.3301
Epoch 040, Train Loss: 4.4709, Valid Loss: 2.1514
Epoch 060, Train Loss: 4.0999, Valid Loss: 1.9826
Epoch 080, Train Loss: 3.9182, Valid Loss: 2.0937
Epoch 100, Train Loss: 3.8080, Valid Loss: 2.0679
Epoch 120, Train Loss: 3.5776, Valid Loss: 2.0766
Epoch 140, Train Loss: 3.6676, Valid Loss: 2.1193
Epoch 160, Train Loss: 3.6619, Valid Loss: 2.0797
Epoch 180, Train Loss: 3.7371, Valid Loss: 2.0759
Epoch 200, Train Loss: 4.3581, Valid Loss: 2.0896


**Try adding a hidden layer with higher dimension**

In [23]:
hidden_units = [512, 512, 256]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ReLU()
  (2): Linear(in_features=512, out_features=512, bias=True)
  (3): ReLU()
  (4): Linear(in_features=512, out_features=256, bias=True)
  (5): ReLU()
  (6): Linear(in_features=256, out_features=1, bias=True)
)

In [24]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [25]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 4.8762, Valid Loss: 2.2480
Epoch 040, Train Loss: 4.5860, Valid Loss: 2.0920
Epoch 060, Train Loss: 4.2419, Valid Loss: 1.9641
Epoch 080, Train Loss: 3.9658, Valid Loss: 2.0902
Epoch 100, Train Loss: 3.9169, Valid Loss: 2.0401
Epoch 120, Train Loss: 3.9030, Valid Loss: 2.0443
Epoch 140, Train Loss: 3.6570, Valid Loss: 2.0290
Epoch 160, Train Loss: 3.6731, Valid Loss: 2.0627
Epoch 180, Train Loss: 4.1093, Valid Loss: 2.0600
Epoch 200, Train Loss: 4.2778, Valid Loss: 2.0707


Adding a hidden layer with ```hidden_units=[512, 256, 64]``` gives the best performance on the validation set. Here it is interesting to note that increase the size of the hidden units did not improve the performance on the validation set.

#### Try various activation functions

* **ReLU (current)**

In [26]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ReLU()
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): ReLU()
  (4): Linear(in_features=256, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=1, bias=True)
)

In [27]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [28]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 5.5669, Valid Loss: 2.3301
Epoch 040, Train Loss: 4.4709, Valid Loss: 2.1514
Epoch 060, Train Loss: 4.0999, Valid Loss: 1.9826
Epoch 080, Train Loss: 3.9182, Valid Loss: 2.0937
Epoch 100, Train Loss: 3.8080, Valid Loss: 2.0679
Epoch 120, Train Loss: 3.5776, Valid Loss: 2.0766
Epoch 140, Train Loss: 3.6676, Valid Loss: 2.1193
Epoch 160, Train Loss: 3.6619, Valid Loss: 2.0797
Epoch 180, Train Loss: 3.7371, Valid Loss: 2.0759
Epoch 200, Train Loss: 4.3581, Valid Loss: 2.0896


* **LeakyReLU**

In [29]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.LeakyReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): LeakyReLU(negative_slope=0.01)
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): LeakyReLU(negative_slope=0.01)
  (4): Linear(in_features=256, out_features=64, bias=True)
  (5): LeakyReLU(negative_slope=0.01)
  (6): Linear(in_features=64, out_features=1, bias=True)
)

In [30]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [31]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 4.8468, Valid Loss: 2.2416
Epoch 040, Train Loss: 4.5766, Valid Loss: 2.1587
Epoch 060, Train Loss: 4.1499, Valid Loss: 2.0231
Epoch 080, Train Loss: 3.8779, Valid Loss: 2.1367
Epoch 100, Train Loss: 3.7839, Valid Loss: 2.0945
Epoch 120, Train Loss: 3.8141, Valid Loss: 2.1212
Epoch 140, Train Loss: 3.6841, Valid Loss: 2.1421
Epoch 160, Train Loss: 3.6417, Valid Loss: 2.1128
Epoch 180, Train Loss: 3.7319, Valid Loss: 2.1447
Epoch 200, Train Loss: 4.3586, Valid Loss: 2.1514


* **ELU**

In [32]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ELU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ELU(alpha=1.0)
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): ELU(alpha=1.0)
  (4): Linear(in_features=256, out_features=64, bias=True)
  (5): ELU(alpha=1.0)
  (6): Linear(in_features=64, out_features=1, bias=True)
)

In [33]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [34]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 4.9892, Valid Loss: 2.3360
Epoch 040, Train Loss: 4.5836, Valid Loss: 2.1643
Epoch 060, Train Loss: 4.3040, Valid Loss: 2.1225
Epoch 080, Train Loss: 3.8226, Valid Loss: 2.2610
Epoch 100, Train Loss: 3.8776, Valid Loss: 2.2129
Epoch 120, Train Loss: 3.8115, Valid Loss: 2.1850
Epoch 140, Train Loss: 3.6759, Valid Loss: 2.2240
Epoch 160, Train Loss: 3.8099, Valid Loss: 2.2118
Epoch 180, Train Loss: 3.7035, Valid Loss: 2.2165
Epoch 200, Train Loss: 3.8563, Valid Loss: 2.2237


* **SELU**

In [35]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.SELU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): SELU()
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): SELU()
  (4): Linear(in_features=256, out_features=64, bias=True)
  (5): SELU()
  (6): Linear(in_features=64, out_features=1, bias=True)
)

In [36]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [37]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 5.2109, Valid Loss: 2.4745
Epoch 040, Train Loss: 4.7176, Valid Loss: 2.2522
Epoch 060, Train Loss: 4.2609, Valid Loss: 2.1464
Epoch 080, Train Loss: 3.7815, Valid Loss: 2.2254
Epoch 100, Train Loss: 3.9192, Valid Loss: 2.1982
Epoch 120, Train Loss: 3.8465, Valid Loss: 2.2265
Epoch 140, Train Loss: 3.6199, Valid Loss: 2.2087
Epoch 160, Train Loss: 3.7528, Valid Loss: 2.1797
Epoch 180, Train Loss: 3.5920, Valid Loss: 2.1847
Epoch 200, Train Loss: 3.6484, Valid Loss: 2.1913


The best performance on the validation set seems to come from ReLU.

#### Batch normalization

* **No batch normalization**

In [38]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ReLU()
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): ReLU()
  (4): Linear(in_features=256, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=1, bias=True)
)

In [39]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [40]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 5.5669, Valid Loss: 2.3301
Epoch 040, Train Loss: 4.4709, Valid Loss: 2.1514
Epoch 060, Train Loss: 4.0999, Valid Loss: 1.9826
Epoch 080, Train Loss: 3.9182, Valid Loss: 2.0937
Epoch 100, Train Loss: 3.8080, Valid Loss: 2.0679
Epoch 120, Train Loss: 3.5776, Valid Loss: 2.0766
Epoch 140, Train Loss: 3.6676, Valid Loss: 2.1193
Epoch 160, Train Loss: 3.6619, Valid Loss: 2.0797
Epoch 180, Train Loss: 3.7371, Valid Loss: 2.0759
Epoch 200, Train Loss: 4.3581, Valid Loss: 2.0896


* **1D batch norm**

In [41]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.BatchNorm1d(hidden_unit))
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU()
  (3): Linear(in_features=512, out_features=256, bias=True)
  (4): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (5): ReLU()
  (6): Linear(in_features=256, out_features=64, bias=True)
  (7): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (8): ReLU()
  (9): Linear(in_features=64, out_features=1, bias=True)
)

In [42]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [43]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 6.1459, Valid Loss: 2.0616
Epoch 040, Train Loss: 5.5160, Valid Loss: 1.9682
Epoch 060, Train Loss: 5.7625, Valid Loss: 1.9632
Epoch 080, Train Loss: 5.0682, Valid Loss: 1.9542
Epoch 100, Train Loss: 5.2680, Valid Loss: 2.1759
Epoch 120, Train Loss: 4.8420, Valid Loss: 1.9381
Epoch 140, Train Loss: 4.4300, Valid Loss: 1.9807
Epoch 160, Train Loss: 4.0248, Valid Loss: 1.9948
Epoch 180, Train Loss: 4.1317, Valid Loss: 1.9803
Epoch 200, Train Loss: 4.5720, Valid Loss: 1.9469


Batch normalization makes the performance worse in this case, so I will not implement it.

#### Learning rates

* **```learning_rate=0.001```**

In [44]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ReLU()
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): ReLU()
  (4): Linear(in_features=256, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=1, bias=True)
)

In [45]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [46]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 5.5669, Valid Loss: 2.3301
Epoch 040, Train Loss: 4.4709, Valid Loss: 2.1514
Epoch 060, Train Loss: 4.0999, Valid Loss: 1.9826
Epoch 080, Train Loss: 3.9182, Valid Loss: 2.0937
Epoch 100, Train Loss: 3.8080, Valid Loss: 2.0679
Epoch 120, Train Loss: 3.5776, Valid Loss: 2.0766
Epoch 140, Train Loss: 3.6676, Valid Loss: 2.1193
Epoch 160, Train Loss: 3.6619, Valid Loss: 2.0797
Epoch 180, Train Loss: 3.7371, Valid Loss: 2.0759
Epoch 200, Train Loss: 4.3581, Valid Loss: 2.0896


* **```learning_rate=0.0001```**

In [47]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ReLU()
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): ReLU()
  (4): Linear(in_features=256, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=1, bias=True)
)

In [48]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

In [49]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 10.6672, Valid Loss: 2.1962
Epoch 040, Train Loss: 4.7093, Valid Loss: 2.2749
Epoch 060, Train Loss: 3.8926, Valid Loss: 2.2123
Epoch 080, Train Loss: 3.5693, Valid Loss: 2.2910
Epoch 100, Train Loss: 3.6135, Valid Loss: 2.2499
Epoch 120, Train Loss: 3.4165, Valid Loss: 2.2592
Epoch 140, Train Loss: 3.3112, Valid Loss: 2.2460
Epoch 160, Train Loss: 3.4705, Valid Loss: 2.2532
Epoch 180, Train Loss: 3.4147, Valid Loss: 2.2539
Epoch 200, Train Loss: 3.5822, Valid Loss: 2.2623


* **```learning_rate=0.01```**

In [50]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ReLU()
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): ReLU()
  (4): Linear(in_features=256, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=1, bias=True)
)

In [51]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

In [52]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 6.5789, Valid Loss: 2.1498
Epoch 040, Train Loss: 5.4988, Valid Loss: 2.1210
Epoch 060, Train Loss: 4.9793, Valid Loss: 1.8967
Epoch 080, Train Loss: 4.0346, Valid Loss: 2.0415
Epoch 100, Train Loss: 5.3023, Valid Loss: 2.0121
Epoch 120, Train Loss: 4.1119, Valid Loss: 2.0480
Epoch 140, Train Loss: 3.8194, Valid Loss: 2.0290
Epoch 160, Train Loss: 4.2391, Valid Loss: 2.0796
Epoch 180, Train Loss: 3.9281, Valid Loss: 2.0583
Epoch 200, Train Loss: 3.8436, Valid Loss: 2.0673


Decreasing the learning rate to ```0.0001``` worsens the performance on the training set. Increasing the learning rate to ```0.01``` reaches a lower performance faster, but makes the performance over epochs much more erratic. There is no  significant overall improvement in comparision to the default ```0.001```, so I will not change it.

#### Optimization functions

* **Adam optimization**

In [53]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ReLU()
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): ReLU()
  (4): Linear(in_features=256, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=1, bias=True)
)

In [54]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [55]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 5.5669, Valid Loss: 2.3301
Epoch 040, Train Loss: 4.4709, Valid Loss: 2.1514
Epoch 060, Train Loss: 4.0999, Valid Loss: 1.9826
Epoch 080, Train Loss: 3.9182, Valid Loss: 2.0937
Epoch 100, Train Loss: 3.8080, Valid Loss: 2.0679
Epoch 120, Train Loss: 3.5776, Valid Loss: 2.0766
Epoch 140, Train Loss: 3.6676, Valid Loss: 2.1193
Epoch 160, Train Loss: 3.6619, Valid Loss: 2.0797
Epoch 180, Train Loss: 3.7371, Valid Loss: 2.0759
Epoch 200, Train Loss: 4.3581, Valid Loss: 2.0896


* **SGD optimization**

In [56]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ReLU()
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): ReLU()
  (4): Linear(in_features=256, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=1, bias=True)
)

In [57]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

In [58]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 49.5324, Valid Loss: 7.2687
Epoch 040, Train Loss: 46.2327, Valid Loss: 6.8517
Epoch 060, Train Loss: 42.7446, Valid Loss: 6.2365
Epoch 080, Train Loss: 39.3341, Valid Loss: 5.8618
Epoch 100, Train Loss: 35.3740, Valid Loss: 5.1722
Epoch 120, Train Loss: 31.6321, Valid Loss: 4.7887
Epoch 140, Train Loss: 28.9457, Valid Loss: 4.3628
Epoch 160, Train Loss: 27.8172, Valid Loss: 4.2317
Epoch 180, Train Loss: 27.4354, Valid Loss: 4.0932
Epoch 200, Train Loss: 27.0568, Valid Loss: 4.1382


Adam optimization leads to significantly better performance than SGD.

#### Implement dropout

In [59]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    all_layers.append(nn.Dropout(p=0.5))
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ReLU()
  (2): Dropout(p=0.5, inplace=False)
  (3): Linear(in_features=512, out_features=256, bias=True)
  (4): ReLU()
  (5): Dropout(p=0.5, inplace=False)
  (6): Linear(in_features=256, out_features=64, bias=True)
  (7): ReLU()
  (8): Dropout(p=0.5, inplace=False)
  (9): Linear(in_features=64, out_features=1, bias=True)
)

In [60]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [61]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 12.8419, Valid Loss: 2.5981
Epoch 040, Train Loss: 11.8132, Valid Loss: 2.6594
Epoch 060, Train Loss: 11.1727, Valid Loss: 2.7630
Epoch 080, Train Loss: 11.2859, Valid Loss: 2.8195
Epoch 100, Train Loss: 10.3954, Valid Loss: 2.4595
Epoch 120, Train Loss: 10.3108, Valid Loss: 2.7423
Epoch 140, Train Loss: 10.2185, Valid Loss: 2.6139
Epoch 160, Train Loss: 9.4553, Valid Loss: 2.5486
Epoch 180, Train Loss: 9.7496, Valid Loss: 2.5442
Epoch 200, Train Loss: 9.0713, Valid Loss: 2.4131


Dropout does not help the performance of this model, so it will not be implemented.

#### Batch size

* **```batch_size=64```**

In [62]:
# create DataLoaders
batch_size = 64

train_dl = DataLoader(train_ds, batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size, shuffle=True)

In [63]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ReLU()
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): ReLU()
  (4): Linear(in_features=256, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=1, bias=True)
)

In [64]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [65]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 5.0191, Valid Loss: 2.2429
Epoch 040, Train Loss: 4.6213, Valid Loss: 2.1831
Epoch 060, Train Loss: 4.0735, Valid Loss: 2.0219
Epoch 080, Train Loss: 3.9549, Valid Loss: 2.1234
Epoch 100, Train Loss: 3.8354, Valid Loss: 2.0831
Epoch 120, Train Loss: 3.7275, Valid Loss: 2.0854
Epoch 140, Train Loss: 3.6928, Valid Loss: 2.1041
Epoch 160, Train Loss: 3.6888, Valid Loss: 2.1020
Epoch 180, Train Loss: 3.5296, Valid Loss: 2.0844
Epoch 200, Train Loss: 4.4159, Valid Loss: 2.1053


* **```batch_size=32```**

In [66]:
# create DataLoaders
batch_size = 32

train_dl = DataLoader(train_ds, batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size, shuffle=True)

In [67]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ReLU()
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): ReLU()
  (4): Linear(in_features=256, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=1, bias=True)
)

In [68]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [69]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 9.7082, Valid Loss: 4.3951
Epoch 040, Train Loss: 8.8270, Valid Loss: 4.3386
Epoch 060, Train Loss: 8.0471, Valid Loss: 4.0919
Epoch 080, Train Loss: 7.6728, Valid Loss: 4.1408
Epoch 100, Train Loss: 7.7827, Valid Loss: 4.1585
Epoch 120, Train Loss: 7.7383, Valid Loss: 4.2119
Epoch 140, Train Loss: 6.9498, Valid Loss: 4.1421
Epoch 160, Train Loss: 7.2664, Valid Loss: 3.9055
Epoch 180, Train Loss: 6.8677, Valid Loss: 4.1584
Epoch 200, Train Loss: 7.4001, Valid Loss: 4.1780


* **```batch_size=128```**

In [70]:
# create DataLoaders
batch_size = 128

train_dl = DataLoader(train_ds, batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size, shuffle=True)

In [71]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ReLU()
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): ReLU()
  (4): Linear(in_features=256, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=1, bias=True)
)

In [72]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [73]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 2.4800, Valid Loss: 1.0850
Epoch 040, Train Loss: 2.1697, Valid Loss: 1.0788
Epoch 060, Train Loss: 2.2222, Valid Loss: 1.0476
Epoch 080, Train Loss: 2.0014, Valid Loss: 1.0544
Epoch 100, Train Loss: 2.0705, Valid Loss: 1.0673
Epoch 120, Train Loss: 2.0171, Valid Loss: 1.0462
Epoch 140, Train Loss: 1.9127, Valid Loss: 1.0582
Epoch 160, Train Loss: 2.1763, Valid Loss: 1.0727
Epoch 180, Train Loss: 2.0089, Valid Loss: 1.0535
Epoch 200, Train Loss: 1.8305, Valid Loss: 1.0584


* **```batch_size=256```**

In [74]:
# create DataLoaders
batch_size = 256

train_dl = DataLoader(train_ds, batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size, shuffle=True)

In [75]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ReLU()
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): ReLU()
  (4): Linear(in_features=256, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=1, bias=True)
)

In [76]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [77]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 1.6488, Valid Loss: 1.0784
Epoch 040, Train Loss: 1.3109, Valid Loss: 1.0602
Epoch 060, Train Loss: 1.2366, Valid Loss: 1.0655
Epoch 080, Train Loss: 1.0730, Valid Loss: 1.0564
Epoch 100, Train Loss: 1.1606, Valid Loss: 1.0508
Epoch 120, Train Loss: 1.1058, Valid Loss: 1.0562
Epoch 140, Train Loss: 1.0114, Valid Loss: 1.0461
Epoch 160, Train Loss: 1.0960, Valid Loss: 1.0505
Epoch 180, Train Loss: 1.0684, Valid Loss: 1.0341
Epoch 200, Train Loss: 1.1879, Valid Loss: 1.0458


Batch sizes ```128``` and ```256``` give the best performances out of the sizes tried. The difference in performance between ```128``` and ```256``` seems minimal.

### Best model (so far)

In [78]:
# create DataLoaders
batch_size = 128

train_dl = DataLoader(train_ds, batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size, shuffle=True)
test_dl = DataLoader(test_ds, batch_size, shuffle=True)

In [79]:
hidden_units = [512, 256, 64]
input_size = X.shape[1]
all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 1))
model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=2048, out_features=512, bias=True)
  (1): ReLU()
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): ReLU()
  (4): Linear(in_features=256, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=1, bias=True)
)

In [80]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [81]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20
for epoch in range(1, num_epochs+1):
    loss_hist_train = 0
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()
    
    loss_hist_valid = 0
    with torch.no_grad():
        for x_batch, y_batch in valid_dl:
            pred = model(x_batch)[:, 0]
            loss = torch.sqrt(loss_fn(pred, y_batch))
            loss_hist_valid += loss.item()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch:0>3}, Train Loss: 'f'{loss_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}')
        
    if epoch == num_epochs:
        print()
        print(f'Final Train MSE: {loss_hist_train:.4f}')
        print(f'Final Valid MSE: {loss_hist_valid:.4f}')

Epoch 020, Train Loss: 2.4800, Valid Loss: 1.0850
Epoch 040, Train Loss: 2.1697, Valid Loss: 1.0788
Epoch 060, Train Loss: 2.2222, Valid Loss: 1.0476
Epoch 080, Train Loss: 2.0014, Valid Loss: 1.0544
Epoch 100, Train Loss: 2.0705, Valid Loss: 1.0673
Epoch 120, Train Loss: 2.0171, Valid Loss: 1.0462
Epoch 140, Train Loss: 1.9127, Valid Loss: 1.0582
Epoch 160, Train Loss: 2.1763, Valid Loss: 1.0727
Epoch 180, Train Loss: 2.0089, Valid Loss: 1.0535
Epoch 200, Train Loss: 1.8305, Valid Loss: 1.0584

Final Train MSE: 1.8305
Final Valid MSE: 1.0584


### Evaluate on the test set

In [82]:
with torch.no_grad():
    for x_batch, y_batch in test_dl:
        pred = model(x_batch)[:, 0]
        loss = torch.sqrt(loss_fn(pred, y_batch)).item()
    print(f'Test MSE: {loss:.4f}')

Test MSE: 0.9465


### Summary

When constructing my model I explored a number of features including: hidden unit size, hidden layers, activation functions, normalization, optimization functions, learning rate, and batch size. 

For the hidden unit size of individual layers, there seems to be a trend that more units leads to better performance. Adding a hidden layer also improved performance. I found ReLU to be the best performing activation function, however, I did not perform parameter explorations with the modified ReLU functions, so there is room for further testing here. Surprisingly, normalization did not improve model performance, and neither did dropout. For learning rate, 0.01 gave the best performance and for optimization functions, Adam gave the best performance. Increasing the batch size further improved the performance of my model.

The NN model I developed here turned out similar to the model I developed in HW8c. They have similar number of layers with similar hidden sizes and both use the ReLU activation function. The biggest difference is that the model in HW8c implements normalization while this model does not. I am not entirely sure why normalization did not improve performance in this model. I think more testing would need to be done.

Compared to my model in HW4, this NN is a vast improvement. The RMSE of the predictions using this model is orders of magnitude lower than the RMSE of the predictions using my HW4 model.