# Neural network training with PyTorch

Outline:
1. Making train, validation, and test datasets
2. Splitting into features and targets
3. Converting to PyTorch tensors
4. Creating a DataLoader for batching
5. Defining the neural network architecture
6. Defining the loss function
7. Training the model
8. Evaluating the model
9. Saving the model

In [95]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

### 1. Making train, validation, and test datasets

The datasets `probability_{year}.csv` in the `data` folder contain various probabilities as features (and targets). For this notebook:
- We combine the years $2008,2012$ into the training set.
- We use the year $2016$ for validation.
- We use the year $2020$ for testing.

In [96]:
# load the datasets
df = pd.read_csv('../midterm/data/probabilities.csv')
years = [2008, 2012, 2016, 2020]
dfs = {year: df[df['year'] == year] for year in years}

# make train, val, and test sets
df_train = pd.concat([dfs[2008], dfs[2012]], axis=0)
df_val = dfs[2016].copy()
df_test = dfs[2020].copy()

# display shapes of the datasets
print(f"Train shape: {df_train.shape}")
print(f"Validation shape: {df_val.shape}")
print(f"Test shape: {df_test.shape}")

Train shape: (6180, 119)
Validation shape: (3090, 119)
Test shape: (3090, 119)


### 2. Splitting into features and targets

- **Targets**: Our target for this notebook will be not one column, but the three columns:
    - `P(d|C)` - probability of voting democrat, given the county
    - `P(r|C)` - probability of voting republican, given the county
    - `P(o|C)` - probability of voting third party or not voting, given the county
Note that for each row, the three columns sum to $1$, i.e. they form a probability distribution with $3$ classes. Our goal is to predict the three probabilities for each county, as well as for the whole country.
- **Features**: The features are all the columns in the dataset *after* the three target columns. That is, `features = df.columns[8:]`.

In [97]:
df.columns

Index(['year', 'gisjoin', 'state', 'county', 'P(C)',
       'P(households_income_under_10k|C)', 'P(households_income_10k_15k|C)',
       'P(households_income_15k_25k|C)', 'P(households_income_25k_plus|C)',
       'P(persons_male|C)',
       ...
       'P(labor_force_civilian|C)', 'P(labor_force_employed|C)',
       'P(labor_force_unemployed|C)', 'P(not_in_labor_force|C)',
       'P(persons_hispanic|C)', 'P(persons_below_poverty|C)', 'P(democrat|C)',
       'P(other|C)', 'P(republican|C)', 'P(non_voter|C)'],
      dtype='object', length=119)

In [98]:
# Last four columns are the targets
target_cols = df_train.columns[-4:].tolist()

# All columns from 4 to -4 are features
feature_cols = df_train.columns[4:-4].tolist()

print('Targets:', target_cols)
print('Features:', feature_cols)

Targets: ['P(democrat|C)', 'P(other|C)', 'P(republican|C)', 'P(non_voter|C)']
Features: ['P(C)', 'P(households_income_under_10k|C)', 'P(households_income_10k_15k|C)', 'P(households_income_15k_25k|C)', 'P(households_income_25k_plus|C)', 'P(persons_male|C)', 'P(persons_female|C)', 'P(male_never_married|C)', 'P(male_married|C)', 'P(male_separated|C)', 'P(male_widowed|C)', 'P(male_divorced|C)', 'P(female_never_married|C)', 'P(female_married|C)', 'P(female_separated|C)', 'P(female_widowed|C)', 'P(female_divorced|C)', 'P(male_18_24_less_than_9th|C)', 'P(male_18_24_some_hs|C)', 'P(male_18_24_hs_grad|C)', 'P(male_18_24_some_college|C)', 'P(male_18_24_associates|C)', 'P(male_18_24_bachelors|C)', 'P(male_18_24_graduate|C)', 'P(male_25_34_less_than_9th|C)', 'P(male_25_34_some_hs|C)', 'P(male_25_34_hs_grad|C)', 'P(male_25_34_some_college|C)', 'P(male_25_34_associates|C)', 'P(male_25_34_bachelors|C)', 'P(male_25_34_graduate|C)', 'P(male_35_44_less_than_9th|C)', 'P(male_35_44_some_hs|C)', 'P(male_35

In [99]:
# Split into features and targets for each set
X_tr = df_train[feature_cols].values
y_tr = df_train[target_cols].values

X_val = df_val[feature_cols].values
y_val = df_val[target_cols].values

X_te = df_test[feature_cols].values
y_te = df_test[target_cols].values

# Also need county weights for custom loss; explained later
w_tr = df_train['P(C)'].values
w_val = df_val['P(C)'].values
w_te = df_test['P(C)'].values

# print shapes of all X sets
print(f"X_tr shape: {X_tr.shape}, y_tr shape: {y_tr.shape}")
print(f"X_val shape: {X_val.shape}, y_val shape: {y_val.shape}")
print(f"X_te shape: {X_te.shape}, y_te shape: {y_te.shape}")

X_tr shape: (6180, 111), y_tr shape: (6180, 4)
X_val shape: (3090, 111), y_val shape: (3090, 4)
X_te shape: (3090, 111), y_te shape: (3090, 4)


### Pre-processing
Before going any further, it is important to pre-process the data. In our present case, we will standardize each feature in the training set to have mean $0$ and standard deviation $1$. We will then apply the same transformation to the validation and test sets (using the mean and standard deviation from the training set).

In [100]:
# After splitting into X_train, X_val, X_test but before creating tensors
from sklearn.preprocessing import StandardScaler

# Fit scaler on training data only
scaler = StandardScaler()
X_tr_scaled = scaler.fit_transform(X_tr)

# Transform validation and test sets using training parameters
X_val_scaled = scaler.transform(X_val)
X_te_scaled = scaler.transform(X_te)

### 3. PyTorch tensors

#### What are tensors?
Mathematically, tensors are a generalization of matrices. Namely, a tensor is a multi-dimensional array. For example, a vector is a 1D tensor, a matrix is a 2D tensor, and a 3D tensor is a 3D array. 

The **shape** of a tensor is the number of elements in each dimension. For example:
- A row vector is a 1D tensor with shape `(,n)` or `(1,n)`, where `n` is the number of elements in the vector.
- A column vector is a 1D tensor with shape `(n,)` or `(n,1)`, where `n` is the number of elements in the vector.
- A matrix is a 2D tensor with shape `(m,n)`, where `m` is the number of rows and `n` is the number of columns. 
- A 3D array (consisting of entries which are located using $3$ indices) is a 3D tensor with shape `(m,n,p)`, where `m` is the number of rows, `n` is the number of columns, and `p$ is the number of... something else (say, depth).

Tensors are the fundamental data structure in PyTorch. They are similar to NumPy arrays, but they can be used on GPUs to accelerate computing. PyTorch tensors can be created from NumPy arrays, and they can also be converted back to NumPy arrays. Let's demonstrate some of the common operations on PyTorch tensors below.

#### Creating PyTorch tensors
There are multiple ways to create PyTorch tensors:
- From a NumPy array, using `torch.from_numpy()` or `torch.FloatTensor()` (if you want to create a float tensor)
- From a list, using `torch.tensor()`
- From a scalar, using `torch.tensor()`
- From a random number generator, using `torch.randn()`, `torch.rand()`, `torch.randint()`, etc.
We give examples of each below.

In [101]:
# Creating tensors from different sources

# 1. From NumPy arrays
numpy_array = np.array([[1, 2, 3], [4, 5, 6]])
tensor_from_numpy = torch.from_numpy(numpy_array)  # Keeps original datatype
tensor_float = torch.FloatTensor(numpy_array)      # Converts to float32
print(f"From NumPy:\n{tensor_from_numpy}")
print(f"DataType: {tensor_from_numpy.dtype}\n")
print(f'Shape: {tensor_from_numpy.shape}\n')

From NumPy:
tensor([[1, 2, 3],
        [4, 5, 6]])
DataType: torch.int64

Shape: torch.Size([2, 3])



In [102]:
# 2. From Python lists
list_data = [[1, 2, 3], [4, 5, 6]]
tensor_from_list = torch.tensor(list_data, dtype=torch.float32)
print(f"From list:\n{tensor_from_list}")
print(f"DataType: {tensor_from_list.dtype}\n")

From list:
tensor([[1., 2., 3.],
        [4., 5., 6.]])
DataType: torch.float32



In [103]:
# 3. From scalars
scalar_tensor = torch.tensor(3.14)
print(f"From scalar: {scalar_tensor}")
print(f"DataType: {scalar_tensor.dtype}\n")

From scalar: 3.140000104904175
DataType: torch.float32



In [104]:
# 4. From random number generators
# 4a. Uniform random [0,1]
random_uniform = torch.rand(2, 3)  # 2x3 matrix with values in [0,1]
print(f"Random uniform [0,1]:\n{random_uniform}\n")

# 4b. Normal distribution (mean=0, std=1)
random_normal = torch.randn(2, 3)  # 2x3 matrix from standard normal
print(f"Random normal (mean=0, std=1):\n{random_normal}\n")

# 4c. Random integers
random_ints = torch.randint(low=0, high=10, size=(2, 3))  # 2x3 matrix, ints in [0,10)
print(f"Random integers [0,10):\n{random_ints}\n")

Random uniform [0,1]:
tensor([[0.9025, 0.8809, 0.1133],
        [0.2601, 0.4873, 0.8141]])

Random normal (mean=0, std=1):
tensor([[-2.1326, -0.7223, -0.7623],
        [ 1.1379,  1.4525, -1.1673]])

Random integers [0,10):
tensor([[2, 8, 5],
        [5, 9, 9]])



In [105]:
# 5. Special tensors
ones = torch.ones(2, 3)      # 2x3 matrix of ones
zeros = torch.zeros(2, 3)    # 2x3 matrix of zeros
identity = torch.eye(3)      # 3x3 identity matrix
print(f"Ones:\n{ones}\n")
print(f"Zeros:\n{zeros}\n")
print(f"Identity:\n{identity}")

Ones:
tensor([[1., 1., 1.],
        [1., 1., 1.]])

Zeros:
tensor([[0., 0., 0.],
        [0., 0., 0.]])

Identity:
tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]])


In our case, we already have dataframes with the training, validation, and test sets. We can convert them to PyTorch tensors using `torch.FloatTensor()` (which is appropriate because all our data consists of floats).

In [106]:
# Convert to PyTorch tensors
X_tr_tensor = torch.FloatTensor(X_tr_scaled)
y_tr_tensor = torch.FloatTensor(y_tr)
w_tr_tensor = torch.FloatTensor(w_tr)

X_val_tensor = torch.FloatTensor(X_val_scaled)
y_val_tensor = torch.FloatTensor(y_val)
w_val_tensor = torch.FloatTensor(w_val)

X_te_tensor = torch.FloatTensor(X_te_scaled)
y_te_tensor = torch.FloatTensor(y_te)
w_te_tensor = torch.FloatTensor(w_te)

In [107]:
X_tr_tensor

tensor([[-0.1414, -0.4803, -0.7481,  ..., -0.8928, -0.4503, -0.8186],
        [ 0.2537, -0.6658, -0.8407,  ..., -0.0219, -0.3300, -0.5752],
        [-0.2234,  2.3103,  1.2226,  ...,  1.3166, -0.2766,  1.1116],
        ...,
        [-0.2472, -0.4677, -0.5342,  ..., -1.3566,  0.0513, -0.2538],
        [-0.2863, -0.2451,  0.1455,  ..., -0.5341,  0.4313, -0.2697],
        [-0.2903, -0.6464, -0.2721,  ..., -0.0672, -0.3482, -0.6281]])

#### Attributes of PyTorch tensors
PyTorch tensors have the following attributes:
- `shape`: the shape of the tensor
- `dtype`: the data type of the tensor
- `device`: the device on which the tensor is stored (CPU or GPU)
- `requires_grad`: whether the tensor requires gradients (for backpropagation)

Let's first demonstrate the `shape`, `dtype`, and `requires_grad` attributes of PyTorch tensors.

In [108]:
# Demonstrate tensor attributes using our training tensors
print("X_tr_tensor attributes:")
print(f"Shape: {X_tr_tensor.shape}")  # Shows dimensions (num_samples, num_features)
print(f"Data type: {X_tr_tensor.dtype}")  # Should be torch.float32
print(f"Device: {X_tr_tensor.device}")  # Shows if tensor is on CPU or GPU
print(f"Requires gradients: {X_tr_tensor.requires_grad}\n")  # Default is False

# Create a tensor that requires gradients (common for neural network parameters)
weight = torch.randn(X_tr_tensor.shape[1], 3, requires_grad=True)
print("Neural network weight tensor attributes:")
print(f"Shape: {weight.shape}")  # (num_features, num_classes)
print(f"Data type: {weight.dtype}")
print(f"Device: {weight.device}")
print(f"Requires gradients: {weight.requires_grad}\n")

X_tr_tensor attributes:
Shape: torch.Size([6180, 111])
Data type: torch.float32
Device: cpu
Requires gradients: False

Neural network weight tensor attributes:
Shape: torch.Size([111, 3])
Data type: torch.float32
Device: cpu
Requires gradients: True



#### Device for PyTorch tensors
PyTorch tensors can be stored on either the CPU or the GPU. By default, PyTorch tensors are stored on the CPU. However, if you have a GPU available, you can move the tensor to the GPU using the `.to()` method. For example, `tensor.to('cuda')` will move the tensor to the GPU on a Windows OS, and `tensor.to('mps')` will move the tensor to the GPU on a Mac OS. You can also move the tensor back to the CPU using `tensor.to('cpu')`.

Let's illustrate this with an example.

In [109]:
# Check for MPS (Metal Performance Shaders) availability on Apple Silicon
# or CUDA availability on machines with NVIDIA GPUs
device = (
    "mps" if torch.backends.mps.is_available()
    else "cuda" if torch.cuda.is_available()
    else "cpu"
)

# Move tensor to appropriate device
device = torch.device(device)
X_tr_gpu = X_tr_tensor.to(device)
print(f"Using device: {device}")
print(f"Tensor device: {X_tr_gpu.device}")

# Move tensor back to CPU
X_tr_cpu = X_tr_gpu.to("cpu")
print(f"Tensor moved back to CPU: {X_tr_cpu.device}")

Using device: mps
Tensor device: mps:0
Tensor moved back to CPU: cpu


#### Best practices for device usage
- Initially store all tensors on the CPU.
- Move the tensors to the GPU only when you need to perform computations on them (i.e. inside the training loop), and don't move too many at once because GPU memory is limited.

#### Operations on PyTorch tensors
PyTorch tensors support a wide variety of operations, including:
- Element-wise operations (addition, subtraction, multiplication, division)
- Matrix operations (dot product, matrix multiplication, transpose)
- Reduction operations (sum, mean, max, min)
- Indexing and slicing
- Reshaping and resizing
- Concatenation and stacking

Let's demonstrate some of these operations below.

In [110]:
# Make a copy of training tensor for demonstrations
X = X_tr_tensor.clone()  # clone() creates a copy
print(f"Original tensor shape: {X.shape}\n")

# 1. Element-wise operations
print("Element-wise operations:")
X_scaled = X * 2.0  # multiplication by scalar
X_normalized = (X - X.mean()) / X.std()  # standardization along all rows and columns
print(f"After standardization - mean: {X_normalized.mean():.3f}, std: {X_normalized.std():.3f}\n")

# standardize each feature
X_means = X.mean(dim=0) # tensor with feature means
X_stds = X.std(dim=0)   # tensor with feature stds
X_standardized = (X - X_means) / X_stds
print(f"Standardized tensor shape: {X_standardized.shape}\n")
print(f'Tensor of means: {X_means}\n')
print(f'Tensor of stds: {X_stds}\n')
print(f'Shape of tensor of means: {X_means.shape}\n')
print(f'Shape of tensor of stds: {X_stds.shape}\n')

# 2. Matrix operations
print("Matrix operations:")
# Transpose
X_t = X.T  # or X.transpose(0,1)
print(f"Original shape: {X.shape},\nTransposed shape: {X_t.shape}")

# Matrix multiplication (assuming X has shape (n_samples, n_features))
X_gram = torch.mm(X_t, X)  # Gram matrix
print(f"Gram matrix shape: {X_gram.shape}\n")

Original tensor shape: torch.Size([6180, 111])

Element-wise operations:
After standardization - mean: -0.000, std: 1.000

Standardized tensor shape: torch.Size([6180, 111])

Tensor of means: tensor([-9.2590e-10,  2.2376e-09, -2.4691e-09, -2.0061e-09, -7.7158e-10,
         1.2345e-09, -1.2345e-09, -1.4660e-09, -9.2590e-10,  9.2590e-10,
        -2.6234e-09, -2.7777e-09,  6.1726e-10,  3.0863e-10,  1.8518e-09,
         1.8518e-09,  9.2590e-10, -2.1604e-09, -9.2590e-10,  2.0061e-09,
        -1.0802e-09, -3.8579e-09,  5.8640e-09,  1.0339e-08,  9.2590e-10,
        -3.0863e-09, -3.0863e-10, -1.0802e-09, -1.2345e-09,  2.4691e-09,
         1.5432e-09, -3.0863e-09, -1.2345e-09,  1.5432e-10,  8.4874e-10,
         4.3209e-09, -6.1726e-10, -4.9381e-09, -1.5432e-09, -3.0863e-10,
        -3.8579e-10, -2.3147e-10,  1.0802e-09,  3.0863e-10, -4.6295e-10,
        -2.4691e-09,  1.5432e-10, -3.9351e-09,  1.2345e-09, -5.2468e-09,
         3.0863e-10, -4.3209e-09,  3.0863e-09,  0.0000e+00, -2.3147e-10,
     

In [111]:
# 3. Reduction operations
print("Reduction operations:")
print(f"Mean of all elements: {X.mean():.3f}")
print(f"Sum of all elements: {X.sum():.3f}")
print(f"Max value: {X.max():.3f}")
print(f"Min value: {X.min():.3f}")
# Per-feature statistics
print(f"Mean per feature (first 3): {X.mean(dim=0)[:3]}\n")

# 4. Indexing and slicing
print("Indexing and slicing:")
first_sample = X[0]  # First sample
first_feature = X[:, 0]  # First feature for all samples
subset = X[:10, :5]  # First 10 samples, first 5 features
print(f"Subset shape: {subset.shape}\n")

# 5. Reshaping and resizing
print("Reshaping:")
n_samples, n_features = X.shape
X_reshaped = X.reshape(n_samples, -1)  # -1 automatically calculates size
X_viewed = X.view(n_samples, -1)  # similar to reshape
print(f"Reshaped tensor shape: {X_reshaped.shape}\n")

Reduction operations:
Mean of all elements: -0.000
Sum of all elements: -0.000
Max value: 48.142
Min value: -9.568
Mean per feature (first 3): tensor([-9.2590e-10,  2.2376e-09, -2.4691e-09])

Indexing and slicing:
Subset shape: torch.Size([10, 5])

Reshaping:
Reshaped tensor shape: torch.Size([6180, 111])



In [112]:
# 6. Concatenation and stacking
print("Concatenation and stacking:")
# Concatenate along rows (dim=0)
X_doubled = torch.cat([X, X], dim=0)
print(f"After row concatenation: {X_doubled.shape}")
# Stack creates a new dimension (the first dimension)
X_stacked = torch.stack([X, X], dim=0)
print(f"After stacking: {X_stacked.shape}\n")

Concatenation and stacking:
After row concatenation: torch.Size([12360, 111])
After stacking: torch.Size([2, 6180, 111])



In [113]:
X_stacked = torch.stack([X_val_tensor, X_te_tensor], dim=0)
print(f'After stacking: {X_stacked.shape}')

# display the tensor in the first dimension
print(f'First tensor in stacked tensor: {X_stacked[0]}')

# display the tensor in the second dimension
print(f'Second tensor in stacked tensor: {X_stacked[1]}')

After stacking: torch.Size([2, 3090, 111])
First tensor in stacked tensor: tensor([[-0.1453, -0.3724, -0.4472,  ...,  0.0314, -0.4161, -0.0543],
        [ 0.3181, -0.8210, -0.6923,  ...,  0.2621, -0.2843, -0.8470],
        [-0.2345,  1.2683,  0.8551,  ...,  1.8157, -0.2999,  1.6332],
        ...,
        [-0.2501, -0.7628, -1.0018,  ..., -1.1039,  0.0705, -0.5239],
        [-0.2879, -1.2734, -0.7343,  ..., -0.2162,  0.4554, -0.5693],
        [-0.2911, -0.4320, -0.7033,  ...,  0.3035, -0.5124,  0.1476]])
Second tensor in stacked tensor: tensor([[-0.1390, -1.0673, -1.6163,  ...,  0.1003, -0.3853, -0.7035],
        [ 0.3771, -1.0615, -1.0949,  ...,  0.2394, -0.2608, -0.9059],
        [-0.2391,  0.4376,  0.5549,  ...,  2.1272, -0.2586,  0.9205],
        ...,
        [-0.2519, -1.6994, -1.4649,  ..., -0.6886,  0.1250, -1.5012],
        [-0.2898, -1.2481, -1.5005,  ..., -0.3608,  0.4626, -1.4830],
        [-0.2923, -0.1725, -0.4869,  ...,  0.7971, -0.3574, -0.5731]])


#### Broadcasting
Broadcasting is a powerful mechanism that allows PyTorch to perform operations on tensors of different shapes. It automatically expands smaller tensors to match the shape of larger tensors without creating copies of the data. This is useful for performing element-wise operations on tensors of different shapes.

**Broadcasting rules**:
1. Starting from the *right-most dimension*, the dimensions of the two tensors are compared one-by-one:
    - If they are equal, they are compatible (no broadcasting is needed).
    - If the smaller tensor has a dimension of size $1$, it is expanded to match the size of the larger tensor.
    - If the smaller tensor is missing a dimension, it is added with size $1$.
    - If the dimensions are not compatible, an error is raised.

The result of the operation is a tensor with the shape of the larger tensor.

In [114]:
# 1. Basic Broadcasting Example

# Create tensors of different shapes
matrix = torch.randn(5, 3)           # Shape: (5, 3)
print('Matrix:', matrix)
print('Matrix shape:', matrix.shape,'\n')

vector = torch.randn(3)              # Shape: (3,)
print('Vector:', vector)
print('Vector:', vector.shape,'\n')

# reshape vector to (5,3) for clarity
vector_reshaped = vector.view(1, 3).expand(5, 3)  # Shape: (5, 3)
print('Vector reshaped:', vector_reshaped)

result1 = matrix + vector           # Vector is broadcast to shape (5, 3)

# Display individual tensors and the result after adding/multiplying
print(f"matrix + vector:\n{result1}")
print('Matrix + reshaped vector:\n', matrix + vector_reshaped)
print(f'Shape of matrix + vector: {result1.shape}\n')

Matrix: tensor([[ 0.8881,  0.6568, -0.3359],
        [-1.1320,  0.5062,  0.6695],
        [-0.0978,  0.2331, -0.5739],
        [ 2.3020,  0.4558, -0.5122],
        [-0.2288,  0.4731, -0.3057]])
Matrix shape: torch.Size([5, 3]) 

Vector: tensor([ 0.5620, -0.2678,  0.0077])
Vector: torch.Size([3]) 

Vector reshaped: tensor([[ 0.5620, -0.2678,  0.0077],
        [ 0.5620, -0.2678,  0.0077],
        [ 0.5620, -0.2678,  0.0077],
        [ 0.5620, -0.2678,  0.0077],
        [ 0.5620, -0.2678,  0.0077]])
matrix + vector:
tensor([[ 1.4500,  0.3890, -0.3282],
        [-0.5700,  0.2384,  0.6772],
        [ 0.4641, -0.0347, -0.5662],
        [ 2.8640,  0.1880, -0.5044],
        [ 0.3331,  0.2053, -0.2980]])
Matrix + reshaped vector:
 tensor([[ 1.4500,  0.3890, -0.3282],
        [-0.5700,  0.2384,  0.6772],
        [ 0.4641, -0.0347, -0.5662],
        [ 2.8640,  0.1880, -0.5044],
        [ 0.3331,  0.2053, -0.2980]])
Shape of matrix + vector: torch.Size([5, 3])



In [115]:
scalar = torch.tensor(2.0)           # Shape: ()
print('Scalar tensor', scalar)
print('Scalar shape', scalar.shape,'\n')

result2 = matrix * scalar           # Scalar is broadcast to shape (5, 3)

print(f"matrix * scalar:\n{result2}")
print(f'Shape of matrix * scalar: {result2.shape}\n')

Scalar tensor tensor(2.)
Scalar shape torch.Size([]) 

matrix * scalar:
tensor([[ 1.7761,  1.3136, -0.6718],
        [-2.2639,  1.0124,  1.3389],
        [-0.1957,  0.4662, -1.1479],
        [ 4.6041,  0.9116, -1.0243],
        [-0.4576,  0.9462, -0.6113]])
Shape of matrix * scalar: torch.Size([5, 3])



In [116]:
# 2. More Complex Broadcasting Example
# Create tensors with compatible shapes for broadcasting
A = torch.randn(4, 1, 3)           # Shape: (4, 1, 3)
B = torch.randn(1, 2, 3)           # Shape: (1, 5, 3)
C = A + B                          # Result shape: (4, 5, 3)

print("\nComplex broadcasting:")
print(f"A shape: {A.shape}")
print(f"B shape: {B.shape}")
print(f"Result shape: {C.shape}")


Complex broadcasting:
A shape: torch.Size([4, 1, 3])
B shape: torch.Size([1, 2, 3])
Result shape: torch.Size([4, 2, 3])


In [117]:
# Common use cases in neural networks

# 1. Adding bias terms
X = torch.randn(32, 10)    # 32 samples, 10 features
bias = torch.randn(10)            # 10 bias terms
output = X + bias          # bias is broadcast to (32, 10)

print(f'X shape: {X.shape}')
print(f'bias shape: {bias.shape}')
print(f'X + bias shape: {output.shape}\n')

# 2. Batch normalization
batch_mean = X.mean(dim=0, keepdim=False)  # Shape: (n_features)
normalized = X - batch_mean      
print(f'Batch mean shape: {batch_mean.shape}')
print(f'Normalized shape: {normalized.shape}\n')

# Alternative with keepdim=True
batch_mean = X.mean(dim=0, keepdim=True)  # Shape: (1, n_features)
normalized = X - batch_mean
print(f'Batch mean shape with keepdim=True: {batch_mean.shape}')
print(f'Normalized shape with keepdim=True: {normalized.shape}\n')

X shape: torch.Size([32, 10])
bias shape: torch.Size([10])
X + bias shape: torch.Size([32, 10])

Batch mean shape: torch.Size([10])
Normalized shape: torch.Size([32, 10])

Batch mean shape with keepdim=True: torch.Size([1, 10])
Normalized shape with keepdim=True: torch.Size([32, 10])


bias shape: torch.Size([10])
X + bias shape: torch.Size([32, 10])

Batch mean shape: torch.Size([10])
Normalized shape: torch.Size([32, 10])

Batch mean shape with keepdim=True: torch.Size([1, 10])
Normalized shape with keepdim=True: torch.Size([32, 10])



In [118]:
# This will raise an error
a = torch.randn(3, 4)
b = torch.randn(2, 3)
c = a + b  # Error: shapes (3,4) and (2,3) cannot be broadcast

RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 1

### 4. Creating a DataLoader for batching
In today's notebook we will use "mini-batch" gradient descent to train our network. This means that we will use a small batch of data to compute the gradients and update the weights, instead of using the entire dataset. This is done to speed up the training process and to reduce the memory usage.

PyTorch provides a convenient way to create mini-batches using the `DataLoader` class. The `DataLoader` class takes a dataset and creates mini-batches of a specified size. It also shuffles the data and allows for parallel loading of data using multiple workers. 

Below, we will create a `DataLoader` for our training, validation, and test sets. Later, we will use the `DataLoader` to iterate over the mini-batches during training and evaluation (moving the tensors to the GPU when needed).

In [119]:
# Import TensorDataset and DataLoader
from torch.utils.data import TensorDataset, DataLoader

# Create TensorDatasets
train_dataset = TensorDataset(X_tr_tensor, y_tr_tensor, w_tr_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor, w_val_tensor)
test_dataset = TensorDataset(X_te_tensor, y_te_tensor, w_te_tensor)

batch_size = 32 # Common batch sizes are 32, 64, 128

# Create DataLoaders

# Training DataLoader with shuffling
train_loader = DataLoader(
    train_dataset, 
    batch_size=batch_size,
    shuffle=True        # Shuffle training data to avoid learning order patterns
)
# Validation and Test DataLoaders (no shuffling needed)
val_loader = DataLoader(
    val_dataset, 
    batch_size=batch_size,
    shuffle=False
)
test_loader = DataLoader(
    test_dataset, 
    batch_size=batch_size,
    shuffle=False
)

# Example: Iterating through batches
print("Demonstrating batch iteration:")
# choose device ('mps' for mac and 'cuda' for NVIDIA GPUs)
device = torch.device("mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu")
for batch_idx, (features, targets, weights) in enumerate(train_loader):
    # Move batch to device if using GPU
    features = features.to(device)
    targets = targets.to(device)
    weights = weights.to(device)
    
    print(f"Batch {batch_idx}")
    print(f"Features shape: {features.shape}")  # Should be (batch_size, n_features)
    print(f"Targets shape: {targets.shape}")    # Should be (batch_size, 3)
    print(f"Weights shape: {weights.shape}")    # Should be (batch_size,)
    
    if batch_idx >= 2:  # Only show first 3 batches
        break

Demonstrating batch iteration:
Batch 0
Features shape: torch.Size([32, 111])
Targets shape: torch.Size([32, 4])
Weights shape: torch.Size([32])
Batch 1
Features shape: torch.Size([32, 111])
Targets shape: torch.Size([32, 4])
Weights shape: torch.Size([32])
Batch 2
Features shape: torch.Size([32, 111])
Targets shape: torch.Size([32, 4])
Weights shape: torch.Size([32])


### 5. Defining the neural network architecture

#### Layers
We will use the following architecture for our neural network:
- An input layer with `n` neurons (where `n` is the number of features in the dataset)
- A hidden layer with `h` neurons (where `h` is a hyperparameter that we will tune)
- An output layer with `4` neurons (one for each probability class: `P(democrat|C)`, `P(republican|C)`, `P(other|C)`, and `P(non_voter|C)`).

#### Activation functions
We will try multiple activation functions for the hidden layer (ReLU, tanh, sigmoid) and the softmax activation function for the output layer. Thus, mathematically, our neural network can be represented as the composition of functions:
\begin{equation*}
    \mathbb{R}^n \xrightarrow{W^{(1)},b^{(1)}} \mathbb{R}^{h} \xrightarrow{\textup{activation}} \mathbb{R}^{h} \xrightarrow{W^{(2)},b^{(2)}} \mathbb{R}^{4} \xrightarrow{\textup{softmax}} \Delta_4,
\end{equation*}
where:
- $W^{(1)}$ and $b^{(1)}$ are the weights and biases of the first layer, shapes $(h,n)$ and $(h,1)$ respectively.
- $W^{(2)}$ and $b^{(2)}$ are the weights and biases of the second layer, shapes $(3,h)$ and $(3,1)$ respectively.

### 6. Defining the loss function

#### Sample loss
In a usual three-class classification problem, there would be a **single** column with the class labels (e.g. $1,2,3$). In our case, we have three columns with the probabilities of each class (sometimes called the **soft labels**). 

We can *directly* use the three columns as the vector of true probabilities in the **cross-entropy loss function**. Thus, if our output for a particular row is $p = (p_1,p_2,p_3,p_4)$, and the true probabilities are $y = (y_1,y_2,y_3,y_4)$, then our loss function (for the particular sample $C$) is given by:
\begin{equation*}
    \textup{sample loss} = L(p,y;C) = -\sum_{i=1}^{4} y_i \log(p_i).
\end{equation*}
This is the same as the usual cross-entropy loss function, but with the true probabilities from the dataset instead of the class labels.

#### Batch loss
Note that different counties contribute differently to the overall vote counts. Namely, the counties with more voters contribute more to the overall vote counts. Thus, it is natural to weight each sample loss by the `P(C)` column, which represents the probability that a randomly chosen person is from that county. Thus, our batch loss is given by:
\begin{equation*}
    \textup{batch loss} = L(p,y; \textup{batches}) = \sum_{C \in \textup{batches}} P(C) \cdot L(p,y;C).
\end{equation*}

#### Implementation of custom loss function
We will implement the custom loss function as a PyTorch module. This will allow us to use it in the training loop and to compute the gradients automatically.

In [158]:
class WeightedCrossEntropyLoss(nn.Module):
    """
    Weighted Cross Entropy Loss that computes the expected value of the sample loss.
    
    This loss function is specifically designed for multi-class probability prediction
    where each sample has an associated weight P(C). The loss is computed as:
        E[Loss] = Σ(P(C) * Loss(C)) / Σ(P(C))
    where Loss(C) = -Σ(y_i * log(p_i)) is the cross entropy for sample C,
    y_i are true probabilities, and p_i are predicted probabilities.

    The normalization by Σ(P(C)) ensures the result can be interpreted as
    the expected loss per unit of population weight, rather than per sample.
    """
    def __init__(self):
        super().__init__()
        
    def forward(self, outputs, targets, weights):
        """
        Compute weighted cross entropy loss.

        Args:
            outputs (torch.Tensor): Model predictions (batch_size, n_classes)
            targets (torch.Tensor): True probability distributions (batch_size, n_classes)
            weights (torch.Tensor): Sample weights P(C) (batch_size,)

        Returns:
            torch.Tensor: Expected value of cross entropy loss
                         computed as Σ(P(C) * Loss(C)) / Σ(P(C))
        """
        # Add small epsilon to avoid log(0)
        eps = 1e-7
        outputs = torch.clamp(outputs, eps, 1.0)
        
        # Compute cross entropy for each sample
        sample_losses = -torch.sum(targets * torch.log(outputs), dim=1)
        
        # Weight each sample loss by P(C)
        weighted_losses = sample_losses * weights
        
        # Return expected value (weighted average)
        return weighted_losses.sum() / weights.sum()

#### Creating the model class
We will create a class for our model that inherits from `torch.nn.Module`. This class will contain the following methods:
- `__init__`: constructor that initializes the model
- `forward`: method that defines the forward pass of the model
- `loss`: method that computes the loss function

NOTE: below, `n_hidden` is what we called `h` above, and `n_features` is what we called `n` above.

In [159]:
class VotingModel(nn.Module):
    def __init__(self, 
                 n_features, # Number of input features
                 n_hidden=128, # Number of neurons in hidden layer
                 activation='relu', # Activation function ('relu' or 'tanh')
                 dropout_rate=0.2):
        """
        Args:
            n_features (int): Number of input features
            n_hidden (int): Number of neurons in hidden layer
            activation (str): Activation function ('relu' or 'tanh')
            dropout_rate (float): Dropout probability
        """
        super().__init__()
        
        # Input layer -> Hidden layer
        self.fc1 = nn.Linear(n_features, n_hidden)
        
        # Activation function
        self.activation = nn.ReLU() if activation == 'relu' else nn.Tanh()
        
        # Dropout layer
        self.dropout = nn.Dropout(p=dropout_rate)
        
        # Hidden layer -> Output layer (4 classes)
        self.fc2 = nn.Linear(n_hidden, 4)
        
        # Initialize weights
        self._init_weights()
    
    def _init_weights(self):
        """Initialize weights using He/Kaiming initialization for ReLU
        and Xavier/Glorot initialization for Tanh"""
        if isinstance(self.activation, nn.ReLU):
            # He initialization for ReLU
            nn.init.kaiming_normal_(self.fc1.weight)
            nn.init.kaiming_normal_(self.fc2.weight)
        else:
            # Xavier initialization for Tanh
            nn.init.xavier_normal_(self.fc1.weight)
            nn.init.xavier_normal_(self.fc2.weight)
        
        # Initialize biases to small values
        nn.init.zeros_(self.fc1.bias)
        nn.init.zeros_(self.fc2.bias)
    
    def forward(self, x):
        """
        Forward pass of the model.
        
        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, n_features)
            
        Returns:
            torch.Tensor: Output tensor of shape (batch_size, 3)
        """
        # First layer + activation
        x = self.activation(self.fc1(x))
        
        # Apply dropout (only during training)
        x = self.dropout(x)
        
        # Output layer (no activation - will be applied with temperature scaling later)
        x = self.fc2(x)
        
        return x

### 7. Training the model
We will use the following steps to train the model:
1. Iterate over all combinations of hyperparameters.
2. For each combination, initialize the model, optimizer, and loss function.
3. Iterate over `n_epochs`:
    - Iterate over mini-batches:
        - Move the mini-batch to the GPU (if available)
        - Zero the gradients
        - Forward pass
        - Compute the loss
        - Backward pass
        - Update the weights
4. Save the model with the best validation loss.

First, let's define a function for the training loop (since we will want to experiment with different hyperparameters).

In [163]:
def train_model(model, 
                train_loader, 
                val_loader, 
                criterion, 
                optimizer, 
                device, 
                n_epochs, 
                temperature):
    """
        Args:
            model: PyTorch model
            train_loader: DataLoader for training data
            val_loader: DataLoader for validation data
            criterion: Loss function
            optimizer: Optimizer
            device: Device to use ('cpu' or 'cuda')
            n_epochs: Number of epochs to train
            temperature: Temperature for softmax scaling
        Returns: tuple with
            best_model_state: Best model state
            best_val_loss: Best validation loss
    """
    
    best_val_loss = float('inf')
    best_model_state = None
    
    # Training loop
    for epoch in range(n_epochs):
        # Training phase
        model.train()
        # Initialize training loss to 0
        train_loss = 0
        # Loop through training batches
        for features, targets, weights in train_loader:
            # Move to device
            features, targets, weights = features.to(device), targets.to(device), weights.to(device)
            # Zero gradients
            optimizer.zero_grad()
            # Forward pass
            outputs = model(features)
            outputs = F.softmax(outputs / temperature, dim=1)  # Apply temperature scaling
            # Compute loss
            loss = criterion(outputs, targets, weights)
            # Backward pass to compute gradients
            loss.backward()
            # Update weights
            optimizer.step()
            # Accumulate training loss
            train_loss += loss.item()
        
        # Average training loss over all batches
        train_loss /= len(train_loader)

        # Validation phase, switch to eval mode
        model.eval()
        # Initialize validation loss to 0
        val_loss = 0
        # Disable gradient calculation
        with torch.no_grad(): 
            # Loop through validation batches
            for features, targets, weights in val_loader: 
                # Move to device
                features, targets, weights = features.to(device), targets.to(device), weights.to(device)
                # Forward pass
                outputs = model(features)
                # Apply temperature scaling
                outputs = F.softmax(outputs / temperature, dim=1)
                # Compute loss for the batch
                loss = criterion(outputs, targets, weights)
                # Accumulate validation loss
                val_loss += loss.item()
        
        # Average validation loss over all batches
        val_loss /= len(val_loader)

        # Save best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            # Save model state
            best_model_state = model.state_dict()
            
        print(f'Epoch {epoch+1}/{n_epochs}:')
        print(f'Training Loss: {train_loss:.6f}')
        print(f'Validation Loss: {val_loss:.6f}')
    
    return best_model_state, best_val_loss

In [164]:
# check if the training loop works

#initialize VotingModel
n_features = X_tr_tensor.shape[1]  # Number of input features
model = VotingModel(n_features, 
                    n_hidden=128, 
                    activation='relu', 
                    dropout_rate=0.2)

In [165]:
# train the model
n_epochs = 5
temperature = 1.0

# Define loss function and optimizer
criterion = WeightedCrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Move model to device
model.to(device)

# Train the model
best_model_state, best_val_loss = train_model(model, 
                                              train_loader, 
                                              val_loader, 
                                              criterion, 
                                              optimizer, 
                                              device, 
                                              n_epochs, 
                                              temperature)

Epoch 1/5:
Training Loss: 1.331157
Validation Loss: 1.123086
Epoch 2/5:
Training Loss: 1.085924
Validation Loss: 1.084971
Epoch 3/5:
Training Loss: 1.046492
Validation Loss: 1.075574
Epoch 4/5:
Training Loss: 1.030230
Validation Loss: 1.067236
Epoch 5/5:
Training Loss: 1.017985
Validation Loss: 1.062286


In [166]:
best_val_loss

1.0622857606288083

In [167]:
best_model_state

OrderedDict([('fc1.weight',
              tensor([[ 1.6026e-01, -1.0483e-01, -1.7721e-01,  ...,  6.6100e-02,
                        7.9640e-03,  1.1258e-01],
                      [ 5.9188e-02, -5.4639e-02,  1.6421e-02,  ..., -2.7527e-02,
                        3.9538e-03, -6.6740e-02],
                      [ 2.3181e-01, -4.0203e-02, -1.2916e-04,  ...,  4.4928e-02,
                       -8.8878e-02, -7.5363e-03],
                      ...,
                      [ 7.3832e-02,  1.2790e-01,  2.0309e-01,  ..., -1.4712e-01,
                       -2.7946e-01,  1.2865e-01],
                      [ 1.9368e-01, -1.1194e-01,  6.6382e-02,  ...,  1.0515e-01,
                        6.7795e-02,  9.5801e-02],
                      [-1.2282e-01, -2.3205e-01, -5.8417e-02,  ..., -1.9362e-01,
                       -2.1366e-02, -5.7792e-02]], device='mps:0')),
             ('fc1.bias',
              tensor([-0.0174,  0.0824,  0.0404, -0.0652, -0.1205, -0.0412, -0.1010, -0.0523,
                    

### 8. Evaluating the model
We will use the following steps to evaluate the model:
1. Load the model with the best validation loss (i.e. the best choice of hyperparameters).
2. Move the model to the GPU (if available).
3. Re-training the model on the entire training set (train + val).
4. Evaluate the model on the test set:
    - Move the test set to the GPU (if available)
    - Forward pass
    - Compute the weighted cross entropy loss on the whole test set

In [168]:
def evaluate_model(model, test_loader, criterion, device, temperature):
    """
    Evaluate model on test set and compute country-wide predictions.
    
    Args:
        model: Trained PyTorch model
        test_loader: DataLoader for test data
        criterion: Loss function
        device: Device to use ('cpu' or 'cuda' or 'mps')
        temperature: Temperature for softmax scaling
    
    Returns:
        test_loss: Average loss on test set
        country_true: True probabilities for whole country (size 3)
        country_pred: Predicted probabilities for whole country (size 3)
    """
    model.eval()  # Set model to evaluation mode
    test_loss = 0
    
    # Arrays to store predictions and true values for all counties
    all_preds = []
    all_true = []
    all_weights = []
    
    with torch.no_grad():  # No need to track gradients
        for features, targets, weights in test_loader:
            # Move to device
            features = features.to(device)
            targets = targets.to(device)
            weights = weights.to(device)
            
            # Forward pass
            outputs = model(features)
            outputs = F.softmax(outputs / temperature, dim=1)
            
            # Compute loss
            loss = criterion(outputs, targets, weights)

            # Accumulate loss
            test_loss += loss.item()
            
            # Store predictions and true values
            all_preds.append(outputs.cpu())
            all_true.append(targets.cpu())
            all_weights.append(weights.cpu())
    
    # Concatenate all batches
    all_preds = torch.cat(all_preds, dim=0)
    all_true = torch.cat(all_true, dim=0)
    all_weights = torch.cat(all_weights, dim=0)
    
    # Compute weighted average for whole country
    # Expand weights to match prediction shape
    weights_expanded = all_weights.unsqueeze(1)  # Shape: (n_counties, 1)
    
    # Compute weighted sum of predictions and true values
    country_pred = (all_preds * weights_expanded).sum(dim=0)  # Shape: (4,)
    country_true = (all_true * weights_expanded).sum(dim=0)   # Shape: (4,)
    
    # Compute average test loss
    test_loss /= len(test_loader)
    
    return test_loss, country_true, country_pred

# Use the function to evaluate your model
model.load_state_dict(best_model_state)  # Load best model state
model.to(device)
test_loss, country_true, country_pred = evaluate_model(
    model, test_loader, criterion, device, temperature
)

In [169]:
# convert country_true and country_pred into rows of a dataframe.
# Columns should be the target columns, index should be ['True', 'Predicted']
df_results = pd.DataFrame(
    data=np.vstack((country_true.numpy(), country_pred.numpy())),
    columns=target_cols,
    index=['True', 'Predicted']
)
df_results

Unnamed: 0,P(democrat|C),P(other|C),P(republican|C),P(non_voter|C)
True,0.24442,0.00902,0.224074,0.522486
Predicted,0.189179,0.007039,0.192075,0.611707


In [157]:
# print the test loss
print(f"Test Loss: {test_loss:.6f}")

Test Loss: 1.085906


### 9. Hyperparameter tuning

#### Dropout (for regularization)
We will also use dropout in the hidden layer to prevent overfitting. Dropout is a regularization technique that randomly sets a fraction of the neurons to $0$ during training. This forces the network to learn more robust features and prevents overfitting. The dropout rate is a hyperparameter that we will tune.

For example, if the dropout rate is $0.5$, then during training, each neuron in the hidden layer has a $50\%$ chance of being set to $0$. This forces the network to not rely on any particular neuron and to "use" all the neurons in the hidden layer. 

#### Weight decay
We will also use weight decay (L2 regularization) to prevent overfitting. Weight decay adds a penalty to the loss function that is proportional to the square of the weights. This forces the network to learn smaller weights and prevents overfitting. The weight decay coefficient is a hyperparameter that we will tune.

#### Hyperparameters to be tuned
In summary, our training loop will be used to train the weights and also tune the following hyperparameters:
- `n_hidden`: number of neurons in the hidden layer
- `learning_rate`: learning rate for the optimizer
- `batch_size`: size of the mini-batches
- `temperature`: temperature for the softmax function (scaled the logits before applying softamx; higher temperature means more uniform distribution, lower temperature means more peaked distribution)
- `dropout_rate`: dropout rate for the dropout layer
- `activation_function`: activation function for the hidden layer

We organize the hyperparameters in a dictionary for easy access during training.

In [130]:
# Define hyperparameter configurations
hyperparams = {
    # Architecture
    'n_hidden': [32, 256],
    'activation_function': ['relu', 'tanh'],
    'dropout_rate': [0.1, 0.5],
    
    # Training process
    'learning_rate': [1e-3, 1e-4],
    'batch_size': [32, 64],
    'temperature': [1.0, 2.0]  # For softmax
}

# Example of a specific configuration
example_config = {
    'n_hidden': 128,
    'activation_function': 'relu',
    'dropout_rate': 0.2,
    'learning_rate': 1e-3,
    'batch_size': 64,
    'temperature': 1.0
}

Next, let's define a function for the hyperparameter tuning, which requires us to try all combinations of hyperparameters and do a separate training loop for each combo. This is called a **grid search**, because we are searching for the best combination out of a grid of hyperparameters.

In [131]:
# Grid search over hyperparameters
def tune_hyperparameters(train_loader, 
                         val_loader, 
                         device, 
                         hyperparams):
    best_val_loss = float('inf')
    best_config = None
    best_model_state = None
    
    # Iterate over hyperparameter combinations
    for n_hidden in hyperparams['n_hidden']:
        for act_fn in hyperparams['activation_function']:
            for dropout in hyperparams['dropout_rate']:
                for lr in hyperparams['learning_rate']:
                    for temp in hyperparams['temperature']:
                        # Initialize model and optimizer
                        model = VotingModel(n_features=X_tr_tensor.shape[1],
                                            n_hidden=n_hidden, 
                                            activation=act_fn,
                                            dropout_rate=dropout).to(device)
                        optimizer = optim.Adam(model.parameters(), lr=lr)
                        criterion = WeightedCrossEntropyLoss()
                        
                        # Display hyperparameter configuration
                        print(f"Training with config: {n_hidden=}, {act_fn=}, {dropout=}, {lr=}, {temp=}")

                        # Train model
                        n_features = X_tr_tensor.shape[1]
                        model_state, val_loss = train_model(
                            model, train_loader, val_loader,
                            criterion, optimizer, device,
                            n_epochs=5, temperature=temp
                        )
                        
                        # Update best model if needed
                        if val_loss < best_val_loss:
                            best_val_loss = val_loss
                            best_config = {
                                'n_hidden': n_hidden,
                                'activation': act_fn,
                                'dropout': dropout,
                                'learning_rate': lr,
                                'temperature': temp
                            }
                            best_model_state = model_state
    
    return best_config, best_model_state, best_val_loss

In [132]:
# run the grid search
best_config, best_model_state, best_val_loss = tune_hyperparameters(train_loader=train_loader,
                                                                    val_loader=val_loader,
                                                                    device=device,
                                                                    hyperparams=hyperparams)
                                                                    

Training with config: n_hidden=32, act_fn='relu', dropout=0.1, lr=0.001, temp=1.0
Epoch 1/5:
Training Loss: 0.000448
Validation Loss: 0.000362
Epoch 2/5:
Training Loss: 0.000350
Validation Loss: 0.000354
Epoch 3/5:
Training Loss: 0.000344
Validation Loss: 0.000365
Epoch 4/5:
Training Loss: 0.000337
Validation Loss: 0.000341
Epoch 5/5:
Training Loss: 0.000331
Validation Loss: 0.000340
Training with config: n_hidden=32, act_fn='relu', dropout=0.1, lr=0.001, temp=2.0
Epoch 1/5:
Training Loss: 0.000368
Validation Loss: 0.000341
Epoch 2/5:
Training Loss: 0.000329
Validation Loss: 0.000339
Epoch 3/5:
Training Loss: 0.000325
Validation Loss: 0.000337
Epoch 4/5:
Training Loss: 0.000321
Validation Loss: 0.000336
Epoch 5/5:
Training Loss: 0.000320
Validation Loss: 0.000335
Training with config: n_hidden=32, act_fn='relu', dropout=0.1, lr=0.0001, temp=1.0
Epoch 1/5:
Training Loss: 0.000498
Validation Loss: 0.000408
Epoch 2/5:
Training Loss: 0.000423
Validation Loss: 0.000376
Epoch 3/5:
Training L

In [133]:
best_val_loss

0.03213225838044309