### Lab 3.1: Batching and Regularization

In this lab you will learn how to set up a dataset to be processed in batches, rather than processing the entire dataset in each training iteration, and explore neural network regularization.

In [20]:
import numpy as np
import torch

In [21]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as pandas dataframes) 
X = adult.data.features 
y = adult.data.targets 
  
# metadata 
print(adult.metadata) 
  
# variable information 
print(adult.variables)

{'uci_id': 2, 'name': 'Adult', 'repository_url': 'https://archive.ics.uci.edu/dataset/2/adult', 'data_url': 'https://archive.ics.uci.edu/static/public/2/data.csv', 'abstract': 'Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset. ', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 48842, 'num_features': 14, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Income', 'Education Level', 'Other', 'Race', 'Sex'], 'target_col': ['income'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1996, 'last_updated': 'Tue Sep 24 2024', 'dataset_doi': '10.24432/C5XW20', 'creators': ['Barry Becker', 'Ronny Kohavi'], 'intro_paper': None, 'additional_info': {'summary': "Extraction was done by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the fol

In [22]:
X.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'],
      dtype='object')

In [23]:
y = y['income'].map({'<=50K':0,'<=50K.':0,'>50K':1,'>50K.':1})

In [24]:
X = X[['age','fnlwgt','education-num','capital-gain','capital-loss','hours-per-week']]

In [25]:
y = y.values
X = X.values.astype('float64')

To make the learning algorithm work more smoothly, we we will subtract the mean of each feature.

Here `np.mean` calculates a mean, and `axis=0` tells NumPy to calculate the mean over the rows (calculate the mean of each column).

In [26]:
X -= np.mean(X,axis=0)

Now we will convert our `X` and `y` arrays to torch Tensors.

In [27]:
X = torch.tensor(X).float()
y = torch.tensor(y).long()

### Exercises

1. Divide the data into train and test splits.
2. Create a neural network for this dataset.
3. Use `TensorDataset` and `DataLoader` to batch the dataset during training.  
4. Use `weight_decay` parameter to `optim.SGD` to introduce L2 regularization during training. Evaluate the effect of regularization on test set accuracy.

In [28]:
# 1. Divide data into train and test splits
import sklearn
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=0, train_size = .75)

In [29]:
# 2. Create a neural network for this dataset. 
mlp_model = torch.nn.Sequential(
    torch.nn.Linear(6, 100), # 6 inputs, 1 hidden layer of size 100
    
    # hidden activation function, the magic happens
    torch.nn.ReLU(),
    
    torch.nn.Linear(100, 2) # 100 inputs, 2 outputs
)

# Create a cross-entropy loss function and a stochastic gradient descent (SGD) optimizer
loss_fn = torch.nn.CrossEntropyLoss()
lr = 1e-2
opt = torch.optim.SGD(mlp_model.parameters(), lr=lr)

In [30]:

# 3. Use TensorDataset and Dataloader to batch the dataset during training.

batch = 32

train_dataset = torch.utils.data.TensorDataset(X_train, y_train)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch, shuffle=True)

test_dataset = torch.utils.data.TensorDataset(X_test, y_test)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch, shuffle=False)
epochs = 100
total_loss = 0

for epoch in range(epochs):
    for batch_X, batch_y in train_dataloader:
        opt.zero_grad()  # Zero out gradients

        z = mlp_model(batch_X)  # Forward pass
        loss = loss_fn(z, batch_y)  # Compute loss

        loss.backward()  # Backpropagation
        opt.step()  # Apply gradients

        total_loss += loss.item()

    print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_dataloader):.4f}")



Epoch 1/100, Loss: nan
Epoch 2/100, Loss: nan
Epoch 3/100, Loss: nan
Epoch 4/100, Loss: nan
Epoch 5/100, Loss: nan
Epoch 6/100, Loss: nan
Epoch 7/100, Loss: nan
Epoch 8/100, Loss: nan
Epoch 9/100, Loss: nan
Epoch 10/100, Loss: nan
Epoch 11/100, Loss: nan
Epoch 12/100, Loss: nan
Epoch 13/100, Loss: nan
Epoch 14/100, Loss: nan
Epoch 15/100, Loss: nan
Epoch 16/100, Loss: nan
Epoch 17/100, Loss: nan
Epoch 18/100, Loss: nan
Epoch 19/100, Loss: nan
Epoch 20/100, Loss: nan
Epoch 21/100, Loss: nan
Epoch 22/100, Loss: nan
Epoch 23/100, Loss: nan
Epoch 24/100, Loss: nan
Epoch 25/100, Loss: nan
Epoch 26/100, Loss: nan
Epoch 27/100, Loss: nan
Epoch 28/100, Loss: nan
Epoch 29/100, Loss: nan
Epoch 30/100, Loss: nan
Epoch 31/100, Loss: nan
Epoch 32/100, Loss: nan
Epoch 33/100, Loss: nan
Epoch 34/100, Loss: nan
Epoch 35/100, Loss: nan
Epoch 36/100, Loss: nan
Epoch 37/100, Loss: nan
Epoch 38/100, Loss: nan
Epoch 39/100, Loss: nan
Epoch 40/100, Loss: nan
Epoch 41/100, Loss: nan
Epoch 42/100, Loss: nan
E

KeyboardInterrupt: 