### Lab 3.1: Batching and Regularization

In this lab you will learn how to set up a dataset to be processed in batches, rather than processing the entire dataset in each training iteration, and explore neural network regularization.

In [33]:
import numpy as np
import torch

In [34]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as pandas dataframes) 
X = adult.data.features 
y = adult.data.targets 
  
# metadata 
print(adult.metadata) 
  
# variable information 
print(adult.variables)

{'uci_id': 2, 'name': 'Adult', 'repository_url': 'https://archive.ics.uci.edu/dataset/2/adult', 'data_url': 'https://archive.ics.uci.edu/static/public/2/data.csv', 'abstract': 'Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset. ', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 48842, 'num_features': 14, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Income', 'Education Level', 'Other', 'Race', 'Sex'], 'target_col': ['income'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1996, 'last_updated': 'Tue Sep 24 2024', 'dataset_doi': '10.24432/C5XW20', 'creators': ['Barry Becker', 'Ronny Kohavi'], 'intro_paper': None, 'additional_info': {'summary': "Extraction was done by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the fol

In [35]:
X.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'],
      dtype='object')

In [36]:
y = y['income'].map({'<=50K':0,'<=50K.':0,'>50K':1,'>50K.':1})

Here I remove the missing values from the features and labels.

In [37]:
bad = X.isna().any(axis=1)
X = X[~bad]
y = y[~bad]

Selecting only the numeric variables:

In [38]:
X = X[['age','fnlwgt','education-num','capital-gain','capital-loss','hours-per-week']]

In [39]:
y = y.values
X = X.values.astype('float64')

To make the learning algorithm work more smoothly, we we will subtract the mean of each feature.

Here `np.mean` calculates a mean, and `axis=0` tells NumPy to calculate the mean over the rows (calculate the mean of each column).

In [40]:
X -= np.mean(X,axis=0)
X /= np.std(X,axis=0)

Now we will convert our `X` and `y` arrays to torch Tensors.

In [41]:
X = torch.tensor(X).float()
y = torch.tensor(y).long()

### Exercises

1. Divide the data into train and test splits.
2. Create a neural network for this dataset.
3. Use `TensorDataset` and `DataLoader` to batch the dataset during training.  
4. Use `weight_decay` parameter to `optim.SGD` to introduce L2 regularization during training. Evaluate the effect of regularization on test set accuracy.

In [42]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.1,random_state=42)

In [43]:
X_train.shape

torch.Size([42858, 6])

In [44]:
y_train.max()

tensor(1)

In [45]:
import torch
from torch import nn

nn_model = nn.Sequential(
    nn.Linear(6,100),
    nn.ReLU(),
    nn.Linear(100,2)
)

In [46]:
from torch.utils.data import TensorDataset, DataLoader
train_ds = TensorDataset(X_train,y_train)
train_loader = DataLoader(train_ds, batch_size=32, shuffle=True)

In [47]:
loss_fn = nn.CrossEntropyLoss()

In [48]:
def calc_accuracy(model,X,y):
    z = nn_model(X)
    pred = torch.argmax(z,dim=-1)
    acc = (pred==y).float().mean().item()
    return acc

In [54]:
def train_model(model,weight_decay):
    opt = torch.optim.SGD(model.parameters(), lr=1e-2, weight_decay=weight_decay)

    for epoch in range(10):
        for (X_batch,y_batch) in train_loader:
            opt.zero_grad()

            z = model(X_batch)
            loss = loss_fn(z,y_batch)

            loss.backward()
            opt.step()
        
    train_acc = calc_accuracy(model,X_train,y_train)
    test_acc = calc_accuracy(model,X_test,y_test)

    print(f'train acc: {train_acc}, test acc: {test_acc}')

In [55]:
train_model(nn_model,weight_decay=0)

train acc: 0.8171403408050537, test acc: 0.8141927123069763


In [58]:
train_model(nn_model,weight_decay=1e-2)

train acc: 0.7582948207855225, test acc: 0.7518370747566223


L2 regularization (weight decay) doesn't seem to be helping anything for this particular model and dataset.