# Basic Tutorial for GEORGE

In this notebook, we demonstrate a simple experiment comparing empirical risk minimization (ERM) and our method (GEORGE) on the MNIST dataset, using a small three-layer CNN model. In this simple example, GEORGE improves worst-case accuracy (i.e., the minimum accuracy over any subclass) compared to ERM. More sophisticated experiments are described in the blog post (and paper, coming soon). The notebook can be run with or without GPU support. For a script version rather than a notebook, see `stratification/run.py`.

There are four main sections to this notebook:
1. **Setup**: Imports and setting up the dataset and model.
2. **Train ERM Model**: we train an empirical risk minimization (ERM) model on the `superclass` labels.
3. **Cluster Activations**: using the feature representation of the ERM model, we leverage dimensionality reduction and clustering techniques in order to estimate approximate `subclass` labels for each example.
4. **Train "GEORGE" Model**: we train a new model that exploits the recovered `subclass` labels to improve worst-case performance on them, using group distributionally robust optimization (GDRO) \[[Sagawa et al. (2020)](https://arxiv.org/abs/1911.08731)\].

## 1. Setup

### 1.1 Imports and configuration setup
Before you start, make sure you have set up the repository correctly and installed all dependencies, as described in the README.

All training options are handled by a single `config` object. In this tutorial, we use the configuration provided in `demo_config.json`. To see how configuration files are defined, validated, and optionally modified via the command line, check out `stratification/utils/parse_args.py` and `stratification/utils/schema.py`.

In [2]:
import shutil
import os

# Define the source and destination paths
source_folder = '/kaggle/input/hidden-stratification'
destination_folder = '/kaggle/working/hidden-stratification'

# Create the destination folder if it doesn't exist
os.makedirs(destination_folder, exist_ok=True)

# Copy the contents of the source folder to the destination folder
for item in os.listdir(source_folder):
    s = os.path.join(source_folder, item)
    d = os.path.join(destination_folder, item)
    if os.path.isdir(s):
        shutil.copytree(s, d, dirs_exist_ok=True)  # Copy directories
    else:
        shutil.copy2(s, d)  # Copy files

print(f"Contents of {source_folder} copied to {destination_folder}")


Contents of /kaggle/input/hidden-stratification copied to /kaggle/working/hidden-stratification


In [3]:
import shutil
import os

# Define the source and destination paths
source_folder = '/kaggle/input/kaggle-json'
destination_folder = '/kaggle/working/kaggle-json'

# Create the destination folder if it doesn't exist
os.makedirs(destination_folder, exist_ok=True)

# Copy the contents of the source folder to the destination folder
for item in os.listdir(source_folder):
    s = os.path.join(source_folder, item)
    d = os.path.join(destination_folder, item)
    if os.path.isdir(s):
        shutil.copytree(s, d, dirs_exist_ok=True)  # Copy directories
    else:
        shutil.copy2(s, d)  # Copy files

print(f"Contents of {source_folder} copied to {destination_folder}")


Contents of /kaggle/input/kaggle-json copied to /kaggle/working/kaggle-json


In [4]:
import os
import shutil

# Create the .kaggle directory if it doesn't exist
os.makedirs('/root/.kaggle', exist_ok=True)

# Move kaggle.json to the .kaggle directory
shutil.move('/kaggle/working/kaggle-json/kaggle.json', '/root/.kaggle/kaggle.json')

# Set the correct permissions for the file
os.chmod('/root/.kaggle/kaggle.json', 0o600)

print("kaggle.json has been moved to /root/.kaggle and permissions have been set.")


kaggle.json has been moved to /root/.kaggle and permissions have been set.


In [5]:
import os
os.chdir('/kaggle/working/hidden-stratification')

In [None]:
!pip install -r requirements.txt

In [7]:
!pip install -e .

Obtaining file:///kaggle/working/hidden-stratification
  Preparing metadata (setup.py) ... [?25ldone
[?25hInstalling collected packages: stratification
  Running setup.py develop for stratification
Successfully installed stratification-1.0


In [12]:
!python stratification/run.py configs/cifar.json exp_dir=checkpoints/new-experiment mode='george' reduction_config.model='none' reduction_config.components=8 activations_dir='NONE' classification_config.num_epochs=150 classification_config.bit_pretrained=False cluster_dir='/kaggle/working/hidden-stratification/checkpoints/new-experiment/run_2024-06-27_03-31-16_e3cd3f57/erm_2024-06-27_03-32-04_709047de/reduce_2024-06-27_03-34-09_af0a39aa/cluster_2024-06-27_03-34-16_6133e806'

2024-06-27 04:46:19.065756: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-27 04:46:19.065811: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-27 04:46:19.067393: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
{
    "exp_dir": "checkpoints/new-experiment",
    "dataset": "cifar",
    "mode": "george",
    "seed": 0,
    "classification_config": {
        "model": "resnet50",
        "checkpoint_metric": "val_acc_rw",
        "optimizer_config": {
            "class_args": {
                "lr": 0.005,
                "weight_decay": 1e-05,
                "momentum": 

In [18]:
import shutil
import os

# Define the source and destination paths
s = '/kaggle/working/hidden-stratification/checkpoints/new-experiment/run_2024-06-27_03-31-16_e3cd3f57/erm_2024-06-27_03-32-04_709047de/reduce_2024-06-27_03-34-09_af0a39aa/cluster_2024-06-27_03-34-16_6133e806/metrics.json'
d= '/kaggle/working/logs'

# Create the destination folder if it doesn't exist
os.makedirs(d, exist_ok=True)
shutil.copy2(s, d)  # Copy files


'/kaggle/working/logs/metrics.json'

In [None]:
import shutil
import os

# Define the source and destination paths
source_folder = '/kaggle/working/hidden-stratification/checkpoints/new-experiment/run_2024-06-20_02-11-51_b9de6416/erm_2024-06-20_02-11-52_692fb72f/reduce_2024-06-20_03-53-55_5206e77c/cluster_2024-06-20_03-54-00_96bc0947/visualizations/test'
destination_folder = '/kaggle/working/plots'

# Create the destination folder if it doesn't exist
os.makedirs(destination_folder, exist_ok=True)

# Copy the contents of the source folder to the destination folder
for item in os.listdir(source_folder):
    s = os.path.join(source_folder, item)
    d = os.path.join(destination_folder, item)
    if os.path.isdir(s):
        shutil.copytree(s, d, dirs_exist_ok=True)  # Copy directories
    else:
        shutil.copy2(s, d)  # Copy files

print(f"Contents of {source_folder} copied to {destination_folder}")


In [15]:
from IPython.display import FileLinks
FileLinks('/kaggle/working/hidden-stratification/checkpoints')

In [7]:
import torch
o = torch.load('//kaggle/working/hidden-stratification/checkpoints/new-experiment/run_2024-06-26_11-47-26_aa84b7b4/erm_2024-06-26_11-47-55_4b817753/outputs.pt')

In [8]:
o['train']['activations'].shape

(40000, 1536)

In [None]:
import json
import torch

from stratification.harness import GEORGEHarness
from stratification.utils.utils import set_seed, init_cuda
from stratification.utils.parse_args import get_config
from stratification.cluster.models.cluster import GaussianMixture
from stratification.cluster.models.reduction import UMAPReducer

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg



# repository base directory
#os.chdir("../")
REPO_DIR = os.getcwd()
print(REPO_DIR)
with open('configs/cifar.json', 'r') as f:
    config = json.dumps(json.load(f))
config = get_config([config])

os.chdir(os.path.join(REPO_DIR, 'stratification'))

use_cuda_if_available = True  # change to True if you want to use CUDA
use_cuda = use_cuda_if_available and torch.cuda.is_available()
# set seeds for reproducibility
set_seed(config['seed'], use_cuda)
# initialize CUDA, if available
#config['allow_multigpu']
init_cuda(config['deterministic'],config['allow_multigpu'] );
import pprint
pprint.pprint(config)

In [None]:
print(json.dumps(config,indent=4))

### 1.2. Initialize GEORGEHarness

The `GEORGEHarness` is an object that handles the "bookkeeping" for each of the steps outlined in this tutorial, such as setting up experiment directories and loading/saving models.  Experiment files are stored in the base directory specified by `config['exp_dir']`. Each experiment run is stored in a subdirectory of this base directory whose filename is based on the (1) training method (ERM, GEORGE, etc.), (2) the timestamp, and (3) a random hash (to avoid collisions).

In [None]:
#
harness = GEORGEHarness(config, use_cuda=use_cuda, log_format='simple')

### 1.3 Get Data and Model Architecture

In this tutorial, we'll use the MNIST dataset. In our case, the task will be to classify digits as < 5 or ≥ 5; these correspond to the two *superclasses*. The *subclasses* are the individual digits (0,1,2,3,4 are the subclasses of the first superclass, and 5,6,7,8,9 are the subclasses of the second superclass).

When fetching the dataloaders and NN architecture, we can also specify the "mode" (training method), as one can specify different data and model options in the configuration for the different training methods.

Additional datasets and architectures can be added under `stratification/classification/datasets` and `stratification/classification/models`, respectively.

In [None]:
dataloaders = harness.get_dataloaders(config, mode='erm')
num_classes = dataloaders['train'].dataset.get_num_classes('superclass')
model = harness.get_nn_model(config, num_classes=num_classes, mode='erm')

## 2. Train ERM Model

We've already initialized our model, a simple CNN architecture. Let's print it out:

In [None]:
import torch
import numpy as np
a = torch.tensor([0,1,2,3,4,5,6,7,8,9])
class_div = {0:0, 4:0 , 2:1}
print(class_div.get(1,2))
torch.tensor(list(map(lambda x : class_div.get(int(x),2),a))).long()

In [None]:
print('Model architecture:')
print(model)

Now, we're ready to train a classifier! First we'll just train a standard classifier using empirical risk minimization (a fancy term for minimizing the average training loss). We'll print out both the overall accuracy and the true robust accuracy (i.e., the minimum accuracy on any subclass), along with the losses as well. The robust accuracy is the metric we are interested in maximizing. We train our models as though we don't know the subclass (digit) labels, in which case we can't actually measure the true robust accuracy. In reality, we do know the subclass labels for this dataset, so we'll measure per-subclass performance to see how well each method *really* does.

In [None]:
pip install torchsummary

In [None]:
from torchsummary import summary
summary(model, (3, 28, 28))

In [None]:
erm_dir = harness.classify(config['classification_config'],  model, dataloaders, 'erm')

The overall test accuracy is around **92%**, but the robust accuracy is quite a bit lower at around **82%**. (Note: Results may vary based on random seed, platform, GPU use, etc.) Let's see if we can improve on this!

## 3. Cluster Model Activations

Now, we'll cluster the data of each superclass, to try and automatically identify the subclasses. However, just clustering the raw data usually doesn't work that well - instead, we cluster in the *feature space* of a trained model. We just trained an ERM model on the task, so we'll now use this model to extract features which we use for clustering. Specifically, the features are the activations (outputs) of the penultimate layer (right before the classification layer).

### 3.1. Initialize Cluster Model and Reduction Model

The clustering procedure consists of two steps:
1. Dimensionality reduction of the activations (optional). If `reduction_model` is `None`, the raw activations are used.
2. Fitting a separate cluster model on the reduced training activations of each superclass.

We'll use UMAP for dimensionality reduction, and we'll use Gaussian mixture model clustering. For simplicity, in this tutorial we fix the number of clusters per superclass to 5 (the true number of subclasses per superclass for this task). *Automatic* selection of the number of clusters based on unsupervised metrics (such as the Silhouette score) is also supported - in fact, our experiments in the blog post and paper are run using this automatic selection procedure, rather than pre-specifying the number of clusters.

In [None]:
# Dimensionality reduction model
reduction_model = UMAPReducer(random_state=12345, n_components=2)
# Clustering model
cluster_model = GaussianMixture(covariance_type="full", n_components=2, n_init=3)

### 3.2. Run `harness.reduce`

Now, we use UMAP to dimensionality-reduce the activations to produce the "features" that we'll cluster.

In [None]:
reduction_dir = harness.reduce(config['reduction_config'], reduction_model,
                               inputs_path=os.path.join(erm_dir, 'outputs.pt'))

### 3.3. Run `harness.cluster`

Now, we cluster the aforementioned features.

In [None]:
# Now, we cluster the features separately for each superclass.
# This step also generates and saves visualizations of the data, which we'll look at in the next part.
cluster_dir = harness.cluster(config['cluster_config'], cluster_model,
                              inputs_path=os.path.join(reduction_dir, 'outputs.pt'));

### 3.4. Visualizing the clusters

Now let's look at the clusters that we found. Since we apply UMAP to reduce to dimension 2, we can directly visualize the data in this two-dimensional "feature space" - see below! The first row corresponds to the first superclass (< 5, i.e. digits 0-4) and the second row corresponds to the second superclass (≥ 5, i.e. digits 5-9). On the left, we color each point by its assigned cluster label. On the right, we color each point by its actual subclass (i.e., which digit that datapoint is). As we can see, the individual subclasses are fairly easy to distinguish in feature space, and as a result *up to permutation* the clusters we find match up quite well with the actual subclasses.

In [None]:
viz_dir = os.path.join(cluster_dir, 'visualizations')
fig, axarr = plt.subplots(2, 2, figsize=(18, 12), gridspec_kw={'wspace':0, 'hspace':0}, squeeze=True, dpi=300)
for i in range(2):
    axarr[i, 0].imshow(mpimg.imread(os.path.join(viz_dir, f'train/group_{i}_cluster_viz.png')))
    axarr[i, 0].axis('off')
    axarr[i, 1].imshow(mpimg.imread(os.path.join(viz_dir, f'train/group_{i}_true_subclass_viz.png')))
    axarr[i, 1].axis('off')

## 4. Train Final (GEORGE) Model

Training the GEORGE model is simple. The only new thing to pass in is the `clusters.pt` path. This is a pickled dictionary saved by `harness.cluster` that contains the cluster labels assigned to the datapoints. These estimated cluster labels are used as a surrogate for the subclass labels. We now train a model to minimize the *worst-case* loss over the clusters using GDRO. Since the cluster assignments are similar to the true subclass labels up to permutation, we expect that this procedure should also improve worst-case accuracy on the true subclasses.

As before, we'll print out both the overall accuracy and true robust accuracy. We'll also print out the *estimated* robust accuracy, which is the minimum accuracy on any *cluster*. Unlike the true robust accuracy, this is something we can actually measure even when we don't know the true subclass labels. If the cluster labels are a good estimate of the true subclass labels, then this estimated robust accuracy should be a good estimate of the true robust accuracy.

In [None]:
set_seed(config['seed'], use_cuda)  # reset random state
# Specify path to estimated subclass labels
dataloaders = harness.get_dataloaders(
    config, mode='george', subclass_labels=os.path.join(cluster_dir, 'clusters.pt'))
# Initialize new model
model = harness.get_nn_model(config, num_classes=num_classes, mode='george')

# Train the final (GEORGE) model
george_dir = harness.classify(config['classification_config'], model, dataloaders,
                              mode='george')

In [None]:
a = [2,3]
list(a)

The overall accuracy is similar to that of the ERM model, but the robust accuracy has improved to around **89%**!
In addition, our estimate of the robust accuracy (**91%**) is quite close to the actual robust accuracy (and these two metrics remain close throughout the entire training run).
Again, results may vary somewhat - but on average across random seeds, the GEORGE model outperforms the ERM model in terms of robust accuracy.

## Conclusion

This notebook demonstrates our framework (GEORGE) for estimating subclasses and improving worst-case subclass accuracy, on a simple "toy" example. Although the end-to-end performance gains are modest in this case, on more complex datasets the gains can be quite dramatic! See our blog post and paper for more details.