## Pre-requisites

Before running this notebook, you should have already used the `extract_features.py` script to extract features from models trained on DHS data. You should have the following structure under the `outputs/` directory:

```
dhs_ooc/
    DHS_OOC_A_ms_samescaled_b64_fc01_conv01_lr0001/
        features.npz
    ...
    DHS_OOC_E_rgb_same_b64_fc001_conv001_lr0001/
        features.npz
dhs_incountry/
    DHS_Incountry_A_ms_samescaled_b64_fc01_conv01_lr001/
        features.npz
    ...
    DHS_Incountry_E_nl_random_b64_fc01_conv01_lr001/
        features.npz
transfer/
    transfer_nlcenter_ms_b64_fc001_conv001_lr0001/
        features.npz
    transfer_nlcenter_rgb_b64_fc001_conv001_lr0001/
        features.npz
        

TODO: update when keep-frac models are added
```

## Instructions

This notebook essentially performs fine-tuning of the final-layer of the Resnet-18 models. However, instead of directly fine-tuning the Resnet-18 models in TensorFlow, we train ridge-regression models using the extracted features. We take this approach for two reasons:

1. It is easier to perform leave-one-group-out ("logo") cross-validated ridge regression using scikit-learn, as opposed to TensorFlow. For out-of-country (OOC) experiments, the left-out group is the test country. For in-country experiments, the left-out group is the test split.
2. We can concatenate the 512-dim features from the RGB/MS CNN models with the 512-dim features from the NL CNN models to form a larger 1024-dim feature vector capturing RGB/MS + NL imagery information. We do this instead of training a CNN with the MS+NL imagery stacked together as an input because we found it to result in better performance.

Because of the extensive cross-validation, each "logo" CV run may take ~2-4 hours. In total, this notebook may take upwards of 15 hours to complete.

After you complete this notebook, use the `model_analysis/dhs_ooc.ipynb` and `model_analysis/dhs_incountry.ipynb` (TODO) notebooks to analyze the final performance of the fine-tuned Resnet-18 models.

## Imports and Constants

In [None]:
%cd '../'
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
from __future__ import annotations

from collections.abc import Iterable
import os
import pickle

import numpy as np
import pandas as pd

from batchers import dataset_constants
from models.linear_model import ridge_cv
from utils.general import load_npz

In [None]:
FOLDS = ['A', 'B', 'C', 'D', 'E']
SPLITS = ['train', 'val', 'test']
OUTPUTS_ROOT_DIR = 'outputs'
COUNTRIES = dataset_constants.DHS_COUNTRIES

KEEPS = [0.05, 0.1, 0.25, 0.5]
SEEDS = [123, 456, 789]

In [None]:
MODEL_DIRS = {
    # OOC models
    'resnet_ms_A': 'DHS_OOC_A_ms_samescaled_b64_fc01_conv01_lr0001',
    'resnet_ms_B': 'DHS_OOC_B_ms_samescaled_b64_fc001_conv001_lr0001',
    'resnet_ms_C': 'DHS_OOC_C_ms_samescaled_b64_fc001_conv001_lr001',
    'resnet_ms_D': 'DHS_OOC_D_ms_samescaled_b64_fc001_conv001_lr01',
    'resnet_ms_E': 'DHS_OOC_E_ms_samescaled_b64_fc01_conv01_lr001',
    'resnet_nl_A': 'DHS_OOC_A_nl_random_b64_fc1.0_conv1.0_lr0001',
    'resnet_nl_B': 'DHS_OOC_B_nl_random_b64_fc1.0_conv1.0_lr0001',
    'resnet_nl_C': 'DHS_OOC_C_nl_random_b64_fc1.0_conv1.0_lr0001',
    'resnet_nl_D': 'DHS_OOC_D_nl_random_b64_fc1.0_conv1.0_lr01',
    'resnet_nl_E': 'DHS_OOC_E_nl_random_b64_fc1.0_conv1.0_lr0001',
    'resnet_rgb_A': 'DHS_OOC_A_rgb_same_b64_fc001_conv001_lr01',
    'resnet_rgb_B': 'DHS_OOC_B_rgb_same_b64_fc001_conv001_lr0001',
    'resnet_rgb_C': 'DHS_OOC_C_rgb_same_b64_fc001_conv001_lr0001',
    'resnet_rgb_D': 'DHS_OOC_D_rgb_same_b64_fc1.0_conv1.0_lr01',
    'resnet_rgb_E': 'DHS_OOC_E_rgb_same_b64_fc001_conv001_lr0001',

    # incountry models
    'incountry_resnet_ms_A': 'DHS_Incountry_A_ms_samescaled_b64_fc01_conv01_lr001',
    'incountry_resnet_ms_B': 'DHS_Incountry_B_ms_samescaled_b64_fc1_conv1_lr001',
    'incountry_resnet_ms_C': 'DHS_Incountry_C_ms_samescaled_b64_fc1.0_conv1.0_lr0001',
    'incountry_resnet_ms_D': 'DHS_Incountry_D_ms_samescaled_b64_fc001_conv001_lr0001',
    'incountry_resnet_ms_E': 'DHS_Incountry_E_ms_samescaled_b64_fc001_conv001_lr0001',
    'incountry_resnet_nl_A': 'DHS_Incountry_A_nl_random_b64_fc1.0_conv1.0_lr0001',
    'incountry_resnet_nl_B': 'DHS_Incountry_B_nl_random_b64_fc1.0_conv1.0_lr0001',
    'incountry_resnet_nl_C': 'DHS_Incountry_C_nl_random_b64_fc1.0_conv1.0_lr0001',
    'incountry_resnet_nl_D': 'DHS_Incountry_D_nl_random_b64_fc1.0_conv1.0_lr0001',
    'incountry_resnet_nl_E': 'DHS_Incountry_E_nl_random_b64_fc01_conv01_lr001',

    # transfer models
    'transfer_resnet_ms': 'transfer_nlcenter_ms_b64_fc001_conv001_lr0001',
    'transfer_resnet_rgb': 'transfer_nlcenter_rgb_b64_fc001_conv001_lr0001',

    # keep-frac models
    # TODO
}

## Load data

`country_labels` is a `np.ndarray` that shows which country each cluster belongs to. Countries are indexed by their position in `dataset_constants.DHS_COUNTRIES`.
```python
array([ 0,  0,  0, ..., 22, 22, 22])
```

`incountry_group_labels` is a `np.ndarray` that shows which "test" fold each cluster belongs to. The first cluster belongs to the "test" split of fold "B" (folds here are 0-indexed).
```python
array([1, 1, 4, ..., 1, 0, 3])
```

In [None]:
df = pd.read_csv('data/dhs_clusters.csv', float_precision='high', index_col=False)
labels = df['wealthpooled'].to_numpy(dtype=np.float32)
locs = df[['lat', 'lon']].to_numpy(dtype=np.float32)
country_labels = df['country'].map(COUNTRIES.index).to_numpy()

with open('data/dhs_incountry_folds.pkl', 'rb') as f:
    incountry_folds = pickle.load(f)

incountry_group_labels = np.zeros(len(df), dtype=np.int64)
for i, fold in enumerate(FOLDS):
    test_indices = incountry_folds[fold]['test']
    incountry_group_labels[test_indices] = i

## OOC

In [None]:
def ridgecv_ooc_wrapper(model_name: str, savedir: str) -> None:
    '''
    Args
    - model_name: str, corresponds to keys in MODEL_DIRS (without the fold suffix)
    - savedir: str, path to directory for saving ridge regression weights and predictions
    '''
    features_dict = {}
    for f in FOLDS:
        model_fold_name = f'{model_name}_{f}'
        model_dir = MODEL_DIRS[model_fold_name]
        npz_path = os.path.join(OUTPUTS_ROOT_DIR, 'dhs_ooc', model_dir, 'features.npz')
        npz = load_npz(npz_path, check={'labels': labels})
        features = npz['features']
        for country in dataset_constants.SURVEY_NAMES[f'DHS_OOC_{f}']['test']:
            features_dict[country] = features

    ridge_cv(
        features=features_dict,
        labels=labels,
        group_labels=country_labels,
        group_names=COUNTRIES,
        do_plot=True,
        savedir=savedir,
        save_weights=True,
        save_dict=dict(locs=locs))

Each of the following 3 cells make take ~2 hours each to run.

In [None]:
model_name = 'resnet_ms'
savedir = os.path.join(OUTPUTS_ROOT_DIR, 'dhs_ooc', 'resnet_ms')
ridgecv_ooc_wrapper(model_name, savedir)

In [None]:
model_name = 'resnet_rgb'
savedir = os.path.join(OUTPUTS_ROOT_DIR, 'dhs_ooc', 'resnet_rgb')
ridgecv_ooc_wrapper(model_name, savedir)

In [None]:
model_name = 'resnet_nl'
savedir = os.path.join(OUTPUTS_ROOT_DIR, 'dhs_ooc', 'resnet_nl')
ridgecv_ooc_wrapper(model_name, savedir)

### Concatenated RGB/MS + NL features

In [None]:
def ridgecv_ooc_concat_wrapper(model_names: Iterable[str], savedir: str) -> None:
    '''
    Args
    - model_names: list of str, correspond to keys in MODEL_DIRS (without the fold suffix)
    - savedir: str, path to directory for saving ridge regression weights and predictions
    '''
    features_dict = {}
    for f in FOLDS:
        concat_features = []  # list of np.array, each shape [N, D_i]
        for model_name in model_names:
            model_dir = MODEL_DIRS[f'{model_name}_{f}']
            npz_path = os.path.join(OUTPUTS_ROOT_DIR, 'dhs_ooc', model_dir, 'features.npz')
            npz = load_npz(npz_path, check={'labels': labels})
            concat_features.append(npz['features'])
        concat_features = np.concatenate(concat_features, axis=1)  # shape [N, D_1 + ... + D_m]
        for country in dataset_constants.SURVEY_NAMES[f'DHS_OOC_{f}']['test']:
            features_dict[country] = concat_features

    ridge_cv(
        features=features_dict,
        labels=labels,
        group_labels=country_labels,
        group_names=COUNTRIES,
        do_plot=True,
        savedir=savedir,
        save_weights=True,
        save_dict=dict(locs=locs),
        verbose=True)

Each of the following 2 cells make take ~3-4 hours each to run.

In [None]:
model_names = ['resnet_ms', 'resnet_nl']
savedir = os.path.join(OUTPUTS_ROOT_DIR, 'dhs_ooc', 'resnet_msnl_concat')
ridgecv_ooc_concat_wrapper(model_names, savedir)

In [None]:
model_names = ['resnet_rgb', 'resnet_nl']
savedir = os.path.join(OUTPUTS_ROOT_DIR, 'dhs_ooc', 'resnet_rgbnl_concat')
ridgecv_ooc_concat_wrapper(model_names, savedir)

## Incountry

In [None]:
def ridgecv_incountry_wrapper(model_name: str, savedir: str) -> None:
    '''
    Args
    - model_name: str, corresponds to keys in MODEL_DIRS (without the fold suffix)
    - savedir: str, path to directory for saving ridge regression weights and predictions
    '''
    features_dict = {}
    for f in FOLDS:
        model_fold_name = f'{model_name}_{f}'
        model_dir = MODEL_DIRS[model_fold_name]
        npz_path = os.path.join(OUTPUTS_ROOT_DIR, 'dhs_incountry', model_dir, 'features.npz')
        npz = load_npz(npz_path, check={'labels': labels})
        features_dict[f] = npz['features']

    ridge_cv(
        features=features_dict,
        labels=labels,
        group_labels=incountry_group_labels,
        group_names=FOLDS,
        do_plot=True,
        savedir=savedir,
        save_weights=True,
        verbose=True)

In [None]:
model_name = 'incountry_resnet_ms'
savedir = os.path.join(OUTPUTS_ROOT_DIR, 'dhs_incountry', 'resnet_ms')
ridgecv_incountry_wrapper(model_name, savedir)

In [None]:
model_name = 'incountry_resnet_nl'
savedir = os.path.join(OUTPUTS_ROOT_DIR, 'dhs_incountry', 'resnet_nl')
ridgecv_incountry_wrapper(model_name, savedir)

### Concatenated MS + NL Features

In [None]:
def ridgecv_incountry_concat_wrapper(model_names: Iterable[str], savedir: str) -> None:
    '''
    Args
    - model_names: list of str, correspond to keys in MODEL_DIRS (without the fold suffix)
    - savedir: str, path to directory for saving ridge regression weights and predictions
    '''
    features_dict = {}
    for i, f in enumerate(FOLDS):
        concat_features = []  # list of np.array, each shape [N, D_i]
        for model_name in model_names:
            model_dir = MODEL_DIRS[f'{model_name}_{f}']
            npz_path = os.path.join(OUTPUTS_ROOT_DIR, 'dhs_incountry', model_dir, 'features.npz')
            npz = load_npz(npz_path, check={'labels': labels})
            concat_features.append(npz['features'])
        concat_features = np.concatenate(concat_features, axis=1)  # shape [N, D_1 + ... + D_m]
        features_dict[f] = concat_features

    ridge_cv(
        features=features_dict,
        labels=labels,
        group_labels=incountry_group_labels,
        group_names=FOLDS,
        do_plot=True,
        savedir=savedir,
        save_weights=True,
        verbose=True)

In [None]:
model_names = ['incountry_resnet_ms', 'incountry_resnet_nl']
savedir = os.path.join(OUTPUTS_ROOT_DIR, 'dhs_incountry', 'resnet_msnl_concat')
ridgecv_incountry_concat_wrapper(model_names, savedir)

### Transfer

In [None]:
def ridgecv_incountry_transfer_wrapper(model_name: str, savedir: str) -> None:
    '''
    Args
    - model_name: str, corresponds to keys in MODEL_DIRS (without the fold suffix)
    - savedir: str, path to directory for saving ridge regression weights and predictions
    '''
    model_dir = MODEL_DIRS[model_name]
    npz_path = os.path.join(OUTPUTS_ROOT_DIR, 'transfer', model_dir, 'features.npz')
    features = load_npz(npz_path, check={'labels': labels})['features']
    ridge_cv(
        features=features,
        labels=labels,
        group_labels=incountry_group_labels,
        group_names=FOLDS,
        do_plot=True,
        savedir=savedir,
        save_weights=False)

In [None]:
model_name = 'transfer_resnet_ms'
savedir = os.path.join(OUTPUTS_ROOT_DIR, 'transfer', MODEL_DIRS[model_name])
ridgecv_incountry_transfer_wrapper(model_name, savedir)

In [None]:
model_name = 'transfer_resnet_rgb'
savedir = os.path.join(OUTPUTS_ROOT_DIR, 'transfer', MODEL_DIRS[model_name])
ridgecv_incountry_transfer_wrapper(model_name, savedir)