# Cassava Leaf Disease Classification

This notebook builds and trains a model for cassava leaf disease classification for the [Kaggle competition](https://www.kaggle.com/c/cassava-leaf-disease-classification/overview).

## Ideas

1. Cross entropy loss, stratified CV, no fmix, cutmix, mixup, w gradient scaling & accumulation [done]
2. add hyperparam tuning with raytune [done]
2. Add smoothed cross entropy loss
3. Add *mixes
4. external data
5. emsemble of models - train a model for each fold and then average their predictions during inference [done]
6. train 15-20 epochs [done]
7. Test time augmentation
8. Better ensemble prediction - majority vote, other...?
10. train a resnet model
11. balanced classes instead of stratified?
12. verify per class accuracy
13. AdaBound - "as good as SGD and as fast as Adam"

In [1]:
import os
from datetime import datetime, timedelta
import time
import random
import warnings
import joblib
import warnings
import gc
import errno
import shutil

# My modules
from config import Config
from logger import init_logger
from common_utils import (set_seeds, read_csvs, stratify_split, setup_model_optimizer, 
                          get_data_dfs, get_loaders, create_holdout_loader, get_schd_crit)
from model import Model
from train_loop_functions import train_epoch, valid_epoch, ensemble_inference
from cassava_dataset import CassavaDataset
from early_stopping import EarlyStopping

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score

import torch
from torch import nn
from torch.cuda.amp import GradScaler
from torch.utils.tensorboard import SummaryWriter

In [2]:
%load_ext autoreload
%autoreload 2

## Preliminaries

In [3]:
label_map = pd.read_json(Config.data_dir + '/label_num_to_disease_map.json', orient='index')
label_map

Unnamed: 0,0
0,Cassava Bacterial Blight (CBB)
1,Cassava Brown Streak Disease (CBSD)
2,Cassava Green Mottle (CGM)
3,Cassava Mosaic Disease (CMD)
4,Healthy


In [4]:
set_seeds(Config.seed)
LOGGER = init_logger() # uses Python's logging framework

## Training and Validation Functions

gradient scaling https://pytorch.org/docs/stable/notes/amp_examples.html

gradient accumulation https://towardsdatascience.com/what-is-gradient-accumulation-in-deep-learning-ec034122cfa

https://towardsdatascience.com/deep-learning-model-training-loop-e41055a24b73

In [5]:
'''
    Trains the model over N epochs for a given fold
    
    train_folds_df: the dataset with a column for fold number
    fold: an integer representing the fold used for validation
    
    Returns a DataFrame consisting of only the the rows used for validation along with the model's predictions
''' 
def train_valid_test(train_folds_df, fold, resultsStore, device, 
                     experiment_name_dir, holdout_dataloader, holdout_targets, tb_writer):
    
    # -------- DATASETS AND LOADERS --------
    # select one of the folds, create train & validation set loaders
    train_df, valid_df = get_data_dfs(train_folds_df, fold)
    train_dataloader, valid_dataloader = get_loaders(train_df, valid_df,
                                                     Config.train_bs, 
                                                     Config.data_dir+'/train_images')
    
    
    # make model and optimizer
    model, optimizer = setup_model_optimizer(Config.model_arch, 
                                           Config.lr, 
                                           Config.is_amsgrad, 
                                           num_labels=train_folds_df.label.nunique(), 
                                           weight_decay=Config.weight_decay,
                                           fc_layer={"middle_fc": False, "middle_fc_size": 0},
                                           device=device,
                                           checkpoint=None)

    scheduler, criterion = get_schd_crit(optimizer)
    
    accuracy = 0.
    best_val_loss = float('inf')
    train_losses, val_losses = [], []
    early_stop = EarlyStopping('val_loss', LOGGER)

    for e in range(Config.epochs):
        epoch_start_time = time.time()
        LOGGER.info(f'Training epoch {e+1}/{Config.epochs}')
        
        # -------- TRAIN --------
        avg_training_loss = train_epoch(train_dataloader, model, 
                                      criterion, optimizer, 
                                      scheduler, GradScaler(), 
                                      Config.accum_iter, LOGGER,
                                      device, tb_writer, fold, e)

        # -------- VALIDATE --------
        avg_validation_loss, preds = valid_epoch(valid_dataloader, model, 
                                                 criterion, LOGGER, device, 
                                                 tb_writer, fold, e)
        
        train_losses.append(avg_training_loss)
        val_losses.append(avg_validation_loss)

        # -------- SCORE METRICS & LOGGING FOR THIS EPOCH --------
        validation_labels = valid_df[Config.target_col].values
        accuracy = accuracy_score(y_true=validation_labels, y_pred=preds)
       
        epoch_elapsed_time = time.time() - epoch_start_time
        tb_writer.add_scalar(f'Avg Epoch Train Loss Fold {fold}', avg_training_loss, e)
        tb_writer.add_scalar(f'Avg Epoch Val Loss Fold {fold}', avg_validation_loss, e)
        tb_writer.add_scalar(f'Epoch Val Accuracy Fold {fold}', accuracy, e)
        
        LOGGER.info(f'\nEpoch training summary:\n Fold {fold+1}/{Config.fold_num} | ' + \
                    f'Epoch: {e+1}/{Config.epochs} | ' + \
                    f'Epoch time: {epoch_elapsed_time} sec\n' + \
                    f'Training loss: {avg_training_loss} | ' + \
                    f'Validation loss: {avg_validation_loss} | ' + \
                    f'Accuracy: {accuracy}')
        
        early_stop(avg_validation_loss)
        if early_stop.stop: break
            
        # --------SAVE MODEL --------
        if avg_validation_loss < best_val_loss: 
            best_val_loss = avg_validation_loss
            torch.save({'model': model.state_dict(), 
                        'accuracy': accuracy, 
                        'preds': preds,
                        'val_loss': best_val_loss,
                        'fold': fold
                       },
                      Config.save_dir + f'/{experiment_name_dir}/{Config.model_arch}_fold{fold}.pth')
            LOGGER.info(f'Saved model!')
        LOGGER.info('----------------')
        # -------- UPDATE LR --------
        if scheduler and e > 4:
            if Config.scheduler == 'ReduceLROnPlateau':
                scheduler.step(avg_validation_loss)
            elif Config.scheduler == 'CosineAnnealingLR' or Config.scheduler == 'CosineAnnealingWarmRestarts':
                scheduler.step()
        gc.collect()

    # -------- TEST ON HOLDOUT SET --------
    # load best model
    checkpoint = torch.load(Config.save_dir + f'/{experiment_name_dir}/{Config.model_arch}_fold{fold}.pth')
    model.load_state_dict(checkpoint['model']) 
    # test
    _, holdout_preds = valid_epoch(holdout_dataloader, model, criterion, LOGGER, device, tb_writer, holdout=True)
    holdout_accuracy = accuracy_score(y_true=holdout_targets, y_pred=holdout_preds)
    
    valid_df['prediction'] = checkpoint['preds']
    del model
    del optimizer
    del train_dataloader
    del valid_dataloader
    return valid_df, checkpoint['accuracy'], holdout_accuracy, train_losses, val_losses

## Main loop

In [6]:
class Results:
    def __init__(self):
        self.fold_to_predictions = []
        self.fold_to_accuracy = []
        self.train_losses = []
        self.val_losses = []

"""
Entry point to training and inference. 
experiment_name_dir (required): a name for a directory in ./trained-models 
kaggle (required): indicates whether to run on kaggle test set
"""
def main(experiment_name_dir, kaggle):
    base_experiment_filename = Config.save_dir + f'/{experiment_name_dir}/{Config.model_arch}_fold'
    try:
        # -------- SETUP --------
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        resultsStore = Results()
        tb_writer = SummaryWriter(f'./runs/{experiment_name_dir}')
        
        # -------- LOAD DATA FROM FILE --------
        data_df, sample_df, holdout_df = read_csvs(Config.data_dir, Config.debug, test_proportion=0.15)
        folds = stratify_split(data_df, Config.fold_num, Config.seed, Config.target_col)
        test_df, test_loader = None, None
        
        # create holdout dataloader to test on totally unseen data
        holdout_dataloader, holdout_targets = create_holdout_loader(holdout_df, Config.data_dir + '/train_images')   

        experiment_list = os.listdir(Config.save_dir)
        if not Config.inference and experiment_name_dir in experiment_list: # resume training from the last fold's checkpoint
            last_fold = len(os.listdir(Config.save_dir + f'/{experiment_name_dir}')) - 1
            if last_fold >= 0: 
                print(f'Experiment exists. Resuming training from latest fold ({last_fold}).')

                checkpoint = torch.load(base_experiment_filename + f'{last_fold}.pth')

                #resume(checkpoint, fold, model, optimizer)
        else: # -------- START TRAINING --------
            if Config.train:
                # make directory for experiment
                try:
                    os.makedirs(Config.save_dir + f'/{experiment_name_dir}')
                    # copy the config file to this directory
                    shutil.copy2('./config.py', Config.save_dir + f'/{experiment_name_dir}')
                except OSError as e:
                    if e.errno != errno.EEXIST:
                        raise
                LOGGER.info('\n========== Running training ==========\n')

                aggregated_output_df = pd.DataFrame()
                
                time_training_start = time.time()
                for fold in range(Config.fold_num):    
                    # _df is the validation prediction output
                    # _df.columns: ['image_id', 'label', 'fold', 'prediction']
                    _df, val_accuracy, holdout_accuracy, train_losses, val_losses = train_valid_test(
                                                                        folds, fold, 
                                                                        resultsStore, device,
                                                                        experiment_name_dir,
                                                                        holdout_dataloader, 
                                                                        holdout_targets, tb_writer)
                    resultsStore.train_losses.append(train_losses)
                    resultsStore.val_losses.append(val_losses)
                    
                    if aggregated_output_df.empty:
                        aggregated_output_df[['image_id', 'label']] = _df[['image_id', 'label']]
                    aggregated_output_df[['prediction_fold'+str(fold)]] = _df['prediction']

                    resultsStore.fold_to_predictions.append(_df[['image_id', 'label', 'prediction']])
                    resultsStore.fold_to_accuracy.append((val_accuracy, holdout_accuracy))

                    LOGGER.info(f'========== fold: {fold+1}/{Config.fold_num} result ==========')
                    LOGGER.info(f'Validation Accuracy: {val_accuracy}')
                    LOGGER.info(f'Holdout Accuracy: {holdout_accuracy}')

                # Cross validation
                time_elapsed_training = time.time() - time_training_start 
                LOGGER.info(f"Training time: {str(timedelta(seconds=time_elapsed_training))}")
                LOGGER.info(f"========== CV ==========") # best results across all folds
                LOGGER.info(f"{resultsStore.fold_to_accuracy}")

                # Save result
                aggregated_output_df.to_csv(Config.save_dir + f'/{experiment_name_dir}/aggregated_output_df.csv', index=False)                
                
        if Config.inference: # runs inference on all trained models, averages result
            LOGGER.info('\n========== Running inference ==========\n')
            
            model_states = [torch.load(base_experiment_filename + f'{fold}.pth')['model']
                            for fold in range(Config.fold_num)]
            assert len(model_states) == Config.fold_num
            
            
            if not kaggle: 
                loader = holdout_dataloader
                num_samples = len(holdout_df)
            else: 
                loader = test_dataloader 
                num_samples = len(test_df)
            
            inference_start = time.time()
            
            predictions = ensemble_inference(model_states, Config.model_arch, 
                                    data_df.label.nunique(), loader, num_samples, device)
            
            inference_elapsed = time.time() - inference_start
            LOGGER.info(f"Inference time: {str(timedelta(seconds=inference_elapsed))}")
            
            if not kaggle:
                holdout_accuracy = accuracy_score(y_true=holdout_targets, y_pred=predictions)
                LOGGER.info(f"Ensemble model holdout accuracy: {holdout_accuracy}")
            
            # submission
            if kaggle:
                submission = pd.DataFrame()
                submission['image_id'] = test_df['image_id']
                submission['label'] = predictions
                submission.to_csv('submission.csv', index=False)
        return resultsStore
    finally: 
        torch.cuda.empty_cache()

In [7]:
resultsStore = None

In [8]:
if __name__ == '__main__':
    try:
        print('Running in debug mode:', Config.debug)
        resultsStore = main(experiment_name_dir='exp3_adabound', kaggle=False)
    except KeyboardInterrupt:
        pass





Running in debug mode: False


Training epoch 1/20
  0%|          | 0/853 [00:00<?, ?it/s]

Epoch     0: adjusting learning rate of group 0 to 1.2185e-03.


	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /opt/conda/conda-bld/pytorch_1607370128159/work/torch/csrc/utils/python_arg_parser.cpp:882.)
  exp_avg.mul_(beta1).add_(1 - beta1, grad)
[TRAIN] Processing batch 853: 100%|██████████| 853/853 [07:41<00:00,  1.85it/s]

[TRAIN] batch loss: 0.07326995573309004
[VAL] Processing batch 143: 100%|██████████| 143/143 [01:38<00:00,  1.46it/s]

[VAL] batch loss: 0.36422240853934856

Epoch training summary:
 Fold 1/4 | Epoch: 1/20 | Epoch time: 559.6121728420258 sec
Training loss: 0.07326995573309004 | Validation loss: 0.36422240853934856 | Accuracy: 0.8759621728612271

Saved model!
Training epoch 2/20
[TRAIN] Processing batch 853: 100%|██████████| 853/853 [07:38<00:00,  1.86it/s]

[TRAIN] batch loss: 0.05338739637713465
[VAL] Processing batch 143: 100%|██████████| 143/143 [01:39<00:00,  1.44it/s]

[VAL] batch loss: 0.34871530835028297

Epoch tra

Epoch     1: adjusting learning rate of group 0 to 1.1911e-03.


[TRAIN] Processing batch 853: 100%|██████████| 853/853 [07:38<00:00,  1.86it/s]

[TRAIN] batch loss: 0.04031860592948646
[VAL] Processing batch 143: 100%|██████████| 143/143 [01:37<00:00,  1.47it/s]

[VAL] batch loss: 0.33997241988569704

Epoch training summary:
 Fold 1/4 | Epoch: 7/20 | Epoch time: 555.9259831905365 sec
Training loss: 0.04031860592948646 | Validation loss: 0.33997241988569704 | Accuracy: 0.887618209808665

Training epoch 8/20
  0%|          | 0/853 [00:00<?, ?it/s]

Epoch     2: adjusting learning rate of group 0 to 1.1117e-03.


[TRAIN] Processing batch 853: 100%|██████████| 853/853 [07:39<00:00,  1.86it/s]

[TRAIN] batch loss: 0.038369858219285014
[VAL] Processing batch 143: 100%|██████████| 143/143 [01:36<00:00,  1.48it/s]

[VAL] batch loss: 0.3284887628486523

Epoch training summary:
 Fold 1/4 | Epoch: 8/20 | Epoch time: 555.8765950202942 sec
Training loss: 0.038369858219285014 | Validation loss: 0.3284887628486523 | Accuracy: 0.8895975368374752

Saved model!
Training epoch 9/20
  0%|          | 0/853 [00:00<?, ?it/s]

Epoch     3: adjusting learning rate of group 0 to 9.8797e-04.


[TRAIN] Processing batch 853: 100%|██████████| 853/853 [07:29<00:00,  1.90it/s]

[TRAIN] batch loss: 0.03704073981426717
[VAL] Processing batch 143: 100%|██████████| 143/143 [01:37<00:00,  1.47it/s]

[VAL] batch loss: 0.34678934359519215

Epoch training summary:
 Fold 1/4 | Epoch: 9/20 | Epoch time: 546.9207141399384 sec
Training loss: 0.03704073981426717 | Validation loss: 0.34678934359519215 | Accuracy: 0.8904772377391686

Training epoch 10/20
  0%|          | 0/853 [00:00<?, ?it/s]

Epoch     4: adjusting learning rate of group 0 to 8.3207e-04.


[TRAIN] Processing batch 853: 100%|██████████| 853/853 [07:38<00:00,  1.86it/s]

[TRAIN] batch loss: 0.03663248365257996
[VAL] Processing batch 143: 100%|██████████| 143/143 [01:36<00:00,  1.48it/s]

[VAL] batch loss: 0.34234492366130537

Epoch training summary:
 Fold 1/4 | Epoch: 10/20 | Epoch time: 555.8240222930908 sec
Training loss: 0.03663248365257996 | Validation loss: 0.34234492366130537 | Accuracy: 0.8935561908950956

Training epoch 11/20
  0%|          | 0/853 [00:00<?, ?it/s]

Epoch     5: adjusting learning rate of group 0 to 6.5925e-04.


[TRAIN] Processing batch 853: 100%|██████████| 853/853 [07:38<00:00,  1.86it/s]

[TRAIN] batch loss: 0.034545274017376504
[VAL] Processing batch 143: 100%|██████████| 143/143 [01:37<00:00,  1.47it/s]

[VAL] batch loss: 0.34237468406766447

Epoch training summary:
 Fold 1/4 | Epoch: 11/20 | Epoch time: 556.1295082569122 sec
Training loss: 0.034545274017376504 | Validation loss: 0.34237468406766447 | Accuracy: 0.8931163404442489

Training epoch 12/20
  0%|          | 0/853 [00:00<?, ?it/s]

Epoch     6: adjusting learning rate of group 0 to 4.8643e-04.


[TRAIN] Processing batch 853: 100%|██████████| 853/853 [07:39<00:00,  1.86it/s]

[TRAIN] batch loss: 0.03389213820807483
[VAL] Processing batch 143: 100%|██████████| 143/143 [01:38<00:00,  1.45it/s]

[VAL] batch loss: 0.3406490725497683

Epoch training summary:
 Fold 1/4 | Epoch: 12/20 | Epoch time: 558.1848247051239 sec
Training loss: 0.03389213820807483 | Validation loss: 0.3406490725497683 | Accuracy: 0.8913569386408621

Metric val_loss has not seen improvement in 4 epochs. Early stop.
[VAL] Processing batch 101: 100%|██████████| 101/101 [01:10<00:00,  1.43it/s]

[VAL] batch loss: 0.34457306497462903
Validation Accuracy: 0.8895975368374752
Holdout Accuracy: 0.8800623052959502
Training epoch 1/20
  0%|          | 0/853 [00:00<?, ?it/s]

Epoch     0: adjusting learning rate of group 0 to 1.2185e-03.


[TRAIN] Processing batch 853: 100%|██████████| 853/853 [07:30<00:00,  1.89it/s]

[TRAIN] batch loss: 0.06988417021144107
[VAL] Processing batch 143: 100%|██████████| 143/143 [01:35<00:00,  1.50it/s]

[VAL] batch loss: 0.40336341572391404

Epoch training summary:
 Fold 2/4 | Epoch: 1/20 | Epoch time: 545.9302754402161 sec
Training loss: 0.06988417021144107 | Validation loss: 0.40336341572391404 | Accuracy: 0.8660655377171762

Saved model!
Training epoch 2/20
[TRAIN] Processing batch 297:  35%|███▍      | 297/853 [02:35<04:50,  1.91it/s]
