## Stratified K-Fold Cross-Validation and Resnet34 with fast.ai

- The first notebook of this serie was a simple baseline using resnet34: [Fast Resnet34 with Fastai](https://www.kaggle.com/code/fmussari/fast-resnet34-with-fastai)  
- In this notebook we are going to explore that same model, ensembling the trainings of 5 folds to see how much that improves the accuracy.
  


<img src="https://drive.google.com/uc?export=view&id=1EucGY8cJYJiuAZHBp95UdS22zVeWvyjl" width="500">

## Acknowledgements

- [Fast Resnet34 with Fastai](https://www.kaggle.com/code/fmussari/fast-resnet34-with-fastai)

**fastai course:**
- [Practical Deep Learning for Coders (a UQ collaboration with fast.ai)](https://itee.uq.edu.au/event/2022/practical-deep-learning-coders-uq-fastai)  

**Jeremy's Notebook Series:**
- [First Steps: Road to the Top, Part 1](https://www.kaggle.com/code/jhoward/first-steps-road-to-the-top-part-1)
- [Small models: Road to the Top, Part 2](https://www.kaggle.com/code/jhoward/small-models-road-to-the-top-part-2)
- [Scaling Up: Road to the Top, Part 3](https://www.kaggle.com/code/jhoward/scaling-up-road-to-the-top-part-3)
- [Multi-target: Road to the Top, Part 4](https://www.kaggle.com/code/jhoward/multi-target-road-to-the-top-part-4)

## K-Fold

- K-Fold is a technique in which data is divided into K parts. In this case we are going to use K=5.
- We will do 5 trainings, each one with a different validation set (in each experiment we are going to take one fold for validation and the other four for training).
- This way we'll end up with 5 models trained in slightly different data, but with completely different validations.
- We don't know wich of them is the best model, but we could assume that taking the mean of them would be the best for generalization.
- We are going to use [KFold from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html).

## Installing the libraries


In [1]:
# fastkaggle allows you to work locally and then submit the results and notebook to Kaggle

try: import fastkaggle

except ModuleNotFoundError:
    !pip install -Uq fastkaggle

from fastkaggle import *
from sklearn.model_selection import KFold
import plotly.express as px
from datetime import datetime as dt

[0m

In [2]:
competition = 'paddy-disease-classification'
path = setup_comp(competition, install='fastai "timm>=0.6.2.dev0"')

from fastai.vision.all import *
#from scipy.special import softmax, log_softmax
import gc

## Setting data paths

In [3]:
# train images
train_path = path / 'train_images'
train_files = get_image_files(train_path)

# test images
test_path = path/'test_images'
test_files = get_image_files(test_path).sorted()

# sample submission
sample_submission = pd.read_csv(path/'sample_submission.csv')

# train labels
train_df = pd.read_csv(path / 'train.csv')


### target distribution

In [4]:
train_df.label.value_counts() * 100 / len(train_df)

normal                      16.950130
blast                       16.700298
hispa                       15.316614
dead_heart                  13.856058
tungro                      10.454502
brown_spot                   9.272605
downy_mildew                 5.957529
bacterial_leaf_blight        4.602671
bacterial_leaf_streak        3.651388
bacterial_panicle_blight     3.238205
Name: label, dtype: float64

## Stratified Folding
- If we apply sklearn KFold to all the dataset, we could end up with 5 folds that don't have the same distributions by targets as our full dataset.
- There is a sklearn function called [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold) to assure that each fold preserves the percentage of samples for each target class.
- Or, as we are going to do here, we can apply K-Fold individually to each of the labels. If we divide each class in 5 randomly, we are sure that at the end, when we join the pieces, we are going to have the same distribution.
- To keep track of which image belongs to what fold, we can create a field called `kfold` and initializa it with

In [5]:
train_df['kfold'] = -1

# Number of Splits
n_folds=5
reversed = False

# For each label we are going to create n_folds folds
for label in train_df.label.unique():
    
    # Assign fold number from 0 to n_folds or from n_folds to 0
    # Because KFold assigns less data for last fold (we assign it to fold 0 or n_folds)
    folds = list(range(n_folds))
    if reversed: folds.reverse()
    
    kf = KFold(n_splits=n_folds, random_state=42, shuffle=True)
    
    # Indices for each label
    label_idxs = train_df[train_df.label==label].index
    
    # Creating folds for those indices
    kf.get_n_splits(label_idxs)

    for _, valid_index in kf.split(label_idxs):

        actual_fold = folds.pop(0)
        df_index = label_idxs[valid_index]
        train_df.loc[df_index, 'kfold'] = actual_fold
    reversed = not reversed
        

In [6]:
train_df.sample(5)

Unnamed: 0,image_id,label,variety,age,kfold
6165,106728.jpg,hispa,ADT45,50,2
6892,105414.jpg,hispa,AndraPonni,65,2
4639,108404.jpg,dead_heart,ADT45,72,1
1931,109459.jpg,blast,ADT45,70,1
4290,104517.jpg,dead_heart,ADT45,70,0


### plot distributions by fold

In [7]:
df = train_df.groupby(['label', 'kfold']).size().reset_index()
df.columns = ['label', 'kfold', 'count']
#df.kfold = df.kfold.astype('str')

fig = px.bar(
    df, x="kfold", y="count",
    color='label', barmode='group',
    height=400
)
fig.show()

## Custom Split Function
- In the first notebook we passed the following splitter to the `DataBlock`:
```
splitter=RandomSplitter(0.2, seed=42),
```
- In this case we are going to use a custom function as splitter.

### FuncSplitter
- fast.ai `FuncSplitter` needs a function to be passed that returns `True` for items that belongs to validation set.
- So we are going to create a list of dictionaries (one for each fold) containing that information.


In [8]:
img2valid = []

for fold in range(n_folds):
    train_df['is_valid'] = False
    idxs = train_df[train_df.kfold == fold].index
    train_df.loc[idxs, 'is_valid'] = True
    
    img2valid.append({ r.image_id: r.is_valid for _, r in train_df.iterrows() })

- There we are, a list of dictionaries that for each fold, has `True` for each validation image, and now it is easy to retrieve, for each fold, if an image belong to the validation set or not.

## Dataloaders for fastai training


In [9]:
def get_datablock(i_fold, size, item_tfms, accum):
    
    def get_split(p):
        # For each fold, return if an image is in valid set or not
        return img2valid[i_fold][p.name]
    
    dblock = DataBlock(
        blocks=(ImageBlock, CategoryBlock),
        get_items=get_image_files,
        get_y=parent_label,
        # Custom Splitter
        splitter = FuncSplitter(get_split),
        item_tfms=item_tfms,
        batch_tfms=aug_transforms(size=size, min_scale=0.75)
    )
    return dblock.dataloaders(train_path, bs=64//accum)

## Training Function

In [10]:
def train(i_fold, arch, size=224, item_tfms=Resize(480, method='squish'), accum=1, epochs=16, lr=0.005):
    
    dls = get_datablock(i_fold=i_fold, size=size, item_tfms=item_tfms, accum=accum)
    print('- First 5 validation images:')
    print([each.name for each in dls.valid.items[:5]])
    
    cbs = GradientAccumulation(64) if accum!=1 else []
    
    # Force torchvision models instead of TIMM, when possible
    try: arch = eval(arch)
    except: arch = arch
        
    learn = vision_learner(dls, arch, metrics=error_rate, cbs=cbs).to_fp16()
    print('- Fine Tuning')
    learn.fine_tune(epochs, lr)
    #print('- Getting predictions')
    #probs, _ = learn.get_preds(dl=dls.test_dl(test_files))
    probs = None # Return only tta_preds
    print('- Getting tta_predictions')
    preds, _ = learn.tta(dl=dls.test_dl(test_files))
    
    return probs, preds, dls.vocab

## Running the Model(s) on Selected Fold
- In this notebook, as in ([Fast Resnet34 with Fastai](https://www.kaggle.com/code/fmussari/fast-resnet34-with-fastai)) notebook, we are going to experiment with resnet34.
- You can create a copy of this notebook and, just as Jeremy did in his notebook ([Scaling Up: Road to the Top, Part 3](https://www.kaggle.com/code/jhoward/scaling-up-road-to-the-top-part-3)), try different models with control over the validation sets you want to use for each model or experiment. Remeber to set `accum` according to model size and available GPU memory.

### resnet34 in each fold

In [11]:
models = {
    'resnet34': {
        (0, Resize(480, method='squish'), 224),
        (1, Resize(480, method='squish'), 224),
        (2, Resize(480, method='squish'), 224),
        (3, Resize(480, method='squish'), 224),
        (4, Resize(480, method='squish'), 224)
    }
    
}

In [12]:
predictions = []
tta_predictions = []

### run the experiments

In [13]:
exp = 1
for arch, details in models.items():
    
    for i_fold, item, size in details:
        print('////'*10)
        print('---Experiment', exp, '--', arch)
        print('fold: ', i_fold)
        print(item.name)
        
        preds, tta_preds, vocab = train(i_fold, arch, size, item_tfms=item, accum=1, epochs=20, lr=0.005)
        
        predictions.append(preds)
        tta_predictions.append(tta_preds)
        
        now = dt.now().strftime("%Y%m%d")
        filename = f'{now}-exp{exp}-{arch}-Fold{i_fold}.csv'
        print(f'Saving {filename}')
        
        sample_submission.label = vocab[tta_preds.argmax(axis=1)]
        sample_submission.to_csv(filename, index=False)
        
        gc.collect()
        torch.cuda.empty_cache()
        exp += 1

////////////////////////////////////////
---Experiment 1 -- resnet34
fold:  3
Resize -- {'size': (480, 480), 'method': 'squish', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}
- First 5 validation images:
['106433.jpg', '107860.jpg', '107042.jpg', '100886.jpg', '105098.jpg']


Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /root/.cache/torch/hub/checkpoints/resnet34-b627a593.pth


  0%|          | 0.00/83.3M [00:00<?, ?B/s]

- Fine Tuning


epoch,train_loss,valid_loss,error_rate,time
0,1.82106,1.007575,0.330773,01:59


epoch,train_loss,valid_loss,error_rate,time
0,0.762712,0.411636,0.12674,01:47
1,0.465117,0.326431,0.111858,01:47
2,0.35364,0.322618,0.088814,01:47
3,0.324914,0.378107,0.113298,01:47
4,0.301681,0.25977,0.076812,01:47
5,0.303149,0.260858,0.083053,01:48
6,0.253609,0.310548,0.090735,01:47
7,0.234559,0.203551,0.056169,01:47
8,0.193365,0.181542,0.046567,01:48
9,0.171235,0.184694,0.057129,01:49


- Getting tta_predictions


Saving 20220718-exp1-resnet34-Fold3.csv
////////////////////////////////////////
---Experiment 2 -- resnet34
fold:  1
Resize -- {'size': (480, 480), 'method': 'squish', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}
- First 5 validation images:
['104765.jpg', '102416.jpg', '101046.jpg', '108310.jpg', '104555.jpg']
- Fine Tuning


epoch,train_loss,valid_loss,error_rate,time
0,1.789822,1.030524,0.343104,01:44


epoch,train_loss,valid_loss,error_rate,time
0,0.749005,0.437849,0.145123,01:47
1,0.434743,0.308224,0.099471,01:48
2,0.352224,0.293205,0.092744,01:47
3,0.331259,0.258127,0.079289,01:47
4,0.32214,0.31262,0.091302,01:47
5,0.318279,0.394702,0.112926,01:47
6,0.237401,0.252605,0.077367,01:47
7,0.23181,0.248449,0.067275,01:49
8,0.200864,0.204255,0.055262,01:47
9,0.15664,0.185162,0.051418,01:48


- Getting tta_predictions


Saving 20220718-exp2-resnet34-Fold1.csv
////////////////////////////////////////
---Experiment 3 -- resnet34
fold:  0
Resize -- {'size': (480, 480), 'method': 'squish', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}
- First 5 validation images:
['100098.jpg', '102484.jpg', '107136.jpg', '105430.jpg', '101813.jpg']
- Fine Tuning


epoch,train_loss,valid_loss,error_rate,time
0,1.75952,1.06831,0.339423,01:45


epoch,train_loss,valid_loss,error_rate,time
0,0.771473,0.420047,0.13125,01:48
1,0.451689,0.284835,0.094231,01:49
2,0.350093,0.271084,0.0875,01:48
3,0.346398,0.356776,0.108173,01:47
4,0.290721,0.303694,0.088942,01:48
5,0.308513,0.370208,0.099038,01:48
6,0.268207,0.216736,0.079327,01:47
7,0.200613,0.210314,0.065865,01:47
8,0.196857,0.202242,0.057692,01:48
9,0.167895,0.14916,0.040385,01:47


- Getting tta_predictions


Saving 20220718-exp3-resnet34-Fold0.csv
////////////////////////////////////////
---Experiment 4 -- resnet34
fold:  2
Resize -- {'size': (480, 480), 'method': 'squish', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}
- First 5 validation images:
['102734.jpg', '108930.jpg', '102019.jpg', '100464.jpg', '101369.jpg']
- Fine Tuning


epoch,train_loss,valid_loss,error_rate,time
0,1.758307,0.986137,0.330932,01:46


epoch,train_loss,valid_loss,error_rate,time
0,0.770776,0.397107,0.126321,01:49
1,0.447202,0.280368,0.086455,01:49
2,0.347407,0.269754,0.087896,01:48
3,0.327556,0.320429,0.104707,01:50
4,0.323262,0.324849,0.090298,01:48
5,0.296593,0.258956,0.073967,01:49
6,0.236839,0.250007,0.069164,01:51
7,0.217097,0.224519,0.060519,01:50
8,0.197059,0.148454,0.045149,01:50
9,0.144682,0.155358,0.040826,01:51


- Getting tta_predictions


Saving 20220718-exp4-resnet34-Fold2.csv
////////////////////////////////////////
---Experiment 5 -- resnet34
fold:  4
Resize -- {'size': (480, 480), 'method': 'squish', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}
- First 5 validation images:
['109629.jpg', '109706.jpg', '104026.jpg', '110365.jpg', '107089.jpg']
- Fine Tuning


epoch,train_loss,valid_loss,error_rate,time
0,1.763906,1.049422,0.315233,01:45


epoch,train_loss,valid_loss,error_rate,time
0,0.741525,0.453382,0.14272,01:48
1,0.451914,0.354371,0.106199,01:49
2,0.344301,0.37021,0.101874,01:48
3,0.335128,0.452625,0.12494,01:48
4,0.322635,0.322908,0.094666,01:49
5,0.279385,0.327499,0.08938,01:48
6,0.226315,0.429295,0.113407,01:48
7,0.220384,0.255178,0.068236,01:48
8,0.199664,0.255937,0.066795,01:48
9,0.178228,0.25938,0.059106,01:48


- Getting tta_predictions


Saving 20220718-exp5-resnet34-Fold4.csv


## Ensembling

In [14]:
[each.shape for each in tta_predictions]

[torch.Size([3469, 10]),
 torch.Size([3469, 10]),
 torch.Size([3469, 10]),
 torch.Size([3469, 10]),
 torch.Size([3469, 10])]

In [15]:
avg_tta_predictions = torch.stack(tta_predictions).mean(0)
avg_tta_predictions.shape

torch.Size([3469, 10])

## Final Submission

In [16]:
vocab[avg_tta_predictions.argmax(dim=1)]

(#3469) ['hispa','normal','blast','blast','blast','brown_spot','dead_heart','brown_spot','hispa','normal'...]

In [17]:
sample_submission.label = vocab[avg_tta_predictions.argmax(dim=1)]

sample_submission.to_csv('submission.csv', index=False)

## Conclusions

- Maybe it is more useful to try different models with different random splits as Jeremy did in his [Scaling Up: Road to the Top, Part 3](https://www.kaggle.com/code/jhoward/scaling-up-road-to-the-top-part-3), instead of using cross-validation (different validations for each training).
- The results when running it locally were the following for 16 epochs: 

<img src="https://drive.google.com/uc?export=view&id=1EucGY8cJYJiuAZHBp95UdS22zVeWvyjl" width="500">

- The ensembled model had a score of **0.98615** which is better than the best score for an individual fold (0.98269).

In [18]:
# Pushing the notebook from my home PC to Kaggle

if not iskaggle:
    push_notebook(
        'fmussari', 
        'Stratified Cross-Validation, Resnet34 & Fastai',
        title='Stratified Cross-Validation, Resnet34 & Fastai',
        file='2022-07. Cross-validation and Resnet34 with Fastai [Submission].ipynb',
        competition=competition, 
        private=False, 
        gpu=True
    )