[Paddy Doctor: Paddy Disease Classification](https://www.kaggle.com/competitions/paddy-disease-classification/overview)

[Scaling Up: Road to the Top, Part 3](https://www.kaggle.com/code/jhoward/scaling-up-road-to-the-top-part-3)

# Memory and gradient accumulation

In [1]:
from fastai.vision.all import *
from fastkaggle import *

set_seed(42)

path = Path("./input")
tst_files = get_image_files(path / "test_images").sorted()

In this analysis our goal will be to train an ensemble of larger models with larger inputs. The challenge when training such models is generally GPU memory. Kaggle GPU have 16280MiB of memory available, as at the time of writing. I like to try out my notebooks on my home PC, then upload them -- but I still need them to run OK on Kaggle(especially if it's a code competition, where this is required). My home PC has 24GiB cards, so just because it runs OK at home doesn't mean it'll run OK on Kaggle.

It's really helpful to be able to quickly try a few models and image sizes and find out what will run successfully. To make this quick, we can just grab a small subset of the data for running short epochs -- the memory use will still be the same, but it'll be much faster.

In [2]:
df = pd.read_csv(path / "train.csv")
df.label.value_counts()

label
normal                      1764
blast                       1738
hispa                       1594
dead_heart                  1442
tungro                      1088
brown_spot                   965
downy_mildew                 620
bacterial_leaf_blight        479
bacterial_leaf_streak        380
bacterial_panicle_blight     337
Name: count, dtype: int64

Let's use *bacterial_panicle_blight* since it's the smallest:

In [3]:
trn_path = path / "train_images" / "bacterial_panicle_blight"

Now we'll set up a `train` function which is very similar to the steps we used for training in the last notebook. But there's a few significant differences...

The fist is that I'm using a `finetune` argument to pick whether we are going to run the `fine_tune()` method, or the `fit_one_cycle()` method -- the latter is faster since it doesn't do an initial fine-tuning of the head. When we fine tune in this function I also have it calculate and return the TTA predictions on the test set, since later on we'll be ensembling the TTA results of a number of models. Note also that we no longer have `seed=42` in the `ImageDataLoaders` line -- that means we'll have different training and validation sets each time we call this. That's what we'll want for ensembling, since it means that each model will use slightly different data.

The more important change is that I've added an `accum` argument to implement *gradient accumulation*. As you'll see in the code below, this does two things:
1. Divide the batch size by `accum`
2. Add the `GradientAccumulation` callback, passing in `accum`

## `fastai`의 `finetune()`과 `fit_one_cycle()` 차이

### 1. `finetune()`
- **목적**: 사전 학습된(pretrained) 모델을 미세 조정(fine-tuning)하기 위한 함수.
- **작동 방식**:
  1. **첫 번째 단계**: 마지막 레이어만 학습 (나머지 레이어는 사전 학습된 가중치 사용).
  2. **두 번째 단계**: 전체 모델을 작은 학습률로 미세 조정.
- **사용 상황**: 사전 학습된 모델을 새로운 데이터셋에 맞게 조정할 때 유용.

### 2. `fit_one_cycle()`
- **목적**: One Cycle Policy 학습률 스케줄링 기법을 적용해 모델을 처음부터 학습.
- **작동 방식**: 학습률을 점진적으로 증가시키다가 다시 감소시키는 사이클을 따름.
  - 학습의 안정성과 효율성을 높이고, 기울기 폭주(exploding gradients)를 방지.
- **사용 상황**: 모델을 처음부터 학습하거나 사전 학습된 모델 전체를 다시 학습할 때 적합.

### 요약
- **`finetune()`**: 사전 학습된 모델을 미세하게 조정하는 데 사용.
- **`fit_one_cycle()`**: 학습률 스케줄링을 통해 새로운 모델을 학습.


In [4]:
def train(
    arch, size, item=Resize(480, method="squish"), accum=1, finetune=True, epochs=12
):
    dls = ImageDataLoaders.from_folder(
        trn_path,
        valid_pct=0.2,
        item_tfms=item,
        batch_tfms=aug_transforms(size=size, min_scale=0.75),
        bs=64 // accum,  # 한 배치에 몇 개의 이미지를 만들어낼지
    )
    # accum이 설정돼있을 경우에 gradient를 축적하여
    # 64개의 이미지 데이터를 처리한 후의 gradient를 한 번에 업데이트
    cbs = GradientAccumulation(64) if accum else []
    learn = vision_learner(dls, arch, metrics=error_rate, cbs=cbs).to_fp16()
    if finetune:
        learn.fine_tune(epochs, 0.01)
        return learn.tta(dl=dls.test_dl(tst_files))
    else:
        learn.unfreeze()
        learn.fit_one_cycle(epochs, 0.01)

In [5]:
# trn_path = path / "train_images"
# train(
#     "convnext_large_in22k",
#     (320, 224),
#     Resize((640, 480)),
#     epochs=12,
#     accum=4,
#     finetune=True,
# )

*Gradient accumulation* refers to a very simple trick: rather than updating the model weights after every batch based on that batch's gradient, instead keep *accumulating*(adding up) the gradients for a few batches, and then update the model weights with those accumulated gradients. In fastai, the parameter you pass to `GradientAccumulation` defines how many batches of gradients are accumulated. Since we're adding up the gradients over `accum` batches, we therefore need to divide the batch size by that same number. The resulting training loop is nearly mathematically identical to using the original batch size, but the amount of memory used is the same as using a batch size `accum` times smaller!

For intance, here's a basic example of a single epoch of a training loop without gradient accumulation:

```python
for x, y in dl:
    # backward로 손실을 모델 파라미터 coeffs에 대해 미분해, 각 파라미터에 대한 gradient를 계산
    # 결과는 coeffs.grad에 저장
    calc_loss(coeffs, x, y).backward()
    # 학습률에 gradient를 곱한 값을 모델 파라미터에서 빼서 파라미터 업데이트
    coeffs.data.sub_(coeffs.grad * lr)
    # 파라미터 gradient를 초기화
    coeffs.grad.zero_()
```

Here's the same thing, but with gradient accumulation added(assuming a target effective batch size of 64):

```python
count = 0  # track count of items seen since last weight update
for x, y in dl:
    count += len(x)  # update count based on this minibatch size
    calc_loss(coeffs, x, y).backward()
    if count >= 64:  # count >= accumulation target, so do weight update
        coeffs.data.sub_(coeffs.grad * lr)
        coeffs.grad.zero_()
        count = 0  # reset count
```

The full impl in fastai is only a few lines of code -- here's the [source code](https://github.com/fastai/fastai/blob/master/fastai/callback/training.py#L26)

To see the impact of gradient accumulation, consider this small model:

In [6]:
train("convnext_small_in22k", 128, epochs=1, accum=1, finetune=False)

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:03


Lets's create a function to find out how much memory it used, and also to then clear out the memory for the next run:

In [7]:
import gc


def report_gpu():
    print(
        f"{torch.cuda.memory_reserved() / (1024 ** 2):.3f} MB({torch.cuda.memory_reserved() / (1024 ** 3):.3f} GB) GPU memory used."
    )
    gc.collect()
    torch.cuda.empty_cache()

In [8]:
report_gpu()

3070.000 MB(2.998 GB) GPU memory used.


So with `accum=1` the GPU used around 3GB RAM. Let's try `accum=2`

In [9]:
train("convnext_small_in22k", 128, epochs=1, accum=2, finetune=False)
report_gpu()

epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:03


2010.000 MB(1.963 GB) GPU memory used.


As you see, the RAM usage has now gone down to 2GB. It's not halved since there's other overhead involved (for larger models this overhead is likely to be relatively lower).

Let's try `4`:

In [10]:
train("convnext_small_in22k", 128, epochs=1, accum=4, finetune=False)
report_gpu()

epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:04


1474.000 MB(1.439 GB) GPU memory used.


The memory use is even lower!

# Checking memory use

We'll now check the memory use for each of the archs and sizes we'll be training later, to ensure they all fit in 12GB RAM. For each of these, I tried `accum=1` first, and then doubled it any time the resulting memory use was over 12GB. As it turns out, `accum=4` was what I needed for every case.

First, `convnext_large`:

In [11]:
train("convnext_large_in22k", 224, epochs=1, accum=4, finetune=False)
report_gpu()

epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:07


6958.000 MB(6.795 GB) GPU memory used.


In [12]:
train("convnext_large_in22k", (320, 240), epochs=1, accum=4, finetune=False)
report_gpu()

epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:09


8618.000 MB(8.416 GB) GPU memory used.


Here's `vit_large`.

In [13]:
train("vit_large_patch16_224", 224, epochs=1, accum=4, finetune=False)
report_gpu()

epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:10


10014.000 MB(9.779 GB) GPU memory used.


Then finally our `swinv2` and `swin` models:

In [14]:
train("swinv2_large_window12_192_22k", 192, epochs=1, accum=4, finetune=False)
report_gpu()

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]


epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:09


8072.000 MB(7.883 GB) GPU memory used.


In [15]:
train("swin_large_patch4_window7_224", 224, epochs=1, accum=4, finetune=False)
report_gpu()

epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:08


7246.000 MB(7.076 GB) GPU memory used.


# Running the models

Using the prev notebook, I tried a bunch of different archs and preprocessing approaches on small models, and picked a few which looked good. We'll using a `dict` to list our the preprocessing approaches we'll use for each arch of interest based on that analysis:

In [16]:
res = 640, 480

In [17]:
models = {
    "convnext_large_in22k": {(Resize(res), (320, 224))},
    "vit_large_patch16_224": {(Resize(480, method="squish"), 224), (Resize(res), 224)},
    "swinv2_large_window12_192_22k": {
        (Resize(480, method="squish"), 192),
        (Resize(res), 192),
    },
    "swin_large_patch4_window7_224": {(Resize(res), 224)},
}

We'll need to switch to using the full training set of course!

In [18]:
trn_path = path / "train_images"

Now we're ready to train all these models. Remember that each is using a different training and validation set, so the results aren't directly comparable.

We'll append each set of TTA preds on the test set into a list called `tta_res`.

In [19]:
tta_res = []

for arch, details in models.items():
    for item, size in details:
        print("---", arch)
        print(size)
        print(item.name)
        tta_res.append(train(arch, size, item=item, accum=4))
        gc.collect()
        torch.cuda.empty_cache()

--- convnext_large_in22k
(320, 224)
Resize -- {'size': (480, 640), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}


epoch,train_loss,valid_loss,error_rate,time
0,0.864535,0.554393,0.161941,03:44


epoch,train_loss,valid_loss,error_rate,time
0,0.473602,0.248172,0.075444,04:46
1,0.334652,0.236502,0.067756,04:45
2,0.341211,0.252537,0.075444,04:45
3,0.27939,0.192348,0.055262,04:44
4,0.172098,0.16588,0.040846,04:43
5,0.165812,0.152433,0.034599,04:44
6,0.142954,0.135342,0.037963,04:45
7,0.122391,0.142625,0.031716,04:45
8,0.06545,0.127263,0.032677,04:45
9,0.053337,0.115736,0.027871,04:45


--- vit_large_patch16_224
224
Resize -- {'size': (480, 640), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}


epoch,train_loss,valid_loss,error_rate,time
0,1.00416,0.594354,0.197982,04:09


epoch,train_loss,valid_loss,error_rate,time
0,0.448514,0.238791,0.074483,05:12
1,0.36961,0.220536,0.067275,05:11
2,0.363891,0.213627,0.064873,05:11
3,0.344941,0.351641,0.085536,05:11
4,0.21482,0.184724,0.050937,05:11
5,0.146942,0.133766,0.035079,05:11
6,0.121613,0.124761,0.033638,05:11
7,0.083246,0.131639,0.031235,05:11
8,0.056405,0.112295,0.022585,05:11
9,0.055756,0.096729,0.019222,05:11


--- vit_large_patch16_224
224
Resize -- {'size': (480, 480), 'method': 'squish', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}


epoch,train_loss,valid_loss,error_rate,time
0,1.005666,0.603615,0.193657,04:14


epoch,train_loss,valid_loss,error_rate,time
0,0.421142,0.253014,0.082653,05:18
1,0.35025,0.275275,0.072081,05:18
2,0.406433,0.27486,0.086016,05:19
3,0.38081,0.340463,0.103316,05:19
4,0.227668,0.18314,0.051898,05:19
5,0.130352,0.173375,0.052379,05:19
6,0.12647,0.136875,0.037001,05:21
7,0.081143,0.144465,0.036521,05:20
8,0.050411,0.13472,0.029313,05:20
9,0.040484,0.126039,0.030274,05:20


--- swinv2_large_window12_192_22k
192
Resize -- {'size': (480, 480), 'method': 'squish', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}


epoch,train_loss,valid_loss,error_rate,time
0,0.867091,0.568214,0.192215,03:53


epoch,train_loss,valid_loss,error_rate,time
0,0.436477,0.255844,0.086977,04:39
1,0.369329,0.241038,0.072561,04:39
2,0.362735,0.26315,0.084575,04:39
3,0.312119,0.208607,0.064392,04:39
4,0.219426,0.195088,0.054781,04:39
5,0.190413,0.141191,0.039404,04:39
6,0.159968,0.147514,0.038443,04:39
7,0.117735,0.101439,0.031716,04:39
8,0.070113,0.082523,0.024027,04:40
9,0.057436,0.079551,0.021624,04:40


--- swinv2_large_window12_192_22k
192
Resize -- {'size': (480, 640), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}


epoch,train_loss,valid_loss,error_rate,time
0,0.911122,0.633168,0.20519,03:50


epoch,train_loss,valid_loss,error_rate,time
0,0.471424,0.248546,0.079769,04:35
1,0.382447,0.234357,0.07112,04:36
2,0.417509,0.254896,0.0889,04:36
3,0.272766,0.20725,0.059106,04:36
4,0.24843,0.152935,0.042768,04:36
5,0.215468,0.133694,0.041807,04:36
6,0.118604,0.09366,0.028352,04:36
7,0.109291,0.081134,0.022585,04:36
8,0.062955,0.082704,0.023066,04:36
9,0.05504,0.059718,0.013455,04:36


--- swin_large_patch4_window7_224
224
Resize -- {'size': (480, 640), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}


epoch,train_loss,valid_loss,error_rate,time
0,0.954399,0.531091,0.164825,03:30


epoch,train_loss,valid_loss,error_rate,time
0,0.481685,0.251494,0.072561,04:14
1,0.380806,0.219351,0.068236,04:14
2,0.374081,0.192615,0.055262,04:14
3,0.348406,0.16501,0.049495,04:14
4,0.199248,0.156519,0.041326,04:14
5,0.167483,0.15107,0.037963,04:14
6,0.134332,0.130631,0.031716,04:14
7,0.09113,0.107957,0.025949,04:14
8,0.086154,0.099502,0.023546,04:14
9,0.050942,0.097683,0.019702,04:14


# Ensembling

Since this has taken quite a while to run, let's save the results, just in case something goes wrong!

In [20]:
save_pickle("tta_res.pkl", tta_res)

`Learner.tta` returns preds and targs for each rows. We just want the preds:

In [21]:
tta_prs = first(zip(*tta_res))

Originally I just used the above preds, but later I realised in my experiments on smaller models that `vit` was a bit better than everything else, so I decided to give those double the weight in my ensemble. I did that by simply adding the to the list as second time(We could also do this by using a weighted average):

In [22]:
tta_prs += tta_prs[1:3]

An *ensemble* simply refers to a model which is itself the result of combining a number of other models. The simplest way to do ensembling is to take the average of the preds of each model:

In [23]:
avg_pr = torch.stack(tta_prs).mean(0)
avg_pr.shape

torch.Size([3469, 10])

That's all that's needed to create an ensemble! Finally, we copy the steps we used in the last notebook to create a submission file:

In [24]:
dls = ImageDataLoaders.from_folder(
    trn_path,
    valid_pct=0.2,
    item_tfms=Resize(480, method="squish"),
    batch_tfms=aug_transforms(size=224, min_scale=0.75),
)

In [25]:
idxs = avg_pr.argmax(dim=1)
vocab = np.array(dls.vocab)
ss = pd.read_csv(path / "sample_submission.csv")
ss["label"] = vocab[idxs]
ss.to_csv("subm3.csv", index=False)

Going from bottom to top, here's what each one was:

1. `convnext_small` trained for 12 epochs, with TTA
2. `convnext_large` trained the same way
3. The ensemble in this notebook, with `vit` models not over-weighted
4. The ensemble in this notebook, with `vit` modls over-weighted

# Conclusion

The key takeaway I hope to get across from this series so far is that you can get great results in image recognition using very little code and a very standardised approach, and that with a rigorous process you can improve in significant steps. Our training function, including data processing and TTA, is just half a dozen lines of code, plus another 7 lines of code to ensemble the models and create a submission file!