# Which batch size to use?

When training a deep learning model, there are many hyperparameters that help tune the parameters for the model. For example, there are learning rate, batch size, number of epochs, and others. So, which batch size do we use?

Some may ask "Batch size for what?" In deep learning, we have to update the parameters of the model after feeding data into it. Let's say we want to build a model for classifying whther an image is a cat or a dog, and we have many correctly labelled pictures of dogs and cats. To make We can let our model update parameters after each image or after the whole dataset. Using one image at a time is not only inefficient because we are not using GPU with parallelization, but also harder to generalize because the model is trying to optimize into each image at a time. However, if we use the whole dataset, we may run out of GPU memory. 

To combine both approaches, we divide dataset into small batches. So, how many images should we put into each batch? We generally use 64, but really, we can try anything. I think bigger batch size will provide better result all the time, so I will try the biggest batch size that can fit in the GPU. Then, I will compare it with smaller batch sizes.

For data, I will use data from [Paddy competition](https://www.kaggle.com/competitions/paddy-disease-classification) because it's small and we used it last time.

In [1]:
from fastai.vision.all import *

In [2]:
path = Path('.')
trn_path = path/'train_images'
im = PILImage.create((trn_path/'bacterial_leaf_blight'/'110320.jpg'))
print(im.size)

(192, 256)


In [3]:
480 / 640, 192 / 256

(0.75, 0.75)

We will use resnet18 for its speed as accuracy is not important here. We just want our model to train fast with reasonable result so that we can try different batch sizes. Will the result differ for different model, such as Convnext or resnext? Maybe it is different, but I think it won't be very different. We should try that another time.

In [4]:
arch = resnet18

## Batch size difference

Here, I defined a function to use the same configuration except for the batch size. 

In [5]:
def train(arch, epochs=5, item=Resize(224, method='squish'), batch=aug_transforms(size=128), bs=64):
    dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=item, batch_tfms=batch, bs=bs)
    learn = vision_learner(dls, arch, metrics=error_rate).to_fp16()
    learn.fine_tune(epochs, 0.01)
    return learn

In [6]:
learn = train(arch, epochs=20, bs=512)

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth


  0%|          | 0.00/44.7M [00:00<?, ?B/s]

epoch,train_loss,valid_loss,error_rate,time
0,2.596521,1.635138,0.465161,00:30


epoch,train_loss,valid_loss,error_rate,time
0,1.375922,0.87113,0.275348,00:31
1,1.082382,0.592112,0.197982,00:30
2,0.848206,0.427977,0.131187,00:31
3,0.669034,0.370086,0.11581,00:30
4,0.527107,0.346576,0.09803,00:30
5,0.426124,0.274266,0.084575,00:30
6,0.346506,0.313324,0.086016,00:29
7,0.288119,0.302993,0.075925,00:29
8,0.238961,0.259476,0.060548,00:29
9,0.19718,0.256443,0.060548,00:30


In [7]:
learn = train(arch, epochs=20, bs=256)

epoch,train_loss,valid_loss,error_rate,time
0,2.346818,1.357702,0.417107,00:23


epoch,train_loss,valid_loss,error_rate,time
0,1.19778,0.712432,0.231139,00:27
1,0.877656,0.425744,0.140798,00:27
2,0.640672,0.33931,0.100913,00:27
3,0.482401,0.316858,0.090822,00:27
4,0.36976,0.276862,0.077847,00:27
5,0.2983,0.24556,0.067275,00:28
6,0.248462,0.308261,0.0889,00:29
7,0.210882,0.189919,0.049495,00:28
8,0.176628,0.1775,0.048534,00:28
9,0.142462,0.201183,0.051898,00:27


In [8]:
# Smaller batch size
learn = train(arch, epochs=20, bs=128)

epoch,train_loss,valid_loss,error_rate,time
0,2.136984,1.406114,0.447862,00:22


epoch,train_loss,valid_loss,error_rate,time
0,1.054516,0.663863,0.215281,00:27
1,0.666271,0.420409,0.132148,00:26
2,0.457096,0.343254,0.101874,00:26
3,0.382713,0.327504,0.100432,00:26
4,0.316829,0.331945,0.090822,00:26
5,0.302304,0.390939,0.100913,00:26
6,0.255539,0.309886,0.088419,00:26
7,0.21978,0.32307,0.077847,00:26
8,0.179994,0.264668,0.067756,00:26
9,0.147762,0.242832,0.059106,00:26


In [9]:
learn = train(arch, epochs=20, bs=64)

epoch,train_loss,valid_loss,error_rate,time
0,1.959155,1.336008,0.434887,00:20


epoch,train_loss,valid_loss,error_rate,time
0,0.92832,0.53258,0.175877,00:24
1,0.577576,0.368951,0.118693,00:24
2,0.450068,0.361453,0.105718,00:24
3,0.451048,0.348951,0.10716,00:25
4,0.422584,0.425991,0.131187,00:25
5,0.368363,0.320497,0.103316,00:25
6,0.315106,0.455855,0.102355,00:25
7,0.254068,0.372847,0.0889,00:25
8,0.228498,0.220114,0.064392,00:25
9,0.183738,0.196138,0.052859,00:25


In [10]:
learn = train(arch, epochs=20, bs=32)

epoch,train_loss,valid_loss,error_rate,time
0,1.922241,1.635834,0.477655,00:24


epoch,train_loss,valid_loss,error_rate,time
0,0.826747,0.497525,0.155214,00:31
1,0.564383,0.400657,0.128784,00:31
2,0.538409,0.509276,0.150408,00:31
3,0.542932,0.409684,0.125901,00:32
4,0.488993,0.436753,0.109082,00:31
5,0.464314,0.43665,0.122057,00:31
6,0.382847,0.394439,0.106679,00:31
7,0.304165,0.369828,0.095627,00:31
8,0.279569,0.222677,0.064392,00:31
9,0.21988,0.206119,0.05334,00:31


In [11]:
learn = train(arch, epochs=20, bs=16)

epoch,train_loss,valid_loss,error_rate,time
0,2.05892,1.550412,0.474291,00:36


epoch,train_loss,valid_loss,error_rate,time
0,0.927028,0.553331,0.185968,00:46
1,0.678595,0.480951,0.150408,00:46
2,0.700101,0.53478,0.16098,00:46
3,0.704966,0.474731,0.15137,00:49
4,0.582422,0.438784,0.127343,00:46
5,0.585485,0.404436,0.127823,00:46
6,0.480821,0.277841,0.082172,00:46
7,0.379346,0.350564,0.091302,00:46
8,0.322888,0.243706,0.065353,00:46
9,0.281296,0.240629,0.063912,00:46


In [12]:
learn = train(arch, epochs=20, bs=8)

epoch,train_loss,valid_loss,error_rate,time
0,2.06032,1.585985,0.498318,01:07


epoch,train_loss,valid_loss,error_rate,time
0,1.120452,0.647971,0.212398,01:25
1,0.911752,0.569064,0.172513,01:24
2,0.856301,0.513152,0.162422,01:24
3,0.937212,0.528084,0.166747,01:24
4,0.768487,0.584787,0.168188,01:24
5,0.675419,0.453394,0.14272,01:25
6,0.617704,0.359392,0.102355,01:24
7,0.651471,0.354981,0.098991,01:25
8,0.417352,0.246947,0.073042,01:25
9,0.34983,0.197683,0.058626,01:28


In [13]:
learn = train(arch, epochs=20, bs=4)

epoch,train_loss,valid_loss,error_rate,time
0,2.258966,1.692897,0.561749,02:00


epoch,train_loss,valid_loss,error_rate,time
0,1.602062,1.179904,0.40221,02:29
1,1.531049,1.210997,0.384911,02:27
2,1.521575,1.199127,0.394041,02:30
3,1.459329,0.981559,0.316194,02:27
4,1.385336,1.032648,0.30322,02:30
5,1.161227,0.891037,0.265257,02:29
6,1.235623,0.780257,0.225853,02:28
7,1.092432,0.680842,0.195099,02:31
8,1.002529,0.697929,0.206631,02:31
9,0.911738,0.547545,0.153292,02:29


It's interesting that bigger batch size performed worse. I thought bigger batch size would always be better. Also, bigger batch of 512 took longer to train than 256, 128, and 64. As batch size got smaller, training time increased when the batch size was smaller than 64. When batch size was 64, accuracy was the highest and training time was the lowest. 

So, should we always use 64? Well, we should try using different pre-trained models and different data and find out this is the case, but it could be a good starting point.