# Checkpointing Tutorial

This notebook shows how to stop and resume the training procedure using checkpoints. __Do not clear the output cells!__


In [1]:
import torch
import torch.nn as nn
from src.dataset import OriginalPatchLocalizationDataset, sample_img_paths
from src.models import OriginalPretextNetwork
from src.train_pretext import train_model

### Intial Training
Let us first set up the training for some dummy experiment.

In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")

img_paths = sample_img_paths(frac=0.1)

ds_train = OriginalPatchLocalizationDataset(img_paths=img_paths[:4600], samples_per_image=1)
ds_val = OriginalPatchLocalizationDataset(img_paths=img_paths[4600:], samples_per_image=1)

print(f"Number of training images: \t {len(ds_train)}")
print(f"Number of validation images: \t {len(ds_val)}")

train_loader = torch.utils.data.DataLoader(ds_train, batch_size=64, shuffle=True, num_workers=4)
val_loader = torch.utils.data.DataLoader(ds_val, batch_size=64, shuffle=False, num_workers=4)

model = OriginalPretextNetwork(backbone="resnet18").to(device)
criterion = nn.CrossEntropyLoss().to(device)

Device: cuda
Number of training images: 	 4600
Number of validation images: 	 367


Now let us train our model for 2 epochs after which we interrupt the training procedure.

In [3]:
description = """this is just some dummy experiment""" # <-- provide some descriptions for your experiment such that you will know what you did

train_model(
    experiment_id="dummy_42", # <-- this will be used to create a directory with all the models, logs, etc.
    experiment_descr=description,
    model=model,    
    train_loader=train_loader,
    val_loader=val_loader,
    device=device,
    criterion=criterion,
    optimizer=None, # <-- if None, the default Adam optimizer will be used
    num_epochs=4,
    log_frequency=36,
)

Epoch: [0][0/71]	Time 3.316s (3.316s)	Speed 19.3 samples/s	Data 0.001s (0.001s)	Loss 2.08310 (2.08310)
Epoch: [0][36/71]	Time 0.279s (0.357s)	Speed 229.4 samples/s	Data 0.011s (0.012s)	Loss 2.07283 (2.36810)
Epoch: [0][71/71]	Time 0.240s (0.316s)	Speed 233.3 samples/s	Data 0.008s (0.012s)	Loss 2.08574 (2.23140)
Test: [0/5]	Time 0.293 (0.293)	Loss 2.0655 (2.0655)
Test: [5/5]	Time 0.082 (0.135)	Loss 2.0797 (2.0713)
Accuracy: 0.142
Saving best model to ./out/dummy_42/
Saving checkpoint to ./out/dummy_42/
Epoch: [1][0/71]	Time 0.547s (0.547s)	Speed 117.0 samples/s	Data 0.001s (0.001s)	Loss 2.08097 (2.08097)
Epoch: [1][36/71]	Time 0.274s (0.281s)	Speed 233.6 samples/s	Data 0.012s (0.011s)	Loss 2.07400 (2.07842)
Epoch: [1][71/71]	Time 0.239s (0.278s)	Speed 234.3 samples/s	Data 0.008s (0.011s)	Loss 2.06522 (2.07731)
Test: [0/5]	Time 0.305 (0.305)	Loss 2.0480 (2.0480)
Test: [5/5]	Time 0.083 (0.141)	Loss 2.0794 (2.0782)
Accuracy: 0.142
Saving checkpoint to ./out/dummy_42/


KeyboardInterrupt: 

### Continuation of Training
Before we can continue the training, we need to set it up again. Other than the checkpoint loading, everythings is exactly the same as the setup code for the initial training.

In [4]:
# literally the same code as above...
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")

img_paths = sample_img_paths(frac=0.1)

ds_train = OriginalPatchLocalizationDataset(img_paths=img_paths[:4600], samples_per_image=1)
ds_val = OriginalPatchLocalizationDataset(img_paths=img_paths[4600:], samples_per_image=1)

print(f"Number of training images: \t {len(ds_train)}")
print(f"Number of validation images: \t {len(ds_val)}")

train_loader = torch.utils.data.DataLoader(ds_train, batch_size=64, shuffle=True, num_workers=4)
val_loader = torch.utils.data.DataLoader(ds_val, batch_size=64, shuffle=False, num_workers=4)

model = OriginalPretextNetwork(backbone="resnet18").to(device)
criterion = nn.CrossEntropyLoss().to(device)

# load checkpoint
checkpoint = torch.load("./out/dummy_42/checkpoint.pth.tar") 
model.load_state_dict(checkpoint['model_state_dict']) # restore model state
optimizer = torch.optim.Adam(model.parameters()) 
optimizer.load_state_dict(checkpoint['optimizer_state_dict']) # restore optimizer state
next_epoch = checkpoint['next_epoch']
best_acc = checkpoint['best_acc']

Device: cuda
Number of training images: 	 4600
Number of validation images: 	 367


Now we can resume the training. Just run the same code as above, but pass the start epoch (as well as the optimizer if you initially passed _None_).

In [5]:
description = """this is just some dummy experiment"""

train_model(
    experiment_id="dummy_42",
    experiment_descr=description,
    model=model,    
    train_loader=train_loader,
    val_loader=val_loader,
    device=device,
    criterion=criterion,
    optimizer=optimizer, # <-- don't forget to pass the Adam optimizer with its latest state if you passed None before!
    start_epoch=next_epoch, # <-- we don't need to start from scratch again!
    num_epochs=4,
    curr_best_acc=best_acc, # <-- need to know the best accuracy achieved so far (to save the best model of the entire training)
    log_frequency=36,
)

Epoch: [2][0/71]	Time 0.600s (0.600s)	Speed 106.7 samples/s	Data 0.011s (0.011s)	Loss 2.07153 (2.07153)
Epoch: [2][36/71]	Time 0.275s (0.284s)	Speed 232.7 samples/s	Data 0.010s (0.012s)	Loss 2.05020 (2.07595)
Epoch: [2][71/71]	Time 0.252s (0.280s)	Speed 222.2 samples/s	Data 0.010s (0.012s)	Loss 2.08823 (2.07424)
Test: [0/5]	Time 0.415 (0.415)	Loss 2.0749 (2.0749)
Test: [5/5]	Time 0.080 (0.155)	Loss 1.9827 (2.0709)
Accuracy: 0.174
Saving best model to ./out/dummy_42/
Saving checkpoint to ./out/dummy_42/
Epoch: [3][0/71]	Time 0.535s (0.535s)	Speed 119.6 samples/s	Data 0.002s (0.002s)	Loss 2.08128 (2.08128)
Epoch: [3][36/71]	Time 0.277s (0.283s)	Speed 231.0 samples/s	Data 0.013s (0.012s)	Loss 2.05026 (2.07260)
Epoch: [3][71/71]	Time 0.239s (0.277s)	Speed 234.3 samples/s	Data 0.008s (0.011s)	Loss 2.04514 (2.07577)
Test: [0/5]	Time 0.451 (0.451)	Loss 2.0953 (2.0953)
Test: [5/5]	Time 0.081 (0.163)	Loss 2.0671 (2.0775)
Accuracy: 0.131
Saving checkpoint to ./out/dummy_42/
Saving final model to

Done!