# Evaluate the finetuned model
The intention was to check if our new codes (systematically modified into command line base) are able to reproduce the performance from the old codes. However, since we decided to deal with one kind of measurement for each model in the new codes, there is no straightforward way to compare the two versions of codes. And the old model didn't provide excellent performance either. I then changed to the target, "if the new codes generate ok performance". Later on, we will conduct a series of hyperparameter tuning to improve the performance.

In [1]:
import torch
from src.eval.eval import finetune_evaluate, finetune_evaluate_base
from src.models.mae_vit_regressor import mae_vit_base_patch16
from src.datas import transforms
from src.datas.dataloader import get_dataloader

In [2]:
# CaCO3
target = "CaCO3"
criterion = torch.nn.MSELoss()
device = torch.device('cuda')

target_mean = torch.load(f"src/datas/xpt_{target}_target_mean.pth")
target_std = torch.load(f"src/datas/xpt_{target}_target_std.pth")
target_transform = transforms.Normalize(target_mean, target_std)

model = mae_vit_base_patch16(pretrained=True, weights=f"results/finetune_test_{target}_20240610/model.ckpt").to(device)
dataloader = get_dataloader(ispretrain=False, annotations_file=f"data/finetune/{target}%/train/info.csv", input_dir=f"data/finetune/{target}%/train", 
                            batch_size=256, transform=transforms.InstanceNorm(), target_transform=target_transform, num_workers=8)

model_mse = finetune_evaluate(model=model, dataloader=dataloader['val'], criterion=criterion)

base_mse = finetune_evaluate_base(dataloader=dataloader['val'], criterion=criterion, mean=target_mean)

r_square = 1 - model_mse / base_mse

print(target)
print(f'MSE: {model_mse:.3f}')
print(f'MSE of base model: {base_mse:.3f}')
print(f'R2: {r_square:.3f}')

CaCO3
MSE: 0.202
MSE of base model: 98.973
R2: 0.998


In [4]:
# TOC
target = "TOC"
criterion = torch.nn.MSELoss()
device = torch.device('cuda')

target_mean = torch.load(f"src/datas/xpt_{target}_target_mean.pth")
target_std = torch.load(f"src/datas/xpt_{target}_target_std.pth")
target_transform = transforms.Normalize(target_mean, target_std)

model = mae_vit_base_patch16(pretrained=True, weights=f"results/finetune_test_{target}_20240610/model.ckpt").to(device)
dataloader = get_dataloader(ispretrain=False, annotations_file=f"data/finetune/{target}%/train/info.csv", input_dir=f"data/finetune/{target}%/train", 
                            batch_size=256, transform=transforms.InstanceNorm(), target_transform=target_transform, num_workers=8)

model_mse = finetune_evaluate(model=model, dataloader=dataloader['val'], criterion=criterion)

base_mse = finetune_evaluate_base(dataloader=dataloader['val'], criterion=criterion, mean=target_mean)

r_square = 1 - model_mse / base_mse

print(target)
print(f'MSE: {model_mse:.3f}')
print(f'MSE of base model: {base_mse:.3f}')
print(f'R2: {r_square:.3f}')

TOC
MSE: 0.244
MSE of base model: 1.076
R2: 0.773


# Summary
The models not just give ok performance. They have relevat or even slightly better performance than the results in Lee et al. (2022). The CaCO3 model's R2 is 0.998 which outperforms 0.96 in Lee et al. (2022). The TOC model's R2 is 0.773 which is relevant to 0.78 in Lee et al. (2022). Okay, let's move on to the hyperparameter tuning after some minor modifications.