This notebook uses the model from https://huggingface.co/google/vit-base-patch16-224-in21k

All seeds have already been set to 42, however if you will run torch using cuda, expect there to still be slight variations in the results if you rerun it. This is due to torch using non-deterministic algorithms (these are 3x faster than deterministic ones for this model at least from my testing)

Based on the code from "may15_VIT3.ipynb" in old

Overall:
- uses 8 epochs, should take 30 minutes at most
    - could be improved setting it so that the model trains forever (like say until 20 epochs) until the validation loss plateus
- has L2 regularization
    - ctrl+f "weight_decay" to find the line that sets this
- NO dropouts 
    - from testing may15_VIT2 VS may15_VIT3, applying 0.2 dropout made the validation loss slightly bigger, but maybe different values of dropout (0.1, or 0.5) would do better
- uses the initial image preprocessing transforms from may13_VIT1.ipynb
    - this achieves the lowest training loss with lower validation loss as well, but it has the biggest gap (difference) between training and validation loss

Useful documentation links for finding what parameters we can change:

HuggingFace:
- https://huggingface.co/docs/transformers/v4.51.3/en/main_classes/trainer#transformers.TrainingArguments
- https://huggingface.co/docs/transformers/en/model_doc/vit#transformers.ViTConfig

PyTorch:
- https://docs.pytorch.org/vision/0.9/transforms.html
- https://docs.pytorch.org/vision/main/generated/torchvision.transforms.Compose.html


May 15, changes by justin
- initial learning rate set to 1e-4
- lr scheduler: cosine
- early stopping callback with patience = 5, threshold = 0.001

best model: epoch 5, step 680

results: may13_VIT1 is still better in terms of loss and f1-score...

In [5]:
## if running in Google Colab, do these
# !pip install datasets=='3.5.1' evaluate torch torchvision transformers Pillow numpy scikit-learn 'accelerate>=0.26.0'

In [6]:
from datasets import load_dataset, Image
import os
"""
.venv/Scripts/activate

python -m image_process
"""
base_output_dir = "model" ## if you wanna save different models, just make a new git branch, saving a VIT model takes up A LOT of space
os.makedirs(base_output_dir, exist_ok=True)
dataset = load_dataset("potato_train/train")
filenames_ds = load_dataset("potato_train/train").cast_column("image", Image(decode=False))

filename_col = [x['image']['path'].split('\\')[-1] for x in filenames_ds['train']]
dataset['train'] = dataset['train'].add_column("filename", filename_col)

We retrieve the feature extractor from our desired VIT model

In [7]:
from transformers import ViTImageProcessor

# import model
model_id = 'google/vit-base-patch16-224-in21k'
feature_extractor = ViTImageProcessor.from_pretrained(
    model_id
)
# feature_extractor

These are the steps used to preprocess the images and to perform data augmentation on the training dataset.

In [8]:
from torchvision.transforms import (
    CenterCrop,
    Compose,
    Normalize,
    RandomHorizontalFlip,
    RandomVerticalFlip,
    # RandomRotation,
    Resize,
    ToTensor,
    ColorJitter,
    RandomAffine,
    # Pad,
    # RandomCrop
)
from PIL import Image  # Import PIL for RandomAffine's resample
import torch
import numpy as np

def set_seeds(seed):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

seed = 42
set_seeds(seed)

normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
size = (feature_extractor.size["height"], feature_extractor.size["width"])

training_transforms = Compose([
    Resize(size),
    CenterCrop(size),
    # RandomRotation((-30, 30)),
    RandomHorizontalFlip(),
    RandomVerticalFlip(),
    ColorJitter(brightness=0.3, contrast=0.2, saturation=0.1, hue=0.05),
    RandomAffine(degrees=10, translate=(0.05, 0.05), scale=(0.95, 1.05), interpolation=Image.BILINEAR),
    ToTensor(),
    normalize
])

def training_image_preprocess(batch):
    batch["pixel_values"] = torch.stack([training_transforms(img) for img in batch["image"]])
    return batch

def preprocess(batch):
    inputs = feature_extractor(batch['image'], return_tensors='pt')
    inputs['label'] = batch['label']
    return inputs

In [9]:
import torch
import numpy as np
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

We split the dataset. 80% for training, 20% for testing.

In [10]:
train_test_split = dataset["train"].train_test_split(test_size=0.2, shuffle=True, seed=42)
dataset_train = train_test_split["train"]
dataset_test = train_test_split["test"]

In [11]:
num_classes = len(set(dataset_train['label']))
labels = dataset_train.features['label']
num_classes, labels

(6,
 ClassLabel(names=['Bacteria', 'Fungi', 'Healthy', 'Pest', 'Phytopthora', 'Virus'], id=None))

In [12]:
prepared_train = dataset_train.with_transform(training_image_preprocess)
prepared_test = dataset_test.with_transform(preprocess)

Save images of preprocessed images (both train and test)

In [13]:
import os
from torchvision.transforms.functional import to_pil_image

def save_unnormalized_images(prepared_dataset, raw_dataset, directory: str):
    os.makedirs(directory, exist_ok=True)
    for index, item in enumerate(prepared_dataset):
        if index >= 10:
            break
        pixel_values = item["pixel_values"]
        image = to_pil_image(pixel_values)
        label_filename = raw_dataset[index]["filename"]

        name_without_extension, extension = os.path.splitext(label_filename)
        filename = f"pp_{name_without_extension}.png"

        filepath = os.path.join(directory, filename)
        image.save(filepath)


save_unnormalized_images(prepared_train, dataset_train, f"{base_output_dir}/preprocessed_train_images")
save_unnormalized_images(prepared_test, dataset_test, f"{base_output_dir}/preprocessed_test_images")

In [14]:
import evaluate

def collate_fn(batch):
    return {
        'pixel_values': torch.stack([x['pixel_values'] for x in batch]),
        'labels': torch.tensor([x['label'] for x in batch])
    }

accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(p):
    predictions = np.argmax(p.predictions, axis=1)
    results = {}
    results.update(accuracy_metric.compute(
        predictions=predictions, 
        references=p.label_ids,
        )
    )
    results.update(f1_metric.compute(predictions=predictions, references=p.label_ids, average="weighted"))
    return results
#

Training Arguments, you can apply stuff like 
- L1 or L2 regularization
- hidden dropouts

etc.

In [19]:
from transformers import ViTForImageClassification, Trainer, TrainingArguments, ViTConfig
#from transformers.trainer_utils import IntervalStrategy, SchedulerType
#from transformers.training_args import OptimizerNames
from transformers.trainer_callback import EarlyStoppingCallback

training_args = TrainingArguments(
  output_dir=base_output_dir,
  per_device_train_batch_size=16,
  eval_strategy="epoch",  # Evaluate at the end of each epoch for early stopping
  save_strategy="epoch",
  num_train_epochs=50,  # Set a large number of epochs, early stopping will handle it
  logging_steps=100,
  learning_rate=1e-4, ## we could try applying learning rate cosine smthg keme
  save_total_limit=2,
  seed=seed,
  remove_unused_columns=False,
  push_to_hub=False,
  load_best_model_at_end=True,
  weight_decay=0.01,  # Add this line to apply L2 regularization
  lr_scheduler_type="cosine",  # Use cosine learning rate scheduler
  warmup_steps=int(0.1 * (len(prepared_train) / 16) * 100), # 10% of the first epoch for warmup
  # ^ Adjust warmup_steps based on your dataset size and batch size
)
config = ViTConfig.from_pretrained(model_id)
config.num_labels = len(dataset_train.features['label'].names)
# If you want to change it (do this BEFORE loading the model with from_pretrained):
# config.hidden_dropout_prob = 0.2
# config.attention_probs_dropout_prob = 0.2
print(f"hidden dropout={config.hidden_dropout_prob}, attention dropout={config.attention_probs_dropout_prob}")

model = ViTForImageClassification.from_pretrained(
    model_id,  # classification head
    config=config,
)
model.to(device)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    compute_metrics=compute_metrics,
    train_dataset=prepared_train,
    eval_dataset=prepared_test,
    processing_class=feature_extractor,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5, early_stopping_threshold=0.001)],
)

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


hidden dropout=0.0, attention dropout=0.0


In [20]:
train_results = trainer.train()

trainer.save_model() # save tokenizer with the model
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)

trainer.save_state() # save the trainer state

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,1.7368,1.513331,0.597786,0.538326
2,1.4217,0.899073,0.789668,0.785758
3,0.6371,0.590421,0.845018,0.839725
4,0.4651,0.480438,0.857934,0.858274
5,0.4353,0.399213,0.885609,0.884025
6,0.3096,0.503302,0.843173,0.840912
7,0.3018,0.468005,0.843173,0.84152
8,0.2573,0.461604,0.857934,0.859015
9,0.2534,0.444649,0.867159,0.864899
10,0.2161,0.464152,0.863469,0.862233


***** train metrics *****
  epoch                    =         10.0
  total_flos               = 1562537366GF
  train_loss               =       0.5692
  train_runtime            =   1:09:28.89
  train_samples_per_second =       25.966
  train_steps_per_second   =        1.631


Retrieve the saved model from the directory, then run the evaluation

In [21]:
from transformers import Trainer, ViTForImageClassification, ViTFeatureExtractor

model = ViTForImageClassification.from_pretrained(base_output_dir)
feature_extractor = ViTFeatureExtractor.from_pretrained(base_output_dir)

# Define the Trainer for evaluation
# (if you don't wanna load the model from the directory and use the trained model directly, comment out the trainer line here)
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    compute_metrics=compute_metrics,
    train_dataset=prepared_train,
    eval_dataset=prepared_test,
    processing_class=feature_extractor,
)
eval_results = trainer.evaluate()

trainer.log_metrics("eval", eval_results)
trainer.save_metrics("eval", eval_results)



***** eval metrics *****
  eval_accuracy               =     0.8856
  eval_f1                     =      0.884
  eval_loss                   =     0.3992
  eval_model_preparation_time =      0.002
  eval_runtime                = 0:00:23.50
  eval_samples_per_second     =     23.063
  eval_steps_per_second       =      2.894
