# Image Classification

## About
- Fine-tune ViT on the Food-101 dataset to classify a food item in an image.
- Utilise fine-tuned model for inference.

## Loading Food-101 Dataset

In [1]:
from datasets import load_dataset

food = load_dataset('ethz/food101', split='train[:5000]')

food = food.train_test_split(test_size=0.2)
food['train'][0]

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00008.parquet:   0%|          | 0.00/490M [00:00<?, ?B/s]

data/train-00001-of-00008.parquet:   0%|          | 0.00/464M [00:00<?, ?B/s]

data/train-00002-of-00008.parquet:   0%|          | 0.00/472M [00:00<?, ?B/s]

data/train-00003-of-00008.parquet:   0%|          | 0.00/464M [00:00<?, ?B/s]

data/train-00004-of-00008.parquet:   0%|          | 0.00/475M [00:00<?, ?B/s]

data/train-00005-of-00008.parquet:   0%|          | 0.00/470M [00:00<?, ?B/s]

data/train-00006-of-00008.parquet:   0%|          | 0.00/478M [00:00<?, ?B/s]

data/train-00007-of-00008.parquet:   0%|          | 0.00/486M [00:00<?, ?B/s]

data/validation-00000-of-00003.parquet:   0%|          | 0.00/423M [00:00<?, ?B/s]

data/validation-00001-of-00003.parquet:   0%|          | 0.00/413M [00:00<?, ?B/s]

data/validation-00002-of-00003.parquet:   0%|          | 0.00/426M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/75750 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/25250 [00:00<?, ? examples/s]

{'image': <PIL.Image.Image image mode=RGB size=512x384>, 'label': 81}

In [2]:
labels = food['train'].features['label'].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

In [3]:
id2label[str(79)]

'prime_rib'

## Preprocess

In [4]:
from transformers import AutoImageProcessor

checkpoint = "google/vit-base-patch16-224-in21k"
img_processor = AutoImageProcessor.from_pretrained(checkpoint)

preprocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/502 [00:00<?, ?B/s]

Fast image processor class <class 'transformers.models.vit.image_processing_vit_fast.ViTImageProcessorFast'> is available for this model. Using slow image processor class. To use the fast image processor class set `use_fast=True`.


In [5]:
from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor

normalize = Normalize(
    mean=img_processor.image_mean, 
    std=img_processor.image_std
)
size = (
    img_processor.size['shortest_edge']
    if 'shortest_edge' in img_processor.size
    else (
        img_processor.size['height'],
        img_processor.size['width']
    )
)

_transforms = Compose([
    RandomResizedCrop(size),
    ToTensor(),
    normalize
])

In [6]:
def transforms(examples):
    examples['pixel_values'] = [_transforms(img.convert('RGB')) for img in examples['image']]
    del examples['image']
    return examples

In [7]:
food = food.with_transform(transforms)

In [8]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

## Evaluate

In [9]:
import evaluate

accuracy = evaluate.load('accuracy')

In [10]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

## Train

In [11]:
from transformers import AutoModelForImageClassification

model = AutoModelForImageClassification.from_pretrained(
    checkpoint,
    num_labels = len(labels),
    id2label = id2label,
    label2id=label2id,
)

model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="model/food_vit_img_cls_model",
    remove_unused_columns=False,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    warmup_steps=1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=food["train"],
    eval_dataset=food["test"],
    processing_class=img_processor,
    compute_metrics=compute_metrics,
)

trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,2.5797,2.411222,0.844
2,1.8555,1.7744,0.89
3,1.607,1.624502,0.899




TrainOutput(global_step=189, training_loss=2.3317694184641358, metrics={'train_runtime': 548.8987, 'train_samples_per_second': 21.862, 'train_steps_per_second': 0.344, 'total_flos': 9.307289843712e+17, 'train_loss': 2.3317694184641358, 'epoch': 3.0})

## Inference

In [15]:
ds = load_dataset("ethz/food101", split="validation[:10]")
img = ds["image"][0]

In [16]:
from transformers import pipeline

classifier = pipeline("image-classification", model="models/food_vit_img_cls_model")
classifier(img)

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Device set to use mps:0


[{'label': 'beignets', 'score': 0.24651269614696503},
 {'label': 'prime_rib', 'score': 0.016910437494516373},
 {'label': 'pork_chop', 'score': 0.013105366379022598},
 {'label': 'hamburger', 'score': 0.01308775506913662},
 {'label': 'chicken_wings', 'score': 0.012260889634490013}]

In [18]:
from transformers import AutoImageProcessor
import torch

img_processor_inference = AutoImageProcessor.from_pretrained("models/food_vit_img_cls_model")
inputs = img_processor_inference(img, return_tensors="pt") 

In [19]:
from transformers import AutoModelForImageClassification

model = AutoModelForImageClassification.from_pretrained("models/food_vit_img_cls_model")

with torch.no_grad():
    logits = model(**inputs).logits


predicted_label = logits.argmax(-1).item()
model.config.id2label[predicted_label]

'beignets'