## Training a Neural Network for Image Classification

The purpose of the convolutional layers is to learn important image features that can
be used for the task at hand. Convolutional layers work by applying a filter to a
particular area of an image (the size of the convolution). The weights of this layer
then learn to recognize specific image features critical in the classification task. For
instance, if we’re training a model that recognizes a person’s hand, the filter may learn
to recognize fingers.
The purpose of the pooling layer is typically to reduce the dimensionality of the inputs
from the previous layer. This layer also uses a filter applied to a portion of the input,
but it has no activation. Instead, it reduces dimensionality of the input by performing
max pooling (where it selects the pixel in the filter with the highest value) or average
pooling (where it takes an average of the input pixels to use instead)

In [1]:
# Import libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

In [6]:
# Define the convolutional neural network architecture
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(64 * 14 * 14, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = nn.functional.relu(self.conv1(x))
        x = nn.functional.relu(self.conv2(x))
        x = nn.functional.max_pool2d(self.dropout1(x), 2)
        x = torch.flatten(x, 1)
        x = nn.functional.relu(self.fc1(self.dropout2(x)))
        x = self.fc2(x)
        return nn.functional.log_softmax(x, dim=1)

In [7]:
# Set the device to run on
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Define the data preprocessing steps
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])

In [8]:
# Load the MNIST dataset
train_dataset = datasets.MNIST('./data', train=True, download=True,
transform=transform)
test_dataset = datasets.MNIST('./data', train=False, transform=transform)
# Create data loaders
batch_size = 64
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size,
shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size,
shuffle=True)
# Initialize the model and optimizer
model = Net().to(device)
optimizer = optim.Adam(model.parameters())
# Compile the model using torch 2.0's optimizer
model = torch.compile(model)

In [9]:
# Define the training loop
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
    data, target = data.to(device), target.to(device)
    optimizer.zero_grad()
    output = model(data)
    loss = nn.functional.nll_loss(output, target)
    loss.backward()
    optimizer.step()

In [10]:
# Define the testing loop
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
    for data, target in test_loader:
        data, target = data.to(device), target.to(device)
        output = model(data)
        # get the index of the max log-probability
        test_loss += nn.functional.nll_loss(
        output, target, reduction='sum'
        ).item() # sum up batch loss
        pred = output.argmax(dim=1, keepdim=True)
        correct += pred.eq(target.view_as(pred)).sum().item()

In [11]:
test_loss /= len(test_loader.dataset)

## Fine-Tuning a Pretrained Model for Image Classification

In [1]:
# Import libraries
import torch
from torchvision.transforms import(
RandomResizedCrop, Compose, Normalize, ToTensor
)
from transformers import Trainer, TrainingArguments, DefaultDataCollator
from transformers import ViTFeatureExtractor, ViTForImageClassification
from datasets import load_dataset, load_metric, Image

2023-12-28 01:53:54.402054: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-28 01:53:54.404584: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-28 01:53:54.445235: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-28 01:53:54.445266: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-28 01:53:54.446391: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to

In [2]:
# Define a helper function to convert the images into RGB

def transforms(examples):
    examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in
    examples["image"]]
    del examples["image"]
    return examples

# Define a helper function to compute metrics
def compute_metrics(p):
    return metric.compute(predictions=np.argmax(p.predictions, axis=1),
    references=p.label_ids)

# Load the fashion mnist dataset
dataset = load_dataset("fashion_mnist")

In [3]:
# Load the processor from the VIT model
image_processor = ViTFeatureExtractor.from_pretrained(
"google/vit-base-patch16-224-in21k"
)
# Set the labels from the dataset
labels = dataset['train'].features['label'].names



In [4]:
# Load the pretrained model
model = ViTForImageClassification.from_pretrained(
    "google/vit-base-patch16-224-in21k",
    num_labels=len(labels),
    id2label={str(i): c for i, c in enumerate(labels)},
    label2id={c: str(i) for i, c in enumerate(labels)}
)

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
# Define the collator, normalizer, and transforms
collate_fn = DefaultDataCollator()
normalize = Normalize(mean=image_processor.image_mean,
std=image_processor.image_std)
size = (
image_processor.size["shortest_edge"]
if "shortest_edge" in image_processor.size
else (image_processor.size["height"], image_processor.size["width"])
)
_transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])
# Load the dataset we'll use with transformations
dataset = dataset.with_transform(transforms)
# Use accuracy as our metric
metric = load_metric("accuracy")

  metric = load_metric("accuracy")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [6]:
# Set the training args
training_args = TrainingArguments(
    output_dir="fashion_mnist_model",
    remove_unused_columns=False,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=0.01,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=False,
    )

In [7]:
# Instantiate a trainer
trainer = Trainer(
model=model,
args=training_args,
data_collator=collate_fn,
compute_metrics=compute_metrics,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
tokenizer=image_processor,
)

In [8]:
# Train the model, log and save metrics
train_results = trainer.train()
trainer.save_model()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

  0%|          | 0/937 [00:00<?, ?it/s]

{'loss': 2.0621, 'learning_rate': 0.0010638297872340426, 'epoch': 0.01}
{'loss': 1.8625, 'learning_rate': 0.002127659574468085, 'epoch': 0.02}
{'loss': 2.029, 'learning_rate': 0.003191489361702128, 'epoch': 0.03}
{'loss': 2.1504, 'learning_rate': 0.00425531914893617, 'epoch': 0.04}
{'loss': 2.2108, 'learning_rate': 0.005319148936170213, 'epoch': 0.05}
{'loss': 2.1254, 'learning_rate': 0.006382978723404256, 'epoch': 0.06}
{'loss': 2.1921, 'learning_rate': 0.007446808510638297, 'epoch': 0.07}
{'loss': 2.2257, 'learning_rate': 0.00851063829787234, 'epoch': 0.09}
{'loss': 2.1419, 'learning_rate': 0.009574468085106383, 'epoch': 0.1}
{'loss': 2.1191, 'learning_rate': 0.009928825622775802, 'epoch': 0.11}
{'loss': 2.0848, 'learning_rate': 0.009810201660735469, 'epoch': 0.12}
{'loss': 2.0774, 'learning_rate': 0.009691577698695136, 'epoch': 0.13}
{'loss': 2.1317, 'learning_rate': 0.009572953736654805, 'epoch': 0.14}
{'loss': 2.0978, 'learning_rate': 0.009454329774614472, 'epoch': 0.15}
{'loss': 

  0%|          | 0/625 [00:00<?, ?it/s]

NameError: name 'np' is not defined

In the realm of unstructured data like text and images, it is extremely common
to start from pretrained models trained on large datasets, instead of starting from
scratch, especially in cases where we don’t have access to as much labeled data. Using
embeddings and other information from the larger model, we can then fine-tune
our own model for a new task without the need for as much labeled information.
In addition, the pretrained model may have information not captured at all in our
training dataset, resulting in an overall performance improvement. This process is
known as transfer learning.
In this example, we load the weights from Google’s ViT (Vision Transformer) model.