# Transfer Learning

Transfer learning consists of reusing a previously trained model, generally obtained from
a large and generic dataset (for example, ImageNet, which contains millions of images),
as the starting point for solving a more specific task. Instead of training a neural
network from scratch, the model's previously acquired representations are leveraged.
These representations tend to capture basic patterns such as edges, textures, and shapes,
as well as increasingly complex visual compositions.

This approach notably reduces the amount of data required, accelerates training, and
usually yields better performance when only small or medium-sized datasets are available.
The network already encodes general visual features, and it only needs to be specialized
for the new task.

## Fundamental Strategies in Transfer Learning

In practical applications, the use of pretrained models is structured around three main
strategies, which differ in terms of which parts of the model are updated during
training: Feature extraction, partial fine-tuning, and full fine-tuning. The choice among
these strategies depends primarily on the size of the target dataset and the similarity
between the pretraining domain and the new task.

### Feature Extraction

In the feature extraction strategy, all parameters of the pretrained model are frozen
except for the final classification layer. In this configuration, the pretrained network
acts as a fixed feature extractor, and only a lightweight classifier is trained on top of
those features.

In [None]:
import torch
import torch.nn as nn
import torchvision.models as models

# Load pretrained model
model = models.resnet18(pretrained=True)

# Freeze ALL layers
for param in model.parameters():
    param.requires_grad = False

# Replace the last layer (classifier)
num_classes = 2  # Example: Dogs vs. cats
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Only model.fc parameters will be trained

This option is particularly suitable when the dataset is very small (for example, fewer
than approximately 1,000 images) and when the images are relatively similar to those in
ImageNet, such as natural scenes and everyday objects. Training is fast, and the risk of
overfitting is reduced, since most weights remain fixed and only the last layer adapts to
the new categories. However, the model's capacity for adapting to domains that differ
substantially from the pretraining data is limited, because the internal representations
are not modified.

### Partial Fine-Tuning

In partial fine-tuning, the earliest layers of the network (those closest to the input)
are kept frozen, whereas several of the last convolutional layers, together with the
classifier, are unfrozen and updated during training. The underlying idea is to preserve
the most generic features, such as edges and simple textures, and adapt the higher-level
representations to the new task.

In [None]:
import torch
import torch.nn as nn
import torchvision.models as models

# Load pretrained model
model = models.resnet18(pretrained=True)

# Freeze the first layers (for example, all except layer4 and fc)
for name, param in model.named_parameters():
    if "layer4" not in name and "fc" not in name:
        param.requires_grad = False

# Replace classifier
num_classes = 2
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Optimizer with different learning rates
optimizer = torch.optim.Adam(
    [
        {
            "params": model.layer4.parameters(),
            "lr": 1e-4,  # Low learning rate for pretrained layers
        },
        {
            "params": model.fc.parameters(),
            "lr": 1e-3,  # Higher learning rate for the new layer
        },
    ]
)

This approach is appropriate for medium-sized datasets (on the order of 1,000 to 10,000
images) and when the domain is moderately different from ImageNet. An example is a
medical imaging task in which the structures still share certain visual patterns, such as
textures or shapes, with natural images, but require adaptation of the high-level
abstractions. In this context, partial fine-tuning offers a balance between flexibility
and protection against overfitting, since only part of the network is modified.

### Full Fine-Tuning

In full fine-tuning, all parameters of the pretrained model are updated during training,
typically using a relatively low learning rate. The objective is to adapt the entire
network to the new domain without abruptly destroying the knowledge acquired during
pretraining (a phenomenon known as catastrophic forgetting).

In [None]:
import torch
import torch.nn as nn
import torchvision.models as models

# Load pretrained model
model = models.resnet18(pretrained=True)

# Replace classifier
num_classes = 2
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Entire model is trainable (low learning rate)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

This strategy is recommended when a large dataset is available (for example, more than
10,000 images) or when the domain is very different from the original one (for instance,
highly specific medical images, satellite images, or industrial data with unusual
textures). Although it implies a higher computational cost and a greater risk of
overfitting if the dataset is insufficient, it tends to provide the highest possible
performance, as the entire architecture is adapted to the new problem.

## Complete Example: Binary Classification (Dogs vs. Cats)

The following section presents a simplified workflow based on the feature extraction
strategy applied to a binary classification problem, such as distinguishing dogs from
cats. This example illustrates the typical steps needed to reuse a pretrained
convolutional network in a practical setting.

### Data Preparation

The necessary transformations are defined first. These transformations include resizing,
cropping, conversion to tensor, and normalization with the mean and standard deviation
used in ImageNet. This normalization is important in order to correctly reuse pretrained
models, since their parameters were optimized under these statistical conditions.

In [None]:
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Transformations (ImageNet normalization)
transform = transforms.Compose(
    [
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(
            mean=[0.485, 0.456, 0.406],  # ImageNet mean
            std=[0.229, 0.224, 0.225],  # ImageNet standard deviation
        ),
    ]
)

# Load dataset organized in folders by class
train_dataset = datasets.ImageFolder("data/train", transform=transform)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

The dataset is assumed to be organized in directories, where each subfolder corresponds
to a class (for example, `dogs/` and `cats/`), following the typical structure used by
`ImageFolder`. This organization enables the automatic assignment of labels according to
the folder names.

### Model Loading and Training

ResNet-18 is used as a feature extractor, and only the last fully connected layer is
trained for the specific classification problem.

In [None]:
import torch
import torch.nn as nn
import torchvision.models as models

# Load pretrained model
model = models.resnet18(pretrained=True)

# Feature extraction: Freeze everything except the last layer
for param in model.parameters():
    param.requires_grad = False

# Replace the classification layer: 2 classes (dogs and cats)
model.fc = nn.Linear(512, 2)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)

for epoch in range(5):
    model.train()
    for images, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

After this training process, the model is specialized in distinguishing between dogs and
cats. It reuses the general visual representations that it previously learned from
ImageNet and adapts only the final decision layer. In a more complete setting, it is
advisable to incorporate validation sets, early stopping, and possibly data augmentation
in order to improve generalization.

## Embedding Extraction

Embeddings are high-dimensional numerical vectors that compactly represent the content of
an image. To obtain them, the pretrained model is reused by removing the final
classification layer and retaining only the part that acts as a feature extractor. In
convolutional architectures such as ResNet, this generally corresponds to the
convolutional trunk and the global pooling operations, up to but excluding the final
fully connected layer.

In [None]:
import torch
import torch.nn as nn
import numpy as np


class FeatureExtractor:
    def __init__(self, model):
        # Remove the last layer (classifier) and keep the convolutional trunk
        self.features = nn.Sequential(*list(model.children())[:-1])
        self.features.eval()

    def extract(self, image):
        """Extracts the embedding of one or more images."""
        with torch.no_grad():
            embedding = self.features(image)  # Shape (B, C, 1, 1) in ResNet
            embedding = embedding.view(embedding.size(0), -1)  # Flatten to (B, C)
        return embedding.numpy()


# Usage example
extractor = FeatureExtractor(model)
image = torch.randn(1, 3, 224, 224)  # Example image
embedding = extractor.extract(image)
print(f"Embedding shape: {embedding.shape}")  # (1, 512) in ResNet18

These embeddings can be used as input for subsequent tasks such as clustering,
visualization, similarity search, or integration into other machine learning models that
operate on vector representations instead of raw images. For example, these vectors may
serve as input to classical algorithms such as support vector machines, logistic
regression, or k-means clustering.

## Embedding Visualization with PCA and t-SNE

When embeddings have been obtained for many images, for example for the entire training
set, they can be projected into two dimensions to visualize how the different classes
cluster in the feature space. This provides an intuitive understanding of how well the
model separates the categories and whether additional processing or model adaptation
might be necessary.

### Visualization with PCA

Principal Component Analysis (PCA) is a linear dimensionality reduction technique that
identifies the directions of maximum variance in the data. It is fast, deterministic, and
offers a first approximation of the structure of the embedding space.

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c=labels, cmap="tab10")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Embeddings in 2D (PCA)")
plt.show()

The resulting plot allows observation of whether the embeddings form well-separated
clusters according to the class labels or whether there is substantial overlap. If the
classes are not clearly separated, it may be necessary to revise the training strategy,
collect more data, or modify the network architecture.

### Visualization with t-SNE

The t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm is a nonlinear
dimensionality reduction technique that often provides more visually interpretable
clusters than PCA, especially in high-dimensional spaces. However, it is more
computationally expensive and sensitive to hyperparameters such as perplexity and
learning rate.

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30)
embeddings_2d = tsne.fit_transform(embeddings)

plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c=labels, cmap="tab10")
plt.title("Embeddings in 2D (t-SNE)")
plt.show()

PCA is deterministic and efficient, which makes it suitable for quick analyses and large
datasets. t-SNE, by contrast, tends to better preserve local neighborhood relationships
and reveal fine-grained groupings, at the expense of higher computational cost and some
variability between runs. In practice, it is common to use PCA as a preliminary step to
reduce dimensionality before applying t-SNE on a smaller number of components.

## Semantic Search with Cosine Similarity

A direct application of embeddings is semantic search: Given a query embedding, the goal
is to retrieve the most similar images in a database by comparing their vector
representations using a similarity measure such as cosine similarity.

### Cosine Similarity

Cosine similarity between two vectors $v_1$ and $v_2$ is defined as:

$$\text{sim}(v_1, v_2) = \frac{v_1 \cdot v_2}{\lVert v_1 \rVert \; \lVert v_2 \rVert}$$

and takes values between $-1$ and $1$. A value of $1$ indicates that the vectors are
identical in direction, $0$ indicates orthogonality (absence of directional
relationship), and $-1$ indicates that the vectors point in opposite directions. In the
context of image embeddings, similarities are often positive and close to $1$ for
semantically similar images.

In [None]:
import numpy as np


def cosine_similarity(vec1, vec2):
    """
    Calculates cosine similarity between two vectors.
    Result is between -1 and 1.
    """
    dot = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot / (norm1 * norm2)

### Simple Semantic Search Engine

Based on cosine similarity, a search mechanism can be constructed that, given a query
embedding, returns the $k$ most similar images in a reference set.

In [None]:
class SemanticSearch:
    def __init__(self, embeddings_db, labels_db):
        """
        embeddings_db: Matrix (N, D) with database embeddings.
        labels_db: Vector (N,) with labels or identifiers.
        """
        norms = np.linalg.norm(embeddings_db, axis=1, keepdims=True)
        self.embeddings_db = embeddings_db / norms  # Normalize for cosine similarity
        self.labels_db = labels_db

    def search(self, query_embedding, top_k=5):
        """Returns the top_k most similar elements to the query."""
        query_norm = query_embedding / np.linalg.norm(query_embedding)
        similarities = np.dot(self.embeddings_db, query_norm)
        top_indices = np.argsort(similarities)[::-1][:top_k]
        top_similarities = similarities[top_indices]
        return top_indices, top_similarities

To construct the search database, embeddings are computed for all images in the dataset:

In [None]:
all_embeddings = []
all_labels = []

for images, labels in train_loader:
    embs = extractor.extract(images)
    all_embeddings.append(embs)
    all_labels.extend(labels.numpy())

all_embeddings = np.vstack(all_embeddings)
all_labels = np.array(all_labels)

searcher = SemanticSearch(all_embeddings, all_labels)

# Query with a new image
query_image = torch.randn(1, 3, 224, 224)  # Example query
query_emb = extractor.extract(query_image).squeeze()

indices, sims = searcher.search(query_emb, top_k=5)

for i, (idx, sim) in enumerate(zip(indices, sims), 1):
    print(f"#{i}: Index {idx}, Similarity {sim:.3f}, Class {all_labels[idx]}")

This mechanism constitutes the basis of visual recommendation systems, image search by
similarity, and interactive tools for exploring visual databases. By adjusting the method
used to compute or index embeddings (for example, using approximate nearest neighbors
search with specialized libraries), the system can scale to large collections of images
without prohibitive computational costs at query time.

## Simplified Complete Transfer Learning Pipeline

The typical workflow for transfer learning in image classification can be conceptually
structured in the following steps: A pretrained model is loaded; a training strategy is
selected (feature extraction, partial fine-tuning, or full fine-tuning); the model is
trained according to this strategy; an embedding extractor is built; embeddings are
computed for the dataset; the structure of the feature space is explored via
visualization techniques; and finally, a semantic search system is constructed using
these embeddings.

In code, a simplified pipeline oriented to feature extraction may take the following
form:

In [None]:
import torch
import torch.nn as nn
import torchvision.models as models
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# STEP 1: Load pretrained model and prepare for feature extraction
model = models.resnet18(pretrained=True)
for param in model.parameters():
    param.requires_grad = False
model.fc = nn.Linear(512, 2)  # Example: 2 classes

# STEP 2: Train only the classifier
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(5):
    model.train()
    for images, labels in train_loader:
        outputs = model(images)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# STEP 3: Create embedding extractor
extractor = FeatureExtractor(model)

# STEP 4: Compute dataset embeddings
embeddings = []
labels_list = []

for images, labels in train_loader:
    emb = extractor.extract(images)
    embeddings.append(emb)
    labels_list.extend(labels.numpy())

embeddings = np.vstack(embeddings)
labels_array = np.array(labels_list)

# STEP 5: Visualize embeddings (for example, with PCA)
pca = PCA(n_components=2)
emb_2d = pca.fit_transform(embeddings)

plt.scatter(emb_2d[:, 0], emb_2d[:, 1], c=labels_array, cmap="tab10")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Embeddings in 2D (PCA)")
plt.show()

# STEP 6: Create semantic search engine
searcher = SemanticSearch(embeddings, labels_array)

# STEP 7: Search for images similar to a query image
query_img = next(iter(train_loader))[0][0:1]
query_emb = extractor.extract(query_img).squeeze()

indices, sims = searcher.search(query_emb, top_k=5)

for i, (idx, sim) in enumerate(zip(indices, sims), 1):
    print(f"#{i}: Index {idx}, Similarity {sim:.3f}, Class {labels_array[idx]}")

This pipeline exemplifies how a pretrained model is reused to solve a classification
problem, how embeddings derived from that model can be exploited, and how these
embeddings feed into visualization and semantic search functionalities. The complete
process illustrates the versatility of transfer learning as a foundational tool in modern
computer vision, enabling efficient development of high-performance models even when
labeled data are limited.