# Deep Learning for Image Data

- Pre-trained models
- Model building from scratch

# Pre-trained models

<img src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/cecbccba-6358-476e-9fd8-e2807de9f220/Frame_118.png?t=1693044751" width=500>

Founded in 2016

Thousands of models (e.g., BERT, ChatGPT) you can use **without training from scratch**!

[Go to Hugging Face](https://huggingface.co/) and explore [the pre-trained models available on the website](https://huggingface.co/models).

## [ResNet-50 v1.5](https://huggingface.co/microsoft/resnet-50)

"ResNet (Residual Network) is a convolutional neural network. ResNet model pre-trained on ImageNet-1k at resolution 224x224."

- 1,000 object categories (classes)
- 1.2 million training images
- 50,000 validation images

In [None]:
import warnings
warnings.filterwarnings("ignore")

from transformers import pipeline
import torch

import textwrap # print output in multiple lines

<img src="https://hips.hearstapps.com/hmg-prod/images/pembroke-welsh-corgi-royalty-free-image-1726720011.jpg?crop=1.00xw:0.756xh;0,0.134xh&resize=1024:">

In [None]:
# Check for CUDA (GPU)
device = 0 if torch.cuda.is_available() else -1  # 0 for GPU, -1 for CPU
print(f"Using device: {'GPU' if device == 0 else 'CPU'}")

# Load image classification pipeline with ResNet-50 (CNN)
classifier = pipeline(
    "image-classification",
    model="microsoft/resnet-50",
    device=device,
    use_fast=True  # Use the fast image processor to avoid the warning
)

# Classify the image
result = classifier("https://hips.hearstapps.com/hmg-prod/images/pembroke-welsh-corgi-royalty-free-image-1726720011.jpg")

for item in result:
    print(f"Label: {item['label']}, Score: {item['score']:.4f}")

<img src="https://people.com/thmb/TlNhUj4fJ8pnJNpEvUN-015Jcac=/750x0/filters:no_upscale():max_bytes(150000):strip_icc():focal(979x595:981x597):format(webp)/bts-members-1-03a9c478f1794c448bcb5f74bf94812c.jpg">

In [None]:
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import requests

image = Image.open(requests.get("https://people.com/thmb/TlNhUj4fJ8pnJNpEvUN-015Jcac=/750x0/filters:no_upscale():max_bytes(150000):strip_icc():focal(979x595:981x597):format(webp)/bts-members-1-03a9c478f1794c448bcb5f74bf94812c.jpg", stream=True).raw)

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

inputs = processor(image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))




btt? not BTS?

Oops! That looks like a hallucination from the model.

Advanced models (e.g., [BLIP 2](https://huggingface.co/docs/transformers/en/model_doc/blip-2)) are more accurate.

## [CLIP model](https://huggingface.co/docs/transformers/en/model_doc/clip)

"CLIP is a is a multimodal vision and language model motivated by **overcoming the fixed number of object categories** when training a computer vision model. CLIP learns about images directly from raw text by jointly training on 400M (image, text) pairs. Pretraining on this scale enables **zero-shot transfer** to downstream tasks." Developed by the OpenAI organization.

This is a **transformer**-based model.

In [None]:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch, requests

# Load model & processor
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "openai/clip-vit-base-patch32"
model = CLIPModel.from_pretrained(model_id).to(device)
processor = CLIPProcessor.from_pretrained(model_id)

# Image and candidate captions
image = Image.open(requests.get(
    "https://people.com/thmb/TlNhUj4fJ8pnJNpEvUN-015Jcac=/750x0/filters:no_upscale():max_bytes(150000):strip_icc():focal(979x595:981x597):format(webp)/bts-members-1-03a9c478f1794c448bcb5f74bf94812c.jpg",
    stream=True).raw)

texts = ["a photo of BTS",
         "a photo of a dog",
         "a photo of a band",
         "a group of men"]

# Predict
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True).to(device)
probs = model(**inputs).logits_per_image.softmax(dim=1)[0]

# Display results
print("\n CLIP Similarity Scores:")
for text, p in zip(texts, probs):
    print(f"{text:<25} -> {p:.4f}")

# CNN Model Building from Scratch

<img src="https://i0.wp.com/developersbreach.com/wp-content/uploads/2020/08/cnn_banner.png?fit=1400%2C658&ssl=1">

### CNN Architecture Summary Table

| Step | Layer Type               | Description                                                                 |
|------|--------------------------|-----------------------------------------------------------------------------|
| 1️⃣   | **Input**                | Raw image input (e.g., a zebra).                                            |
| 2️⃣   | **Convolution + ReLU**  | Filters (kernels) extract features like edges; ReLU adds non-linearity.     |
| 3️⃣   | **Pooling**             | Downsamples feature maps to reduce size and retain important info.          |
| 4️⃣   | **Convolution + ReLU**  | Further feature extraction (deeper patterns).                               |
| 5️⃣   | **Pooling**             | More downsampling for dimensionality reduction.                             |
| 6️⃣   | **Flatten**             | Converts feature maps into a 1D feature vector.                             |
| 7️⃣   | **Fully Connected**     | Dense layers combine features and learn decision boundaries.                |
| 8️⃣   | **Output (Softmax)**    | Outputs class probabilities (e.g., Zebra: 0.7).                             |

✅ **Final Prediction**: Class with highest probability (e.g., **Zebra**).


```python
model = tf.keras.Sequential([
    tf.keras.Input(shape=(28, 28, 1)),
    tf.keras.layers.Conv2D(16, (3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])


**CNN Model Layer Breakdown**

| Layer                                      | Purpose (Matches CNN Diagram)                                               |
|-------------------------------------------|------------------------------------------------------------------------------|
| `tf.keras.Input(shape=(28,28,1))`          | Input layer for 28×28 grayscale images (e.g., MNIST digits)                 |
| `Conv2D(16, (3, 3), activation='relu')`    | Convolution layer + ReLU to extract local patterns                          |
| `MaxPooling2D(2, 2)`                       | Pooling layer to downsample and retain key features                         |
| `Flatten()`                                | Flatten feature maps to 1D vector (prepares for Dense layers)               |
| `Dense(64, activation='relu')`             | Fully connected hidden layer                                                |
| `Dense(10, activation='softmax')`          | Output layer: softmax to predict probabilities for 10 classes               |


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# Deep Learning Setup
import tensorflow as tf
from tensorflow.keras.models import Sequential           # Sequential model: stack layers linearly
from tensorflow.keras.layers import Dense, Input         # Dense: fully connected layer, Input: define input shape
from tensorflow.keras.optimizers import Adam             # Adam: an efficient optimizer for training
from tensorflow.keras.utils import plot_model

import warnings
warnings.filterwarnings("ignore")

# Set seeds for reproducibility
import random
seed_value = 42  # Choose any seed value you want
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)
tf.config.experimental.enable_op_determinism()  # TensorFlow 2.9+

In [None]:
# Load data
(images, labels), _ = tf.keras.datasets.fashion_mnist.load_data()

# images are like X.
# labels are like y.

Fashion MNIST is a dataset of **grayscale images** of clothing items, commonly used for training image classification models.

- **Training set**: 60,000 images and labels
- **Test set**: 10,000 images and labels
- **Image size**: 28 × 28 pixels
- **Color**: Grayscale (single channel)
- **Labels**: Integers from 0 to 9, each representing a clothing category

| Label | Class Name   |
|------|--------------|
| 0    | T-shirt/top   |
| 1    | Trouser       |
| 2    | Pullover      |
| 3    | Dress         |
| 4    | Coat          |
| 5    | Sandal        |
| 6    | Shirt         |
| 7    | Sneaker       |
| 8    | Bag           |
| 9    | Ankle boot    |

Each image represents one article of clothing, and the label indicates the correct category.


In [None]:
# Create a DataFrame

flat_images = images.reshape(images.shape[0], -1)

df = pd.DataFrame(flat_images)
df['label'] = labels

label_map = {0: 'T-shirt/top', 1: 'Trouser', 2: 'Pullover', 3: 'Dress', 4: 'Coat',
    5: 'Sandal', 6: 'Shirt', 7: 'Sneaker', 8: 'Bag', 9: 'Ankle boot'}
df['label_name'] = df['label'].map(label_map)

df.head()

In [None]:
# view the first image
plt.imshow(images[0], cmap='gray')
plt.title("Label: {}".format(labels[0]))
plt.show()
# Label 9 is Ankle boot

In [None]:
# view the actual value of the above image
images[0]

- 28 rows  → height of the image  
- 28 cols  → width of the image  
- Each number = brightness of a pixel

In [None]:
# In training ML/DL models, we normalize numerical values (actual values to the range between 0 and 1)
# Normalize pixel values from 0–255 ==> 0.0–1.0
images = images / 255.0

In [None]:
# Expected shape by a CNN: (height, width, channels)
# Must be 3D per image: (28, 28, 1)
# The 1 is the channel → 1 for grayscale, 3 for RGB.
# Add channel dimension (needed for CNN)
images = images.reshape(-1, 28, 28, 1)

| Value | Meaning |
|:------|:--------|
| -1 | Automatically infer the number of images (batch size) |
| 28 | Height (pixels) |
| 28 | Width (pixels) |
| 1 | 1 Channel (grayscale) |


## Training model

In [None]:
# Build a simple CNN model with softmax output

model = tf.keras.Sequential([
    tf.keras.Input(shape=(28, 28, 1)),                    # Input: grayscale image
    tf.keras.layers.Conv2D(16, (3, 3), activation=''),    # 'relu' - Conv layer to extract patterns
    tf.keras.layers.MaxPooling2D(2, 2),                    # Downsample by 2
    tf.keras.layers.Flatten(),                             # Flatten 2D → 1D
    tf.keras.layers.Dense(64, activation=''),              # 'relu' - Hidden layer
    tf.keras.layers.Dense(10, activation='')        # Output: 'softmax' - 10 class probabilities
])

In [None]:
# Choose the optimizer, loss function, and metric:
model.compile(
    optimizer='',
    loss='',      # used for multi-class classification (e.g., classifying images into categories: cat, dog, car, airplane, etc.)
    metrics=['']
)

In [None]:
# Train the model (using all data, no split)
history = model.fit( , , epochs=3)

Learning Progress Over Epochs

| Epoch  | What Happens                             | Accuracy     |
|--------|-------------------------------------------|--------------|
| 1️⃣     | Model starts with random weights          | Low          |
| 2️⃣     | Learns basic patterns                     | Higher       |
| 3️⃣ | Learns finer patterns, reduces mistakes   | Even higher  |


In [None]:
# Plot training history
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title('Training Accuracy')
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Accuracy after each epoch
print(history.history['accuracy'])

# Final accuracy
final_acc = history.history['accuracy'][-1]
print(f"Final Training Accuracy: {final_acc:.2f}")

In [None]:
probabilities = model.predict(images)

# Convert to class labels
y_pred = np.argmax(probabilities, axis=1)

# True labels
y_true = labels  # still integers 0–9

cm = confusion_matrix(y_true, y_pred)
cm

Looks like predicting **label 6 Shirt** is difficult.

In [None]:
# Find the first index where label == 6 ('Shirt')
shirt_index = (labels == 6).nonzero()[0][0]

# Get the image
shirt_image = images[shirt_index]

# Plot the image
plt.imshow(shirt_image, cmap='gray')
plt.title('Label: 6 (Shirt)')
plt.axis('off')
plt.show()

## Predict New Images

In [None]:
# This is the second image in the dataset. It's a T-shirt/top and its label is 0

sample = images[1]  # already normalized, shape: (28, 28, 1)

plt.imshow(sample, cmap='gray')
plt.title("A sample image. Predict me!")
plt.show()

In [None]:
sample = sample.reshape(1, 28, 28, 1)

In [None]:
probabilities = model.predict(sample)
predicted_class = tf.argmax(probabilities, axis=1).numpy()[0]
print(f"Predicted class: {predicted_class}")

# Labe 0 is a T-shirt/top

Let's try one more.

In [None]:
# Label 6 - Shirt
print(shirt_index)

In [None]:
sample = images[18]

plt.imshow(sample, cmap='gray')
plt.title("A sample image. Predict me!")
plt.show()

In [None]:
sample = sample.reshape(1, 28, 28, 1)

In [None]:
probabilities = model.predict(sample)
predicted_class = tf.argmax(probabilities, axis=1).numpy()[0]
print(f"Predicted class: {predicted_class}")

# Labe 6 is a Shirt

Conclusion: Our image recognization model works well :)