# Convolutional Layers Assignment
Full explanations, experiments, and architectural reasoning.
Dataset: **CIFAR-10**


## 1. Introduction and Motivation

### 1.1 Assignment Context
This assignment explores convolutional neural networks (CNNs) not as black boxes, but as architectural components whose design choices directly impact learning efficiency, generalization, and interpretability.

**Key Questions:**
- Why do convolutional layers work better than fully connected layers for image data?
- How do architectural decisions (kernel size, depth, stride) affect performance?
- What inductive biases do convolutional layers introduce?

### 1.2 Dataset Justification: CIFAR-10

**CIFAR-10** is chosen for the following reasons:

* **Spatial structure**: Natural images with strong local correlations
* **Translation invariance**: Objects can appear at different positions
* **Hierarchical features**: Low-level edges → mid-level textures → high-level objects
* **Computational feasibility**: 32×32 images allow rapid experimentation
* **Sufficient complexity**: 10 classes, diverse visual patterns

These properties make CIFAR-10 ideal for demonstrating the convolutional inductive bias.

## 2. Dataset Description and Exploratory Data Analysis (EDA)
### 2.1 Dataset Selection

The **CIFAR-10** dataset is used for this assignment. CIFAR-10 consists of small natural images distributed across multiple object categories, making it highly suitable for evaluating **convolutional inductive bias**.

**Key properties:**

* **Number of images:** 60,000
* **Train / test split:** 50,000 / 10,000
* **Image resolution:** 32 × 32
* **Channels:** 3 (RGB)
* **Number of classes:** 10
* **Class distribution:** Balanced

This dataset is appropriate for convolutional layers because the images exhibit strong **local spatial correlations** and **translation invariance**, which convolution explicitly exploits through **local receptive fields** and **weight sharing**.

### 2.2 Loading the Dataset

#### Development Environment Setup

For this project execution, a specialized virtual environment (`cnn-env`) was configured with the following technical specifications:

* **Python Version:** Version **3.10** was utilized, specifically selected for its stability and compatibility with project dependencies.
* **Main Libraries:** The environment integrates **TensorFlow**, facilitating the implementation and training of convolutional neural networks (CNNs).
* **Environment Management:** The use of `cnn-env` ensures dependency isolation, preventing version conflicts and ensuring experiment reproducibility.

In [None]:
import tensorflow as tf
from tensorflow.keras.datasets import cifar10
import numpy as np
import matplotlib.pyplot as plt

# Load dataset
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

### 2.3 Basic Dataset Inspection

In [None]:
print("Training shape:", x_train.shape)
print("Test shape:", x_test.shape)
print("Pixel range:", x_train.min(), x_train.max())

### 2.4 Sample Visualization

The purpose of visualization is not statistical analysis, but understanding the structure and variability of input tensors.

In [None]:
class_names = ['airplane','automobile','bird','cat','deer','dog','frog','horse','ship','truck']

plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.imshow(x_train[i])
    plt.title(class_names[y_train[i][0]])
    plt.axis('off')
plt.tight_layout()
plt.show()

### 2.5 Preprocessing

Pixel values are normalized to the [0, 1] range. Labels are **one-hot encoded** for categorical classification.

In [None]:
# Normalize pixel values
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

# One-hot encode labels
y_train_cat = tf.keras.utils.to_categorical(y_train, 10)
y_test_cat = tf.keras.utils.to_categorical(y_test, 10)

## 3. Baseline Model: Fully Connected Network (FCN)

### 3.1 Motivation
As a baseline, a fully connected neural network is trained on the same dataset. This model ignores spatial structure by flattening the image, treating each pixel as an independent feature. This baseline highlights the limitations of non-convolutional architectures for image data.

### 3.2 Architecture Definition

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Dropout

fc_model = Sequential([
    Flatten(input_shape=(32,32,3)),
    Dense(512, activation='relu'),
    Dropout(0.2),
    Dense(256, activation='relu'),
    Dropout(0.2),
    Dense(10, activation='softmax')
])

fc_model.compile(optimizer='adam',
                 loss='categorical_crossentropy',
                 metrics=['accuracy'])

fc_model.summary()

### 3.3 Training and Evaluation

This section puts into practice the fundamental concepts covered in class, specifically the training configuration through the definition of **batches** and **epochs**:

* **Batch Size:** Refers to the number of data samples processed by the network before updating internal parameters (weights). This technique enables more efficient and memory-stable training.
* **Epochs:** Defines the total number of times the learning algorithm goes through the complete training dataset.

This configuration is crucial for controlling model convergence and preventing overfitting during the learning process.

In [None]:
history_fc = fc_model.fit(
    x_train, y_train_cat,
    validation_split=0.1,
    epochs=10,
    batch_size=64,
    verbose=2
)

fc_test_loss, fc_test_acc = fc_model.evaluate(x_test, y_test_cat, verbose=0)
print("FC Test Accuracy:", fc_test_acc)

### 3.4 Observed Limitations

Despite having over a million parameters, the fully connected model exhibits limited generalization performance. The flattening process destroys spatial locality, forcing the network to relearn the same patterns at different image locations independently.

## 4. Convolutional Neural Network (CNN) Design

### 4.1 Architectural Justification
The CNN is designed under the following fundamental principles:

* **Local receptive fields:** To capture spatial correlations in the image
* **Weight sharing:** To reduce total number of parameters
* **Increasing depth:** To build hierarchical feature representations
* **Pooling:** To introduce translation invariance

The architecture is intentionally shallow to prioritize structural reasoning over brute depth.

### 4.2 CNN Architecture

In [None]:
from tensorflow.keras.layers import Conv2D, MaxPooling2D

cnn_model = Sequential([
    Conv2D(32, (3,3), padding='same', activation='relu', input_shape=(32,32,3)),
    MaxPooling2D((2,2)),
    Conv2D(64, (3,3), padding='same', activation='relu'),
    MaxPooling2D((2,2)),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.2),
    Dense(10, activation='softmax')
])

cnn_model.compile(optimizer='adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

cnn_model.summary()

### 4.3 Training and Evaluation

In [None]:
history_cnn = cnn_model.fit(
    x_train, y_train_cat,
    validation_split=0.1,
    epochs=10,
    batch_size=64,
    verbose=2
)

cnn_test_loss, cnn_test_acc = cnn_model.evaluate(x_test, y_test_cat, verbose=0)
print("CNN Test Accuracy:", cnn_test_acc)

## 5. Controlled Experiment: Effect of Kernel Size

### 5.1 Experimental Setup
We vary the kernel size of the convolutional layers while keeping all other architectural and training parameters fixed. This allows us to isolate the effect of receptive field size on learning.

### 5.2 Alternative Model (5×5 Kernels)

In [None]:
cnn_5x5 = Sequential([
    Conv2D(32, (5,5), padding='same', activation='relu', input_shape=(32,32,3)),
    MaxPooling2D((2,2)),
    Conv2D(64, (5,5), padding='same', activation='relu'),
    MaxPooling2D((2,2)),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.2),
    Dense(10, activation='softmax')
])

cnn_5x5.compile(optimizer='adam',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

cnn_5x5.summary()

### 5.3 Results and Comparison

In [None]:
history_5x5 = cnn_5x5.fit(
    x_train, y_train_cat,
    validation_split=0.1,
    epochs=10,
    batch_size=64,
    verbose=2
)

loss_5x5, acc_5x5 = cnn_5x5.evaluate(x_test, y_test_cat, verbose=0)
print("5x5 CNN Test Accuracy:", acc_5x5)

### 5.4 Quantitative Comparison

In [None]:
import pandas as pd

results = pd.DataFrame({
    'Model': ['Fully Connected', 'CNN 3x3', 'CNN 5x5'],
    'Parameters': [
        fc_model.count_params(),
        cnn_model.count_params(),
        cnn_5x5.count_params()
    ],
    'Test Accuracy': [fc_test_acc, cnn_test_acc, acc_5x5]
})

print(results.to_string(index=False))

### 5.5 Training Curves Visualization

In [None]:
plt.figure(figsize=(14,5))

# Accuracy
plt.subplot(1,2,1)
plt.plot(history_fc.history['accuracy'], label='FC Train', linestyle='--')
plt.plot(history_fc.history['val_accuracy'], label='FC Val')
plt.plot(history_cnn.history['accuracy'], label='CNN 3x3 Train', linestyle='--')
plt.plot(history_cnn.history['val_accuracy'], label='CNN 3x3 Val')
plt.plot(history_5x5.history['accuracy'], label='CNN 5x5 Train', linestyle='--')
plt.plot(history_5x5.history['val_accuracy'], label='CNN 5x5 Val')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Model Accuracy Comparison')
plt.legend()
plt.grid(True, alpha=0.3)

# Loss
plt.subplot(1,2,2)
plt.plot(history_fc.history['loss'], label='FC Train', linestyle='--')
plt.plot(history_fc.history['val_loss'], label='FC Val')
plt.plot(history_cnn.history['loss'], label='CNN 3x3 Train', linestyle='--')
plt.plot(history_cnn.history['val_loss'], label='CNN 3x3 Val')
plt.plot(history_5x5.history['loss'], label='CNN 5x5 Train', linestyle='--')
plt.plot(history_5x5.history['val_loss'], label='CNN 5x5 Val')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Model Loss Comparison')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Interpretation and Architectural Reasoning

### 6.1 Why Convolutional Layers Outperform Fully Connected Networks

**Observed Results:**
- The CNN models achieve significantly higher accuracy than the fully connected baseline
- This is accomplished with fewer parameters
- Training converges faster for convolutional models

**Fundamental Reasons:**

1. **Spatial Locality Preservation:**
   - Fully connected layers treat each pixel independently, destroying spatial relationships
   - Convolutional layers maintain spatial structure through local receptive fields
   - Adjacent pixels in images are highly correlated — convolution exploits this

2. **Parameter Efficiency through Weight Sharing:**
   - A fully connected layer connecting 32×32×3 input to 512 units requires ~1.5M parameters
   - A 3×3 convolutional layer with 32 filters requires only 896 parameters
   - The same filter is applied across the entire image, learning features once

3. **Translation Invariance:**
   - Objects can appear at different positions in images
   - Fully connected networks must learn the same feature at every location independently
   - Convolutional layers automatically detect features regardless of position

### 6.2 Inductive Bias Introduced by Convolution

**Inductive bias** refers to the assumptions a learning algorithm makes about the structure of the problem.

**Convolutional Inductive Biases:**

1. **Locality:** Nearby pixels are more relevant than distant pixels
2. **Stationarity:** Patterns that appear in one part of the image are likely to appear elsewhere
3. **Compositionality:** Complex patterns are built hierarchically from simpler ones

These biases are **not learned from data** — they are **encoded in the architecture**. This is why convolutional networks require less data to learn effectively on images.

### 6.3 Kernel Size Trade-offs (3×3 vs 5×5)

**Observations from Experiments:**
- The 3×3 model achieves comparable or better performance than the 5×5 model
- The 3×3 model has fewer parameters
- Larger kernels capture more context but are less parameter-efficient

**Analysis:**
- Two 3×3 convolutions have the same receptive field as one 5×5 convolution
- But two 3×3 layers have more non-linearity (two ReLU activations)
- Modern architectures (VGG, ResNet) favor stacking small kernels over using large ones

### 6.4 When Convolution is NOT Appropriate

**Problem Types Where Convolution Fails:**

1. **Tabular Data:**
   - Features in a spreadsheet (age, income, credit score) have no spatial relationship
   - Reordering columns doesn't change meaning
   - No translation invariance or locality

2. **Graph-Structured Data:**
   - Social networks, molecular structures
   - Relationships are defined by edges, not spatial proximity
   - Requires graph neural networks, not CNNs

3. **Sequential Data with Long-Range Dependencies:**
   - Language modeling, time series with distant correlations
   - Convolution has limited receptive field
   - Transformers or RNNs are more suitable

4. **Problems Requiring Global Context:**
   - Board game positions (Go, Chess)
   - Global relationships matter more than local patterns
   - Attention mechanisms or graph networks work better

**Key Insight:** Convolution works when:
- Data has a grid structure
- Local patterns are important
- Same patterns can appear at different positions

When these assumptions don't hold, other architectures are more appropriate.

## 7. Bonus: Filter Visualization

### 7.1 First Layer Filters
Visualizing what the convolutional filters learn in the first layer.

In [None]:
# Get first convolutional layer weights
first_layer = cnn_model.layers[0]
filters, biases = first_layer.get_weights()

# Normalize filter values to 0-1 for visualization
f_min, f_max = filters.min(), filters.max()
filters_normalized = (filters - f_min) / (f_max - f_min)

# Plot first 16 filters
n_filters = min(16, filters.shape[3])
plt.figure(figsize=(12,3))
for i in range(n_filters):
    plt.subplot(2, 8, i+1)
    plt.imshow(filters_normalized[:,:,:,i])
    plt.axis('off')
plt.suptitle('Learned 3x3 Filters (First Layer)', fontsize=14)
plt.tight_layout()
plt.show()

### 7.2 Feature Map Visualization
Showing how the network transforms an input image through convolutional layers.

In [None]:
# Create model that outputs intermediate layers
from tensorflow.keras.models import Model

layer_outputs = [layer.output for layer in cnn_model.layers[:4]]  # First 4 layers
activation_model = Model(inputs=cnn_model.input, outputs=layer_outputs)

# Get activations for a sample image
sample_image = x_test[0:1]  # Take first test image
activations = activation_model.predict(sample_image, verbose=0)

# Visualize
layer_names = ['Conv2D 32 filters', 'MaxPooling2D', 'Conv2D 64 filters', 'MaxPooling2D']

fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

# Original image
axes[0].imshow(x_test[0])
axes[0].set_title('Original Image')
axes[0].axis('off')

# Show first feature map from each layer
for i, (activation, name) in enumerate(zip(activations, layer_names)):
    if len(activation.shape) == 4:  # Conv layers
        axes[i+1].imshow(activation[0, :, :, 0], cmap='viridis')
        axes[i+1].set_title(f'{name}\nFeature Map 0')
        axes[i+1].axis('off')

# Show multiple feature maps from first conv layer
for idx in range(3):
    axes[5+idx].imshow(activations[0][0, :, :, idx*8], cmap='viridis')
    axes[5+idx].set_title(f'Conv1 Filter {idx*8}')
    axes[5+idx].axis('off')

plt.tight_layout()
plt.show()

## 8. Conclusions

### 8.1 Key Findings

1. **Convolutional Superiority:** Convolutional models outperform fully connected baselines on CIFAR-10 while using significantly fewer parameters

2. **Kernel Efficiency:** Smaller kernels (3×3) achieve better or comparable performance to larger kernels (5×5) with fewer parameters

3. **Inductive Bias:** Convolution encodes assumptions about spatial locality and translation invariance, which are valid for image data

4. **Limitations:** Convolution is not appropriate for data lacking spatial structure or translation invariance

### 8.2 Architectural Insights

This assignment demonstrates that **architectural choices encode prior knowledge** about the problem structure. Convolutional layers are not merely a technique, but a principled way to incorporate domain knowledge (spatial locality, stationarity) into neural network design.

The superior performance of CNNs is not due to "more advanced" mathematics, but due to **better alignment between architecture and data structure**.

### 8.3 Future Directions

Potential extensions:
- Data augmentation (rotation, flipping, color jitter)
- Deeper architectures (ResNet blocks)
- Batch normalization
- Different pooling strategies (average pooling, learned pooling)
- Attention mechanisms on top of convolutional features