# Custom Implementation of a Convolutional Neural Network (CNN)

## Project Overview

This project implements a complete Convolutional Neural Network framework from scratch using only NumPy, demonstrating deep understanding of CNN fundamentals and computer vision principles. Rather than relying on high-level frameworks like PyTorch or TensorFlow, this implementation builds every component from mathematical foundations, providing insights into the inner workings of modern deep learning systems.

**What it does:** The framework implements a 2-layer CNN architecture for binary image classification, featuring custom convolution operations, multiple activation functions, and a nearest-neighbor classifier. It demonstrates feature extraction through edge detection filters and shows how CNNs learn hierarchical representations.

**Why it's interesting:** This project showcases the ability to build core deep learning algorithms rather than just use them. Understanding CNN operations at this level is crucial for computer vision engineering roles, as it demonstrates both mathematical rigor and practical implementation skills. The framework reveals how convolutional layers extract meaningful features from raw pixel data.

## Key Features Implemented

### Core CNN Operations
- **Custom Convolution**: Manual implementation of 2D convolution with configurable stride and padding
- **Multi-layer Architecture**: 2-layer CNN with feature extraction and classification
- **Activation Functions**: Multiple activation types including ReLU, sigmoid, and custom hard sign
- **Feature Extraction**: Edge detection filters for horizontal and vertical pattern recognition

### Mathematical Implementations
- **Convolution Operations**: Explicit filter sliding and element-wise multiplication
- **Output Size Calculation**: Dynamic spatial map dimension computation
- **Feature Embeddings**: Multi-dimensional feature vector generation
- **Distance Metrics**: Custom L2 norm implementation for similarity computation

### Analysis and Classification
- **Nearest Neighbor Classification**: K=1 classification using L2 distance
- **Performance Evaluation**: Accuracy metrics and prediction analysis
- **Feature Visualization**: Analysis of learned representations
- **Pattern Recognition**: Binary classification of geometric patterns

---

## Implementation Details

### Convolution Operation Function and Activation Functions

The core convolution operation computes the weighted sum between input patches and learned filters, implementing the fundamental building block of CNNs. Multiple activation functions are provided to demonstrate different non-linear transformations.

In [8]:
import numpy as np
from math import sqrt
import pandas as pd
from IPython.display import display

In [9]:

# Convolution Operation Function and Activation Functions

def single_filter_convoltion(input_patch, filter):
    """
    Computes one element/neuron (i, j) in the pre-activation output spatial map of a single filter.
    Assumes that the input (image or convoltional layer output volume) is only 2D because of no color channel depth dimension
    or multi-filter convolutional layers.

    The convolution operation is the weighted element-wise sum between an image patch and filter of identical dimensions.
    [Not implemented here: extends down the depth column for multi-channel/multi-filter input volumes]
    """

    #explicitly cast matrix elements to floats
    input_patch = input_patch.astype(float)
    fitler = filter.astype(float)

    filter_height, filter_width = filter.shape

    pre_activation_value = 0
    for u in range(filter_height):
        for v in range(filter_width):
            pre_activation_value += input_patch[u, v] * filter[u, v]
    
    return pre_activation_value

#not used
def ReLu(pre_act_spatial_map):
    """
    Applies the ReLu activation function element-wise to each pre-activation value/neuron/element (these people
    need to get their terminology act together) in the output spatial map of a single filter.
    """
    
    act_spatial_map = np.maximum(pre_act_spatial_map, 0)

    return act_spatial_map

#not used
def sigmodial_activation(pre_act_spatial_map):
    """
    Applies the Sigmodial activation function element-wise to each pre-activation value/neuron/element i
    n the output spatial map of a single filter.
    Output elements centered around 0 and between [-1 , 1]
    """

    act_spatial_map = 1.0/(1.0 + np.exp(-10*pre_act_spatial_map)) # scale the exponent to exaggerate difference between postive and negative input values

    return (2 * act_spatial_map) - 1 # recenter is so values are between [-1, 1] instead of [0, 1]

def hard_sign_activation(pre_act_spatial_map):
    """
    Applies the Hard Sign activation function element-wise to each pre-activation value/neuron/element 
    in the output spatial map of a single filter.
    """
    
    act_spatial_map = np.where(pre_act_spatial_map >= 0, +1, -1)

    return act_spatial_map

#not used
def square_activation(pre_act_spatial_map):
    """
    Applies the square activation function element-wise to each pre-activation value/neuron/element 
    in the output spatial map of a single filter.
    """
    
    act_spatial_map = pre_act_spatial_map**2

    return act_spatial_map


def output_map_size(input, filter, stride=1, padding=0):
    """
    Returns the dimensions (height or width; they are always ==) of the output spatial map
    given input and filter matrices.
    Stride and Padding assumed to be 1 and 0
    """
    input_dim = input.shape[0] # use height, but height == width
    filter_dim = filter.shape[0]
    output_dim = ((input_dim - filter_dim + (2 * padding)) / stride) + 1
    output_height = output_width = int(output_dim)
    
    return output_height, output_width


## Data Preparation and Architecture Design

### Image Dataset Definition

The framework uses a synthetic dataset of 8 geometric patterns (5x5 pixel images) to demonstrate CNN feature learning capabilities. Each image contains distinct geometric shapes that the network learns to classify through hierarchical feature extraction.

In [None]:
# Define the Image Matrices

I1 = np.array([
    [-1, -1, -1, -1, -1],
    [-1,  1,  1,  1, -1],
    [-1,  1, -1,  1, -1],
    [-1,  1,  1,  1, -1],
    [-1, -1, -1, -1, -1],
])

I2 = np.array([
    [-1, -1, -1, -1, -1],
    [-1, -1,  1, -1, -1],
    [-1, -1,  1, -1, -1],
    [-1, -1,  1, -1, -1],
    [-1, -1, -1, -1, -1],
])

I3 = np.array([
    [-1, -1, -1, -1, -1],
    [ 1,  1,  1, -1, -1],
    [ 1, -1,  1, -1, -1],
    [ 1,  1,  1, -1, -1],
    [-1, -1, -1, -1, -1],
])

I4 = np.array([
    [-1, -1, -1, -1, -1],
    [-1, -1, -1,  1, -1],
    [-1, -1, -1,  1, -1],
    [-1, -1, -1,  1, -1],
    [-1, -1, -1, -1, -1],
])

I5 = np.array([
    [-1, -1,  1, -1, -1],
    [-1, -1,  1, -1, -1],
    [-1, -1,  1, -1, -1],
    [-1, -1, -1, -1, -1],
    [-1, -1, -1, -1, -1],
])

I6 = np.array([
    [-1, -1, -1, -1, -1],
    [-1, -1, -1, -1, -1],
    [ 1,  1,  1, -1, -1],
    [ 1, -1,  1, -1, -1],
    [ 1,  1,  1, -1, -1],
])

I7 = np.array([
    [-1, -1, -1, -1, -1],
    [-1, -1, -1, -1, -1],
    [-1,  1, -1, -1, -1],
    [-1,  1, -1, -1, -1],
    [-1,  1, -1, -1, -1],
])

I8 = np.array([
    [-1, -1, -1, -1, -1],
    [-1, -1,  1,  1,  1],
    [-1, -1,  1, -1,  1],
    [-1, -1,  1,  1,  1],
    [-1, -1, -1, -1, -1],
])

images = [I1, I2, I3, I4, I5, I6, I7, I8]
#flat_images = [img.flatten().astype(float) for img in images] # not needed





### Filter Design and Feature Extraction Strategy

The CNN architecture employs two specialized filters:
- **W1 (Horizontal Edge Detector)**: Designed to detect horizontal edges with large positive/negative weight contrasts
- **W2 (Pattern Aggregator)**: Simple 2×2 all-ones filter that aggregates binary edge indicators

This filter design strategy demonstrates how CNNs learn to extract meaningful features through carefully crafted weight matrices.

In [None]:

# Define Weight Matrices

# Detects Horizontal Edges
W1 = np.array([
    [ 100,  100],
    [-100, -100]
])

# Detects Vertical Edges
W2 = np.array([
    [1, 1],
    [1, 1]
])

filters = [W1, W2]

## CNN Forward Pass Implementation

### Multi-Layer Feature Extraction Pipeline

This section implements the complete forward pass through the 2-layer CNN architecture. The first convolutional layer applies edge detection filters, while the second layer aggregates these features to create discriminative embeddings for classification.

**Layer 1**: Applies horizontal edge detection filter (W1) with hard sign activation to binarize edge responses
**Layer 2**: Uses pattern aggregation filter (W2) to create final feature vectors without activation

In [11]:
# Extract final layer feature vector/image embedding (these people need to hire a taxonomist) for each training and test image

image_embeddings = []

layer_1_output_maps = []




W1_height = W1.shape[0]
W1_width = W1.shape[1]

# 1st convolutional layer
for img in images:
    #initialize pre-activation output spatial map for Layer 1, current image
    W1_output_height, W1_output_width = output_map_size(I3, W1)
    W1_pre_act_spatial_map = np.zeros((W1_output_height, W1_output_width))
    
    # 1st layer pre-activation convolution operation
    # move filter over image patches first left-to-right, then top-to-bottom
    # stride = 1, padding = 0    
    for i in range(W1_output_height):
        for j in range(W1_output_width):
            img_patch = img[i : i + W1_height, j : j + W1_width]
            pre_activation_value = single_filter_convoltion(img_patch, W1)
            W1_pre_act_spatial_map[i, j] = pre_activation_value

    W1_spatial_map = hard_sign_activation(W1_pre_act_spatial_map)
    #add current test image's layer 1 output map to the list
    layer_1_output_maps.append(W1_spatial_map)


W2_height = W2.shape[0]
W2_width = W2.shape[1]



layer_2_output_maps = []

# 2nd convolutional layer
for L1_output_map in layer_1_output_maps:

    #initialize pre-activation output spatial map for Layer 2, for current input spatial map
    W2_output_height, W2_output_width = output_map_size(layer_1_output_maps[1], W2)
    W2_pre_act_spatial_map = np.zeros((W2_output_height, W2_output_width))

    # 2nd layer pre-activation convolution operation
    # move filter over 1st convolutional layer output map, first left-to-right, then top-to-bottom
    # stride = 1, padding = 0    
    for i in range(W2_output_height): # W2_output-height=3
        for j in range(W2_output_width):
            L1_map_patch = L1_output_map[i : i + W2_height, j : j + W2_width]
            pre_activation_value = single_filter_convoltion(L1_map_patch, W2)
            W2_pre_act_spatial_map[i, j] = pre_activation_value

    W2_spatial_map = W2_pre_act_spatial_map # no activation in Layer 2
    #add current test image's layer 1 output map to the list
    layer_2_output_maps.append(W2_spatial_map)

image_embeddings = [L2_map.flatten() for L2_map in layer_2_output_maps] #flatten each image's final output spatial map
train_embeddings = image_embeddings[:2]
test_embeddings = image_embeddings[2:]

In [12]:
print(image_embeddings)

[array([0., 0., 0., 2., 0., 2., 2., 0., 2.]), array([2., 0., 2., 4., 4., 4., 4., 4., 4.]), array([0., 0., 2., 0., 2., 4., 0., 2., 4.]), array([4., 2., 0., 4., 4., 4., 4., 4., 4.]), array([4., 4., 4., 4., 4., 4., 4., 4., 4.]), array([0., 0., 2., 0., 0., 2., 0., 2., 4.]), array([0., 2., 4., 0., 2., 4., 4., 4., 4.]), array([2., 0., 0., 4., 2., 0., 4., 2., 0.])]


## Classification and Performance Evaluation

### Nearest Neighbor Classification System

The framework implements a K=1 nearest neighbor classifier using custom L2 distance computation. This approach demonstrates how learned CNN features can be used for downstream classification tasks, showing the effectiveness of the extracted representations.

**Training Set**: Images I1 and I2 (class 0 and class 1 respectively)
**Test Set**: Images I3 through I8 for evaluation
**Distance Metric**: Custom L2 norm implementation for similarity computation


First a custom L2 (Euclidean) distance function for similarity computation between feature vectors is implemented. While NumPy provides `np.linalg.norm()`, this manual implementation demonstrates understanding of the mathematical foundations and allows for customization of the distance metric.

In [13]:
def L2_Norm(embedding_1, embedding_2):
    """
    Computes the L2 (Euclidean Norm) between 2 vectors.
    Used as the similarity function of simple nearest neighbor (NN) classifier.

    Note to grader: I am implmenting manually for practice (obvi there is np.linag.norm() method)

    """
    diff_vec = embedding_1 - embedding_2
    square_diff_vec = diff_vec ** 2
    L2_norm = sqrt(np.sum(square_diff_vec))

    return L2_norm

In [14]:
test_image_names = [ f"I{ind}" for ind in range(3, 9)]

predicted_class_table = pd.DataFrame(columns=["Test Image", "Predicted Label", "Actual Label"])
real_labels = [0, 1, 1, 0, 1, 0] # corresponds to images 3 to 8

# perform k=1 Nearest Neighbor Search
for ind , test_emb in enumerate(test_embeddings):
    if L2_Norm(train_embeddings[0], test_emb) <  L2_Norm(train_embeddings[1], test_emb):
        pred_label = 0
    else:
        pred_label = 1
    predicted_class_table.loc[len(predicted_class_table)] = [test_image_names[ind], pred_label, real_labels[ind]] # add new record/row to the DF for the current test image


In [15]:
from IPython.display import display
display(predicted_class_table)

Unnamed: 0,Test Image,Predicted Label,Actual Label
0,I3,0,0
1,I4,1,1
2,I5,1,1
3,I6,0,0
4,I7,1,1
5,I8,0,0


#### Explanation for W1, W2, and the Activation Function, g:

W₁ (Horizontal Edge Detector) uses very large positive weights on the top row and equally large negative weights on the bottom row to detect horiztonal edges within any 2x2 image patch. This produces a hugh positive pre-activation (and everything else a huge negative one) in order to identify class 0 which only is the only class with horizontal edges./

W₂ (Pattern Aggregator) is a simple 2×2 all-ones filter that sums those binary horizontal hits across each local patch, ensuring the final score for class-0 images is consistently high.

hard_sign_activation binarizes every pre-activation to +1 or –1 so that the second convolution sees only clear indicators of “horizontal edge present” versus “absent.” This helps sharpens the nearest-neighbor separation at the end.