# Convolutional Neural Network (CNN) Tutorial

A typical CNN has the following structure:

1. **Input Layer**
2. **Convolution Layer + Activation Function**
3. **Pooling Layer**
4. **Fully Connected Layer (Classification)**

<img src="img/ex11.png">



## Input Layer

* The input layer receives **raw image data**.
* Images can be **grayscale (1 channel)** or **RGB (3 channels)**.
* Pixel values usually range from **0 to 255**. To stabilize training, we **normalize** pixel values:

$$
x_\text{normalized} = \frac{x}{255}, \quad x_\text{normalized} \in [0,1]
$$

* Example: Pixel value $128 \to 128/255 \approx 0.502$

* If the input is an RGB image of size $H \times W$, each channel is normalized separately.



## Convolution Layer

The **convolution layer** extracts **local features** like edges, corners, and textures.

* Let input image $I$ be of size $H \times W$ and filter (kernel) $K$ of size $k_h \times k_w$.
* The **feature map** $S$ is calculated as a **2-D discrete convolution**:

$$
S[i,j] = \sum_{m=0}^{k_h-1} \sum_{n=0}^{k_w-1} I[i+m, j+n], K[m,n]
$$

* **Stride $s$:** Number of pixels the kernel moves each step. Larger stride → smaller feature map
* **Activation function:** Introduces **nonlinearity**. Example: ReLU

$$
f(x) = \max(0, x)
$$



### Numerical Example

* Input patch:

$$
I_\text{patch} = \begin{bmatrix} 1 & 2 & 1 \\ 0 & 1 & 0 \\ 2 & 1 & 2 \end{bmatrix}
$$

* Filter (vertical edge detector):

$$
K = \begin{bmatrix} 1 & 0 & -1 \\ 1 & 0 & -1 \\ 1 & 0 & -1 \end{bmatrix}
$$

Compute **convolution at top-left corner**:

$$
\begin{aligned}
S[0,0] &= 1\cdot 1 + 2\cdot 0 + 1\cdot(-1) \\
&+ 0\cdot 1 + 1\cdot 0 + 0\cdot(-1) \\
&+ 2\cdot 1 + 1\cdot 0 + 2\cdot(-1) \\
&= 1 - 1 + 0 + 2 - 2 = 0
\end{aligned}
$$

* The **feature map is smaller** than the original patch.
* Increasing stride reduces the size further.



## Pooling Layer

Pooling **reduces spatial dimensions** while preserving important information.

* Common types:

1. **Max Pooling:** $s[i,j] = \max{S[\text{patch around }(i,j)]}$
2. **Average Pooling:** $s[i,j] = \frac{1}{N} \sum S[\text{patch}]$

* Example: 2×2 max pooling, stride 2, on

$$
S = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix}
$$

$$
\max(1,3,2,4) = 4
$$

* Pooling **reduces computation** in later layers and adds **translation invariance** for small shifts.

* Larger pooling windows reduce feature map size more aggressively.



### 2-D Max Pooling Example

* Input:

$$
S = \begin{bmatrix} 1 & 3 & 2 & 4 \\ 5 & 6 & 7 & 8 \\ 2 & 1 & 0 & 3 \\ 4 & 5 & 2 & 1 \end{bmatrix}
$$

* Pool size = 2, stride = 2

* Pool regions and max values:

$$
\begin{aligned}
\max\begin{bmatrix}1 & 3 \\ 5 & 6\end{bmatrix} &= 6 \\
\max\begin{bmatrix}2 & 4 \\ 7 & 8\end{bmatrix} &= 8 \\
\max\begin{bmatrix}2 & 1 \\ 4 & 5\end{bmatrix} &= 5 \\
\max\begin{bmatrix}0 & 3 \\ 2 & 1\end{bmatrix} &= 3
\end{aligned}
$$

* Resulting pooled map:

$$
S_\text{pooled} = \begin{bmatrix} 6 & 8 \\ 5 & 3 \end{bmatrix}
$$



## Fully Connected Layer (FC)

After convolution + pooling, feature maps are **flattened** into a vector.

* Each neuron in the FC layer connects to **every input feature**:

$$
y = f\left(\sum_i w_i x_i + b\right)
$$

* Example: MNIST digit classification (10 classes)
* Final layer usually uses **softmax** to produce probabilities:

$$
P(y=i|x) = \frac{e^{z_i}}{\sum_{j=0}^{9} e^{z_j}}
$$

* Softmax ensures $0 \le P(y=i) \le 1$ and $\sum_i P(y=i) = 1$



## Digit Recognizer Example (Concrete Pipeline)

1. Input: $28 \times 28$ grayscale image of a handwritten digit
2. **Convolution Layer:** Detect edges and local patterns

   * Example: $3 \times 3$ filters, stride 1 $\rightarrow$ feature map $26 \times 26$
3. **Activation:** ReLU → apply $\max(0,x)$ elementwise
4. **Pooling Layer:** $2 \times 2$ max pooling, stride 2 → reduces map to $13 \times 13$
5. **Flatten:** Convert pooled feature maps to vector of length $13 \cdot 13 \cdot \text{num filters}$
6. **Fully Connected Layer:** Map vector to 10 neurons → softmax → predicted digit probabilities


In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import os

warnings.filterwarnings('ignore')
print(os.listdir('input/'))

## Loading the Data Set

In [None]:
train = pd.read_csv("input/train.csv")
print(train.shape)
train.head()

In [None]:
test = pd.read_csv("input/test.csv")
print(test.shape)
test.head()

In [None]:
Y_train = train["label"]
X_train = train.drop(labels = ["label"], axis = 1)

In [None]:
Y_train.value_counts(dropna=False).reset_index()

In [None]:
plt.figure(figsize=(15,3))
plt.title("Number of digit classes")

Y_train_vc = Y_train.value_counts(dropna=False).reset_index()
print(Y_train_vc)

Y_vc_x = Y_train_vc['label']
Y_vc_y = Y_train_vc['count']

sns.barplot(x=Y_vc_x, y=Y_vc_y, palette="icefire")
plt.show()

In [None]:
# plot some samples
img = X_train.iloc[0].to_numpy()
img = img.reshape((28,28))

plt.imshow(img,cmap='gray')
plt.title(train.iloc[0,0])
plt.axis("off")
plt.show()

In [None]:
# plot some samples
img = X_train.iloc[3].to_numpy()
img = img.reshape((28,28))

plt.imshow(img,cmap='gray')
plt.title(train.iloc[3,0])
plt.axis("off")
plt.show()

## Normalization, Reshape and Label Encoding

- Normalization
    - We perform a grayscale normalization to reduce the effect of illumination's differences.
    - If we perform normalization, CNN works faster.
- Reshape
    - Train and test images (28 x 28)
    - We reshape all data to 28x28x1 3D matrices.
    - Keras needs an extra dimension in the end which correspond to channels. 
    
Our images are gray scaled so it use only one channel.
- Label Encoding
    - Encode labels to one hot vectors
        - 2 => [0,0,1,0,0,0,0,0,0,0]
        - 4 => [0,0,0,0,1,0,0,0,0,0]

In [None]:
# Normalize the data
X_train = X_train / 255.0
test = test / 255.0

print("x_train shape: ",X_train.shape)
print("test shape: ",test.shape)

In [None]:
# Reshape
X_train = X_train.values.reshape(-1,28,28,1)
test = test.values.reshape(-1,28,28,1)

print("x_train shape: ",X_train.shape)
print("test shape: ",test.shape)

In [None]:
from tensorflow.keras.utils import to_categorical

Y_train = to_categorical(Y_train, num_classes=10)


### Train Test Split

In [None]:
# Split the train and the validation set for the fitting
from sklearn.model_selection import train_test_split


X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size = 0.1, random_state=2)

print("x_train shape",X_train.shape)
print("x_test shape",X_val.shape)
print("y_train shape",Y_train.shape)
print("y_test shape",Y_val.shape)

In [None]:
# Some examples
plt.imshow(X_train[2][:,:,0],cmap='gray')
plt.show()

## Implementing with Keras

In [None]:
from sklearn.metrics import confusion_matrix
import itertools

from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from tensorflow.keras.optimizers import Adam


In [None]:
model = Sequential()
model.add(Conv2D(filters = 8, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (28,28,1)))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Conv2D(filters = 16, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(10, activation = "softmax"))


# Define the optimizer
optimizer = Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

# Compile the model
model.compile(optimizer = optimizer ,loss = "categorical_crossentropy", metrics=["accuracy"])

In [None]:
epochs = 10  # for better result increase the epochs
batch_size = 250

### Data Augmentation
- To avoid overfitting problem, we need to expand artificially our handwritten digit dataset
- Alter the training data with small transformations to reproduce the variations of digit.
- For example, the number is not centered The scale is not the same (some who write with big/small numbers) The image is rotated.

<img src="img/ex10.png">

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# data augmentation
datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # dimesion reduction
        rotation_range=5,  # randomly rotate images in the range 5 degrees
        zoom_range = 0.1, # Randomly zoom image 10%
        width_shift_range=0.1,  # randomly shift images horizontally 10%
        height_shift_range=0.1,  # randomly shift images vertically 10%
        horizontal_flip=False,  # randomly flip images
        vertical_flip=False)  # randomly flip images

datagen.fit(X_train)

## Fit the model

In [None]:
# Fit the model
history = model.fit(datagen.flow(X_train,Y_train, batch_size=batch_size),
                              epochs = epochs, validation_data = (X_val,Y_val), 
                              steps_per_epoch=X_train.shape[0] // batch_size)

## Evaluate the model

In [None]:
# Plot the loss and accuracy curves for training and validation 
plt.plot(history.history['val_loss'], color='b', label="validation loss")
plt.title("Test Loss")
plt.xlabel("Number of Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()

In [None]:
Y_pred = model.predict(X_val)
Y_pred_classes = np.argmax(Y_pred, axis = 1) 
Y_true = np.argmax(Y_val,axis = 1) 
confusion_mtx = confusion_matrix(Y_true, Y_pred_classes) 

f,ax = plt.subplots(figsize=(8, 8))
sns.heatmap(confusion_mtx, annot=True, linewidths=0.01,cmap="Greens",linecolor="gray", fmt= '.1f',ax=ax)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()