### **Convolutional Neural Network (CNN)**
**CNN** is a specialized deep-learning **ANN** architecture that is designed for image processing, computer vision, and spatial data analysis. It is used for object detection, face recognition, self-driving, etc...

**CNN** = **Convolutional Network** + **Pooling** + **ANN**


**CNN** is nothing but ANN with feature engineering. The first thing we do is **we extract features** from the image, **compress the image features** and then send the feaures to the ANN for processing.

**Why ANN alone is not good enough for image processing?**
- Say you have a color image of 50 x 50 pixels

- A color is combination 3 channels: R G B.

- Hence, the image is made of 50 x 50 x 3 = 7500 pixels.

- But pixels alone does not identify the object in the image. **You need spatial features ALSO.**

- A 3 minutes video captured using 60Hz camera gives you nearly 10000 frames or images.

- You end up having **10000 x 7500 = 75, 000, 000 pixels** which is  huge data to process a 3 minute video.

- **_Hence, along with Spatial features, you also need data compression or sample downsizing._***

- This explains why ANN alone is not enough for image processing.

#### **Solution**
You need **Convolution Network** and **Pooling**. **Convolution Network** extracts the spatial features from the image, and **Pooling** does the sample or feature downsizing.

![CNN](https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41598-024-51258-6/MediaObjects/41598_2024_51258_Fig1_HTML.png?as=webp)

https://poloclub.github.io/cnn-explainer/


### 1. Convolution Network / Layer
**A convolution layer is the core building block of a Convolutional Neural Network (CNN)**. It is designed to automatically extract spatial features from input data (usually from images) using a set of learnable filters or kernels.

**Imagine**

You have an image represented as a grid of pixel values (e.g., 7x7 color scale). The convolution layer slides a small filter (like 3x3 ) over the image and performs element-wise multiplication and summation to produce a feature map (also called an activation map).

![CNN](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*ciDgQEjViWLnCbmX-EeSrA.gif)

Popular Kernel Sizes are:
- 7 x 7 (used usually in the first convolution layer for large images)
- 5 x 5
- 3 x 3

**even number kernel is not used because it fails to capture the center of the image.

**Note**
- ***One kernel / filter mean one feature***. Hence, more feature you need, more kernel needs to be added.
- If we need minute details from the image, we will have to apply multiple filters. (Popular filter sizes are 16, 32, and 64.
- More feature, more complex is the model.
- **_An activation function applied after convolution to introduce non-linearity, which helps the network learn complex patterns._**

**Other Parameters in Convolutional Operation**

- **Stride**: Stride indicates how many pixels the kernel should be shifted over at a time.The impact stride has on a CNN is similar to kernel size.
When stride is set to 1, the filter moves across one pixel at a time, and when the stride is set to 2, the filter moves across two pixels at a time.
The higher the stride value, the smaller the output, and vice versa.

- **Padding**: Padding is often necessary when the kernel extends beyond the activation map.



### 2. Pooling
**Pooling** is a downsampling operation used in Convolutional Neural Networks (CNNs) to reduce the spatial size (width × height) of feature maps, while retaining important features. It is done to decrease computational load and control overfitting.

#### Max Pooling vs Average Pooling

![Pooling](https://cdn.analyticsvidhya.com/wp-content/uploads/2024/08/597371-kqieqhxzicu7thjaqbfpbq-66c7045e59b1e.webp)



### 3. Fully Connected (Dense) Layers
Dense layer is nothing, but the ANN layer. **After several** convolutional and pooling layers, the feature maps are flattened and passed to fully connected layers for classification or regression.

### 4. Dropout (optional)
Dropout is a regularization technique used during training to prevent overfitting by randomly "dropping out" (turning off) a fraction of the neurons in a layer during each training step.

## A Simple CNN Code

In [14]:
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
from matplotlib import pyplot as plt

# Load and preprocess the data
# x_train is a 3 dimensional array 60000 x 28 x 28 --> 60000 training images. Each image is of "28 x 28" pixel image showing hand-written digit sample.
# Each element in the x_train (i.e. x_train[][][] is a digit representing gray color code in the 0-255 scale)
# y_train is a 1 dimensional array 60000 --> Each element is numeric representation of image in x_train

# x_test is a 3 dimensional array 10000 x 28 x 28 --> 10000 training images. Each image is of "28 x 28" pixel image showing hand-written digit sample.
# Each element in the x_test (i.e. x_train[][][] is a digit representing gray color code in the 0-255 scale)
# y_test is a 1 dimensional array 10000 --> Each element is numeric representation of image in x_test
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# print(x_train.shape)

# plt.imshow(x_train[0]) # its gray scale image. but matplot applies default viridis color
# plt.imshow(x_train[0], cmap="gray")
# plt.show()


# Reshape and normalize the numerical data
## It reshapes 3D array 60000 x 28 x 28 to 4D array 60000 x 28 x 28 x 1
## the last columun represents the no of channel. Since our image is in gray scale, it has one channel. The last column stores 1 always.
## Why to 4D? Because the CNN architecture expects it in that format.
# x_train[] --> Image Index (0 - 59999)
# x_train[][] --> Image Pixel Row Index (0 - 27)
# x_train[][][] --> Image Pixel Column Index (0 - 27)
# x_train[][][][] --> Color Channel (1-3 for colored images, 1 for grayscaled images)
x_train = x_train.reshape(-1, 28, 28, 1).astype('float32') / 255
x_test  = x_test.reshape(-1, 28, 28, 1).astype('float32') / 255
# print(x_train.shape)

# Normalize the categorical data (target)
## Target here represents the hand-written digit class ranging from 0-9
## to_categorical() does one-hot encoding on it.
## one-hot encoding is must of loss function "categorical_crossentropy"
print(y_train[0])
y_train = to_categorical(y_train)
y_test  = to_categorical(y_test)
print(y_train[0])

# Build CNN model
model = models.Sequential([
    # Convolution Layer1 - Apply 32 filter/kernel of 3x3 size
    # Each image dimension is 28 x 28 x 1. Hence, the imput layer should have 784 neurons to read each pixel.
    # input_shape=(28, 28, 1) defines the input layer neurons
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)), # Pooling using 2x2 filter

    # Convolution Layer2 - Apply 64 filter/kernel of 3x3 size
    ## input_shape() is not specified here. Keras determines the shape and passes it automatically.
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),


    layers.Flatten(), # Convert 2D array to 1D array input

    # Here comes the ANN layers
    layers.Dense(64, activation='relu'),    # ANN Hidden/Dense layer with 64 neurons
    layers.Dense(10, activation='softmax')  # 10 classes for MNIST (It is a multiclass of 10 predicting class 0 to 9 digit)
])

# Compile and train the model
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=4, validation_data=(x_test, y_test))




5
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/4
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m74s[0m 38ms/step - accuracy: 0.9049 - loss: 0.3037 - val_accuracy: 0.9851 - val_loss: 0.0487
Epoch 2/4
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 38ms/step - accuracy: 0.9851 - loss: 0.0493 - val_accuracy: 0.9878 - val_loss: 0.0359
Epoch 3/4
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m72s[0m 39ms/step - accuracy: 0.9888 - loss: 0.0337 - val_accuracy: 0.9914 - val_loss: 0.0276
Epoch 4/4
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m69s[0m 37ms/step - accuracy: 0.9929 - loss: 0.0227 - val_accuracy: 0.9895 - val_loss: 0.0301
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 11ms/step - accuracy: 0.9853 - loss: 0.0399
Test accuracy: 0.9894999861717224


In [17]:
# Evaluate on training data
train_loss, train_acc = model.evaluate(x_train, y_train)
print("Training accuracy:", train_acc)

# Evaluate on test data
test_loss, test_acc = model.evaluate(x_test, y_test)
print("Test accuracy:", test_acc)


[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 11ms/step - accuracy: 0.9955 - loss: 0.0133
Training accuracy: 0.9958999752998352
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 11ms/step - accuracy: 0.9853 - loss: 0.0399
Test accuracy: 0.9894999861717224


In [18]:
import tensorflow as tf
from sklearn.metrics import accuracy_score

# 1. Predict probabilities for test data
pred = model.predict(x_test)

# 2. Get predicted class labels by taking argmax
y_pred = tf.argmax(pred, axis=1).numpy()  # convert to NumPy array

# 3. If y_test is one-hot encoded, convert it back to class labels
if len(y_test.shape) > 1 and y_test.shape[1] > 1:
    y_true = tf.argmax(y_test, axis=1).numpy()
else:
    y_true = y_test

# 4. Compute accuracy
acc = accuracy_score(y_true, y_pred)
print(f"Test accuracy (manual): {acc:.4f}")


[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 14ms/step
Test accuracy (manual): 0.9895
