# Explore the data step by step and preprocess it to be used in a Convolutional Neural Network

In [6]:
import numpy as np
from tensorflow import keras
from tensorflow.keras.utils import to_categorical

In [13]:
print("Loading MNIST dataset...")
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()
print("MNIST loaded successfully")

Loading MNIST dataset...
MNIST loaded successfully


In [8]:
print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")

Training data shape: (60000, 28, 28)
Test data shape: (10000, 28, 28)


Normalise pixel values. This will make all values be in range [0,1] rather than [0,255]. This is done to make the CNN faster, more stable and more accurate.

In [9]:
X_train = X_train.astype("float32") / 255.0
X_test = X_test.astype("float32") / 255.0

Reshape the data for CNN input. Now, the shape is in form (batch_size, height, width). A CNN needs shape (batch_size, height, width, channels).
- batch_size - number of images in batch (60000 in MNIST)
- channels - number of color channels: 1 for grayscale, 3 for RGB (we have grayscale)


In [10]:
X_train = X_train.reshape(-1, 28, 28, 1) # using -1 automatically infers batch size
X_test = X_test.reshape(-1, 28, 28, 1)
print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")

Training data shape: (60000, 28, 28, 1)
Test data shape: (10000, 28, 28, 1)


The MNIST labels are integers from ```0``` to ```9```. We use one-hot encoding to convert each integer into a vector where the correct class is ```1``` and the otehrs are ```0```.
We use ```.astype("float32)``` for sevral reasons: firstly, layers in our CNN require float inputs, not integers; secondly, for precisiopj and efiiciency

In [14]:
num_classes = 10
print(f"Training data shape before: {y_train.shape}")
print(f"Test data shape before: {y_test.shape}")
y_train = to_categorical(y_train, num_classes).astype("float32")
y_test = to_categorical(y_test, num_classes).astype("float32")
print(f"Training data shape after: {y_train.shape}")
print(f"Test data shape after: {y_test.shape}")

Training data shape before: (60000,)
Test data shape before: (10000,)
Training data shape after: (60000, 10)
Test data shape after: (10000, 10)
