In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Digit Recognizer: A Comprehensive Walkthrough

This notebook details the process of building and training a convolutional neural network (CNN) to recognize handwritten digits from the famous MNIST dataset. We will cover everything from data loading and exploration to model building, training with cross-validation, and finally, generating a submission file for the Kaggle competition.

## Table of Contents
* [1. Introduction](#1.-Introduction)
* [2. Setup and Dependencies](#2.-Setup-and-Dependencies)
* [3. Reproducibility](#3.-Reproducibility)
* [4. Data Loading](#4.-Data-Loading)
* [5. Exploratory Data Analysis (EDA)](#5.-Exploratory-Data-Analysis-(EDA))
* [6. Data Preprocessing](#6.-Data-Preprocessing)
* [7. Data Visualization](#7.-Data-Visualization)
* [8. Model Configuration and Training Strategy](#8.-Model-Configuration-and-Training-Strategy)
* [9. Data Preprocessing Pipeline](#9.-Data-Preprocessing-Pipeline)
* [10. Model Architecture](#10.-Model-Architecture)
* [11. Model Training with Cross-Validation](#11.-Model-Training-with-Cross-Validation)
* [12. Performance Evaluation](#12.-Performance-Evaluation)
* [13. Visualizing Training History](#13.-Visualizing-Training-History)
* [14. Preparing Test Data](#14.-Preparing-Test-Data)
* [15. Generating Predictions](#15.-Generating-Predictions)
* [16. Creating the Submission File](#16.-Creating-the-Submission-File)
* [17. Final Submission](#17.-Final-Submission)
* [18. Conclusion](#18.-Conclusion)

<a id='1.-Introduction'></a>
## 1. Introduction

Welcome to this comprehensive guide on building a digit recognizer using deep learning. In this notebook, we'll tackle the classic MNIST handwritten digit classification problem. Our goal is to create a model that can accurately identify digits from 0 to 9 based on pixel data from images. We will employ a **Convolutional Neural Network (CNN)**, a powerful type of neural network particularly well-suited for image-based tasks. This notebook will walk you through the entire machine learning workflow, from setting up the environment and exploring the data to building, training, and evaluating our model. We'll also cover best practices like cross-validation to ensure our model is robust and performs well on unseen data. Let's get started!

<a id='2.-Setup-and-Dependencies'></a>
## 2. Setup and Dependencies

Before we dive into the project, it's crucial to import all the necessary libraries. This initial step ensures that we have all the tools we need for data manipulation, visualization, and building our deep learning model. Here, we import **NumPy** for numerical operations, **Pandas** for handling our datasets, **Matplotlib** for plotting and visualizing the digit images, and **TensorFlow**, the core library for building and training our neural network. We also import **StratifiedKFold** from **scikit-learn**, which will be instrumental in our cross-validation strategy, helping us to create balanced folds for training and validation.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn.model_selection import StratifiedKFold

<a id='3.-Reproducibility'></a>
## 3. Reproducibility

To ensure that our results are consistent and can be reproduced by others (and by ourselves in future runs), we set a global random seed. Machine learning models often involve random processes, such as the initialization of model weights or the shuffling of data. By setting a seed for both **NumPy** and **TensorFlow**, we guarantee that these random operations will produce the same sequence of numbers every time the code is executed. This is a critical practice for debugging, sharing our work, and ensuring the reliability of our experiments.

In [None]:
# Set random seeds for reproducibility across NumPy and TensorFlow
SEED = 28
np.random.seed(SEED)
tf.random.set_seed(SEED)

<a id='4.-Data-Loading'></a>
## 4. Data Loading

The first step in any machine learning project is to load the data. Here, we use the **Pandas** library to read our training and testing datasets from CSV files. The `train.csv` file contains the pixel values for each image along with a 'label' column that indicates the actual digit. The `test.csv` file contains the pixel values for the images we need to classify for our Kaggle submission. Loading these into pandas DataFrames allows for easy manipulation and inspection of the data, which is essential for the subsequent steps of our analysis.

In [None]:
df_train = pd.read_csv('../input/digit-recognizer/train.csv')
df_test = pd.read_csv('../input/digit-recognizer/test.csv')

<a id='5.-Exploratory-Data-Analysis-(EDA)'></a>
## 5. Exploratory Data Analysis (EDA)

A fundamental step in the machine learning pipeline is to get a feel for the data we're working with. In this section, we perform some initial exploratory data analysis (EDA) to understand the structure and characteristics of our datasets. We will inspect the last few rows of our training and test DataFrames using the `.tail()` method. This gives us a quick preview of the data, showing the pixel values and, for the training set, the corresponding labels. This initial look helps confirm that the data has been loaded correctly and gives us a sense of the format we'll be working with. We'll also check the dimensions of our datasets, examine the unique labels to ensure we have all ten digits (0-9), and check for any missing values. This foundational analysis is key to building an effective and robust model.

In [None]:
df_train.tail()

In [None]:
df_test.tail()

In [None]:
print(df_train.shape, df_test.shape)

In [None]:
pd.unique(sorted(df_train.label))

In [None]:
df_train['label'].value_counts()

In [None]:
df_train.isna().sum().sum()

In [None]:
df_test.isna().sum().sum()

<a id='6.-Data-Preprocessing'></a>
## 6. Data Preprocessing

With a better understanding of our data, we now move to the preprocessing stage. This is where we prepare the data for our machine learning model. The primary goal here is to separate our features (the pixel values of the images) from our target variable (the digit labels). We create `X_train` by dropping the 'label' column from our training DataFrame and `y_train` by selecting only the 'label' column. For the test set, since there are no labels, `X_test` will consist of all the pixel values. We also convert these pandas DataFrames into **NumPy** arrays using the `.values` attribute, as this is the format expected by **TensorFlow**. Finally, we print the shapes of our newly created arrays to verify that the dimensions are correct and that the preprocessing step was successful.

In [None]:
X_train = df_train.drop('label', axis= 1).values
X_test = df_test.values
y_train = df_train['label'].values
print(X_train.shape, y_train.shape, X_test.shape)

<a id='7.-Data-Visualization'></a>
## 7. Data Visualization

To gain a more intuitive understanding of our dataset, it's helpful to visualize some of the images. In this step, we'll take a look at a sample image from both the training and test sets. We select an image by its index, reshape the 1D array of 784 pixels into a 2D 28x28 array, and then use **Matplotlib** to display it. For the training image, we also display its corresponding label as the title of the plot. This not only confirms that our data represents actual handwritten digits but also gives us a qualitative sense of the data's characteristics, such as the variation in writing styles.

In [None]:
# Select and reshape a sample training image (28x28) for visualization
INDEX = 1234

np.set_printoptions(linewidth=120)
img_train = X_train[INDEX].reshape(28, 28)
print(img_train)

In [None]:
# Do same thing for a test image
img_test = X_test[INDEX].reshape(28, 28)
print(img_test)

In [None]:
# Display the selected training image and its true label
plt.imshow(img_train)
plt.title(f'{y_train[INDEX]}')
plt.show()

In [None]:
plt.imshow(img_test)
plt.show()

<a id='8.-Model-Configuration-and-Training-Strategy'></a>
## 8. Model Configuration and Training Strategy

Before we start building and training our model, it's important to define our training strategy and set up some key hyperparameters. In this section, we establish the configuration for our training process. We define the number of folds for our cross-validation (`NUM_FOLD`), the batch size for training (`BATCH_SIZE`), and parameters for our data pipeline like the shuffle buffer size (`SHUFFLE_SIZE`) and prefetch size (`PREFETCH_SIZE`). We then initialize **StratifiedKFold** from scikit-learn, which will ensure that each fold of our data has a proportional representation of each digit class. We also create empty lists to store the accuracy and loss for each fold, as well as the training history. This systematic approach allows for a more organized and robust training process.

In [None]:
# Define cross-validation parameters and data pipeline constants
NUM_FOLD = 10          # Number of folds for Stratified K-Fold CV
BATCH_SIZE = 32        # Batch size for training
SHUFFLE_SIZE = 1000    # Shuffle buffer size for dataset shuffling
PREFETCH_SIZE = tf.data.AUTOTUNE  # Automatic prefetch tuning for performance

# Stratified K-Fold object to preserve label distribution
kf = StratifiedKFold(NUM_FOLD, shuffle=True, random_state=SEED)

# Lists to store metrics across folds
fold_acc_hist, fold_loss_hist = [], []
histories = []
# A boolean variable for logging through KFold loop
log_print = True

<a id='9.-Data-Preprocessing-Pipeline'></a>
## 9. Data Preprocessing Pipeline

To streamline the preprocessing of our image data, we create a sequential model using **TensorFlow's Keras API**. This pipeline will take our raw input data (a 1D array of 784 pixels) and transform it into the format expected by our convolutional neural network. The pipeline consists of two main steps: first, a `Reshape` layer that converts the 1D array into a 2D image of size 28x28 with a single color channel (grayscale). Second, a `Rescaling` layer that normalizes the pixel values from the range [0, 255] to [0, 1]. This normalization is a crucial step that helps the model converge faster and more effectively during training.

In [None]:
# Preprocessing pipeline: reshape flat vectors into 28x28x1 and scale pixel values
preprocessing_data = tf.keras.Sequential([
    tf.keras.Input(shape= (784,)),
    tf.keras.layers.Reshape((28, 28, 1)),
    tf.keras.layers.Rescaling(1./255)
])

<a id='10.-Model-Architecture'></a>
## 10. Model Architecture

Here, we define the architecture of our **Convolutional Neural Network (CNN)**. A CNN is a type of deep learning model that is particularly effective for image classification tasks. Our model is built using the Keras functional API, which offers a flexible way to create complex models.

The architecture consists of the following layers:
* **Input Layer**: Specifies the input shape of our data, which is a 28x28 grayscale image.
* **Convolutional Layer (`Conv2D`)**: This is the core building block of a CNN. It applies a set of learnable filters to the input image, allowing the model to detect features like edges, corners, and textures. We use 32 filters of size 3x3 and a 'relu' activation function.
* **Max Pooling Layer (`MaxPooling2D`)**: This layer downsamples the feature maps, reducing their spatial dimensions. This helps to make the model more robust to variations in the position of features in the image and also reduces the computational load.
* **Flatten Layer**: This layer converts the 2D feature maps into a 1D vector, preparing the data to be fed into the dense layers.
* **Dense Layer**: A fully connected layer with 128 neurons and a 'relu' activation function. This layer learns to combine the features extracted by the convolutional layers to make classifications.
* **Output Layer**: The final dense layer with 10 neurons, one for each digit class (0-9). We use a 'linear' activation function here because we will be using `SparseCategoricalCrossentropy(from_logits=True)` as our loss function, which is more numerically stable.

This architecture is a simple yet effective design for the MNIST digit recognition task.

In [None]:
# Build a simple CNN model for digit classification
def build_digit_model(input_size=(28, 28, 1), num_classes=10):
    inputs = tf.keras.Input(shape=input_size)
    x = tf.keras.layers.Conv2D(32, (3, 3), activation='relu')(inputs)
    x = tf.keras.layers.MaxPooling2D((2, 2))(x)
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(128, activation='relu')(x)
    outputs = tf.keras.layers.Dense(num_classes, activation='linear')(x)  # logits output

    model = tf.keras.Model(inputs, outputs)
    return model


<a id='11.-Model-Training-with-Cross-Validation'></a>
## 11. Model Training with Cross-Validation

This is the core of our notebook, where we train our CNN model using a robust cross-validation strategy. We use **StratifiedKFold** to split our training data into 10 folds. For each fold, we:

1.  **Split the data**: We divide the data into a training set and a validation set.
2.  **Create `tf.data.Dataset` objects**: We convert our NumPy arrays into `tf.data.Dataset` objects, which are highly efficient for building input pipelines in TensorFlow.
3.  **Apply preprocessing**: We apply our `preprocessing_data` pipeline to both the training and validation datasets.
4.  **Build and compile the model**: We create a new instance of our `build_digit_model()` and compile it with the Adam optimizer, sparse categorical crossentropy loss, and accuracy as our metric.
5.  **Define callbacks**: We use two important callbacks: `EarlyStopping` to prevent overfitting by stopping the training when the validation loss stops improving, and `ReduceLROnPlateau` to reduce the learning rate when the model's performance plateaus.
6.  **Train the model**: We train the model for up to 50 epochs, using our prepared datasets and callbacks.
7.  **Evaluate the model**: After training, we evaluate the model on the validation set and record the loss and accuracy for that fold.

This process is repeated for all 10 folds, ensuring that our model's performance is not dependent on a specific random split of the data.

In [None]:
# Perform 10-fold stratified cross-validation
for i, (train_idx, val_idx) in enumerate(kf.split(X_train, y_train)):
    
    # Split data into training and validation subsets for the current fold
    X_train_fold, X_val_fold = X_train[train_idx], X_train[val_idx]
    y_train_fold, y_val_fold = y_train[train_idx], y_train[val_idx]
    
    # Convert numpy arrays to TensorFlow Datasets
    train_ds = tf.data.Dataset.from_tensor_slices((X_train_fold, y_train_fold))
    val_ds = tf.data.Dataset.from_tensor_slices((X_val_fold, y_val_fold))
    if log_print:
        print(f'Train and Validation Datasets created successfully!')
    
    # Optimize dataset pipelines: shuffle, batch, prefetch
    train_ds = (train_ds
                .cache()
                .shuffle(SHUFFLE_SIZE)
                .batch(BATCH_SIZE)
                .prefetch(PREFETCH_SIZE))
    val_ds = (val_ds
              .cache()
              .batch(BATCH_SIZE)
              .prefetch(PREFETCH_SIZE))
    
    # Apply preprocessing (reshape + rescaling) to datasets
    processed_train_ds = train_ds.map(lambda x, y: (preprocessing_data(x), y))
    processed_val_ds = val_ds.map(lambda x, y: (preprocessing_data(x), y))
    if log_print:
        print(f'Processing datasets done!')
    
    
    model = build_digit_model()
    if log_print:
        model.summary()
        log_print = False
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=['accuracy']
    )
    
    # Define callbacks for early stopping and learning rate reduction
    early_stopping = tf.keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
    reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(patience=3, factor=0.2, min_lr=1e-5)
    
    # Train the model
    print(f"----- Fold {i+1}/{NUM_FOLD} -----")
    history = model.fit(
        processed_train_ds,
        epochs=50,
        validation_data=processed_val_ds,
        callbacks=[early_stopping, reduce_lr],
        verbose=2
    )
    
    # Save training history and evaluate the model on validation set
    histories.append(history.history)
    val_loss, val_accuracy = model.evaluate(processed_val_ds, verbose=0)
    
    # Record performance metrics for the current fold
    print(f'Fold {i+1} Loss: {val_loss:.4f}')
    print(f'Fold {i+1} Accuracy: {val_accuracy:.4f}')
    fold_acc_hist.append(val_accuracy)
    fold_loss_hist.append(val_loss)

<a id='12.-Performance-Evaluation'></a>
## 12. Performance Evaluation

After completing our 10-fold cross-validation, we can now assess the overall performance of our model. We calculate the average validation accuracy and loss across all the folds. This gives us a more reliable estimate of how our model is likely to perform on unseen data compared to a single train-validation split. A high average accuracy and low average loss indicate that our model has learned the underlying patterns in the data well and is able to generalize to new examples.

In [None]:
# Display average validation accuracy and loss across all folds
print(f'Average Validation Accuracy: {np.mean(fold_acc_hist):.4f}')
print(f'Average Validation Loss: {np.mean(fold_loss_hist):.4f}')

<a id='13.-Visualizing-Training-History'></a>
## 13. Visualizing Training History

To get a more detailed look at our model's learning process, we visualize the training and validation metrics from the last fold. We plot the accuracy and loss for both the training and validation sets over the course of the epochs.

* **Accuracy Plot**: This plot shows how the model's accuracy on the training and validation data changes with each epoch. Ideally, both lines should increase and converge. A large gap between the two lines can be a sign of overfitting.
* **Loss Plot**: This plot shows the model's loss on the training and validation data. We want to see both lines decrease and converge. An increasing validation loss while the training loss decreases is a clear indicator of overfitting.

These plots are invaluable for diagnosing training issues and understanding the behavior of our model.

In [None]:
# Convert the last training history to a DataFrame for easy plotting
last_hist = histories[-1]
hist_df = pd.DataFrame(last_hist)

# Plot training vs validation accuracy and loss
fig, axs = plt.subplots(1, 2, figsize=(12, 6))

axs[0].plot(hist_df['accuracy'], label='Train Accuracy')
axs[0].plot(hist_df['val_accuracy'], label='Validation Accuracy')
axs[0].set_title('Model Accuracy (Last Fold)')
axs[0].set_xlabel('Epoch', size=10)
axs[0].set_ylabel('Accuracy', size=10)
axs[0].legend(loc='lower right')

axs[1].plot(hist_df['loss'], label='Train Loss')
axs[1].plot(hist_df['val_loss'], label='Validation Loss')
axs[1].set_title('Model Loss (Last Fold)')
axs[1].set_xlabel('Epoch', size=10)
axs[1].set_ylabel('Loss', size=10)
axs[1].legend(loc='upper right')

plt.show()


<a id='14.-Preparing-Test-Data'></a>
## 14. Preparing Test Data

Now that we have a trained model, it's time to prepare the test data for prediction. Similar to our training data pipeline, we first convert the test data into a `tf.data.Dataset`. We then apply the same preprocessing steps, including reshaping and rescaling, to ensure that the test data is in the exact same format as the data our model was trained on. This consistency is crucial for obtaining accurate predictions.

In [None]:
# Create and preprocess test dataset for inference
test_ds = tf.data.Dataset.from_tensor_slices(X_test)
test_ds = (test_ds
           .batch(BATCH_SIZE)
           .prefetch(PREFETCH_SIZE))
processed_test_ds = test_ds.map(lambda x: preprocessing_data(x))

<a id='15.-Generating-Predictions'></a>
## 15. Generating Predictions

With our test data preprocessed, we can now use our trained model to make predictions. We call the `model.predict()` method on our processed test dataset. This will return an array of logits, where each row corresponds to an image and each column represents the model's confidence for a particular digit class. To get the final predicted label, we use `tf.argmax()` to find the index of the highest logit for each image. This index corresponds to the digit that the model believes is represented in the image. We then print a sample prediction to see our model in action.

In [None]:
# Generate predictions (logits) on the test dataset and convert to labels
preds = model.predict(processed_test_ds)
preds_label = tf.argmax(preds, axis=1)
print(f'Sample Prediction: {preds_label[:10]}')

<a id='16.-Creating-the-Submission-File'></a>
## 16. Creating the Submission File

For the Kaggle competition, we need to format our predictions into a specific CSV file format. The submission file should have two columns: 'ImageId' and 'Label'. The 'ImageId' should be a 1-based index for each image in the test set, and the 'Label' column should contain our model's prediction for that image. We create a pandas DataFrame with these two columns, using the index of the test DataFrame for the 'ImageId' and our predicted labels.

In [None]:
# Prepare submission DataFrame with image indices and predicted labels
submission = pd.DataFrame(
    {'ImageId': df_test.index + 1,  # Kaggle requires 1-based index usually
     'Label': preds_label.numpy()}
)

<a id='17.-Final-Submission'></a>
## 17. Final Submission

The final step is to save our submission DataFrame to a CSV file. We use the `.to_csv()` method from pandas, making sure to set `index=False` to avoid writing the DataFrame's index as an extra column. We then print a confirmation message and display the first few rows of our submission file using `.head()` to verify that it has been created correctly. This file is now ready to be uploaded to the Kaggle competition for scoring.

In [None]:
# Save predictions to CSV for Kaggle submission
submission.to_csv('/kaggle/working/submission.csv', index= False)
print(f'Submission CSV file created successfully!')
submission.head()

<a id='18.-Conclusion'></a>
## 18. Conclusion

In this notebook, we have successfully built, trained, and evaluated a convolutional neural network for handwritten digit recognition. We followed a structured approach, starting with data exploration and preprocessing, followed by model building and a robust cross-validation training strategy. Our final model achieves a high average validation accuracy, demonstrating its effectiveness in classifying the MNIST digits.

**Potential Next Steps:**

* **Data Augmentation**: To make our model even more robust, we could apply data augmentation techniques like random rotations, shifts, and zooms to the training images. This would expose the model to more variations and could improve its generalization performance.
* **Model Architecture Tuning**: We could experiment with different CNN architectures, such as adding more convolutional or dense layers, using different filter sizes, or incorporating techniques like batch normalization and dropout.
* **Hyperparameter Optimization**: A more systematic approach to hyperparameter tuning, using techniques like grid search or Bayesian optimization, could help us find the optimal combination of learning rate, batch size, and other parameters.
* **Ensemble Methods**: Combining the predictions of multiple models (an ensemble) can often lead to better performance than any single model. We could train several different models and average their predictions.

Thank you for following along with this digit recognizer project. We hope this comprehensive walkthrough has been both informative and practical.