## Digit Recognizer

Although this is a computer vision problem, I created a simple model using **K-Nearest Neighbors** algorithm in this notebook to be a good starting point knowing that CNN would be a much better option. I used the **GridSearchCV** to fine tune the hyperparameters such as *"n_neighbors", and "weights"* and to perform cross-validation. Furthermore, I have used **Data Augmentation** or **Artificial Data Synthesis** technique in this notebook to boost the model's performance on the test set.

Please **upvote** if you like this notebook and share your valuable feedback.

You can find my other notebooks below:

* [Disaster Tweets Classification](https://www.kaggle.com/gauthampughazh/disaster-or-not-plotly-use-tfidf-h2o-ai-automl)
* [House Sales Price Prediction](https://www.kaggle.com/gauthampughazh/house-sales-price-prediction-svr)
* [Titanic Survival Classification](https://www.kaggle.com/gauthampughazh/titanic-survival-prediction-pandas-plotly-keras)

In [None]:
import numpy as np # Linear algebra
import pandas as pd # For data manipulation
import json
import os
import matplotlib.pyplot as plt # For visualization
from sklearn.neighbors import KNeighborsClassifier # For modelling
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold # For evaluation and hyperparameter tuning
from sklearn.metrics import confusion_matrix, classification_report # For evaluation
from scipy.ndimage.interpolation import shift # For data augmentation
from IPython.display import FileLink # For downloading the output file

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

**Peeking the data**

Loading the datasets into dataframes

In [None]:
train_df = pd.read_csv("/kaggle/input/digit-recognizer/train.csv")
test_df = pd.read_csv("/kaggle/input/digit-recognizer/test.csv")
submission_df = pd.read_csv("/kaggle/input/digit-recognizer/sample_submission.csv")

Knowing about the features in the datasets

In [None]:
train_df.info()

In [None]:
test_df.info()

Setting up the training and testing data as numpy arrays

In [None]:
X_train = train_df.iloc[:, 1:].values
y_train = train_df.iloc[:, 0].values
X_test = test_df.values

print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")

Visualizing a single digit as a 28 X 28 image from the dataset

In [None]:
some_digit = X_train[40]

some_digit_image = some_digit.reshape(28, 28)
print(f"Label: {y_train[40]}")
plt.imshow(some_digit_image, cmap="binary")
plt.show()

**Model Selection**

Using **StratifiedKFold** to make test data represent samples from all classes (digits). Also, cross-validating the model with 5 folds and displaying the classification report of the model for each test fold. Using **Confusion Matrix** to get more information about the model's performance on classifying each digit correctly.

In [None]:
stratified_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for fold, indices in enumerate(stratified_fold.split(X_train, y_train)):
    # Creating datasets for training and testing the model 
    X_train_, y_train_ = X_train[indices[0]], y_train[indices[0]]
    X_test_, y_test_ = X_train[indices[1]], y_train[indices[1]]
    
    estimator = KNeighborsClassifier()
    estimator.fit(X_train_, y_train_)
    predictions = estimator.predict(X_test_)
    
    print(f"Classification report for Fold {fold + 1}:")
    print(classification_report(y_test_, predictions, digits=3), end="\n\n")
    
    print(f"Confusion Matrix for Fold {fold + 1}:")
    print(confusion_matrix(y_test_, predictions), end="\n\n")

**Fine-tuning the model by finding the best values for the hyperparameters (weights, n_neighbors) using GridSearchCV**

In [None]:
grid_params = {
    "weights": ['uniform', 'distance'],
    "n_neighbors": [3, 4, 5, 6, 8, 10]
}

estimator = KNeighborsClassifier()
grid_estimator = GridSearchCV(estimator, # Base estimator
                              grid_params, # Parameters to tune
                              cv=stratified_fold, # cross-validation stratergy
                              verbose=2, # Verbosity of the logs
                              n_jobs=-1) # Number of jobs to be run concurrently with -1 meaning all the processors
# Fitting the estimator with training data
grid_estimator.fit(X_train, y_train)

print(f"Best Score: {grid_estimator.best_score_}", end="\n\n")
print(f"Best Parameters: \n{json.dumps(grid_estimator.best_params_, indent=4)}",
      end="\n\n")
print("Grid Search CV results:")
results_df = pd.DataFrame(grid_estimator.cv_results_)
results_df

**Fitting a new model with the found hyperparameter values to the training data and making predictions on the test data**

In [None]:
estimator = KNeighborsClassifier(n_neighbors=4, weights='distance')
estimator.fit(X_train, y_train)
predictions = estimator.predict(X_test)

**Data Augmentation**

After reshaping the pixels to 28 X 28 images, each image is shifted down, up, left and right by one pixel generating four different images for each image in the dataset.

In [None]:
def shift_in_one_direction(image, direction):
    if direction == "DOWN":
        image = shift(image, [1, 0])
    elif direction == "UP":
        image = shift(image, [-1, 0])
    elif direction == "LEFT":
        image = shift(image, [0, -1])
    else:
        image = shift(image, [0, 1])

    return image


def shift_in_all_directions(image):
    reshaped_image = image.reshape(28, 28)

    down_shifted_image = shift_in_one_direction(reshaped_image, "DOWN").reshape(1, 784)
    up_shifted_image = shift_in_one_direction(reshaped_image, "UP").reshape(1, 784)
    left_shifted_image = shift_in_one_direction(reshaped_image, "LEFT").reshape(1, 784)
    right_shifted_image = shift_in_one_direction(reshaped_image, "RIGHT").reshape(1, 784)

    return np.r_[down_shifted_image, up_shifted_image,
                 left_shifted_image, right_shifted_image]


X_train_add = np.apply_along_axis(shift_in_all_directions, 1, X_train).reshape(-1, 784)
y_train_add = np.repeat(y_train, 4)

print(f"X_train_add shape: {X_train_add.shape}")
print(f"y_train_add shape: {y_train_add.shape}")

Combining the synthesized data with the actual training data

In [None]:
X_train_combined = np.r_[X_train, X_train_add]
y_train_combined = np.r_[y_train, y_train_add]

print(f"X_train_combined shape: {X_train_combined.shape}")
print(f"y_train_combined shape: {y_train_combined.shape}")

**Fitting a new model with the tuned hyperparameters to the combined dataset**

In [None]:
cdata_estimator = KNeighborsClassifier(n_neighbors=4, weights='distance')
cdata_estimator.fit(X_train_combined, y_train_combined)
cdata_estimator_predictions = cdata_estimator.predict(X_test)

**Generating the submission file**

In [None]:
submission_df["Label"] = predictions
submission_df.to_csv('submission.csv', index=False)
FileLink('submission.csv')

In [None]:
submission_df["Label"] = cdata_estimator_predictions
submission_df.to_csv('cdata_submission.csv', index=False)
FileLink('cdata_submission.csv')

**Note:** With **Data Augmentation** the accuracy jumped from 97.185% to 98.128% on the test data.