Kaggle competition Digit Recognizer https://www.kaggle.com/competitions/digit-recognizer/overview

Public score: 0.9736.

I constructed this notebook following Kaggle Learn Intro to Deep Learning https://www.kaggle.com/learn/intro-to-deep-learning and TensorFlow tutorials https://www.tensorflow.org/tutorials/quickstart/beginner. I have introduced dropout to correct overfitting, and used early stopping to prevent both underfitting and overfitting. Other main references include: NURSULTAN KURMANBEKOV's digit-recognizer-mnist https://www.kaggle.com/code/nursultankurmanbekov/digit-recognizer-mnist and YASSINE GHOUZAM's Introduction to CNN Keras - 0.997 (top 6%) https://www.kaggle.com/code/yassineghouzam/introduction-to-cnn-keras-0-997-top-6.

As pointed in Chapter 2 of Deep Learning with Python by François Chollet, this simple model suffers from overfitting. So I switched to convnets approach in a separate notebook https://www.kaggle.com/code/garfield2021/digit-recognizer-keras-cnn. The test accuracy score increased from ~0.97 to ~0.989. 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from keras.utils import to_categorical

from tensorflow import keras
from tensorflow.keras import layers
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import EarlyStopping

# 1. Process the Data

In [None]:
# Load the train and test datasets.
train = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')
test = pd.read_csv('/kaggle/input/digit-recognizer/test.csv')

In [None]:
Y_train = train["label"]
X_train = train.drop(labels=["label"], axis=1)

del train

In [None]:
# The count distribution for 10 digits are close to uniform.
g = sns.histplot(data=Y_train)
Y_train.value_counts()

In [None]:
# No missing values.
X_train.isnull().sum().max()

In [None]:
test.isnull().sum().max()

In [None]:
print("X_train shape: ", X_train.shape)
print("Y_train shape: ", Y_train.shape)
print("test shape: ", test.shape)

# 2. Normalize and Split

The pixel values of the images range from 0 through 255. Scale these values to a range of 0 to 1 by dividing the values by 255.0. This also converts the sample data from integers to floating-point numbers.

In [None]:
X_train, test = X_train / 255.0, test / 255.0

In [None]:
Y_train = to_categorical(Y_train, num_classes = 10)

In [None]:
# stratify - make sure classes are evenlly represented across splits
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, stratify=Y_train, train_size=0.75)

In [None]:
plt.imshow(X_train.values.reshape(-1,28,28,1)[0], cmap=plt.get_cmap('gray'))

In [None]:
print("X_train shape: ", X_train.shape)
print("Y_train shape: ", Y_train.shape)
print("X_val shape: ", X_val.shape)
print("Y_val shape: ", Y_val.shape)

# 3. Build the Model

I have chosen a three-layer network with over 500 neurons. This network should be capable of learning the relationships in the data. The Sequential model will connect together a list of layers in order from first to last: 
the first layer gets the input, the last layer produces the output. A batch normalization layer looks at each batch as it comes in, first normalizing the batch with its own mean and standard deviation, and then also putting the data on a new scale with two trainable rescaling parameters. Batchnorm, in effect, performs a kind of coordinated rescaling of its inputs.

In [None]:
model = Sequential([
        
    # the hidden ReLU layers
    Dense(256, activation='relu'),
    # apply 20% dropout to the next layer
    Dropout(0.2),
    
    Dense(256, activation='relu'),
    Dropout(0.2),
    
    # the multi-class output layers
    Dense(10, activation='softmax')
])

In [None]:
# After defining the model, we compile in the optimizer and loss function. 
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

When a model is too eagerly learning noise, the validation loss may start to increase during training. To prevent this, we can simply stop the training whenever it seems the validation loss isn't decreasing anymore, i.e., early stopping. Once we detect that the validation loss is starting to rise again, we can reset the weights back to where the minimum occured. Training with early stopping also means we're in less danger of stopping the training too early, before the network has finished learning signal. So besides preventing overfitting from training too long, early stopping can also prevent underfitting from not training long enough.

In [None]:
early_stopping = EarlyStopping(
    
    # how many epochs to wait before stopping
    patience=5,
    
    # minimium amount of change to count as an improvement
    min_delta=0.001,
    
    restore_best_weights=True
)

We've told Keras to feed the optimizer 512 rows of the training data at a time (the batch_size) and to do that 100 times all the way through the dataset (the epochs). You can see that Keras will keep you updated on the loss as the model trains.

In [None]:
history = model.fit(
    X_train, Y_train,
    validation_data=(X_val, Y_val),
    batch_size=128,
    epochs=10,
    
    # put your callbacks in a list
    callbacks=[early_stopping]
)

Often, a better way to view the loss though is to plot it. The fit method in fact keeps a record of the loss produced during training in a History object. We'll convert the data to a Pandas dataframe, which makes the plotting easy. The model overfits the training data at best accuracy close to 0.99 but underfits the validation set at accuracy smaller than 0.98.

In [None]:
history_df = pd.DataFrame(history.history)

history_df.loc[:, ['loss', 'val_loss']].plot(title="Cross-entropy")
print("Minimum validation loss: {}".format(history_df['val_loss'].min()))

history_df.loc[:, ['accuracy', 'val_accuracy']].plot(title="Accuracy")
print("Maximum validation accuracy: {}".format(history_df['val_accuracy'].max()))

# 4. Predict and Submit

In [None]:
# Make predicitons based on the model trained before.
predictions = model.predict(test)

In [None]:
# Select the index with the maximum probability
predictions = np.argmax(predictions,axis =1)

In [None]:
predictions = pd.Series(predictions, name='Label')

In [None]:
submission = pd.concat([pd.Series(range(1,28001),name = "ImageId"), predictions],axis = 1)
submission.to_csv("submission.csv",index=False)