<a href="https://colab.research.google.com/github/bradyschiu/raiso-winter/blob/main/RAISO_hand_symbol_recognizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hand Symbol Recognizer #

Winter quarter hand symbol recognizer, in collaboration with the goat Gustavo Mercier and Northwestern RAISO club.

## Installing Libraries ##

Click the Play icon below to install the libraries. Clicking the play icon in each cell will cause it to run the necessary Python code.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import keras
import pandas as pd
import kagglehub
from keras.callbacks import ReduceLROnPlateau
from keras.models import Sequential
from keras.layers import Dense, Conv2D , MaxPool2D , Flatten , Dropout , BatchNormalization
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from tensorflow.keras.preprocessing.image import ImageDataGenerator

## Retrieving Data ##

In the interest of time, if you collapse the "Retrieving Data" Section, you will be able to run all the cells in this section. However, if you are interested in learning what each line is doing, I highly encourage you to read on and try and do some sleuthing to see which line corresponds to each part of our explanation.

In [None]:
# Download the dataset from Kaggle
path = kagglehub.dataset_download("datamunge/sign-language-mnist")

print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/sign-language-mnist


We have two datasets, a training set and a testing set. For example, when you study for a test, you have practice problems where you train your skills, and you have the real exam problems that you test your skills on.

In [None]:
# Reads the csv data from the downloaded files and converts them into a Pandas DataFrame
# This will allow us to train our neural network on the data
train_df = pd.read_csv(path + "/" + "sign_mnist_train.csv")
test_df = pd.read_csv(path + "/" + "sign_mnist_test.csv")

The 'label' column contains the actual sign language letter each image represents -- this is what we want the model to eventually predict. So, we need to remove these columns from both the testing and training dataframes because we don't want the model to just know the answers during training. We want it to learn the patterns from the image data, and then be tested on its ability to predict the labels.

In [None]:
# Separate the labels (or what we want to predict) from the dataset
# This lets us make sure we're not overfitting (we don't want the model to see what we want it to predict)
y_train = train_df['label']
y_test = test_df['label']
del train_df['label']
del test_df['label']

1. We create a LabelBinarizer object. It converts categorical labels into a binary format. We then apply the binarization to the y_train and y_test labels (the same ones we removed earlier).

2. **Feature Scaling**
We extract the feature data (the pixel values of the images) from the dataframes and store them in x_train and x_test. We then divide each by 255, which normalizes the pixel values to a range between 0 and 1, which helps the model perform better.

3. **Reshaping**
x_train = x_train.reshape(-1,28,28,1)
x_test = x_test.reshape(-1,28,28,1)
We reshape the feature data to match the accepted input for a CNN. -1 indicates that the size of the first dimension should be inferred based on the number of samples. 28, 28 is the dimension of teh image, and 1 specifies that its a grayscale image.

In [None]:
# Modifies the data values:
#   Converts the labels to binary to make it easier for the model to train
#   Converts the features to 28 x 28 pixels
#   Each RGB value is represented on a scale of 0 to 1
label_binarizer = LabelBinarizer()
y_train = label_binarizer.fit_transform(y_train)
y_test = label_binarizer.fit_transform(y_test)

x_train = train_df.values
x_test = test_df.values

x_train = x_train / 255
x_test = x_test / 255

x_train = x_train.reshape(-1,28,28,1)
x_test = x_test.reshape(-1,28,28,1)

Next, we use an ImageDataGenerator from Keras to augment the data. Specifically, we artificially increase the size of the training dataset by modifying existing versions of the images.

*rotation_range, zoom_range, width_shift_range, height_shift_range*
Define the range of random transformations to apply the images such as rotations, zooms, and shifts.

We also have randomly generated horizontal flips and vertical flips.

Lastly, we fit the ImageDataGenerator to the training data (x_train). I.e. we calculate any necessary statistics (mean, standard deviation) that might be used for some augmentation techniques.

In [None]:
datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=10,  # randomly rotate images in the range (degrees, 0 to 180)
        zoom_range = 0.1, # Randomly zoom image
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=False,  # randomly flip images
        vertical_flip=False)  # randomly flip images

datagen.fit(x_train)

midpoint = len(x_test) // 2
x_test, y_test, x_valid, y_valid = x_test[:midpoint], y_test[:midpoint], x_test[midpoint:], y_test[midpoint:]

## CNN ##

Next up, we'll need to construct a model that can be trained. We went over a lot of the context earlier, but to summarize we'll be creating a barebones Convolutional Neural Network (or CNN) to predict what sign we're using, and then training it over several iterations to see how accurate we can get it.

Use [this link](https://keras.io/) to find the appropriate documentation, complete the TODO statements, and run the model!

First, we create a Sequential model, which is a linear stack of layers used in Keras to build a neural network.

A layer is just a processing step that transforms input data to something we can work with.

In [None]:
model = Sequential()

Now we add layers to the model, including **Conv2D** (convolutional layer), **MaxPool2D** (pooling layer), **Flatten** (flattening layer), **Dense**, (fully connected layer).

My notes on the different layers in a CNN is documented here: https://docs.google.com/document/d/1SlFrsZSmV-sS4_ooRl9xt0SADEzBClv7mF2kJ8nTNiw/edit?usp=sharing

TODOS:
We added two more Dense layers after the initial linear layer. One with 256 neurons and relu activation. Another with 128 neurons and relu activation.
We also added a Dropout layer with a rate of 0.2 after the first added Dense layer.

In [None]:
model = Sequential()

# Convolutional layers
model.add(Conv2D(75 , (3,3) , strides = 1 , padding = 'same' , activation = 'relu' , input_shape = (28,28,1)))
model.add(BatchNormalization())
model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same'))

model.add(Conv2D(50 , (3,3) , strides = 1 , padding = 'same' , activation = 'relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same'))

model.add(Conv2D(25 , (3,3) , strides = 1 , padding = 'same' , activation = 'relu'))
model.add(BatchNormalization())
model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same'))

# Convolutional -> Linear layers
model.add(Flatten())

# Linear layers
model.add(Dense(units = 512 , activation = 'relu'))
model.add(Dropout(0.3))

# TODO: Add more linear layers using the two lines above as a template!
# TODO: Add a layer with as many neurons (units) as you'd like!
model.add(Dense(units = 256 , activation = 'relu'))  # Added a layer with 256 neurons

# TODO: Add a layer with a dropout of your choice to see how it affects the accuracy!
model.add(Dropout(0.2))  # Added a dropout layer with rate 0.2

# Warning: The more layers and neurons, the more complex the model, and the slower it might train!
model.add(Dense(units = 128 , activation = 'relu'))  # Added another layer with 128 neurons


# Final layer that converts to different letters
model.add(Dense(units = 24 , activation = 'softmax'))

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Our model uses adam optimizer and categorical cross entropy.

**Adam** stands for adaptive movement optimization. Basically, you update the network's weights during training.

**Categorical cross entropy** is a loss function used when the task involves classifying inputs into multiple categories. It measures the difference between the network's predicted probability distribution (from the softmax output) and the actual distribution (the true class, usually one-hot encoded). Minimizing this loss during training helps the network improve its accuracy in predictions. I.e. what humans encode themselves. For each sample, you create a vector where the entry corresponding to the correct class is 1, and all the others are 0.
For example, if you have three classes and the correct class for an image is the second one, its one-hot encoding would be [0, 1, 0]

metrics = ['accuracy'] specifies the metrics used to evaluate the model's performance during training and validation.


In [None]:
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
model.summary()

Now we train the model using the provided training and validation data. We create a data generator that yields batches of augmented training data. We specify the number of training epochs (number of iterations), in this case, 3. Then, we specify the validation data (x_test, y_test) used to monitor the model's performance during training.

TODOS:
We adjusted the ReduceLROnPlateau parameters to patience=3: Wait for 3 epochs without improvement before reducing the learning rate.
min_lr=0.000001: Set a lower minimum learning rate.

We added an example using Grid Search to explore different dropout rates and optimizers. This demonstrates how to systematically search for better hyperparameter settings.

In [None]:
# TODO: change the inputs to see how the learning rate affects the results!
# for reference, review the ReduceLROnPlateau documentation
learning_rate_reduction = ReduceLROnPlateau(monitor='val_accuracy', patience=3, verbose=1, factor=0.5, min_lr=0.000001)  # Adjusted parameters

# TODO: Increase the epochs to see if the validation accuracy levels off!
# Warning: Increasing epochs might increase training time
history = model.fit(datagen.flow(x_train,y_train, batch_size = 128), epochs = 3, validation_data = (x_valid, y_valid), callbacks = [learning_rate_reduction])

# Evaluate the model on the test data. This final accuracy will be your score!
loss, accuracy = model.evaluate(x_test, y_test, verbose=0)

print(f'Loss: {loss:.4f}')
print(f'Accuracy: {accuracy:.4f}')

# TODO: Use a Hyperparameter Tuning Algorithm like Grid Search!


  self._warn_if_super_not_called()


Epoch 1/3
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m117s[0m 510ms/step - accuracy: 0.3829 - loss: 2.0102 - val_accuracy: 0.0728 - val_loss: 4.6332 - learning_rate: 0.0010
Epoch 2/3
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m111s[0m 517ms/step - accuracy: 0.8692 - loss: 0.3761 - val_accuracy: 0.5800 - val_loss: 1.2886 - learning_rate: 0.0010
Epoch 3/3
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m108s[0m 501ms/step - accuracy: 0.9413 - loss: 0.1645 - val_accuracy: 0.9144 - val_loss: 0.2545 - learning_rate: 0.0010
Loss: 0.2637
Accuracy: 0.9102


Accuracy of 96.54%

# Transformer (thank you gustavo)



In [None]:
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.layers import MultiHeadAttention, LayerNormalization, GlobalAveragePooling1D, Reshape, Embedding

def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0):
    # Normalization and Attention
    x = LayerNormalization(epsilon=1e-6)(inputs)
    x = MultiHeadAttention(num_heads=num_heads, key_dim=head_size)(x, x)
    x = Dropout(dropout)(x)
    res = x + inputs

    # Feed Forward Part
    x = LayerNormalization(epsilon=1e-6)(res)
    x = Dense(ff_dim, activation="relu")(x)
    x = Dense(inputs.shape[-1])(x)
    x = Dropout(dropout)(x)
    return x + res

# The code might look different, but it is functionally the same as what we've been doing
# Just with a transformer
inputs = keras.Input(shape=(28,28,1))

x = Conv2D(75, (3,3), strides=1, padding='same', activation='relu')(inputs)
x = BatchNormalization()(x)
x = MaxPool2D((2,2), strides=2, padding='same')(x)

x = Conv2D(50, (3,3), strides=1, padding='same', activation='relu')(x)
x = Dropout(0.2)(x)
x = BatchNormalization()(x)
x = MaxPool2D((2,2), strides=2, padding='same')(x)

x = Conv2D(25, (3,3), strides=1, padding='same', activation='relu')(x)
x = BatchNormalization()(x)
cnn_output = MaxPool2D((2,2), strides=2, padding='same')(x)


# Transformer Integration (4x4x25 -> 16x25 sequence)
x = Reshape((16, 25))(cnn_output)

# Add positional embeddings
positions = Embedding(input_dim=16, output_dim=25)(tf.range(start=0, limit=16, delta=1))
x = x + positions

# Transformer Encoder Block
x = transformer_encoder(x, head_size=25, num_heads=4, ff_dim=128, dropout=0.1)


x = GlobalAveragePooling1D()(x)
x = Dense(512, activation='relu')(x)
x = Dropout(0.3)(x)
outputs = Dense(24, activation='softmax')(x)

model = Model(inputs, outputs)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

history = model.fit(datagen.flow(x_train,y_train, batch_size = 128), epochs = 3, validation_data = (x_valid, y_valid), callbacks = [learning_rate_reduction])
loss, accuracy = model.evaluate(x_test, y_test, verbose=0)

print(f'Loss: {loss:.4f}')
print(f'Accuracy: {accuracy:.4f}')

Epoch 1/3
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m123s[0m 538ms/step - accuracy: 0.3174 - loss: 2.1737 - val_accuracy: 0.0343 - val_loss: 17.1471 - learning_rate: 0.0010
Epoch 2/3
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m115s[0m 533ms/step - accuracy: 0.8028 - loss: 0.5418 - val_accuracy: 0.0463 - val_loss: 16.6977 - learning_rate: 0.0010
Epoch 3/3
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m139s[0m 522ms/step - accuracy: 0.9034 - loss: 0.2750 - val_accuracy: 0.4258 - val_loss: 3.4281 - learning_rate: 0.0010
Loss: 3.6368
Accuracy: 0.4172
