# Classification on `emnist`

## 1. Create `Readme.md` to document your work

Explain your choices, process, and outcomes.

## 2. Classify all symbols

### Choose a model

Your choice of model! Choose wisely...

### Train away!

Is do you need to tune any parameters? Is the model expecting data in a different format?

### Evaluate the model

Evaluate the models on the test set, analyze the confusion matrix to see where the model performs well and where it struggles.

### Investigate subsets

On which classes does the model perform well? Poorly? Evaluate again, excluding easily confused symbols (such as 'O' and '0').

### Improve performance

Brainstorm for improving the performance. This could include trying different architectures, adding more layers, changing the loss function, or using data augmentation techniques.

## 2. Classify digits vs. letters model showdown

Perform a full showdown classifying digits vs letters:

1. Create a column for whether each row is a digit or a letter
2. Choose an evaluation metric
3. Choose several candidate models to train
4. Divide data to reserve a validation set that will NOT be used in training/testing
5. K-fold train/test
    1. Create train/test splits from the non-validation dataset
    2. Train each candidate model (best practice: use the same split for all models)
    3. Apply the model the the test split
    4. (*Optional*) Perform hyper-parametric search
    5. Record the model evaluation metrics
    6. Repeat with a new train/test split
6. Promote winner, apply model to validation set
7. (*Optional*) Perform hyper-parametric search, if applicable
8. Report model performance

In [None]:
!pip install emnist


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


In [None]:
# Import packages
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import emnist
from hashlib import sha1

In [None]:
# Load the data, and reshape it into a 28x28 array

# The size of each image is 28x28 pixels
size = 28

# Extract the training split as images and labels
image, label = emnist.extract_training_samples('byclass')

# Add columns for each pixel value (28x28 = 784 columns)
raw_train = pd.DataFrame()

# Add a column showing the label
raw_train['label'] = label

# Add a column with the image data as a 28x28 array
raw_train['image'] = list(image)


# Repeat for the test split
image, label = emnist.extract_test_samples('byclass')
raw_test = pd.DataFrame()
raw_test['label'] = label
raw_test['image'] = list(image)

In [None]:
# Let's start cleaning!

# Labels! They're hard to understand as numbers, so let's map them to characters
# We can do this by manually creating a dictionary:
LABELS = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
          'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
          'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

# Or generate the list of labels using the following code:
# create the characters list, which is the digits, then uppercase, then lowercase
chars = string.digits + string.ascii_uppercase + string.ascii_lowercase
# create the dictionary mapping the numbers to the characters
num_to_char = {i: chars[i] for i in range(len(chars))}

In [None]:
raw_train['mapped_label'] = raw_train['label'].map(num_to_char)
print(raw_train[['mapped_label']])

raw_test['mapped_label'] = raw_test['label'].map(num_to_char)
print(raw_test[['mapped_label']])

def label_category(value):
    if pd.isnull(value):
        return pd.NA  # Use pd.NA for missing values
    # Try to convert to numeric, and check if the result is not NaN
    elif not pd.isnull(pd.to_numeric(value, errors='coerce')):
        return 'number'
    elif isinstance(value, str) and value.isalpha():
        return 'letter'
    else:
        return pd.NA  # Use pd.NA for any other case that is considered missing

raw_train['label_cat'] = raw_train['mapped_label'].apply(label_category)
print(raw_train['label_cat'])

raw_test['label_cat'] = raw_test['mapped_label'].apply(label_category)

       mapped_label
0                 Z
1                 a
2                 6
3                 3
4                 M
...             ...
697927            e
697928            l
697929            5
697930            B
697931            M

[697932 rows x 1 columns]
       mapped_label
0                 I
1                 a
2                 0
3                 3
4                 X
...             ...
116318            7
116319            t
116320            S
116321            0
116322            5

[116323 rows x 1 columns]
0         letter
1         letter
2         number
3         number
4         letter
           ...  
697927    letter
697928    letter
697929    number
697930    letter
697931    letter
Name: label_cat, Length: 697932, dtype: object


In [None]:
def label_category_code(value):
    if pd.isnull(value):
        return pd.NA  # Use pd.NA for missing values
    # Try to convert to numeric, and check if the result is not NaN
    elif not pd.isnull(pd.to_numeric(value, errors='coerce')):
        return 1
    elif isinstance(value, str) and value.isalpha():
        return 0
    else:
        return pd.NA  # Use pd.NA for any other case that is considered missing

raw_train['label_cat_code'] = raw_train['mapped_label'].apply(label_category_code)
print(raw_train['label_cat_code'])

raw_test['label_cat_code'] = raw_test['mapped_label'].apply(label_category_code)

0         0
1         0
2         1
3         1
4         0
         ..
697927    0
697928    0
697929    1
697930    0
697931    0
Name: label_cat_code, Length: 697932, dtype: int64


In [None]:
%pip install scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
# Flatten each image if they are 2D arrays
X_train = np.array([image.flatten() for image in raw_train['image']])

# Ensure the target variable is in the correct shape
y_train = raw_train['label_cat_code'].values  # Assuming 'label_cat_code' is a column in a pandas DataFrame
y_train_allclass = raw_train['mapped_label'].values

# Create validation set (which called test set in the class)
# Assuming 'X_train' has been flattened and 'y_train', 'y_train_allclass' are defined
X_train, X_val, y_train, y_val, y_train_allclass, y_val_allclass = train_test_split(
    X_train, y_train, y_train_allclass, test_size=0.2, random_state=42, stratify=y_train
)

X_head = X_train[:1000]  # Using the first 1000 samples for a smaller training subset
y_head = y_train[:1000]
y_allclass_head = y_train_allclass[:1000]


Ex. Part 1 - Classifying all using random forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint

# Define the parameter distribution
param_dist = {
    'n_estimators': randint(10, 50),  # Example: Number of trees in a range
    'max_depth': [None, 3, 5, 7],  # Example: Maximum depth of the tree
    # Add more parameters and distributions here
}

# Initialize the classifier
rf_classifier = RandomForestClassifier()

# Initialize the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=rf_classifier, param_distributions=param_dist, n_iter=50, cv=3, n_jobs=-1, verbose=2, random_state=42)
#If you don't explicitly specify the scoring parameter, it defaults to the estimator's default scorer (if available), which, for most classifiers, is accuracy.
#Reducing number of iterations and CV to make sure run time is not long

# Fit the random search to the data
random_search.fit(X_head, y_allclass_head)

# Print the best parameters
print("Best parameters found: ", random_search.best_params_)

# Use the best estimator for further predictions
best_rf_classifier = random_search.best_estimator_


Fitting 3 folds for each of 50 candidates, totalling 150 fits




KeyboardInterrupt: 

In [None]:
rf_classifier = RandomForestClassifier(n_estimators=100)
rf_classifier.fit(X_head, y_allclass_head)

In [None]:
# Looking at the prediction accuracy using the trained data
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Initialize the classifier
rf_classifier = RandomForestClassifier(n_estimators=100)

# Perform three-fold cross-validation
cv_scores = cross_val_score(rf_classifier, X_head, y_allclass_head, cv=3)

# Print the accuracy for each fold
print(f'Accuracy scores for each fold: {cv_scores}')

1.0
[[58  0  0 ...  0  0  0]
 [ 0 51  0 ...  0  0  0]
 [ 0  0 50 ...  0  0  0]
 ...
 [ 0  0  0 ...  4  0  0]
 [ 0  0  0 ...  0  2  0]
 [ 0  0  0 ...  0  0  6]]


In [None]:
print(f'The accuracy using the training dataset used to train the model is {accuracy_score(y_allclass_head, rf_classifier.predict(X_head))}')
print(f'The accuracy using the whole training dataset is {accuracy_score(y_train_allclass, rf_classifier.predict(X_train))}') # And I'm looking at the accuracy of the whole train data

The accuracy using the training dataset used to train the model is 1.0
The accuracy using the whole training dataset is 0.5636300137012062


In [None]:
# Evaluate on test set
X_test = np.array([image.flatten() for image in raw_test['image']])
print(X_test)
y_test_allclass = raw_test['mapped_label']
print(y_test_allclass)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
0         I
1         a
2         0
3         3
4         X
         ..
116318    7
116319    t
116320    S
116321    0
116322    5
Name: mapped_label, Length: 116323, dtype: object


In [None]:
# Looking at the prediction accuracy using the test data
print(accuracy_score(y_test_allclass, rf_classifier.predict(X_test)))
print(confusion_matrix(y_test_allclass, rf_classifier.predict(X_test)))

0.5636374577684551
[[3628    5   13 ...    0    0    0]
 [   0 5657   29 ...    0    0    0]
 [  67   42 4990 ...    0    0   23]
 ...
 [   1   17  112 ...   38    1    1]
 [   1   27    4 ...    0    0    0]
 [   3    4  275 ...    0    0   16]]


In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix

# Exclude '0' and 'O' from the evaluation
excluded_labels = ['0', 'O']

# Create a mask for letters (excluding 'O')
is_letter = np.array([label.isalpha() and label not in excluded_labels for label in y_test_allclass])

# Create a mask for numbers (excluding '0')
is_number = np.array([label.isdigit() and label not in excluded_labels for label in y_test_allclass])

# Filter the test set for letters
X_test_letters = X_test[is_letter]
y_test_letters = y_test_allclass[is_letter]

# Filter the test set for numbers
X_test_numbers = X_test[is_number]
y_test_numbers = y_test_allclass[is_number]

# Make predictions for letters
y_pred_letters = rf_classifier.predict(X_test_letters)

# Make predictions for numbers
y_pred_numbers = rf_classifier.predict(X_test_numbers)

# Calculate and print the accuracy for letters
accuracy_letters = accuracy_score(y_test_letters, y_pred_letters)
print(f'Accuracy for letters (excluding "O"): {accuracy_letters}')

# Calculate and print the accuracy for numbers
accuracy_numbers = accuracy_score(y_test_numbers, y_pred_numbers)
print(f'Accuracy for numbers (excluding "0"): {accuracy_numbers}')

# Count the number of letters (excluding 'O') - Looking at the distribution 
num_letters = len(y_test_letters)
print(f'Number of letters (excluding "O"): {num_letters}')

# Count the number of numbers (excluding '0')
num_numbers = len(y_test_numbers)
print(f'Number of numbers (excluding "0"): {num_numbers}')

Accuracy for letters (excluding "O"): 0.30717616914597506
Accuracy for numbers (excluding "0"): 0.8324319140774837
Number of letters (excluding "O"): 54249
Number of numbers (excluding "0"): 52140


From what the small model (used for simple training) could tell, random forest is better at numbers compared to letters.

The idea here to improve the model is that perhaps more of the letters from the training data to train the model compared to numbers, as numbers may lead to an overfitting (higher variance) of the model. As what we could see, the accuracy is 1.0 when we only use the trained data - this may be an indication that there are least amount of bias but what we sacrafised is that there was a high variance.

We can also use a lower depth of the trees or lower number of trees in the random forest to fix that.

Ex. Part 2 - Linear logistic regression

In [None]:
# Try logistic regression

logistic_regression = LogisticRegression(solver = 'liblinear', random_state=0)
logistic_regression.fit(X_head, y_head)

In [None]:
confusion_matrix(y_train, logistic_regression.predict(X_train))

array([[173186, 109131],
       [103883, 172145]])

Ridge Regression

In [None]:
ridge_regression = Ridge(alpha=1.0)

k-nearest neighbour

Random Forest

Each model performance