# Classification on `emnist`

## 1. Create `Readme.md` to document your work

Explain your choices, process, and outcomes.

## 2. Classify all symbols

### Choose a model

Your choice of model! Choose wisely...

### Train away!

Is do you need to tune any parameters? Is the model expecting data in a different format?

### Evaluate the model

Evaluate the models on the test set, analyze the confusion matrix to see where the model performs well and where it struggles.

### Investigate subsets

On which classes does the model perform well? Poorly? Evaluate again, excluding easily confused symbols (such as 'O' and '0').

### Improve performance

Brainstorm for improving the performance. This could include trying different architectures, adding more layers, changing the loss function, or using data augmentation techniques.

## 2. Classify digits vs. letters model showdown

Perform a full showdown classifying digits vs letters:

1. Create a column for whether each row is a digit or a letter
2. Choose an evaluation metric 
3. Choose several candidate models to train
4. Divide data to reserve a validation set that will NOT be used in training/testing
5. K-fold train/test
    1. Create train/test splits from the non-validation dataset 
    2. Train each candidate model (best practice: use the same split for all models)
    3. Apply the model the the test split 
    4. (*Optional*) Perform hyper-parametric search
    5. Record the model evaluation metrics
    6. Repeat with a new train/test split
6. Promote winner, apply model to validation set
7. (*Optional*) Perform hyper-parametric search, if applicable
8. Report model performance

In [1]:
# Import packages
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import emnist
from hashlib import sha1

In [2]:
# Load the data, and reshape it into a 28x28 array

# The size of each image is 28x28 pixels
size = 28 

# Extract the training split as images and labels
image, label = emnist.extract_training_samples('byclass')

# Add columns for each pixel value (28x28 = 784 columns)
raw_train = pd.DataFrame()

# Add a column showing the label
raw_train['label'] = label

# Add a column with the image data as a 28x28 array
raw_train['image'] = list(image)


# Repeat for the test split
image, label = emnist.extract_test_samples('byclass')
raw_test = pd.DataFrame()
raw_test['label'] = label
raw_test['image'] = list(image)

In [3]:
# Let's start cleaning!

# Labels! They're hard to understand as numbers, so let's map them to characters
# We can do this by manually creating a dictionary:
LABELS = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 
          'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
          'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

# Or generate the list of labels using the following code:
# create the characters list, which is the digits, then uppercase, then lowercase
chars = string.digits + string.ascii_uppercase + string.ascii_lowercase
# create the dictionary mapping the numbers to the characters
num_to_char = {i: chars[i] for i in range(len(chars))}

In [4]:
raw_train['mapped_label'] = raw_train['label'].map(num_to_char)
print(raw_train[['mapped_label']])

def label_category(value):
    if pd.isnull(value):
        return pd.NA  # Use pd.NA for missing values
    # Try to convert to numeric, and check if the result is not NaN
    elif not pd.isnull(pd.to_numeric(value, errors='coerce')):
        return 'number'
    elif isinstance(value, str) and value.isalpha():
        return 'letter'
    else:
        return pd.NA  # Use pd.NA for any other case that is considered missing

raw_train['label_cat'] = raw_train['mapped_label'].apply(label_category)
print(raw_train['label_cat'])

raw_test['label_cat'] = raw_test['mapped_label'].apply(label_category)

       mapped_label
0                 Z
1                 a
2                 6
3                 3
4                 M
...             ...
697927            e
697928            l
697929            5
697930            B
697931            M

[697932 rows x 1 columns]
0         letter
1         letter
2         number
3         number
4         letter
           ...  
697927    letter
697928    letter
697929    number
697930    letter
697931    letter
Name: label_cat, Length: 697932, dtype: object


In [5]:
def label_category_code(value):
    if pd.isnull(value):
        return pd.NA  # Use pd.NA for missing values
    # Try to convert to numeric, and check if the result is not NaN
    elif not pd.isnull(pd.to_numeric(value, errors='coerce')):
        return 1
    elif isinstance(value, str) and value.isalpha():
        return 0
    else:
        return pd.NA  # Use pd.NA for any other case that is considered missing
    
raw_train['label_cat_code'] = raw_train['mapped_label'].apply(label_category_code)
print(raw_train['label_cat_code'])

raw_test['label_cat_code'] = raw_test['mapped_label'].apply(label_category_code)

0         0
1         0
2         1
3         1
4         0
         ..
697927    0
697928    0
697929    1
697930    0
697931    0
Name: label_cat_code, Length: 697932, dtype: int64


In [8]:
pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.4.0-1-cp311-cp311-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.12.0-cp311-cp311-macosx_12_0_arm64.whl.metadata (165 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.4/165.4 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m-:--:--[0m
Collecting threadpoolctl>=2.0.0 (from scikit-learn)
  Downloading threadpoolctl-3.2.0-py3-none-any.whl.metadata (10.0 kB)
Downloading scikit_learn-1.4.0-1-cp311-cp311-macosx_12_0_arm64.whl (10.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.6/10.6 MB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading scipy-1.12.0-cp311-cp311-macosx_12_0_arm64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading threadpoolctl-3.2.0-py3-none-any.whl (15 kB)
Installing collected pack

Linear logistic regression

In [10]:
# Try logistic regression
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Ridge
from sklearn.metrics import classification_report, confusion_matrix

logistic_regression = LogisticRegression(solver = 'liblinear', random_state=0)

LogisticRegression(random_state=0, solver='liblinear')


In [17]:
# Flatten each image if they are 2D arrays
X_train = np.array([image.flatten() for image in raw_train['image']])

# Ensure the target variable is in the correct shape
y_train = raw_train['label_cat_code'].values  # Assuming 'label_cat_code' is a column in a pandas DataFrame

# Create validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Fit the model
logistic_regression.fit(X_train, y_train)

In [None]:
confusion_matrix(raw_train['label_cat_code'], logistic_regression.predict(raw_train['image']))

NotFittedError: This LogisticRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Ridge Regression

In [None]:
ridge_regression = Ridge(alpha=1.0)

k-nearest neighbour

Random Forest

Each model performance