<a href="https://colab.research.google.com/github/ambwhl/datasci_223/blob/exercise-4/exercises/4-classification/exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification on `emnist`

## 1. Create `Readme.md` to document your work

Explain your choices, process, and outcomes.

## 2. Classify all symbols

### Choose a model

Your choice of model! Choose wisely...

### Train away!

Is do you need to tune any parameters? Is the model expecting data in a different format?

### Evaluate the model

Evaluate the models on the test set, analyze the confusion matrix to see where the model performs well and where it struggles.

### Investigate subsets

On which classes does the model perform well? Poorly? Evaluate again, excluding easily confused symbols (such as 'O' and '0').

### Improve performance

Brainstorm for improving the performance. This could include trying different architectures, adding more layers, changing the loss function, or using data augmentation techniques.

## 2. Classify digits vs. letters model showdown

Perform a full showdown classifying digits vs letters:

1. Create a column for whether each row is a digit or a letter
2. Choose an evaluation metric
3. Choose several candidate models to train
4. Divide data to reserve a validation set that will NOT be used in training/testing
5. K-fold train/test
    1. Create train/test splits from the non-validation dataset
    2. Train each candidate model (best practice: use the same split for all models)
    3. Apply the model the the test split
    4. (*Optional*) Perform hyper-parametric search
    5. Record the model evaluation metrics
    6. Repeat with a new train/test split
6. Promote winner, apply model to validation set
7. (*Optional*) Perform hyper-parametric search, if applicable
8. Report model performance

In [3]:
# Uncomment and install below packages if not already installed
%pip install -q numpy pandas scikit-learn emnist matplotlib


In [4]:
%reset -f

In [1]:
# Import packages
import os
import string
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
import emnist
from IPython.display import display, Markdown

# Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score


In [2]:
##help function
def int_to_char(label):
    if label < 10:
        return str(label)
    elif label < 36:
        return chr(label - 10 + ord('A'))
    else:
        return chr(label - 36 + ord('a'))

##Display performance metrics and confusion matrix for a model.
def display_metrics(task, model_name, metrics_dict):
    metrics = metrics_dict[task][model_name]
    acc = metrics['accuracy']
    prec = metrics['precision']
    rec = metrics['recall']
    f1 = metrics['f1']
    cm = metrics['confusion_matrix']
    metrics_df = pd.DataFrame({
        'Accuracy': [acc],
        'Precision': [prec],
        'Recall': [rec],
        'F1 Score': [f1]
    })
    cm_df = pd.DataFrame(cm, index=['Actual Class {}'.format(i) for i in range(len(cm))],
                         columns=['Predicted Class {}'.format(i) for i in range(len(cm[0]))])

    # Display performance metrics and confusion matrix
    display(Markdown(f"### Performance Metrics for {model_name}"))
    display(metrics_df)
    display(Markdown(f"### Confusion Matrix for {model_name}"))
    display(cm_df)

metrics_dict = {}
metrics_dict['letter_vs_digit'] = {}
metrics_dict['all symbols'] = {}


In [22]:
# Load train data, 16 seconds using T4 GPU provided by google colab
image, label = emnist.extract_training_samples('byclass')
train = pd.DataFrame()
train['image'] = list(image)
train['image_flat'] = train['image'].apply(lambda x: np.array(x).reshape(-1))
train['label'] = label

# Add a column with the character corresponding to the label
class_label = np.array([int_to_char(l) for l in label])
train['class'] = class_label
train = train[:1000]

# load test set
imaget, labelt = emnist.extract_test_samples('byclass')
class_labelt = np.array([int_to_char(l) for l in labelt])
valid = pd.DataFrame()
valid['image'] = list(imaget)
valid['image_flat'] = valid['image'].apply(lambda x: np.array(x).reshape(-1))
valid['label'] = labelt
valid['class'] = class_labelt
valid = valid[:1000]

In [None]:
#### Task1: Classify all symbols ####

# train in RandomForest，use T4 GPU provided by google colab，8 min 57sec
task = 'all symbols'
model_name = 'random_forest'
metrics_dict[task] = {model_name: {}}

# Initialize random forest classifier
rf_clf = RandomForestClassifier(n_estimators=50, random_state=42)##n_estimators more than 50 collapses

# Train model
rf_clf.fit(train['image_flat'].tolist(), train['label'])


In [22]:
##evaluate in test set
y_pred = rf_clf.predict(valid['image_flat'].tolist())

In [26]:
#analyze the confusion matrix
acc = accuracy_score(valid['label'], y_pred)
prec = precision_score(valid['label'], y_pred,average = 'weighted')
rec = recall_score(valid['label'], y_pred,average = 'weighted')
f1 = f1_score(valid['label'], y_pred,average = 'weighted')
cm = confusion_matrix(valid['label'], y_pred)

metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}

display_metrics(task, model_name, metrics_dict)

### Performance Metrics for random_forest

Unnamed: 0,Accuracy,Precision,Recall,F1 Score
0,0.825469,0.817227,0.825469,0.806842


### Confusion Matrix for random_forest

Unnamed: 0,Predicted Class 0,Predicted Class 1,Predicted Class 2,Predicted Class 3,Predicted Class 4,Predicted Class 5,Predicted Class 6,Predicted Class 7,Predicted Class 8,Predicted Class 9,...,Predicted Class 52,Predicted Class 53,Predicted Class 54,Predicted Class 55,Predicted Class 56,Predicted Class 57,Predicted Class 58,Predicted Class 59,Predicted Class 60,Predicted Class 61
Actual Class 0,4747,2,1,2,10,3,7,0,12,1,...,0,1,0,0,0,0,0,0,0,0
Actual Class 1,0,5708,9,0,2,0,1,10,3,0,...,0,2,0,12,0,0,0,0,0,0
Actual Class 2,5,4,5682,14,4,1,1,19,10,1,...,0,0,0,3,0,0,0,1,0,24
Actual Class 3,2,0,27,5808,0,20,0,26,26,14,...,0,0,0,2,0,0,0,1,0,0
Actual Class 4,0,3,6,1,5399,0,6,2,2,31,...,0,3,0,24,0,1,0,1,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Actual Class 57,0,0,0,0,9,0,1,0,3,0,...,0,9,0,0,0,140,1,0,1,0
Actual Class 58,1,0,0,0,4,0,2,0,0,0,...,0,0,0,0,0,0,272,0,0,0
Actual Class 59,0,2,13,1,11,2,0,2,6,0,...,0,2,0,5,0,0,1,255,6,0
Actual Class 60,1,1,2,6,106,2,0,2,5,3,...,0,1,0,2,0,3,0,1,66,0


In [28]:
# Subset `train` and `valid` to only include digits
symbols_list = ['0', '1',' 2', '3', '4', '5', '6', '7', '8', '9']

mask_train = train['class'].apply(lambda x: x in symbols_list)
train_01 = train[mask_train]
train_01.reset_index(drop=True, inplace=True)

mask_valid = valid['class'].apply(lambda x: x in symbols_list)
valid_01 = valid[mask_valid]
valid_01.reset_index(drop=True, inplace=True)

In [31]:
# train subsets in RandomForest，use CPU provided by google colab, 4 min 33 sec
rf_clf.fit(train_01['image_flat'].tolist(), train_01['label'])
y_pred = rf_clf.predict(valid_01['image_flat'].tolist())
acc = accuracy_score(valid_01['label'], y_pred)
prec = precision_score(valid_01['label'], y_pred,average = 'weighted')
rec = recall_score(valid_01['label'], y_pred,average = 'weighted')
f1 = f1_score(valid_01['label'], y_pred,average = 'weighted')
cm = confusion_matrix(valid_01['label'], y_pred)

metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}

display_metrics(task, model_name, metrics_dict)


### Performance Metrics for random_forest

Unnamed: 0,Accuracy,Precision,Recall,F1 Score
0,0.984995,0.985003,0.984995,0.98499


### Confusion Matrix for random_forest

Unnamed: 0,Predicted Class 0,Predicted Class 1,Predicted Class 2,Predicted Class 3,Predicted Class 4,Predicted Class 5,Predicted Class 6,Predicted Class 7,Predicted Class 8
Actual Class 0,5726,7,3,13,4,10,0,13,2
Actual Class 1,1,6304,8,3,0,1,8,3,2
Actual Class 2,7,5,5865,5,23,1,24,27,12
Actual Class 3,5,2,1,5553,0,15,1,8,34
Actual Class 4,10,2,46,5,5085,19,0,11,12
Actual Class 5,12,9,0,10,15,5655,0,4,0
Actual Class 6,1,8,3,28,1,1,6044,9,44
Actual Class 7,10,13,26,32,29,13,5,5479,26
Actual Class 8,8,3,28,38,4,0,28,20,5557


In [33]:
##initial RandomForest with higher n_estimators
rf_clf_1 = RandomForestClassifier(n_estimators=100, random_state=42)##

##train and test new model in subsets, use CPU provided by google colab, 8 min 31sec
rf_clf_1.fit(train_01['image_flat'].tolist(), train_01['label'])
y_pred = rf_clf_1.predict(valid_01['image_flat'].tolist())
acc = accuracy_score(valid_01['label'], y_pred)
prec = precision_score(valid_01['label'], y_pred,average = 'weighted')
rec = recall_score(valid_01['label'], y_pred,average = 'weighted')
f1 = f1_score(valid_01['label'], y_pred,average = 'weighted')
cm = confusion_matrix(valid_01['label'], y_pred)

metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}

display_metrics(task, model_name, metrics_dict)

### Performance Metrics for random_forest

Unnamed: 0,Accuracy,Precision,Recall,F1 Score
0,0.985591,0.985602,0.985591,0.985589


### Confusion Matrix for random_forest

Unnamed: 0,Predicted Class 0,Predicted Class 1,Predicted Class 2,Predicted Class 3,Predicted Class 4,Predicted Class 5,Predicted Class 6,Predicted Class 7,Predicted Class 8
Actual Class 0,5725,5,5,12,4,12,0,12,3
Actual Class 1,0,6299,9,3,2,2,8,4,3
Actual Class 2,6,3,5866,5,23,1,27,25,13
Actual Class 3,8,2,1,5557,0,10,1,7,33
Actual Class 4,7,2,41,3,5094,21,0,10,12
Actual Class 5,10,9,1,10,15,5653,0,7,0
Actual Class 6,1,6,4,25,1,2,6050,6,44
Actual Class 7,9,12,20,33,26,13,7,5487,26
Actual Class 8,4,3,24,37,5,1,25,19,5568


In [7]:
###Task2: Classify digits vs. letters model showdown

#1.Create a column for whether each row is a digit or a letter
train['is_letter'] = train['label'] >= 10
valid['is_letter'] = valid['label'] >= 10

# Display the first few rows of the dataset
display(train.head())



Unnamed: 0,image,image_flat,label,class,is_letter
0,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",35,Z,True
1,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",36,a,True
2,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",6,6,False
3,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,3,False
4,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",22,M,True


In [6]:
## new display function
def display_metrics(task, model_name, metrics_dict):
    """Display performance metrics and confusion matrix for a model."""
    metrics_df = pd.DataFrame()
    cm_df = pd.DataFrame()
    for key, value in metrics_dict[task][model_name].items():
        if type(value) == np.ndarray:
            cm_df = pd.DataFrame(value, index=['actual 0', 'actual 1'], columns=['predicted 0', 'predicted 1'])
        else:
            metrics_df[key] = [value]
    display(Markdown(f'# Performance Metrics: {model_name}'))
    display(metrics_df)
    display(Markdown(f'# Confusion Matrix: {model_name}'))
    display(cm_df)

In [5]:
#2.Choose an Letter vs Digit evaluation metric: Logistic Regression

task = 'letter_vs_digit'
model_name = 'logistic_regression'

# load Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler


In [11]:
# Initialize logistic regression classifier
lr_clf = LogisticRegression(max_iter=1000, random_state=42)

# Scale the data, use CPU on Google colab, 11sec
# When running without scaling the data, the model does not converge
scaler = StandardScaler()
train_scaled = scaler.fit_transform(train['image_flat'].tolist())
valid_scaled = scaler.transform(valid['image_flat'].tolist())

In [12]:
# Train and evaluate model,use CPU on Google colab, 6min48sec
lr_clf.fit(train_scaled, train['is_letter'])
y_pred = lr_clf.predict(valid_scaled)

# Calculate performance metrics
acc = accuracy_score(valid['is_letter'], y_pred)
prec = precision_score(valid['is_letter'], y_pred)
rec = recall_score(valid['is_letter'], y_pred)
f1 = f1_score(valid['is_letter'], y_pred)
cm = confusion_matrix(valid['is_letter'], y_pred)

# Store performance metrics in dictionary
metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}

# Display performance metrics
display_metrics(task, model_name, metrics_dict)

In [8]:
#3.Choose several candidate models:random forest
task = 'letter_vs_digit'
model_name = 'random forest'

# Initialize random forest classifier and train, use CPU on Google colab, 9min7sec
rf_clf_2 = RandomForestClassifier(n_estimators=50, random_state=42)
rf_clf_2.fit(train['image_flat'].tolist(), train['is_letter'])


In [10]:
##evaluate in test set
y_pred = rf_clf_2.predict(valid['image_flat'].tolist())

In [12]:
# Calculate performance metrics
acc = accuracy_score(valid['is_letter'], y_pred)
prec = precision_score(valid['is_letter'], y_pred)
rec = recall_score(valid['is_letter'], y_pred)
f1 = f1_score(valid['is_letter'], y_pred)
cm = confusion_matrix(valid['is_letter'], y_pred)

# Store performance metrics in dictionary
metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}

# Display performance metrics
display_metrics(task, model_name, metrics_dict)

# Performance Metrics: random forest

Unnamed: 0,accuracy,precision,recall,f1
0,0.898017,0.913501,0.880233,0.896558


# Confusion Matrix: random forest

Unnamed: 0,predicted 0,predicted 1
actual 0,53050,4868
actual 1,6995,51410


In [23]:
#4.Divide data to reserve a validation set that will NOT be used in training/testing
byclass = pd.concat([train, valid], ignore_index=True)
byclass = byclass.sample(frac=1).reset_index(drop=True)
valid_n = byclass[:500]
non_valid= byclass[500:2000]

In [37]:
#5.K-fold train/test
from sklearn.model_selection import KFold
rf_clf = RandomForestClassifier(n_estimators=50, random_state=42)
lr_clf = LogisticRegression(max_iter=100, random_state=42)
scaler = StandardScaler()

##split k-fold data
k = 2
kf = KFold(n_splits=k, shuffle=True, random_state=42)
#Create train/test splits from the non-validation dataset
round = 0
metricsrecord = {}
for train_index, test_index in kf.split(non_valid):
    round = + 1
    train_n = non_valid.iloc[train_index]
    test_n = non_valid.iloc[test_index]
    task = str(round)
    ##random forest
    model_name = 'random forest'
    rf_clf.fit(train_n['image_flat'].tolist(), train_n['class'])
    y_pred = rf_clf.predict(test_n['image_flat'].tolist())
    acc = accuracy_score(test_n['class'], y_pred)
    prec = precision_score(test_n['class'], y_pred,average = 'weighted')
    rec = recall_score(test_n['class'], y_pred,average = 'weighted')
    f1 = f1_score(test_n['class'], y_pred,average = 'weighted')
    cm = confusion_matrix(test_n['class'], y_pred)
    metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}
    metricsrecord.append(metrics_dict)#Record the model evaluation metrics
    ##logistic model
    model_name = 'logistic model'
    train_scaled = scaler.fit_transform(train_n['image_flat'].tolist())
    test_scaled = scaler.transform(test_n['image_flat'].tolist())
    lr_clf.fit(test_scaled, test_n['class'])
    y_pred = lr_clf.predict(test_scaled)

    acc = accuracy_score(test_n['class'], y_pred)
    prec = precision_score(test_n['class'], y_pred,average = 'weighted')
    rec = recall_score(test_n['class'], y_pred,average = 'weighted')
    f1 = f1_score(test_n['class'], y_pred,average = 'weighted')
    cm = confusion_matrix(test_n['class'], y_pred)
    metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}
    metricsrecord.append(metrics_dict)#Record the model evaluation metrics


#6.Promote winner, apply model to validation set

#8.Report model performance

  _warn_prf(average, modifier, msg_start, len(result))


TypeError: string indices must be integers