<a href="https://colab.research.google.com/github/ambwhl/datasci_223/blob/exercise-4/exercises/4-classification/exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification on `emnist`

## 1. Create `Readme.md` to document your work

Explain your choices, process, and outcomes.

## 2. Classify all symbols

### Choose a model

Your choice of model! Choose wisely...

### Train away!

Is do you need to tune any parameters? Is the model expecting data in a different format?

### Evaluate the model

Evaluate the models on the test set, analyze the confusion matrix to see where the model performs well and where it struggles.

### Investigate subsets

On which classes does the model perform well? Poorly? Evaluate again, excluding easily confused symbols (such as 'O' and '0').

### Improve performance

Brainstorm for improving the performance. This could include trying different architectures, adding more layers, changing the loss function, or using data augmentation techniques.

## 2. Classify digits vs. letters model showdown

Perform a full showdown classifying digits vs letters:

1. Create a column for whether each row is a digit or a letter
2. Choose an evaluation metric
3. Choose several candidate models to train
4. Divide data to reserve a validation set that will NOT be used in training/testing
5. K-fold train/test
    1. Create train/test splits from the non-validation dataset
    2. Train each candidate model (best practice: use the same split for all models)
    3. Apply the model the the test split
    4. (*Optional*) Perform hyper-parametric search
    5. Record the model evaluation metrics
    6. Repeat with a new train/test split
6. Promote winner, apply model to validation set
7. (*Optional*) Perform hyper-parametric search, if applicable
8. Report model performance

In [None]:
# Uncomment and install below packages if not already installed
%pip install -q numpy pandas matplotlib seaborn scikit-learn tensorflow emnist xgboost


In [None]:
%reset -f

In [None]:
# Import packages
import os
import string
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import emnist
from IPython.display import display, Markdown

# Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
# Logistic Regression
#from sklearn.linear_model import LogisticRegression
#from sklearn.preprocessing import StandardScaler
# XGBoost (SVM)
#from xgboost import XGBClassifier
# Deep Learning
#import tensorflow as tf
#from tensorflow import keras
#from tensorflow.keras.models import Sequential
#from tensorflow.keras.layers import Dense, Flatten

# Constants
SIZE = 28
REBUILD = True
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

In [None]:
##help function
def int_to_char(label):
    if label < 10:
        return str(label)
    elif label < 36:
        return chr(label - 10 + ord('A'))
    else:
        return chr(label - 36 + ord('a'))

##Display performance metrics and confusion matrix for a model.
def display_metrics(task, model_name, metrics_dict):
    metrics = metrics_dict[task][model_name]
    acc = metrics['accuracy']
    prec = metrics['precision']
    rec = metrics['recall']
    f1 = metrics['f1']
    cm = metrics['confusion_matrix']
    metrics_df = pd.DataFrame({
        'Accuracy': [acc],
        'Precision': [prec],
        'Recall': [rec],
        'F1 Score': [f1]
    })
    cm_df = pd.DataFrame(cm, index=['Actual Class {}'.format(i) for i in range(len(cm))],
                         columns=['Predicted Class {}'.format(i) for i in range(len(cm[0]))])

    # Display performance metrics and confusion matrix
    display(Markdown(f"### Performance Metrics for {model_name}"))
    display(metrics_df)
    display(Markdown(f"### Confusion Matrix for {model_name}"))
    display(cm_df)




In [None]:
digits = list(range(10))
uppercase_letters = list(range(10, 36))
lowercase_letters = list(range(36, 62))
class_labels = digits + uppercase_letters + lowercase_letters
assert len(class_labels) == 62

In [None]:
# Load train data, 16 seconds using T4 GPU provided by google colab
image, label = emnist.extract_training_samples('byclass')
train = pd.DataFrame()
train['image'] = list(image)
train['image_flat'] = train['image'].apply(lambda x: np.array(x).reshape(-1))
train['label'] = label

# Convert labels to characters
class_label = np.array([int_to_char(l) for l in label])

# Add a column with the character corresponding to the label
train['class'] = class_label



Downloading emnist.zip: 536MB [00:25, 21.6MB/s]


In [None]:
# load test set
imaget, labelt = emnist.extract_test_samples('byclass')
class_labelt = np.array([int_to_char(l) for l in labelt])
valid = pd.DataFrame()
valid['image'] = list(imaget)
valid['image_flat'] = valid['image'].apply(lambda x: np.array(x).reshape(-1))
valid['label'] = labelt
valid['class'] = class_labelt

In [None]:
metrics_dict = {
    'all symbols' : { # task name
        'logistic_regression': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        },
        'xgboost': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        },
        'random_forest': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        },
        'neural_network': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        }
    }
}



In [None]:
# RandomForest，use T4 GPU provided by google colab，8 min 57sec
task = 'all symbols'
model_name = 'random_forest'
metrics_dict[task] = {model_name: {}}

# Initialize random forest classifier
rf_clf = RandomForestClassifier(n_estimators=50, random_state=42)##n_estimators more than 50 collapses

# Train and evaluate model
rf_clf.fit(train['image_flat'].tolist(), train['label'])


In [None]:
y_pred = rf_clf.predict(valid['image_flat'].tolist())

ValueError: Shape of passed values is (62, 62), indices imply (2, 2)

In [None]:
#
acc = accuracy_score(valid['label'], y_pred)
prec = precision_score(valid['label'], y_pred,average = 'macro')
rec = recall_score(valid['label'], y_pred,average = 'macro')
f1 = f1_score(valid['label'], y_pred,average = 'macro')
cm = confusion_matrix(valid['label'], y_pred)

#
metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}

display_metrics(task, model_name, metrics_dict)

### Performance Metrics for random_forest

Unnamed: 0,Accuracy,Precision,Recall,F1 Score
0,0.825469,0.77154,0.646397,0.663805


### Confusion Matrix for random_forest

Unnamed: 0,Predicted Class 0,Predicted Class 1,Predicted Class 2,Predicted Class 3,Predicted Class 4,Predicted Class 5,Predicted Class 6,Predicted Class 7,Predicted Class 8,Predicted Class 9,...,Predicted Class 52,Predicted Class 53,Predicted Class 54,Predicted Class 55,Predicted Class 56,Predicted Class 57,Predicted Class 58,Predicted Class 59,Predicted Class 60,Predicted Class 61
Actual Class 0,4747,2,1,2,10,3,7,0,12,1,...,0,1,0,0,0,0,0,0,0,0
Actual Class 1,0,5708,9,0,2,0,1,10,3,0,...,0,2,0,12,0,0,0,0,0,0
Actual Class 2,5,4,5682,14,4,1,1,19,10,1,...,0,0,0,3,0,0,0,1,0,24
Actual Class 3,2,0,27,5808,0,20,0,26,26,14,...,0,0,0,2,0,0,0,1,0,0
Actual Class 4,0,3,6,1,5399,0,6,2,2,31,...,0,3,0,24,0,1,0,1,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Actual Class 57,0,0,0,0,9,0,1,0,3,0,...,0,9,0,0,0,140,1,0,1,0
Actual Class 58,1,0,0,0,4,0,2,0,0,0,...,0,0,0,0,0,0,272,0,0,0
Actual Class 59,0,2,13,1,11,2,0,2,6,0,...,0,2,0,5,0,0,1,255,6,0
Actual Class 60,1,1,2,6,106,2,0,2,5,3,...,0,1,0,2,0,3,0,1,66,0


### Performance Metrics for random_forest

Unnamed: 0,Accuracy,Precision,Recall,F1 Score
0,0.825469,0.77154,0.646397,0.663805


### Confusion Matrix for random_forest

Unnamed: 0,Predicted Class 0,Predicted Class 1,Predicted Class 2,Predicted Class 3,Predicted Class 4,Predicted Class 5,Predicted Class 6,Predicted Class 7,Predicted Class 8,Predicted Class 9,...,Predicted Class 52,Predicted Class 53,Predicted Class 54,Predicted Class 55,Predicted Class 56,Predicted Class 57,Predicted Class 58,Predicted Class 59,Predicted Class 60,Predicted Class 61
Actual Class 0,4747,2,1,2,10,3,7,0,12,1,...,0,1,0,0,0,0,0,0,0,0
Actual Class 1,0,5708,9,0,2,0,1,10,3,0,...,0,2,0,12,0,0,0,0,0,0
Actual Class 2,5,4,5682,14,4,1,1,19,10,1,...,0,0,0,3,0,0,0,1,0,24
Actual Class 3,2,0,27,5808,0,20,0,26,26,14,...,0,0,0,2,0,0,0,1,0,0
Actual Class 4,0,3,6,1,5399,0,6,2,2,31,...,0,3,0,24,0,1,0,1,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Actual Class 57,0,0,0,0,9,0,1,0,3,0,...,0,9,0,0,0,140,1,0,1,0
Actual Class 58,1,0,0,0,4,0,2,0,0,0,...,0,0,0,0,0,0,272,0,0,0
Actual Class 59,0,2,13,1,11,2,0,2,6,0,...,0,2,0,5,0,0,1,255,6,0
Actual Class 60,1,1,2,6,106,2,0,2,5,3,...,0,1,0,2,0,3,0,1,66,0
