# Classification on `emnist`

## 1. Create `Readme.md` to document your work

Explain your choices, process, and outcomes.

## 2. Classify all symbols

### Choose a model

Your choice of model! Choose wisely...

### Train away!

Is do you need to tune any parameters? Is the model expecting data in a different format?

### Evaluate the model

Evaluate the models on the test set, analyze the confusion matrix to see where the model performs well and where it struggles.

### Investigate subsets

On which classes does the model perform well? Poorly? Evaluate again, excluding easily confused symbols (such as 'O' and '0').

### Improve performance

Brainstorm for improving the performance. This could include trying different architectures, adding more layers, changing the loss function, or using data augmentation techniques.

## 2. Classify digits vs. letters model showdown

Perform a full showdown classifying digits vs letters:

1. Create a column for whether each row is a digit or a letter
2. Choose an evaluation metric
3. Choose several candidate models to train
4. Divide data to reserve a validation set that will NOT be used in training/testing
5. K-fold train/test
    1. Create train/test splits from the non-validation dataset
    2. Train each candidate model (best practice: use the same split for all models)
    3. Apply the model the the test split
    4. (*Optional*) Perform hyper-parametric search
    5. Record the model evaluation metrics
    6. Repeat with a new train/test split
6. Promote winner, apply model to validation set
7. (*Optional*) Perform hyper-parametric search, if applicable
8. Report model performance

In [1]:
!pip install emnist




[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


In [2]:
# Import packages
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import emnist
from hashlib import sha1

In [3]:
# Load the data, and reshape it into a 28x28 array

# The size of each image is 28x28 pixels
size = 28

# Extract the training split as images and labels
image, label = emnist.extract_training_samples('byclass')

# Add columns for each pixel value (28x28 = 784 columns)
raw_train = pd.DataFrame()

# Add a column showing the label
raw_train['label'] = label

# Add a column with the image data as a 28x28 array
raw_train['image'] = list(image)


# Repeat for the test split
image, label = emnist.extract_test_samples('byclass')
raw_test = pd.DataFrame()
raw_test['label'] = label
raw_test['image'] = list(image)

In [4]:
# Let's start cleaning!

# Labels! They're hard to understand as numbers, so let's map them to characters
# We can do this by manually creating a dictionary:
LABELS = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
          'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
          'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

# Or generate the list of labels using the following code:
# create the characters list, which is the digits, then uppercase, then lowercase
chars = string.digits + string.ascii_uppercase + string.ascii_lowercase
# create the dictionary mapping the numbers to the characters
num_to_char = {i: chars[i] for i in range(len(chars))}

In [5]:
raw_train['mapped_label'] = raw_train['label'].map(num_to_char)
print(raw_train[['mapped_label']])

raw_test['mapped_label'] = raw_test['label'].map(num_to_char)
print(raw_test[['mapped_label']])

def label_category(value):
    if pd.isnull(value):
        return pd.NA  # Use pd.NA for missing values
    # Try to convert to numeric, and check if the result is not NaN
    elif not pd.isnull(pd.to_numeric(value, errors='coerce')):
        return 'number'
    elif isinstance(value, str) and value.isalpha():
        return 'letter'
    else:
        return pd.NA  # Use pd.NA for any other case that is considered missing

raw_train['label_cat'] = raw_train['mapped_label'].apply(label_category)
print(raw_train['label_cat'])

raw_test['label_cat'] = raw_test['mapped_label'].apply(label_category)

       mapped_label
0                 Z
1                 a
2                 6
3                 3
4                 M
...             ...
697927            e
697928            l
697929            5
697930            B
697931            M

[697932 rows x 1 columns]
       mapped_label
0                 I
1                 a
2                 0
3                 3
4                 X
...             ...
116318            7
116319            t
116320            S
116321            0
116322            5

[116323 rows x 1 columns]


0         letter
1         letter
2         number
3         number
4         letter
           ...  
697927    letter
697928    letter
697929    number
697930    letter
697931    letter
Name: label_cat, Length: 697932, dtype: object


In [6]:
def label_category_code(value):
    if pd.isnull(value):
        return pd.NA  # Use pd.NA for missing values
    # Try to convert to numeric, and check if the result is not NaN
    elif not pd.isnull(pd.to_numeric(value, errors='coerce')):
        return 1
    elif isinstance(value, str) and value.isalpha():
        return 0
    else:
        return pd.NA  # Use pd.NA for any other case that is considered missing

raw_train['label_cat_code'] = raw_train['mapped_label'].apply(label_category_code)
print(raw_train['label_cat_code'])

raw_test['label_cat_code'] = raw_test['mapped_label'].apply(label_category_code)

0         0
1         0
2         1
3         1
4         0
         ..
697927    0
697928    0
697929    1
697930    0
697931    0
Name: label_cat_code, Length: 697932, dtype: int64


In [7]:
%pip install scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [8]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

In [9]:
# Flatten each image if they are 2D arrays
X_train = np.array([image.flatten() for image in raw_train['image']])

# Ensure the target variable is in the correct shape
y_train = raw_train['label_cat_code'].values  # Assuming 'label_cat_code' is a column in a pandas DataFrame
y_train_allclass = raw_train['mapped_label'].values

# Create validation set (which called test set in the class)
# Assuming 'X_train' has been flattened and 'y_train', 'y_train_allclass' are defined
X_train, X_val, y_train, y_val, y_train_allclass, y_val_allclass = train_test_split(
    X_train, y_train, y_train_allclass, test_size=0.2, random_state=42, stratify=y_train
)

X_head = X_train[:1000]  # Using the first 1000 samples for a smaller training subset
y_head = y_train[:1000]
y_allclass_head = y_train_allclass[:1000]


Ex. Part 1 - Classifying all using random forest

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [11]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint

# Define the parameter distribution
param_dist = {
    'n_estimators': randint(10, 50),  # Example: Number of trees in a range
    'max_depth': [3, 5, 7, 10],  # Example: Maximum depth of the tree
    # Add more parameters and distributions here
}

# Initialize the classifier
rf_classifier = RandomForestClassifier()

# Initialize the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=rf_classifier, param_distributions=param_dist, n_iter=50, cv=3, n_jobs=-1, verbose=2, random_state=42)
#If you don't explicitly specify the scoring parameter, it defaults to the estimator's default scorer (if available), which, for most classifiers, is accuracy.
#Reducing number of iterations and CV to make sure run time is not long

# Fit the random search to the data
random_search.fit(X_head, y_allclass_head)

# Print the best parameters
print("Best parameters found: ", random_search.best_params_)

# Use the best estimator for further predictions
best_rf_classifier = random_search.best_estimator_


Fitting 3 folds for each of 50 candidates, totalling 150 fits




[CV] END .......................max_depth=7, n_estimators=38; total time=   1.6s
[CV] END .......................max_depth=7, n_estimators=17; total time=   0.9s
[CV] END .......................max_depth=7, n_estimators=17; total time=   1.0s
[CV] END .......................max_depth=3, n_estimators=30; total time=   0.9s
[CV] END .......................max_depth=7, n_estimators=17; total time=   0.9s
[CV] END .......................max_depth=7, n_estimators=38; total time=   1.6s
[CV] END .......................max_depth=3, n_estimators=30; total time=   0.5s
[CV] END .......................max_depth=3, n_estimators=30; total time=   0.4s
[CV] END .......................max_depth=7, n_estimators=38; total time=   1.7s
[CV] END .......................max_depth=7, n_estimators=20; total time=   0.7s
[CV] END .......................max_depth=7, n_estimators=28; total time=   1.1s
[CV] END .......................max_depth=7, n_estimators=20; total time=   0.8s
[CV] END ...................

In [12]:
n_est = random_search.best_params_['n_estimators']
md = random_search.best_params_['max_depth']
rf_classifier = RandomForestClassifier(n_estimators= n_est, max_depth = md)
# Combine the first 1000 rows of X_train and the first 1000 rows of X_val
X_combined = np.concatenate((X_head, X_val[:429]), axis=0) #429 just to match 70% train and 30% test

# If you also want to combine the corresponding labels
y_allclass_combined = np.concatenate((y_allclass_head, y_val_allclass[:429]), axis=0)

# Train the model here
rf_classifier.fit(X_head, y_allclass_head)

In [13]:
# Looking at the prediction accuracy using the trained data
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Perform three-fold cross-validation
cv_scores = cross_val_score(rf_classifier, X_combined, y_allclass_combined, cv=3)

# Print the accuracy for each fold
print(f'Accuracy scores for each fold: {cv_scores}')
import numpy as np
print(f'Mean accuracy out of the three-fold test: {np.mean(cv_scores)}')



Accuracy scores for each fold: [0.57442348 0.52310924 0.54831933]
Mean accuracy out of the three-fold test: 0.548617350504143


In [14]:
# Evaluate on test set
X_test = np.array([image.flatten() for image in raw_test['image']])
print(X_test)
y_test_allclass = raw_test['mapped_label']
print(y_test_allclass)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
0         I
1         a
2         0
3         3
4         X
         ..
116318    7
116319    t
116320    S
116321    0
116322    5
Name: mapped_label, Length: 116323, dtype: object


In [15]:
# Looking at the prediction accuracy using the test data
print(accuracy_score(y_test_allclass, rf_classifier.predict(X_test)))
print(confusion_matrix(y_test_allclass, rf_classifier.predict(X_test)))

0.5353799334611384
[[3666   17   29 ...    0    0    0]
 [   1 5808   19 ...    0    0    0]
 [  60   55 4697 ...    0    0   24]
 ...
 [   2   12  116 ...   15    0    0]
 [   2   18    5 ...    0    0    0]
 [   5    5  238 ...    1    0    8]]


In [16]:
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix

# Exclude '0' and 'O' from the evaluation
excluded_labels = ['0', 'O']

# Create a mask for letters (excluding 'O')
is_letter = np.array([label.isalpha() and label not in excluded_labels for label in y_test_allclass])

# Create a mask for numbers (excluding '0')
is_number = np.array([label.isdigit() and label not in excluded_labels for label in y_test_allclass])

# Filter the test set for letters
X_test_letters = X_test[is_letter]
y_test_letters = y_test_allclass[is_letter]

# Filter the test set for numbers
X_test_numbers = X_test[is_number]
y_test_numbers = y_test_allclass[is_number]

# Make predictions for letters
y_pred_letters = rf_classifier.predict(X_test_letters)

# Make predictions for numbers
y_pred_numbers = rf_classifier.predict(X_test_numbers)

# Calculate and print the accuracy for letters
accuracy_letters = accuracy_score(y_test_letters, y_pred_letters)
print(f'Accuracy for letters (excluding "O"): {accuracy_letters}')

# Calculate and print the accuracy for numbers
accuracy_numbers = accuracy_score(y_test_numbers, y_pred_numbers)
print(f'Accuracy for numbers (excluding "0"): {accuracy_numbers}')

# Count the number of letters (excluding 'O') - Looking at the distribution 
num_letters = len(y_test_letters)
print(f'Number of letters (excluding "O"): {num_letters}')

# Count the number of numbers (excluding '0')
num_numbers = len(y_test_numbers)
print(f'Number of numbers (excluding "0"): {num_numbers}')

Accuracy for letters (excluding "O"): 0.2747516083245774
Accuracy for numbers (excluding "0"): 0.806367472190257
Number of letters (excluding "O"): 54249
Number of numbers (excluding "0"): 52140


From what the small model (used for simple training) could tell, random forest is better at numbers compared to letters.

The idea here to improve the model is that perhaps more of the letters from the training data to train the model compared to numbers, as numbers may lead to an overfitting (higher variance) of the model. As what we could see, the accuracy is 1.0 when we only use the trained data - this may be an indication that there are least amount of bias but what we sacrafised is that there was a high variance.

We can also use a lower depth of the trees or lower number of trees in the random forest to fix that.

Ex. Part 2 - Ridge regression

In [17]:
# Try ridge regression as classifier

# Combine the first 1000 rows of X_train and the first 1000 rows of X_val
X_combined = np.concatenate((X_head, X_val[:429]), axis=0) #429 just to match 70% train and 30% test

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_combined_scale = scaler.fit_transform(X_combined.tolist())
X_test_scale = scaler.transform(X_test.tolist())

# If you also want to combine the corresponding labels
y_combined = np.concatenate((y_head, y_val[:429]), axis=0)

# Initialize Ridge Regression
from sklearn.linear_model import RidgeClassifier
ridge = RidgeClassifier(random_state = 235)

# Define the parameter grid to search over for Ridge Regression
param_grid = {
    'alpha': [0.1, 1.0, 10.0, 100.0]
}

# Initialize GridSearchCV with 3-fold cross-validation
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(ridge, param_grid, cv=3, scoring='f1', n_jobs=-1)

# Fit GridSearchCV to the scaled data
grid_search.fit(X_combined_scale, y_combined)

# Print the best parameters and the corresponding score
print("Best parameters:", grid_search.best_params_)

# If you want to use the best model found by GridSearchCV
best_ridge = grid_search.best_estimator_

Best parameters: {'alpha': 100.0}


In [18]:
# You've already combined and scaled your datasets, so you can directly fit the Ridge model
best_ridge.fit(X_combined_scale, y_combined)

In [19]:
print(confusion_matrix(y_combined, best_ridge.predict(X_combined_scale)))
print(f'The whole accuracy here is: {accuracy_score(y_combined, best_ridge.predict(X_combined_scale))}')
from sklearn.metrics import f1_score
print(f'The whole F1 score is: {f1_score(y_combined, best_ridge.predict(X_combined_scale))}')

# Perform three-fold cross-validation
cv_scores_ridge = cross_val_score(best_ridge, X_combined_scale, y_combined, cv=3, scoring='f1')
print(f'The scores of each three-fold validation: {cv_scores_ridge}')
print(f'The mean score is: {np.mean(cv_scores_ridge)}')

[[580 130]
 [127 592]]
The whole accuracy here is: 0.8201539538138558
The whole F1 score is: 0.8216516308119362
The scores of each three-fold validation: [0.70612245 0.68979592 0.69246436]
The mean score is: 0.6961275752663592


In [20]:
# Create a DataFrame to store F1 scores and the mean F1 score for each run
# Assuming you want to store results of multiple runs, consider each run as an iteration
results_df = pd.DataFrame(columns=['Model', 'Fold 1 F1 Score', 'Fold 2 F1 Score', 'Fold 3 F1 Score', 'Mean F1 Score'])

# Add the results of this run to the DataFrame
results_df.loc[len(results_df)] = ['Ridge Classifier', *cv_scores_ridge, np.mean(cv_scores_ridge)]

# Display the DataFrame
print(results_df)

              Model  Fold 1 F1 Score  Fold 2 F1 Score  Fold 3 F1 Score  \
0  Ridge Classifier         0.706122         0.689796         0.692464   

   Mean F1 Score  
0       0.696128  


k-nearest neighbour

In [21]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score

# Instantiate the kNN model for classification
knn_classifier = KNeighborsClassifier()  # Starting with 5 neighbours, adjust based on your needs

# Define the parameter grid to search over for Ridge Regression
param_grid = {
    'n_neighbors': [2, 5, 7, 10] 
}

# Initialize GridSearchCV with 3-fold cross-validation
from sklearn.model_selection import GridSearchCV
grid_search_knn = GridSearchCV(knn_classifier, param_grid, cv=3, scoring='f1', n_jobs=-1)

# Fit GridSearchCV to the scaled data
grid_search_knn.fit(X_combined_scale, y_combined)

# Print the best parameters and the corresponding score
print("Best parameters:", grid_search_knn.best_params_)

# If you want to use the best model found by GridSearchCV
best_knn = grid_search_knn.best_estimator_

# Fit the model to your scaled training data
best_knn.fit(X_combined_scale, y_combined)

Best parameters: {'n_neighbors': 5}


In [22]:
print(confusion_matrix(y_combined, best_knn.predict(X_combined_scale)))
print(f'The whole accuracy here is: {accuracy_score(y_combined, best_knn.predict(X_combined_scale))}')
from sklearn.metrics import f1_score
print(f'The whole F1 score is: {f1_score(y_combined, best_knn.predict(X_combined_scale))}')

# Perform three-fold cross-validation
cv_scores_knn = cross_val_score(best_knn, X_combined_scale, y_combined, cv=3, scoring='f1')
print(f'The scores of each three-fold validation: {cv_scores_knn}')
print(f'The mean score is: {np.mean(cv_scores_knn)}')

# Add the results of this run to the DataFrame
results_df.loc[len(results_df)] = ['kNN', *cv_scores_knn, np.mean(cv_scores_knn)]

# Display the DataFrame
print(results_df)

[[585 125]
 [ 83 636]]
The whole accuracy here is: 0.85444366689993
The whole F1 score is: 0.8594594594594595
The scores of each three-fold validation: [0.78740157 0.76861167 0.78557114]
The mean score is: 0.7805281290359466
              Model  Fold 1 F1 Score  Fold 2 F1 Score  Fold 3 F1 Score  \
0  Ridge Classifier         0.706122         0.689796         0.692464   
1               kNN         0.787402         0.768612         0.785571   

   Mean F1 Score  
0       0.696128  
1       0.780528  


Random Forest

In [23]:
# Define the parameter distribution
param_dist = {
    'n_estimators': randint(10, 50),  # Example: Number of trees in a range
    'max_depth': [3, 5, 7, 10],  # Example: Maximum depth of the tree
    # Add more parameters and distributions here
}

# Initialize the classifier
rf_classifier_bin = RandomForestClassifier()

# Initialize the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=rf_classifier_bin, param_distributions=param_dist, n_iter=50, cv=3, n_jobs=-1, verbose=2, random_state=42)
#If you don't explicitly specify the scoring parameter, it defaults to the estimator's default scorer (if available), which, for most classifiers, is accuracy.
#Reducing number of iterations and CV to make sure run time is not long

# Fit the random search to the data
random_search.fit(X_combined_scale, y_combined)

# Print the best parameters
print("Best parameters found: ", random_search.best_params_)

# Use the best estimator for further predictions
best_rf_bin_classifier = random_search.best_estimator_

Fitting 3 folds for each of 50 candidates, totalling 150 fits


[CV] END .......................max_depth=3, n_estimators=30; total time=   0.3s
[CV] END .......................max_depth=7, n_estimators=17; total time=   0.3s
[CV] END .......................max_depth=7, n_estimators=17; total time=   0.3s
[CV] END .......................max_depth=7, n_estimators=17; total time=   0.3s
[CV] END .......................max_depth=3, n_estimators=30; total time=   0.4s
[CV] END .......................max_depth=3, n_estimators=30; total time=   0.4s
[CV] END .......................max_depth=7, n_estimators=38; total time=   0.8s
[CV] END .......................max_depth=7, n_estimators=38; total time=   0.8s
[CV] END .......................max_depth=7, n_estimators=38; total time=   0.9s
[CV] END .......................max_depth=7, n_estimators=20; total time=   0.7s
[CV] END .......................max_depth=7, n_estimators=28; total time=   0.8s
[CV] END .......................max_depth=7, n_estimators=28; total time=   0.9s
[CV] END ...................

In [24]:
print(confusion_matrix(y_combined, best_rf_bin_classifier.predict(X_combined_scale)))
print(f'The whole accuracy here is: {accuracy_score(y_combined, best_rf_bin_classifier.predict(X_combined_scale))}')
print(f'The whole F1 score is: {f1_score(y_combined, best_rf_bin_classifier.predict(X_combined_scale))}')

# Perform three-fold cross-validation
cv_scores_rf = cross_val_score(best_rf_bin_classifier, X_combined_scale, y_combined, cv=3, scoring='f1')
print(f'The scores of each three-fold validation: {cv_scores_rf}')
print(f'The mean score is: {np.mean(cv_scores_rf)}')

# Add the results of this run to the DataFrame
results_df.loc[len(results_df)] = ['RF', *cv_scores_rf, np.mean(cv_scores_rf)]

# Display the DataFrame
print(results_df)

[[709   1]
 [  9 710]]
The whole accuracy here is: 0.9930020993701889


The whole F1 score is: 0.993006993006993
The scores of each three-fold validation: [0.7375     0.71914894 0.73360656]
The mean score is: 0.7300851645157541
              Model  Fold 1 F1 Score  Fold 2 F1 Score  Fold 3 F1 Score  \
0  Ridge Classifier         0.706122         0.689796         0.692464   
1               kNN         0.787402         0.768612         0.785571   
2                RF         0.737500         0.719149         0.733607   

   Mean F1 Score  
0       0.696128  
1       0.780528  
2       0.730085  


Gradient Boosting

In [27]:
from sklearn.ensemble import GradientBoostingClassifier
# Instantiate the gradient boosting model for classification
gb_classifier = GradientBoostingClassifier()  # Starting with 5 neighbours, adjust based on your needs

# Define the parameter grid to search over for Ridge Regression
param_grid = {
    'n_estimators': [2, 5, 7, 10],
    'learning_rate': [0.1, 0.01, 0.001]
}
gb_classifier_gs = GridSearchCV(gb_classifier, param_grid, cv=3, scoring='f1', n_jobs=-1)

# Fit GridSearchCV to the scaled data
gb_classifier_gs.fit(X_combined_scale, y_combined)

# Print the best parameters and the corresponding score
print("Best parameters:", gb_classifier_gs.best_params_)

# If you want to use the best model found by GridSearchCV
best_gb = gb_classifier_gs.best_estimator_

# Fit the model to your scaled training data
best_gb.fit(X_combined_scale, y_combined)

Best parameters: {'learning_rate': 0.1, 'n_estimators': 10}


In [28]:
print(confusion_matrix(y_combined, best_gb.predict(X_combined_scale)))
print(f'The whole accuracy here is: {accuracy_score(y_combined, best_gb.predict(X_combined_scale))}')
print(f'The whole F1 score is: {f1_score(y_combined, best_gb.predict(X_combined_scale))}')

# Perform three-fold cross-validation
cv_scores_gb = cross_val_score(best_gb, X_combined_scale, y_combined, cv=3, scoring='f1')
print(f'The scores of each three-fold validation: {cv_scores_gb}')
print(f'The mean score is: {np.mean(cv_scores_gb)}')

# Add the results of this run to the DataFrame
results_df.loc[len(results_df)] = ['GB', *cv_scores_gb, np.mean(cv_scores_gb)]

# Display the DataFrame
print(results_df)

[[576 134]
 [167 552]]
The whole accuracy here is: 0.7893631910426872
The whole F1 score is: 0.7857651245551601
The scores of each three-fold validation: [0.69333333 0.69787234 0.70292887]
The mean score is: 0.6980448480172509
              Model  Fold 1 F1 Score  Fold 2 F1 Score  Fold 3 F1 Score  \
0  Ridge Classifier         0.706122         0.689796         0.692464   
1               kNN         0.787402         0.768612         0.785571   
2                RF         0.737500         0.719149         0.733607   
3                GB         0.693333         0.697872         0.702929   

   Mean F1 Score  
0       0.696128  
1       0.780528  
2       0.730085  
3       0.698045  


Each model performance

In [29]:
# Find the index of the row with the highest mean F1 score
max_f1_index = results_df['Mean F1 Score'].idxmax()

# Print the row with the highest mean F1 score
print(results_df.loc[max_f1_index])


Model                   kNN
Fold 1 F1 Score    0.787402
Fold 2 F1 Score    0.768612
Fold 3 F1 Score    0.785571
Mean F1 Score      0.780528
Name: 1, dtype: object


In [None]:
# Apply it to the validation data
y_test = raw_test['label_cat_code']

print(confusion_matrix(y_test, best_knn.predict(X_test_scale)))
print(f'The whole accuracy here is: {accuracy_score(y_test, best_knn.predict(X_test_scale))}')
from sklearn.metrics import f1_score
print(f'The whole F1 score is: {f1_score(y_test, best_knn.predict(X_test_scale))}')