# Lecture 11: Classifier Model Evalutation Metrics
These notes build off the code we wrote in lecture 10.

![Machine learning diagram](./ml_diagram.png)

### First, set up a virtual environment (venv)

In the terminal (or command line interface), run this command to create a virtual environment: `python -m venv venv` or `python3 -m venv venv`

To start the virtual environment with a mac run the command: `source venv/bin/activate` <br>
To start the virtual environment with a PC and git bash run the command: `source venv/Scripts/activate`
`

Check which python environment you are in using the command: `which python`

The purpose of virtual environments is to allow you to install different versions of python packages for different projects on your computer. I recommend setting up a virtual environment before installing Python packages.

### Install the dependencies

The packages required to run this repository are in requirements.txt

To install them, run the command:

`pip install -r requirements.txt`

### Set up imports

In [None]:
# import pandas, matplotlib, and the neccessary functions from scikit-learn
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

### Load dataset

In [None]:
# Load the dataset using pandas and the read_csv function
thyroid_df = pd.read_csv('data/thyroid_data.csv')

thyroid_df.head()


In [None]:
# inspect the shape of the dataframe
thyroid_df.shape

In [None]:
# inspect the fraction of rows where Recurred is 'Yes' or 'No' using the value_counts method
thyroid_df['Recurred'].value_counts(normalize=True)

### Data preprocessing (one hot encoding)
Transform categorical features into a ML-compatible format

In [None]:
# make a list of the columns names of categorical features
columns_to_exclude = ['Age', 'Recurred']

categorical_columns = [col for col in thyroid_df.columns if col not in columns_to_exclude]

categorical_columns


In [None]:
# Define a scikit learn column transformer that will encode the categorical columns using the OneHotEncoder
column_transformer = ColumnTransformer(
    transformers=[
        ('encoder', OneHotEncoder(), categorical_columns)
    ],
    remainder='passthrough'  # This leaves the excluded columns unchanged
)

# Apply the transformer to the DataFrame using the fit_transform method
transformed_data = column_transformer.fit_transform(thyroid_df)

type(transformed_data)

In [None]:
# get the names of the encoded features using the get_feature_names_out function
encoded_feature_names = column_transformer.named_transformers_['encoder'].get_feature_names_out(categorical_columns)

encoded_feature_names

In [None]:
# make a list of all the feature names to use as column values in the DataFrame
all_feature_names = list(encoded_feature_names) + ['Age', 'Recurred']

all_feature_names

In [None]:
# create a dataframe using the transformed data with column names from the all_feature_names list
transformed_df = pd.DataFrame(transformed_data, columns=all_feature_names)

transformed_df.head()


In [None]:
# inspect the shape of the transformed dataframe
transformed_df.shape

### Data preprocessing (split data into training and testing data)

In [None]:
# split the transformed dataframe into a features dataframe and target series
X = transformed_df.drop('Recurred', axis=1)

y = transformed_df['Recurred']


# inspect feature dataframe
X.head()



In [None]:
# inspect target series
y.head()

In [None]:
# split into training and testing sets. The training set should have 283 samples, and the testing set should have 100 samples
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=100, random_state=42)

In [None]:
# validate that the training and testing sets have the correct number of samples
X_train.shape, X_test.shape, y_train.shape, y_test.shape

### Using the training data to perform grid search to find the best hyperparameters

In [None]:
# Define the parameter grid
param_grid = {
    'class_weight': ['balanced'],  # Assuming you want to keep this fixed
    'criterion': ['gini', 'entropy'],  # Assuming you want to keep this fixed
    'max_depth': [2, 4, 8],  # Exploring values around 4
    'max_features': ['sqrt'],  # Assuming you want to keep this fixed
    'n_estimators': [30, 60, 90]  # Exploring values around 60
}


In [None]:
# Initialize the RandomForestClassifier
rf_classifier = RandomForestClassifier()

# Initialize the GridSearchCV object
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=2, scoring='accuracy')

# Fit the GridSearchCV object to your data (X_train, y_train)
grid_search.fit(X_train, y_train)



In [None]:
# Get and print the best parameters found by the GridSearchCV using the best_params_
best_params = grid_search.best_params_

best_params

In [None]:
# Get the accuracy score of the model with the best parameters
best_score = grid_search.best_score_

best_score

### Train a model using the best set of hyperparameters

In [None]:
# Initialize a new RandomForestClassifier model with the best parameters
final_model = RandomForestClassifier(
  class_weight='balanced', 
  criterion='entropy',
  max_depth=8,
  max_features='sqrt',
  n_estimators=90,
)


# Fit the model on your training data
final_model.fit(X_train, y_train)

### Test the model with the testing data

In [None]:
# use the score function to get the accuracy of the model on the testing data
final_model.score(X_test, y_test)

### Get a plot the feature importances

In [None]:
# create a dataframe of the feature importances
feature_importances = final_model.feature_importances_

feature_importances_df = pd.DataFrame({'feature': X.columns, 'importance': feature_importances})

# sort the dataframe so the most important features are at the top
feature_importances_df.sort_values('importance', ascending=False, inplace=True)

feature_importances_df.head()


In [None]:
# plot the feature importances as a bar chart using .plot.bar()
feature_importances_df.plot.bar(x='feature', y='importance', rot=90, figsize=(13, 6))

# Start Lecture 11
### Let's look at the confusion matrix

In [None]:
# look at the confusion matrix for the random forest model


In [None]:
# look at the values of the confusion matrix


### Get the associated evaluation metrics

In [None]:
# get sensitivity (true positive rate, recall) from the confusion matrix


In [None]:
# get specificity (true negative rate) from the confusion matrix


In [None]:
# get precision (positive predictive value) from the confusion matrix


In [None]:
# get the negative predictive value from the confusion matrix


In [None]:
# calculate the F1 score


In [None]:
# look at the classification report


In [None]:
# look at area under the ROC curve


In [None]:
# look at y_pred_proba


In [None]:
# plot the ROC curve


In [None]:
# look at the thresholds


### Train K-nearest neighbors and Support Vector Machine Models

![image.png](metric_table.png)

### Train a k-nearest neighbors model and look at metrics

In [None]:
# train a KNN model


In [None]:
# get the predictions


In [None]:
# get the classification report


In [None]:
# look at area under the ROC curve


In [None]:
# plot the ROC curve


# Activity

### Train a support vector machine classifier model and look at metrics
Use the hyperparameters: C=0.01, class_weight='balanced', kernel='linear', gamma='scale' to match what was used in the paper

In [None]:
# train an SVM classifier model
from sklearn.svm import SVC


In [None]:
# get the predictions


In [None]:
# get the classification report


In [None]:
assert report_svm[75:78] == '.83'

In [None]:
# look at area under the ROC curve


In [None]:
assert roc_auc_svm > 0.921
assert roc_auc_svm < 0.922

In [None]:
# plot the ROC curve
