## **SHIFTKEY x ACENET: Machine Learning Basics**
**Module 2: Implementations of Machine Learning -- Classification**




Here we will be doing a classification analysis on the Iris dataset.

Read about the dataset here:

https://scikit-learn.org/1.4/auto_examples/datasets/plot_iris_dataset.html

https://en.wikipedia.org/wiki/Iris_flower_data_set


We will walk through:

1. Importing relevant libraries
2. Loading the dataset
3. Exploratory data analysis
4. Data preprocessing
5. Define the model
6. Fit model and make predictions
7. Evaluate model
8. Use grid search for better statistics
9. Hyperparameter tuning to optimize model
10. Feature importance to get insights on model

**1. Import libraries**

First, we important the libraries that we will need for our analysis. When doing your own analysis you might add these as you go along, which is good, when you do that it is typical and organized to keep them all at the top of your code.

An overview of some of these libraries:

**Matplotlib:** Plotting.

**Seaborn:** More plotting.

**Pandas:** Manipulation of dataframes.

**NumPy:** Create arrays, use mathematical functions.

**Scikit-learn:** Machine learning.



In [None]:
# IMPORTS

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

from sklearn.datasets import load_iris

**2. Loading the dataset**

Now we load the Iris dataset. For practice purposes, we are using an ```sklearn``` dataset, so that we can focus on the machine learning. In practice, this will be more involved.

We will use the Iris extended dataset which can be downloaded here:

https://www.kaggle.com/datasets/samybaladram/iris-dataset-extended?resource=download&select=iris_extended.csv

In [None]:
# iris = load_iris(as_frame=True)
local = True

if not local:
    from google.colab import files
    uploaded = files.upload()

    # download here: https://www.kaggle.com/datasets/samybaladram/iris-dataset-extended?resource=download
    fn = 'iris_extended.csv'
else:
    fn = './iris_extended.csv'

data = pd.read_csv(fn)

# create a target column
species_mapping = {
    "setosa": 0,
    "versicolor": 1,
    "virginica": 2
}
data["target"] = data["species"].map(species_mapping)

In [None]:
# look at the columns, examine the top 10 rows of the dataset

In [None]:
# select features i.e. the columns that we want to use

**3. Exploratory data analysis**

Now we do Exploratory Data Analysis, also known as EDA, to examine our dataset. It is very important to understand your dataset in order to conduct a meaningful analysis.

In [None]:
# exploratory data analysis, plot the distributions, the pairplot, the correlation matrix

# distributions


# pairplot


# correlation matrix



**4. Data preprocessing**

Now we do our data preprocessing, where we prepare the data to be input to the model. Since we are using the Iris dataset there is not too much to do in this particular instance, but typically this involves:

**Imputation:** Filling in missing values through various methods.
https://scikit-learn.org/stable/modules/impute.html

**Scaling:** Scale features of the data to be within a certain range (for example 0 to 1).
https://scikit-learn.org/stable/modules/preprocessing.html

ONE METHOD (key emphasis on one of many) is to use the ```sklearn``` method called ```Pipeline```. This allows for an easy way to chain together the preprocessing steps with the model in a way that we will know what works. When using ```sklearn``` machine models I like to use this. If you want to use a different library, for example if you are doing deep learning and using ```PyTorch``` to create a neural network, you will need to manually apply the preprocesssing transformations to your data. If you would like to learn more about this, keep an eye out for ACENET's upcoming training sessions!

NOTE:

We use $X$ to represent our inputs aka the features.

We use $y$ to represent our output aka the target.

In [None]:
# data preprocessing


# set up X and y



# get numeric features



# set up numeric transformer



# put together into preprocessor object


**5. Define the model**

Now we define our classifier model. Keep an eye on the parameters we select, these are what we fine-tune later.

We then call the ```sklearn.Pipeline``` method to chain together our preprocessor and our classifier model.

In [None]:
# select model we will use random forest classifier

# define classifier


# link the preprocessor and the model in the pipeline


Now, we split the dataset into training and testing subsets. This is because we want to use a portion of the data to "fit" the model, and a **SEPARATE** portion of the data to test and evaluate the model.

**IT IS EXTREMELY IMPORTANT THAT THE TRAINING AND TESTING SUBSETS DO NOT MIX, OR YOUR MODEL IS COMPLETELY INVALID!**

In [None]:
# now split the data for fitting



**6. Fit model and make predictions**

Now we fit the model. Since we used the ```Pipeline```, this very simple where we can just do ```iris_model.fit``` and it will automatically apply the preprocessing and then do the fit to learn the model weights. Remember that you **fit** with the **training** data.

In [None]:
# now fit the model



**7. Evaluate model**

Now that we have fit the model, we can make predictions on the testing data and evaluate the model. Recall the precision, recall, and F1-score from Module 1 last week.

We also calculate the confusion matrix display which summarizes the performance of a classification model by comparing the true labels to the predicted labels. This gives us a more detailed evaluation of the model that cannot be quantified with just a single number. Ideally the largest values in the matrix would be on the diagonal. Large values off the diagonal to a certain direction (upper or lower part of the matrix) could indicate a systematic bias in the classifier.

In [None]:
# now that the model is fit, predict the testing data and do evaluation metrics

# calculate predictions on the TEST data

#  calculate the classification report and confusion matrix
species_names = ['setosa','versicolor','virginica']


**7.1 Feature importance**

Now we calculate the feature importance. In supervised learning tasks like this classification, not all of the input features will contribute equally to the outcome. Some of the features will be a stronger predictor of the target, and others may be weaker predictors or redundant.

Understanding the feature importance in your model can help with interpreting the model as you can see how each feature effects the output. This could be very useful for example in a science experiement, as you could see which measurements affect the outcome the most which can help to inform future studies.

In [None]:
# FEATURE IMPORTANCE

# get feature importance from classifier model

# match feature importance with feature names

# sort by importance

# print top features

# plot feature importance

**8. Cross validation for better statistics**

Before we were splitting the data into a single instance of training and validation subsets, the model was fit with the training data, and then evaluated on the validation data. 

This is a very valid method, but keep this in mind: When we do this split and get performance metrics like accuracy, confusion matrix, etc, these are basically one possible calculation of the performance metrics. These metrics could change if we split the data differently!

For example, let's say our data consists of `A,B,C,D,E`. If we use `A,B,C,D` for training and then `E` for validation, that will give us a set of performance metrics.

What if we did the analysis 5 times with the following combinations:

Training: `A,B,C,D`, Validation: `E`

Training: `A,B,C,E`, Validation: `D`

Training: `A,B,D,E`, Validation: `C`

Training: `A,C,D,E`, Validation: `B`

Training: `B,C,D,E`, Validation: `A`

And then averaged the performance metrics from each of these experiments? This would give us a more robust idea of the performance of the model that is less dependent on the specific training and validation split that we used.

In [None]:
# now use cross_val_score to fit and predict the model over multiple train/test splits



# print results

**9. Hyperparameter tuning**

Remember we defined our classifier model like this:

classifier = RandomForestClassifier(
    n_estimators=100,            # number of trees in the forest
    max_depth=None,              # maximum depth of each tree (None = expand until all leaves are pure)
    min_samples_split=2,         # minimum number of samples required to split an internal node
    random_state=1997            # ensures reproducibility
)

How can we decide these parameters like number of estimators or max depth? We can guess, run the model, and look at the performance metrics. We can then make slightly different guesses and repeat the process. Hyperparameter tuning is a systematic way of automating this process.

We use grid search parameter tuning, where we define a grid of parameters to test, and every combination is then tested.

The `sklearn` function `GridSearchCV` combines the grid search for hyperparameter tuning with the cross validation (where you set the number of folds with the parameter `cv`). From the results you can then load the best model.

There are also other methods for hyperparameter tuning like random search, where you can define a larger grid and instead of every possible combinations, combinations are chosen at random. There is also Bayesian search where statistics are used to find the best combination of hyperparameters.

In [None]:
# hyperparameter tuning to find the best model

# set up param grid


# use that with grid search


# print best params and corresponding score

In [None]:
# now do predictions with the best model and look at the confusion matrix again


In [None]:
# FEATURE IMPORTANCE AGAIN

# # get feature importance from best xgb model
# feature_importance = best_model.named_steps['classifier'].feature_importances_

# # match feature importance with feature names
# feature_names = X_train.columns
# feature_importance_df = pd.DataFrame({'Feature': feature_names,
#                                       'Importance': feature_importance})

# # sort by importance
# feature_importance_df = feature_importance_df.sort_values(by='Importance',ascending=False)

# # print top features
# print(feature_importance_df)

# # plot feature importance
# plt.figure(figsize=(10, 6))
# plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'], color='skyblue')
# plt.xlabel("Feature Importance")
# plt.ylabel("Features")
# plt.title("Feature Importance on Predicting Iris Species in RandomForestClassifier Model")
# plt.gca().invert_yaxis()  # Invert axis to show highest importance at the top
# plt.show()