<a href="https://colab.research.google.com/github/carighi/al_ml_workshop/blob/main/Model_Building_and_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Model Building and Evaluation**
By now, you must have learned how to clean and process data/feature set. In this tutorial, we will learn different types of algorithms and the metrics for evaluating their performance.

Adapted from Wafiq Syed 2020 [How to use Scikit-Learn Datasets for Machine Learning](https://towardsdatascience.com/how-to-use-scikit-learn-datasets-for-machine-learning-d6493b38eca3) and Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).

[Scikit-learn](https://scikit-learn.org/stable/) is a great package to perform predictive analysis in Python. To use it the package is called sklearn. You will import specific libraries from the package uing from sklearn import <library>
For creating graphs and plots we will use matplotlib.pyplot package.


In [None]:
# imports necessary libraries
from sklearn import datasets, metrics
import matplotlib.pyplot as plt

# sets the backend of matplotlib to the 'inline' backend: With this backend,
# the output of plotting commands is displayed inline within frontends like
# the Jupyter notebook, directly below the code cell that produced it.
# The resulting plots will then also be stored in the notebook document.
%matplotlib inline

## Classification Example

In this example, we’ll be working with the [“Breast Cancer Wisconsin” dataset](https://colab.research.google.com/drive/18WQntcxOb2dqanbT_EQaJ-q11wtJgdAx?authuser=1#scrollTo=5yqLtP6IhsUC). We will import the data and understand how to read it. We will also build a simple ML model that is able to classify cancer scans either as malignant or benign.

###Import “Breast Cancer Wisconsin” dataset

We will obtain data from [sklearn.datasets](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer).

When using sklearn datasets, each dataset has a corresponding function used to load the dataset. These functions follow the same format: "load_DATASET()", where DATASET refers to the name of the dataset.
 The output is a Bunch object (dictionary)

In our previous exercises we used !wget to get the data from some repository or website.

In [None]:
from sklearn import datasets
data = datasets.load_breast_cancer()

These load functions (such as *load_breast_cancer()*) don't return data in the tabular format, they return a **Bunch** object, a Scikit-Learn's fancy name for a Dictionary.

Let's looking into its keys.

In [None]:
print(data.keys())

We can get the following keys:

* **data** is all the feature data (the attributes of the scan that help us identify if the tumor is malignant or benign, such as radius, area, etc.) in a NumPy array
* **target** is the target data (the variable you want to predict, in this case whether the tumor is malignant or benign) in a NumPy array,
* **feature_names** are the names of the feature variables, in other words names of the columns in data
* **target_names** is the name(s) of the target variable(s), in other words name(s) of the target column(s)
* **DESCR**, short for DESCRIPTION, is a description of the dataset
filename is the path to the actual file of the data in CSV format.

It’s important to note that all of Scikit-Learn datasets are divided into data and target. data represents the features, which are the variables that help the model learn how to predict. target includes the actual labels. In our case, the target data is one column classifies the tumor as either 0 indicating malignant or 1 for benign.

Let's take a look the description of the dataset

In [None]:
print(data.DESCR)

###Working with the Dataset

We can use *pandas* to explore the dataset.

In [None]:
# Import pandas
import pandas as pd
# Read the DataFrame, first using the feature data
df = pd.DataFrame(data.data, columns=data.feature_names)
# Add a target column, and fill it with the target data. 0 encodes for malignant or 1 encodes for benign.
df['target'] = data.target
# Show the first five rows
df.head()

To see the value of this dataset, run

In [None]:
df.info()

###Let's do some machine learning (ML)

Let’s build a model that classifies cancer tumors as malignant (spreading) or benign (non-spreading). This will show you how to use the data for your own models. We’ll build a simple K-Nearest Neighbors model.

First, let’s split the dataset into two, one for training the model — giving it data to learn from, and the second for testing the model — seeing how well the model performs on data (scans) it hasn’t seen before.

In [None]:
# Store the feature data
X = data.data
# store the target data
y = data.target
# split the data using Scikit-Learn's train_test_split
from sklearn.model_selection import train_test_split
#split is 70%train, 30%test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print('size of train and test datasets', X_train.shape, X_test.shape)

This gives us two datasets —one for training and one for testing. Let’s get onto training the model.



##SVM model

Lets try classifying with [support vector machine classifier](https://youtu.be/Y6RRHw9uN9o) first.

For classification we will use linear SVM (kernel="linear"). We also need to set up a regularization parameter denoted as 'C' in SVM, to control the trade-off between achieving a low training error and a low testing error, that is, the ability to generalize your classifier to unseen data. C takes positive values. A very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points. This sometimes help to prevent overfitting to the training data and improve the model's ability to generalize to unseen data So, by adjusting the value of C, you can control the balance between keeping the decision boundary smooth and minimizing the classification error.


In [None]:
from sklearn.svm import SVC
#rememeber 'random_state' parameter is used for reproducibility of the results each time we run this cell
svm = SVC(kernel="linear", C=0.025, random_state=42)
svm.fit(X_train, y_train)
#score returns the mean accuracy
svm.score(X_test, y_test)

## K nearest neighbor classifier
Let's try now a different algorithm. The [K nearest neighbor classifier](https://www.ibm.com/topics/knn). To apply this method, we need to import KNeighborsClassifier from sklearn.neighbors. Important parameters is number of neighbors (usually called k). If you set n_neighbors to a small value, like 1 or 2, the model might be too sensitive to noise in the data. On the other hand, if you set n_neighbors to a large value, the model might include points from other classes, which could lead to misclassification.

A common practice is to try different values of n_neighbors (like 3, 5, 7, etc.) and see which one gives the best performance on your validation set. This process is known as hyperparameter tuning.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
logreg = KNeighborsClassifier(n_neighbors=6)
logreg.fit(X_train, y_train)
logreg.score(X_test, y_test)

**QUESTION:** Which method gives a better mean accuracy?

## Classification Model Evaluation Metrics

Let's check other metrics. We will repeat some steps from previous to keep things complete

In [None]:
# let’s first prepare train and test datasets to build our classification models.
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#rememeber 'random_state' parameter is used for reproducibility of the results each time we run this cell
svm = SVC(kernel="linear", C=0.025, random_state=42)
svm.fit(X_train, y_train)

print('size of train and test datasets', X_train.shape, X_test.shape)
#score returns the mean accuracy
svm.score(X_test, y_test)

### Confusion Matrix

In [None]:
import pandas as pd

# Display the confusion matrix for the model predictions on the test dataset.
def display_confusion_matrix(true_labels, predicted_labels, classes=[1,0]):

    total_classes = len(classes)
    level_labels = [total_classes*[0], list(range(total_classes))]
    cm = metrics.confusion_matrix(y_true=true_labels, y_pred=predicted_labels,
                                  labels=classes)
    cm_frame = pd.DataFrame(data=cm,
                            columns=pd.MultiIndex(levels=[['Predicted:'], classes],
                                                  codes=level_labels),
                            index=pd.MultiIndex(levels=[['Actual:'], classes],
                                                codes=level_labels))
    print(cm_frame)


# predict on test data and view confusion matrix
y_pred = svm.predict(X_test)
display_confusion_matrix(true_labels=y_test, predicted_labels=y_pred, classes=[0, 1])

The matrix should be read as follows:

    In 'Actual: 0, Predicted: 0' - The model correctly predicted 60 instances where the actual class was '0'.
    In 'Actual: 0, Predicted: 1' - The model incorrectly predicted 3 instances as class '1' when they were actually class '0'.
    In 'Actual: 1, Predicted: 0' - The model incorrectly predicted 2 instances as class '0' when they were actually class '1'.
    In 'Actual: 1, Predicted: 1' - The model correctly predicted 106 instances where the actual class was '1'.

This matrix helps you understand where your model is making mistakes. It's especially useful for imbalanced datasets where accuracy alone can be misleading.
Specific example: Out of the 63 observations with label 0 (malignant), our model has correctly predicted 60 observations. Similarly out of
Out of the 108 observations with label 1 (benign), our model has correctly predicted 106 observations

### True Positive, False Positive, True Negative and False Negative

In [None]:
# considering 1 as our positive class
positive_class = 1
# True Positive (TP): This is the count of the total number of instances from the
#positive class where the true class label was equal to the predicted class label.
TP = 106

# False Positive (FP): This is the count of the total number of instances from the
#negative class where our model misclassified them by predicting them as positive.
FP = 3

# True Negative (FN): This is the count of the total number of instances from the
# negative class where the true class label was equal to the predicted class label.
TN = 60

# False Negative (FN): This is the count of the total number of instances from the
# positive class where our model misclassified them by predicting them as negative.
FN = 2

### Accuracy

This is one of the most popular measures of classifier performance. It is defined as the overall
accuracy or proportion of correct predictions of the model. The formula for computing accuracy from the
confusion matrix is:

$Accurcy=\frac{TP+TN}{TP+FP+TN+FN}$

In [None]:
fw_acc = round(metrics.accuracy_score(y_true=y_test, y_pred=y_pred), 5)
mc_acc = round((TP + TN) / (TP + TN + FP + FN), 5)
print('Framework Accuracy:', fw_acc)
print('Manually Computed Accuracy:', mc_acc)

### Precision

Precision, also known as positive predictive value, is another metric that can be derived from
the confusion matrix. It is defined as the number of predictions made that are actually correct or relevant out
of all the predictions based on the positive class. The formula for precision is as follows:

$Precision=\frac{TP}{TP+FP}$

A model with high precision will identify a higher fraction of positive class as compared to a model
with a lower precision. Precision becomes important in cases where we are more concerned about finding
the maximum number of positive class even if the total accuracy reduces.

In [None]:
fw_prec = round(metrics.precision_score(y_true=y_test, y_pred=y_pred), 5)
mc_prec = round((TP) / (TP + FP), 5)
print('Framework Precision:', fw_prec)
print('Manually Computed Precision:', mc_prec)

### Recall

Recall, also known as sensitivity, is a measure of a model to identify the percentage of relevant
data points. It is defined as the number of instances of the positive class that were correctly predicted. This is
also known as hit rate, coverage, or sensitivity. The formula for recall is:

$Recall=\frac{TP}{TP+FN}$

Recall becomes an important measure of classifier performance in scenarios where we want to catch
the most number of instances of a particular class even when it increases our false positives. For example,
consider the case of bank fraud, a model with high recall will give us higher number of potential fraud cases.
But it will also help us raise alarm for most of the suspicious cases.

In [None]:
fw_rec = round(metrics.recall_score(y_true=y_test, y_pred=y_pred), 5)
mc_rec = round((TP) / (TP + FN), 5)
print('Framework Recall:', fw_rec)
print('Manually Computed Recall:', mc_rec)

### F1-Score

There are some cases in which we want a balanced optimization of both precision and recall.
F1 score is a metric that is the harmonic mean of precision and recall and helps us optimize a classifier for
balanced precision and recall performance.
The formula for the F1 score is:

$F1 Score = \frac{2 x Precision x Recall}{Precision + Recall}$

In [None]:
fw_f1 = round(metrics.f1_score(y_true=y_test, y_pred=y_pred), 5)
mc_f1 = round((2*mc_prec*mc_rec) / (mc_prec+mc_rec), 5)
print('Framework F1-Score:', fw_f1)
print('Manually Computed F1-Score:', mc_f1)

## Clustering Example

In this example, we will learn how we can fit a clustering model on “Breast Cancer Wisconsin” dataset. We will use a labeled dataset to help us see the results of the clustering model and compare it with actual labels. A point to remember here is that, usually labeled data is not available in the real world,
which is why we choose to go for unsupervised methods like clustering. We will try to cover two different
algorithms, one each from partitioning based clustering and hierarchical clustering.

In [None]:
# Load Wisconsin Breast Cancer Dataset
import numpy as np
from sklearn.datasets import load_breast_cancer

# load data
data = load_breast_cancer()
X = data.data
y = data.target
print(X.shape, data.feature_names)

It is evident that we have a total of 569 observations and 30 attributes or features for each observation.

### Partition based Clustering Example

We will choose the simplest yet most popular partition based clustering model for our example, which
is **K-means** algorithm. This algorithm is a centroid based clustering algorithm, which starts with some
assumption about the total clusters in the data and with random centers assigned to each of the clusters.
It then reassigns each data point to the center closest to it, using Euclidean distance as the distance metric.
After each reassignment, it recalculates the center of that cluster. The whole process is repeated iteratively
and stopped when reassignment of data points doesn’t change the cluster centers. Variants include
algorithms like **K-medioids**.

In [None]:
# determine these two clusters (either 0 or 1) from the data by K-means clustering
from sklearn.cluster import KMeans

km = KMeans(n_clusters=2, random_state=2)
km.fit(X)

labels = km.labels_
centers = km.cluster_centers_
# labels of the first 10 data points
print(labels[:10])
# numerical value of the dimensions of the data
# (the 30 attributes in the dataset) around which data is clustered.
print(centers)

In [None]:
# we will leverage PCA to reduce the input dimensions (30) to two principal components
# and visualize the clusters on top of the same.
import plotly.express as px
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
bc_pca = pca.fit_transform(X)
fig = px.scatter(bc_pca, x=0, y=1, color=df['target'])
fig.show()

In [None]:
# visualize the clusters on the reduced 2D feature space for the actual labels as
# well as the clustered output labels.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
fig.suptitle('Visualizing breast cancer clusters')
fig.subplots_adjust(top=0.85, wspace=0.5)
ax1.set_title('Actual Labels')
ax2.set_title('Clustered Labels')

for i in range(len(y)):
    if y[i] == 0:
        c1 = ax1.scatter(bc_pca[i,0], bc_pca[i,1],c='g', marker='.')
    if y[i] == 1:
        c2 = ax1.scatter(bc_pca[i,0], bc_pca[i,1],c='r', marker='.')

    if labels[i] == 0:
        c3 = ax2.scatter(bc_pca[i,0], bc_pca[i,1],c='g', marker='.')
    if labels[i] == 1:
        c4 = ax2.scatter(bc_pca[i,0], bc_pca[i,1],c='r', marker='.')

l1 = ax1.legend([c1, c2], ['0', '1'])
l2 = ax2.legend([c3, c4], ['0', '1'])

We can clearly see that the clustering has worked quite well and it shows distinct
separation between clusters with labels 0 and 1 and is quite similar to the actual labels. However we do
have some overlap where we have mislabeled some instances.

Remember in an actual real-world scenario, you will not have the actual labels to compare with and the
main idea is to find structures or patterns in your data in the form of these clusters.
Hence even when dealing with labeled data and running clustering do not
compare clustered label values with actual labels and try to measure accuracy.

Another very important
point to remember is that cluster label values have no significance. The labels 0 and 1 are just values to
distinguish cluster data points from each other.

Also another important note
is that if we had asked for more than two clusters, the algorithm would have readily supplied more clusters
but it would have been hard to interpret those and many of them would not make sense. Hence, one of
the caveats of using the K-means algorithm is to use it in the case where we have some idea about the total
number of clusters that may exist in the data.

## Clustering Model Evaluation Metrics

The lack of a validated ground truth, i.e. the abscence of true labels in the data makes the  evaluation of clustering (or unsupervised models in general) very difficult.



### External validation

External validation means validating the clustering model when we have some ground truth available
as labeled data. The presence of external labels reduces most of the complexity of model evaluation as
the clustering (unsupervised) model can be validated in similar fashion to classification models.

Three popular metrics can be used in this scenario:

* **Homogeneity**: A clustering model prediction result satisfies homogeneity if all of
its clusters contain only data points that are members of a single class (based on the
true class labels).
* **Completeness**: A clustering model prediction result satisfies completeness if
all the data points of a specific ground truth class label are also elements of the
same cluster.
* **V-measure**: The harmonic mean of homogeneity and completeness scores gives us
the V-measure value.

Values are typically bounded between 0 and 1 and usually higher values are better. Let’s compute these
metric on our two K-means clustering models.

Let's calculate these metrics for the Kmeans classification of the breast cancer data we ran before (variables X for the data and y for the observation labels).

In [None]:
#km_labels = km.labels_ is getting the labels for each point in the dataset. These labels represent which cluster each point belongs to.
km_labels = km.labels_
#calculating the homogeneity, completeness, and V-measure of the clustering.These are metrics used to evaluate the quality of the clustering.
km_hcv = np.round(metrics.homogeneity_completeness_v_measure(y, km_labels), 3)
print('Homogeneity, Completeness, V-measure metrics for num clusters=2: ', km_hcv)

### Build two clustering models on the breast cancer dataset

We will now compare the two K-means models:
one with two clusters and the second one with five clusters—and
then evaluate their performance.

In [None]:
km2 = KMeans(n_clusters=2, random_state=42).fit(X)
km2_labels = km2.labels_

km5 = KMeans(n_clusters=5, random_state=42).fit(X)
km5_labels = km5.labels_

In [None]:
km2_hcv = np.round(metrics.homogeneity_completeness_v_measure(y, km2_labels), 3)
km5_hcv = np.round(metrics.homogeneity_completeness_v_measure(y, km5_labels), 3)

print('Homogeneity, Completeness, V-measure metrics for num clusters=2: ', km2_hcv)
print('Homogeneity, Completeness, V-measure metrics for num clusters=5: ', km5_hcv)

We can see that the V-measure for the first model with two clusters is better than the one with five
clusters and the reason is because of higher completeness score.

### Internal validation

Internal validation means validating a clustering model by defining metrics that capture the expected
behavior of a good clustering model. A good clustering model can be identified by two very desirable traits:
* Compact groups, i.e. the data points in one cluster occur close to each other.
* Well separated groups, i.e. two groups\clusters have as large distance among
them as possible.



####Silhouette Coefficient
Silhouette coefficient is a metric that tries to combine the two requirements of a good clustering model. The
silhouette coefficient is defined for each sample and is a combination of its similarity to the data points in its
own cluster and its dissimilarity to the data points not in its cluster.

The silhouette coefficient is usually bounded between -1 (incorrect clustering) and +1 (excellent quality
dense clusters). A higher value of silhouette coefficient generally means that the clustering model is leading
to clusters that are dense and well separated and distinguishable from each other. Lower scores indicate
overlapping clusters.

In [None]:
import sklearn.metrics

km2_silc = metrics.silhouette_score(X, km2_labels, metric='euclidean')
km5_silc = metrics.silhouette_score(X, km5_labels, metric='euclidean')

print('Silhouette Coefficient for num clusters=2: ', km2_silc)
print('Silhouette Coefficient for num clusters=5: ', km5_silc)

We can observe that from the metric results it seems like we have better
cluster quality with two clusters as compared to five clusters.