<a href="https://colab.research.google.com/github/cm-int/machine-learning-fundamentals/blob/main/module_3/Labs/Lab3_1_Refining_a_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3.1: Refining a Machine Learning Model

In this lab, you'll perform the following tasks:

- Build a Logistic Regression model to classify the data without any modifications to the data
- Examine the results and measure the performance, especially the precision
-	Explore and refine the dataset
-	Recreate and retest the model
-	Repeat until the performance is optimized 

You'll also compare the performance of two models constructed using different algorithms.

## Scenario

This dataset is related to white variants of the Portuguese "Vinho Verde" wine.The dataset describes the amount of various chemicals present in wine and their effect on it's quality. This is a binary dataset; the quality is either 'Poor' or 'Good'. Your task is to predict the quality of wine using the given data.

The dataset contains the following columns:

Input variables (based on physicochemical tests):\
1 - fixed acidity\
2 - volatile acidity\
3 - citric acid\
4 - residual sugar\
5 - chlorides\
6 - free sulfur dioxide\
7 - total sulfur dioxide\
8 - density\
9 - pH\
10 - sulphates\
11 - alcohol\
12 - alkalinity\
13 - e330 level\
14 - effervescence index\
15 - consumable\
\
Output variable (based on sensory data):\
16 - quality (0=poor, 1=good)

## Acknowledgements:
This dataset is also available from Kaggle & UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality.

#Read the data

In [None]:
# Upload the winequalitywhites.csv file from Github

!wget 'https://raw.githubusercontent.com/cm-int/machine-learning-fundamentals/main/module_3/Labs/winequalitywhites.csv'

In [None]:
import pandas as pd
import numpy as np

# Read the data into a Pandas DataFrame named wine_data

wine_data = pd.read_csv('winequalitywhites.csv')
wine_data

#Split the data

In [None]:
# Create the wine_features DataFrame with every column apart from quality

wine_features = wine_data.drop(['quality'], axis=1)
wine_features

In [None]:
# Create the wine_quality series containing only the quality column

wine_quality = wine_data['quality']
wine_quality

In [None]:
# Split the data into training and test datasets

from sklearn.model_selection import train_test_split

features_train, features_test, predictions_train, predictions_test = train_test_split(wine_features, wine_quality, test_size=0.33, random_state=13)

#Create a Logistic Regression model to classify the data

In [None]:
# Create and fit the Logistic Regression model with the 'saga' solver and no regularization and an increased number of iterations and reduced tolerance (to allow the algorithm to converge)

from sklearn.linear_model import LogisticRegression

wine_model = LogisticRegression(solver='saga', penalty='none', max_iter=2000, tol=1e-3)
_ = wine_model.fit(features_train, predictions_train)

In [None]:
# Test the model and examine the confusion matrix

from sklearn.metrics import ConfusionMatrixDisplay 

test_results = wine_model.predict(features_test)
_ = ConfusionMatrixDisplay.from_predictions(predictions_test, test_results)

In [None]:
# Calculate the precision, recall, F1-score, AUC and accuracy for the model

from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_auc_score

print(f'Precision: {precision_score(predictions_test, test_results, zero_division=0)}\n')
print(f'Recall: {recall_score(predictions_test, test_results, zero_division=0)}\n')
print(f'F1 Score: {f1_score(predictions_test, test_results, zero_division=0)}\n')
print(f'AUC: {roc_auc_score(predictions_test, test_results)}\n')
print(f'Accuracy: {accuracy_score(predictions_test, test_results)}\n')

In [None]:
# Plot the ROC curve for the model from the estimator and from the test predictions
from sklearn.metrics import roc_curve, RocCurveDisplay

display = RocCurveDisplay.from_estimator(wine_model, features_test, predictions_test)
display = RocCurveDisplay.from_predictions(predictions_test, test_results)

In [None]:
# Plot the Precision/Recall graph for the model using the estimator and from the test results
from sklearn.metrics import PrecisionRecallDisplay

_ = PrecisionRecallDisplay.from_estimator(wine_model, features_test, predictions_test)

display = PrecisionRecallDisplay.from_predictions(predictions_test, test_results)
_ = display.ax_.set_ylim(bottom=0, top=1)

In [None]:
# Find the threshold that maximizes precision and recall for the 'good' (1) class label
# Display the F1 score, precision, and recall for this threshold

from sklearn.metrics import precision_recall_curve

test_results_proba = wine_model.predict_proba(features_test)
precision, recall, thresholds = precision_recall_curve(predictions_test, test_results_proba[:, 1])

precision[precision == 0] = 1e-99
recall[recall == 0] = 1e-99
fscores = (2 * precision * recall) / (precision + recall)

ix = np.argmax(fscores)
print(f'Optimal threshold is {thresholds[ix]}\nF1 Score is {fscores[ix]}\nPrecision is {precision[ix]}\nRecall is {recall[ix]}')

**What do you conclude from these statistics?**

The precision indicates that the model has a large false positive rate. Many wines classified as having *good* quality are actually *poor*.

The recall shows that the model has a much smaller false negative rate. A few wines that are classified as *poor* should actually be *good*.

The high recall but low precision results in a misleadingly high F1 score.

The AUC indicates that the model is performing no better than random guesswork.

These statistics show that you should never use one measurement in isolation to judge the performance of a model.

The model *may* appear work better with a probability threshold of 0.484 for the class labels; predictions with a probability less than this value should be a 0, and those at or above this value should be a 1. However, the precision indicates that reducing the threshold is likely to increase the already substantial number of false positives (the precision will dropp) and only reduce the number of false negatives (the recall will improve); it makes a biased model even more biased.

# Evaluate the model

In [None]:
# Calculate the Gini Coefficient for the model
# Gini Coefficient=2×(AUC−1)

from sklearn.metrics import roc_auc_score

auc = roc_auc_score(predictions_test, test_results)
gini_coeff = (2 * auc) - 1
print(f'Gini Coefficient is: {gini_coeff}')

**What does this coefficient signify?**

A Gini Coefficient of 0.01 indicates the model has very poor performance. Ideally, you should aim for a Gini Coefficient greater than 0.6.

In [None]:
# Calculate Cohen's Kappa for the model

from sklearn.metrics import cohen_kappa_score

kappa_score = cohen_kappa_score(predictions_test, test_results)
print(f"Cohen's Kappa is: {kappa_score}")

**What does this value mean?**

The Cohen's Kappa value lies between 0.01 and 0.2. This indicates that there is very slight agreement between the model and the real observations.

In [None]:
# Calculate the Hamming Loss for the model
from sklearn.metrics import hamming_loss

hamming_score = hamming_loss(predictions_test, test_results)
print(f'Hamming Loss is: {hamming_score}')

**What proportion of the predictions are incorrect?**

The Hamming Loss indicates that 33.5% of the predictions are incorrect. This model is a poor fit.

In [None]:
# Calculate the Matthews Correlation Coefficient for the model

from sklearn.metrics import matthews_corrcoef

mcc = matthews_corrcoef(predictions_test, test_results)
print(f'Matthews Correlation Coefficient is: {mcc}')

**How strong is the relationship between the predicted and observed class labels?**

The relationship is between 0 and 0.19, which means there is a negligable relationship.

In [None]:
# Plot the cumulative gains chart for the model
!pip install Scikit-plot

import scikitplot as skplt

test_results_proba = wine_model.predict_proba(features_test)
_ = skplt.metrics.plot_cumulative_gain(predictions_test, test_results_proba, figsize=(10, 10))

**Overall, do your findings confirm your earlier conclusions about the precision and recall of the model?**

All of the metrics confirm that the model is currently a very poor fit

# Refine the model - scale the data

In [None]:
# Apply a MinMaxScaler to the wine_features dataframe 

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
column_names = wine_features.columns
scaled_wine_features = pd.DataFrame(scaler.fit_transform(wine_features), columns=column_names)

scaled_wine_features

In [None]:
# Rebuild the model with scaled features:
# - Recreate test and training datasets
# - Build the Logistic Regression model with the same parameters as before

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

features_train, features_test, predictions_train, predictions_test = train_test_split(scaled_wine_features, wine_quality, test_size=0.33, random_state=13)
wine_model = LogisticRegression(solver='saga', penalty='none', max_iter=2000, tol=1e-3)
_ = wine_model.fit(features_train, predictions_train)

In [None]:
# Test the model
# - Make predictions and examine the confusion matrix
# - Calculate the precision, recall, F1-score, AUC and accuracy for the model
# - Plot the ROC curve for the model from the estimator and from the test predictions
# - Plot the Precision/Recall graph for the model using the estimator and from the test results

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_auc_score
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import roc_curve, RocCurveDisplay

test_results = wine_model.predict(features_test)
_ = ConfusionMatrixDisplay.from_predictions(predictions_test, test_results)

print(f'Precision: {precision_score(predictions_test, test_results, zero_division=0)}\n')
print(f'Recall: {recall_score(predictions_test, test_results, zero_division=0)}\n')
print(f'F1 Score: {f1_score(predictions_test, test_results, zero_division=0)}\n')
print(f'AUC: {roc_auc_score(predictions_test, test_results)}\n')
print(f'Accuracy: {accuracy_score(predictions_test, test_results)}\n')

display = RocCurveDisplay.from_estimator(wine_model, features_test, predictions_test)
display = RocCurveDisplay.from_predictions(predictions_test, test_results)

_ = PrecisionRecallDisplay.from_estimator(wine_model, features_test, predictions_test)
display = PrecisionRecallDisplay.from_predictions(predictions_test, test_results)
_ = display.ax_.set_ylim(bottom=0, top=1)

In [None]:
# Evaluate the model
# - Calculate the Gini Coefficient for the model
# - Calculate Cohen's Kappa
# - Calculate the Hamming Loss
# - Calculate the Matthews Correlation Coefficient
# - Plot the cumulative gains chart for the model

from sklearn.metrics import roc_auc_score, cohen_kappa_score, hamming_loss, log_loss, matthews_corrcoef
import scikitplot as skplt

auc = roc_auc_score(predictions_test, test_results)
gini_coeff = (2 * auc) - 1
print(f'Gini Coefficient is: {gini_coeff}')

kappa_score = cohen_kappa_score(predictions_test, test_results)
print(f"Cohen's Kappa is: {kappa_score}")

hamming_score = hamming_loss(predictions_test, test_results)
print(f'Hamming Loss is: {hamming_score}')

mcc = matthews_corrcoef(predictions_test, test_results)
print(f'Matthews Correlation Coefficient is: {mcc}')

test_results_proba = wine_model.predict_proba(features_test)
_ = skplt.metrics.plot_cumulative_gain(predictions_test, test_results_proba, figsize=(10, 10))

**Has the model improved?**

There has been a notable improvement in all metrics. The false positive and false negative rates have both decreased.

# Refine the model - remove constant and quasi-constant features

In [None]:
# Look for features with little variance in the scaled dataframe

print(scaled_wine_features.var())

**Which features have a notably small variance?**

The *consumable* feature has zero variance, so has the same value in every observation

In [None]:
# Verify that 'consumable' has only one value - display all the unique values in this feature

print(np.unique(scaled_wine_features['consumable']))

In [None]:
# Rebuild the model without this feature:
# - Drop the feature from the scaled_wine_features dataframe  
# - Recreate test and training datasets
# - Build the Logistic Regression model with the same parameters as before

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

no_constants_wine_features = scaled_wine_features.drop(['consumable'], axis=1)

features_train, features_test, predictions_train, predictions_test = train_test_split(no_constants_wine_features, wine_quality, test_size=0.33, random_state=13)

wine_model = LogisticRegression(solver='saga', penalty='none', max_iter=2000, tol=1e-3)
_ = wine_model.fit(features_train, predictions_train)

In [None]:
# Test the model
# - Make predictions and examine the confusion matrix
# - Calculate the precision, recall, F1-score, AUC and accuracy for the model
# - Plot the ROC curve for the model from the estimator and from the test predictions
# - Plot the Precision/Recall graph for the model using the estimator and from the test results

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_auc_score
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import roc_curve, RocCurveDisplay

test_results = wine_model.predict(features_test)
_ = ConfusionMatrixDisplay.from_predictions(predictions_test, test_results)

print(f'Precision: {precision_score(predictions_test, test_results, zero_division=0)}\n')
print(f'Recall: {recall_score(predictions_test, test_results, zero_division=0)}\n')
print(f'F1 Score: {f1_score(predictions_test, test_results, zero_division=0)}\n')
print(f'AUC: {roc_auc_score(predictions_test, test_results)}\n')
print(f'Accuracy: {accuracy_score(predictions_test, test_results)}\n')

display = RocCurveDisplay.from_estimator(wine_model, features_test, predictions_test)
display = RocCurveDisplay.from_predictions(predictions_test, test_results)

_ = PrecisionRecallDisplay.from_estimator(wine_model, features_test, predictions_test)
display = PrecisionRecallDisplay.from_predictions(predictions_test, test_results)
_ = display.ax_.set_ylim(bottom=0, top=1)

In [None]:
# Evaluate the model
# - Calculate the Gini Coefficient for the model
# - Calculate Cohen's Kappa
# - Calculate the Hamming Loss
# - Calculate the Matthews Correlation Coefficient
# - Plot the cumulative gains chart for the model

from sklearn.metrics import roc_auc_score, cohen_kappa_score, hamming_loss, log_loss, matthews_corrcoef
import scikitplot as skplt

auc = roc_auc_score(predictions_test, test_results)
gini_coeff = (2 * auc) - 1
print(f'Gini Coefficient is: {gini_coeff}')

kappa_score = cohen_kappa_score(predictions_test, test_results)
print(f"Cohen's Kappa is: {kappa_score}")

hamming_score = hamming_loss(predictions_test, test_results)
print(f'Hamming Loss is: {hamming_score}')

mcc = matthews_corrcoef(predictions_test, test_results)
print(f'Matthews Correlation Coefficient is: {mcc}')

test_results_proba = wine_model.predict_proba(features_test)
_ = skplt.metrics.plot_cumulative_gain(predictions_test, test_results_proba, figsize=(10, 10))

**Has the model improved?**

There has been no effect on predictive power. The constant column was probably not an important part of the model, but it makes sense to remove it to save resources.

# Refine the model - find and remove correlated features

In [None]:
# Find correlated features in the scaled and reduced dataset

import seaborn as sns
from matplotlib import pyplot as plt

correlation_matrix = no_constants_wine_features.corr(method='kendall')
plt.figure(figsize=(15, 15))
sns.heatmap(correlation_matrix, annot=True, linecolor='black')

plt.show()

**Which features show a strong correlation?**

The 'e330.level' and 'citric acid' features have a positive correlation coefficient of 1, meaning that they convey the same information.

The 'pH' and 'alkalinity' columns have a negative correlation coefficient of -1. Alkalinity is the exact converse of pH.

In [None]:
# Remove the e330.level and alkalinity features and rebuild the model
# - Drop the features from the scaled_wine_features dataframe  
# - Recreate test and training datasets
# - Build the Logistic Regression model with the same parameters as before

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

no_correlation_wine_features = no_constants_wine_features.drop(['e330.level', 'alkalinity'], axis=1)

features_train, features_test, predictions_train, predictions_test = train_test_split(no_correlation_wine_features, wine_quality, test_size=0.33, random_state=13)

wine_model = LogisticRegression(solver='saga', penalty='none', max_iter=2000, tol=1e-3)
_ = wine_model.fit(features_train, predictions_train)

In [None]:
# Test the model
# - Make predictions and examine the confusion matrix
# - Calculate the precision, recall, F1-score, AUC and accuracy for the model
# - Plot the ROC curve for the model from the estimator and from the test predictions
# - Plot the Precision/Recall graph for the model using the estimator and from the test results

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_auc_score
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import roc_curve, RocCurveDisplay

test_results = wine_model.predict(features_test)
_ = ConfusionMatrixDisplay.from_predictions(predictions_test, test_results)

print(f'Precision: {precision_score(predictions_test, test_results, zero_division=0)}\n')
print(f'Recall: {recall_score(predictions_test, test_results, zero_division=0)}\n')
print(f'F1 Score: {f1_score(predictions_test, test_results, zero_division=0)}\n')
print(f'AUC: {roc_auc_score(predictions_test, test_results)}\n')
print(f'Accuracy: {accuracy_score(predictions_test, test_results)}\n')

display = RocCurveDisplay.from_estimator(wine_model, features_test, predictions_test)
display = RocCurveDisplay.from_predictions(predictions_test, test_results)

_ = PrecisionRecallDisplay.from_estimator(wine_model, features_test, predictions_test)
display = PrecisionRecallDisplay.from_predictions(predictions_test, test_results)
_ = display.ax_.set_ylim(bottom=0, top=1)

In [None]:
# Evaluate the model
# - Calculate the Gini Coefficient for the model
# - Calculate Cohen's Kappa
# - Calculate the Hamming Loss
# - Calculate the Matthews Correlation Coefficient
# - Plot the cumulative gains chart for the model

from sklearn.metrics import roc_auc_score, cohen_kappa_score, hamming_loss, log_loss, matthews_corrcoef
import scikitplot as skplt

auc = roc_auc_score(predictions_test, test_results)
gini_coeff = (2 * auc) - 1
print(f'Gini Coefficient is: {gini_coeff}')

kappa_score = cohen_kappa_score(predictions_test, test_results)
print(f"Cohen's Kappa is: {kappa_score}")

hamming_score = hamming_loss(predictions_test, test_results)
print(f'Hamming Loss is: {hamming_score}')

mcc = matthews_corrcoef(predictions_test, test_results)
print(f'Matthews Correlation Coefficient is: {mcc}')

test_results_proba = wine_model.predict_proba(features_test)
_ = skplt.metrics.plot_cumulative_gain(predictions_test, test_results_proba, figsize=(10, 10))

**Has the model improved?**

No, but you shouldn't necessarily expect it to have done. Like removing constant and quasi-constant features, the purpose of removing correlated features is to minimize the resources required to build and use the model. The important point is that the model shouldn't be worse as a result. In this case, the metrics are the same as the previous model.

# Refine the model - remove noise using univariate feature selection

In [None]:
# Perform SHAP analysis to find the features that have the most impact on predictions

!pip install shap

import shap

explainer = shap.Explainer(wine_model.predict, features_test) 
values = explainer(features_train)

shap.summary_plot(shap_values=values, features=features_train, plot_type="bar")
shap.summary_plot(shap_values=values, features=features_train, plot_type="violin") 

In [None]:
# Rebuild the model with only the top five features

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

reduced_wine_features = no_correlation_wine_features[['density', 'residual.sugar', 'alcohol', 'volatile.acidity', 'pH']]

features_train, features_test, predictions_train, predictions_test = train_test_split(reduced_wine_features, wine_quality, test_size=0.33, random_state=13)

wine_model = LogisticRegression(solver='saga', penalty='none', max_iter=2000, tol=1e-3)
_ = wine_model.fit(features_train, predictions_train)

In [None]:
# Test the model

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_auc_score
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import roc_curve, RocCurveDisplay

test_results = wine_model.predict(features_test)
_ = ConfusionMatrixDisplay.from_predictions(predictions_test, test_results)

print(f'Precision: {precision_score(predictions_test, test_results, zero_division=0)}\n')
print(f'Recall: {recall_score(predictions_test, test_results, zero_division=0)}\n')
print(f'F1 Score: {f1_score(predictions_test, test_results, zero_division=0)}\n')
print(f'AUC: {roc_auc_score(predictions_test, test_results)}\n')
print(f'Accuracy: {accuracy_score(predictions_test, test_results)}\n')

display = RocCurveDisplay.from_estimator(wine_model, features_test, predictions_test)
display = RocCurveDisplay.from_predictions(predictions_test, test_results)

_ = PrecisionRecallDisplay.from_estimator(wine_model, features_test, predictions_test)
display = PrecisionRecallDisplay.from_predictions(predictions_test, test_results)
_ = display.ax_.set_ylim(bottom=0, top=1)

In [None]:
# Evaluate the model

from sklearn.metrics import roc_auc_score, cohen_kappa_score, hamming_loss, log_loss, matthews_corrcoef
import scikitplot as skplt

auc = roc_auc_score(predictions_test, test_results)
gini_coeff = (2 * auc) - 1
print(f'Gini Coefficient is: {gini_coeff}')

kappa_score = cohen_kappa_score(predictions_test, test_results)
print(f"Cohen's Kappa is: {kappa_score}")

hamming_score = hamming_loss(predictions_test, test_results)
print(f'Hamming Loss is: {hamming_score}')

mcc = matthews_corrcoef(predictions_test, test_results)
print(f'Matthews Correlation Coefficient is: {mcc}')

test_results_proba = wine_model.predict_proba(features_test)
_ = skplt.metrics.plot_cumulative_gain(predictions_test, test_results_proba, figsize=(10, 10))

**Has the model improved?**

There is a marginal improvement in the number of true positives and false negatives.

# Refine the model - find the combination of features that give the lowest false positive rate

In [None]:
# Use selectFpr() function to find the best combination of features that minimize the FPR

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFpr
from sklearn.feature_selection import chi2

features_train, features_test, predictions_train, predictions_test = train_test_split(no_correlation_wine_features, wine_quality, test_size=0.33, random_state=13)

features_selector = SelectFpr(score_func=chi2)
_ = features_selector.fit(features_train, predictions_train)

In [None]:
# Print the feature names and scores

feature_names = features_selector.get_feature_names_out()

for i in range(len(feature_names)):
	print(f'Feature {feature_names[i]}: {features_selector.scores_[i]}')

**How does this compare to the features found by using SHAP analysis?**

The list found by using forward selection was 'density', 'residual.sugar', 'alcohol', 'volatile.acidity', and 'pH'. These results suggest that selecting the features 'volatile.acidity', 'chlorides', 'density', and 'alcohol' will give the lowest FPR.

In [None]:
# Rebuild the model with these features

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

reduced_wine_features = no_correlation_wine_features[['volatile.acidity', 'chlorides', 'density', 'alcohol']]

features_train, features_test, predictions_train, predictions_test = train_test_split(reduced_wine_features, wine_quality, test_size=0.33, random_state=13)

wine_model = LogisticRegression(solver='saga', penalty='none', max_iter=2000, tol=1e-3)
_ = wine_model.fit(features_train, predictions_train)

In [None]:
# Test the model

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_auc_score
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import roc_curve, RocCurveDisplay

test_results = wine_model.predict(features_test)
_ = ConfusionMatrixDisplay.from_predictions(predictions_test, test_results)

print(f'Precision: {precision_score(predictions_test, test_results, zero_division=0)}\n')
print(f'Recall: {recall_score(predictions_test, test_results, zero_division=0)}\n')
print(f'F1 Score: {f1_score(predictions_test, test_results, zero_division=0)}\n')
print(f'AUC: {roc_auc_score(predictions_test, test_results)}\n')
print(f'Accuracy: {accuracy_score(predictions_test, test_results)}\n')

display = RocCurveDisplay.from_estimator(wine_model, features_test, predictions_test)
display = RocCurveDisplay.from_predictions(predictions_test, test_results)

_ = PrecisionRecallDisplay.from_estimator(wine_model, features_test, predictions_test)
display = PrecisionRecallDisplay.from_predictions(predictions_test, test_results)
_ = display.ax_.set_ylim(bottom=0, top=1)

In [None]:
# Evaluate the model

from sklearn.metrics import roc_auc_score, cohen_kappa_score, hamming_loss, log_loss, matthews_corrcoef
import scikitplot as skplt

auc = roc_auc_score(predictions_test, test_results)
gini_coeff = (2 * auc) - 1
print(f'Gini Coefficient is: {gini_coeff}')

kappa_score = cohen_kappa_score(predictions_test, test_results)
print(f"Cohen's Kappa is: {kappa_score}")

hamming_score = hamming_loss(predictions_test, test_results)
print(f'Hamming Loss is: {hamming_score}')

mcc = matthews_corrcoef(predictions_test, test_results)
print(f'Matthews Correlation Coefficient is: {mcc}')

test_results_proba = wine_model.predict_proba(features_test)
_ = skplt.metrics.plot_cumulative_gain(predictions_test, test_results_proba, figsize=(10, 10))

**Has the model improved?**

There is a very small decrease in the FPR but also a decrease in the TPR and an increase in the FNR.

# Refine the model - remove noise using multivariate feature selection

In [None]:
# Use forward selection to find the best combination of features

from sklearn.linear_model import LogisticRegression 
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import train_test_split

logistic_model = LogisticRegression(solver='saga', penalty='none', max_iter=2000, tol=1e-3) 

features_train, features_test, predictions_train, predictions_test = train_test_split(no_correlation_wine_features, wine_quality, test_size=0.33, random_state=13)

sfs_forward = SequentialFeatureSelector(logistic_model, n_features_to_select=5, direction="forward")
_ = sfs_forward.fit(features_train, predictions_train) 

print(f'Features selected by forward sequential selection: {sfs_forward.get_feature_names_out()}') 

In [None]:
# Rebuild the model with only the top five features

from sklearn.linear_model import LogisticRegression

reduced_wine_features = no_correlation_wine_features[sfs_forward.get_feature_names_out()]

features_train, features_test, predictions_train, predictions_test = train_test_split(reduced_wine_features, wine_quality, test_size=0.33, random_state=13)

wine_model = LogisticRegression(solver='saga', penalty='none', max_iter=2000, tol=1e-3)
_ = wine_model.fit(features_train, predictions_train)

In [None]:
# Test the model

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_auc_score
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import roc_curve, RocCurveDisplay

test_results = wine_model.predict(features_test)
_ = ConfusionMatrixDisplay.from_predictions(predictions_test, test_results)

print(f'Precision: {precision_score(predictions_test, test_results, zero_division=0)}\n')
print(f'Recall: {recall_score(predictions_test, test_results, zero_division=0)}\n')
print(f'F1 Score: {f1_score(predictions_test, test_results, zero_division=0)}\n')
print(f'AUC: {roc_auc_score(predictions_test, test_results)}\n')
print(f'Accuracy: {accuracy_score(predictions_test, test_results)}\n')

display = RocCurveDisplay.from_estimator(wine_model, features_test, predictions_test)
display = RocCurveDisplay.from_predictions(predictions_test, test_results)

_ = PrecisionRecallDisplay.from_estimator(wine_model, features_test, predictions_test)
display = PrecisionRecallDisplay.from_predictions(predictions_test, test_results)
_ = display.ax_.set_ylim(bottom=0, top=1)

In [None]:
# Evaluate the model

from sklearn.metrics import roc_auc_score, cohen_kappa_score, hamming_loss, log_loss, matthews_corrcoef
import scikitplot as skplt

auc = roc_auc_score(predictions_test, test_results)
gini_coeff = (2 * auc) - 1
print(f'Gini Coefficient is: {gini_coeff}')

kappa_score = cohen_kappa_score(predictions_test, test_results)
print(f"Cohen's Kappa is: {kappa_score}")

hamming_score = hamming_loss(predictions_test, test_results)
print(f'Hamming Loss is: {hamming_score}')

mcc = matthews_corrcoef(predictions_test, test_results)
print(f'Matthews Correlation Coefficient is: {mcc}')

test_results_proba = wine_model.predict_proba(features_test)
_ = skplt.metrics.plot_cumulative_gain(predictions_test, test_results_proba, figsize=(10, 10))

**Has the model improved?**

Multivariate forward selection has produced a better model overall than univariate feature selection. The Matthews Correlation Coefficient now indicates a strong relationship between the observed values and predictions made by the model, although the Gini Coefficient is still relatively low.

# Investigate the impact of regularization on the model

In [None]:
# Measure the learning rate of the model before regularization

from sklearn.model_selection import learning_curve, train_test_split
from sklearn.linear_model import LogisticRegression

features_train, features_test, predictions_train, predictions_test = train_test_split(no_constants_wine_features, wine_quality, test_size=0.33, random_state=13)
wine_model = LogisticRegression(solver='saga', penalty='none', max_iter=2000, tol=1e-3)
_ = wine_model.fit(features_train, predictions_train)

# Compute the data for the learning curve using 10-fold cross validation of the model
train_sizes, train_scores, test_scores = learning_curve(estimator=wine_model, X=features_train, y=predictions_train, train_sizes=np.linspace(0.1, 1.0, 19), cv=10, scoring='precision')

In [None]:
# Plot the learning curve

import matplotlib.pyplot as plt

plt.figure(figsize=(10,10))
plt.plot((0,3000), (0.75,0.75), c='Grey', alpha=0.5)
plt.plot((0,3000), (0.80,0.80), c='Grey', alpha=0.5)
plt.plot(train_sizes, np.mean(train_scores,axis=1), label='Train (no penalty)')
plt.plot(train_sizes, np.mean(test_scores,axis=1), label='Test (no penalty)')
plt.xlabel('Dataset Size', fontdict={'family': 'serif', 'color':'darkred', 'weight':'normal', 'size': 28})
plt.ylabel('Precision', fontdict={'family': 'serif', 'color':'darkred', 'weight':'normal', 'size': 28})
plt.ylim(bottom=0.7, top=0.85)
plt.legend(prop={'size': 20})
plt.show()

print(f'Best test score precision: {np.max(np.mean(test_scores,axis=1))}')

In [None]:
# Measure the learning rate of the model with L1 regularization

wine_model = LogisticRegression(solver='saga', penalty='l1', max_iter=2000, tol=1e-3)
_ = wine_model.fit(features_train, predictions_train)

# Compute the data for the learning curve using 10-fold cross validation of the model
train_sizes, train_scores, test_scores = learning_curve(estimator=wine_model, X=features_train, y=predictions_train, train_sizes=np.linspace(0.1, 1.0, 19), cv=10, scoring='precision')

In [None]:
# Plot the learning curve

import matplotlib.pyplot as plt

plt.figure(figsize=(10,10))
plt.plot((0,3000), (0.75,0.75), c='Grey', alpha=0.5)
plt.plot((0,3000), (0.80,0.80), c='Grey', alpha=0.5)
plt.plot(train_sizes, np.mean(train_scores,axis=1), label='Train (L1)')
plt.plot(train_sizes, np.mean(test_scores,axis=1), label='Test (L1)')
plt.xlabel('Dataset Size', fontdict={'family': 'serif', 'color':'darkred', 'weight':'normal', 'size': 28})
plt.ylabel('Precision', fontdict={'family': 'serif', 'color':'darkred', 'weight':'normal', 'size': 28})
plt.ylim(bottom=0.7, top=0.85)
plt.legend(prop={'size': 20})
plt.show()

print(f'Best test score precision: {np.max(np.mean(test_scores,axis=1))}')

In [None]:
# Measure the learning rate of the model with L2 regularization

wine_model = LogisticRegression(solver='saga', penalty='l2', max_iter=2000, tol=1e-3)
_ = wine_model.fit(features_train, predictions_train)

# Compute the data for the learning curve using 10-fold cross validation of the model
train_sizes, train_scores, test_scores = learning_curve(estimator=wine_model, X=features_train, y=predictions_train, train_sizes=np.linspace(0.1, 1.0, 19), cv=10, scoring='precision')

In [None]:
# Plot the learning curve

import matplotlib.pyplot as plt

plt.figure(figsize=(10,10))
plt.plot((0,3000), (0.75,0.75), c='Grey', alpha=0.5)
plt.plot((0,3000), (0.80,0.80), c='Grey', alpha=0.5)
plt.plot(train_sizes, np.mean(train_scores,axis=1), label='Train (L2)')
plt.plot(train_sizes, np.mean(test_scores,axis=1), label='Test (L2)')
plt.xlabel('Dataset Size', fontdict={'family': 'serif', 'color':'darkred', 'weight':'normal', 'size': 28})
plt.ylabel('Precision', fontdict={'family': 'serif', 'color':'darkred', 'weight':'normal', 'size': 28})
plt.ylim(bottom=0.7, top=0.85)
plt.legend(prop={'size': 20})
plt.show()

print(f'Best test score precision: {np.max(np.mean(test_scores,axis=1))}')

In [None]:
# Measure the learning rate of the model with Elastic Net regularization

wine_model = LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=0.5, max_iter=2000, tol=1e-3)
_ = wine_model.fit(features_train, predictions_train)

# Compute the data for the learning curve using 10-fold cross validation of the model
train_sizes, train_scores, test_scores = learning_curve(estimator=wine_model, X=features_train, y=predictions_train, train_sizes=np.linspace(0.1, 1.0, 19), cv=10, scoring='precision')

In [None]:
# Plot the learning curve

import matplotlib.pyplot as plt

plt.figure(figsize=(10,10))
plt.plot((0,3000), (0.75,0.75), c='Grey', alpha=0.5)
plt.plot((0,3000), (0.80,0.80), c='Grey', alpha=0.5)
plt.plot(train_sizes, np.mean(train_scores,axis=1), label='Train (Elastic Net)')
plt.plot(train_sizes, np.mean(test_scores,axis=1), label='Test (Elastic Net)')
plt.xlabel('Dataset Size', fontdict={'family': 'serif', 'color':'darkred', 'weight':'normal', 'size': 28})
plt.ylabel('Precision', fontdict={'family': 'serif', 'color':'darkred', 'weight':'normal', 'size': 28})
plt.ylim(bottom=0.7, top=0.85)
plt.legend(prop={'size': 20})
plt.show()

print(f'Best test score precision: {np.max(np.mean(test_scores,axis=1))}')

**What do you conclude about applying the different forms of regularization to this model?**

L1 regularization and L2 regularization both have a small detrimental effect. The model is not being overfitted, so regularization is probably unnecessary.

# Compare the Logistic Regression model to a Random Forest model

In [None]:
# Create a random forest model over the same data

from sklearn.ensemble import RandomForestClassifier

forest_model = RandomForestClassifier()
_ = forest_model.fit(features_train, predictions_train)

In [None]:
# Test the random forest model

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_auc_score
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import roc_curve, RocCurveDisplay

rf_test_results = forest_model.predict(features_test)
_ = ConfusionMatrixDisplay.from_predictions(predictions_test, rf_test_results)

print(f'Precision: {precision_score(predictions_test, rf_test_results, zero_division=0)}\n')
print(f'Recall: {recall_score(predictions_test, test_results, zero_division=0)}\n')
print(f'F1 Score: {f1_score(predictions_test, rf_test_results, zero_division=0)}\n')
print(f'AUC: {roc_auc_score(predictions_test, rf_test_results)}\n')
print(f'Accuracy: {accuracy_score(predictions_test, test_results)}\n')

display = RocCurveDisplay.from_estimator(forest_model, features_test, predictions_test)
display = RocCurveDisplay.from_predictions(predictions_test, rf_test_results)

_ = PrecisionRecallDisplay.from_estimator(forest_model, features_test, predictions_test)
display = PrecisionRecallDisplay.from_predictions(predictions_test, rf_test_results)
_ = display.ax_.set_ylim(bottom=0, top=1)

In [None]:
# Evaluate the model

from sklearn.metrics import roc_auc_score, cohen_kappa_score, hamming_loss, log_loss, matthews_corrcoef
import scikitplot as skplt

auc = roc_auc_score(predictions_test, rf_test_results)
gini_coeff = (2 * auc) - 1
print(f'Gini Coefficient is: {gini_coeff}')

kappa_score = cohen_kappa_score(predictions_test, rf_test_results)
print(f"Cohen's Kappa is: {kappa_score}")

hamming_score = hamming_loss(predictions_test, rf_test_results)
print(f'Hamming Loss is: {hamming_score}')

mcc = matthews_corrcoef(predictions_test, rf_test_results)
print(f'Matthews Correlation Coefficient is: {mcc}')

rf_test_results_proba = forest_model.predict_proba(features_test)
_ = skplt.metrics.plot_cumulative_gain(predictions_test, rf_test_results_proba, figsize=(10, 10))

**How does this model compare to the Logistic Regression model?**

The random forest model has significantly better recall and precision than the logistic regression model. Overall its performance is superior.

In [None]:
# Perform McNemar's test to compare the error rates of the models

!pip install Mlxtend

from mlxtend.evaluate import mcnemar_table, mcnemar

table = mcnemar_table(y_target=predictions_test, y_model1=test_results, y_model2=rf_test_results)

chi2, p = mcnemar(ary=table, corrected=True)
print(f'\nContingency table\n{table}')
print(f'\nchi-squared statistic: {chi2}, p-value: {p}\n')

**What does this test indicate?**

The p-value is very small 0.05. The difference in error rates between the two models is statistically significant.

In [None]:
# Perform 5x2 cross-validation test to compare the models

from mlxtend.evaluate import paired_ttest_5x2cv

t, p = paired_ttest_5x2cv(estimator1=wine_model, estimator2=forest_model, X=features_train, y=predictions_train)

print(f't-statistic: {t}')
print(f'p-value: {p}')

**Is there a significance in the difference of the accuracy of the two models?**

The p-value is very low and is below the accepted threshold of 5% for statistical significance. This result indicates that although there is a statistically significant difference in the accuracy of the two models.

In [None]:
# Compare the DET curves for the two models

from sklearn.metrics import DetCurveDisplay
import matplotlib.pyplot as plt

fig, ax_det = plt.subplots(1, 1, figsize=(10, 10))
_ = DetCurveDisplay.from_estimator(wine_model, features_test, predictions_test, ax=ax_det, name='Logistic Regression Model')

_ = DetCurveDisplay.from_estimator(forest_model, features_test, predictions_test, ax=ax_det, name='Random Forest Model')


**How does the Logistic Regression model compare to the Random Forest model**

The DET curve shows that the Random Forest model generally has a lower error rate that the Logistic Regression model. Remember that the axes on this graph have a non-linear scale.

#Conclusions

It is important to understand how to measure the effects of tuning a model in different ways, and how to compare the performance of two models.

Scaling the features can has a notable effect on a linear model, although the results will likely be less dramatic on a tree-based model.

This exercise also highlights that algorithm selection is an important part of building a machine learning classification model. The random forest model worked much better than the logistic regression model, even without performing any tuning.