<a href="https://colab.research.google.com/github/cm-int/machine-learning-fundamentals/blob/main/module_3/Labs/Lab3_1_Refining_a_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3.1: Refining a Machine Learning Model

In this lab, you'll perform the following tasks:

- Build a Logistic Regression model to classify the data without any modifications to the data
- Examine the results and measure the performance, especially the precision
-	Explore and refine the dataset
-	Recreate and retest the model
-	Repeat until the performance is optimized 

You'll also compare the performance of two models constructed using different algorithms.

## Scenario

This dataset is related to white variants of the Portuguese "Vinho Verde" wine.The dataset describes the amount of various chemicals present in wine and their effect on it's quality. This is a binary dataset; the quality is either 'Poor' or 'Good'. Your task is to predict the quality of wine using the given data.

The dataset contains the following columns:

Input variables (based on physicochemical tests):\
1 - fixed acidity\
2 - volatile acidity\
3 - citric acid\
4 - residual sugar\
5 - chlorides\
6 - free sulfur dioxide\
7 - total sulfur dioxide\
8 - density\
9 - pH\
10 - sulphates\
11 - alcohol\
12 - alkalinity\
13 - e330 level\
14 - effervescence index\
15 - consumable\
\
Output variable (based on sensory data):\
16 - quality (0=poor, 1=good)

## Acknowledgements:
This dataset is also available from Kaggle & UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality.

The solution code for this lab is available <a href="https://colab.research.google.com/github/cm-int/machine-learning-fundamentals/blob/main/module_3/Labs/Lab3_1_Refining_a_Model_solution.ipynb" target="_parent">here</a>

#Read the data

In [None]:
# Upload the winequalitywhites.csv file from Github
# This step is complete

!wget 'https://raw.githubusercontent.com/cm-int/machine-learning-fundamentals/main/module_3/Labs/winequalitywhites.csv'

In [None]:
import pandas as pd
import numpy as np

# Read the data into a Pandas DataFrame named wine_data
# This step is complete

wine_data = pd.read_csv('winequalitywhites.csv')
wine_data

#Split the data

In [None]:
# Create the wine_features DataFrame with every column apart from quality
# This step is complete

wine_features = wine_data.drop(['quality'], axis=1)
wine_features

In [None]:
# Create the wine_quality series containing only the quality column
# This step is complete

wine_quality = wine_data['quality']
wine_quality

In [None]:
# Split the data into training and test datasets
# This step is complete

from sklearn.model_selection import train_test_split

features_train, features_test, predictions_train, predictions_test = train_test_split(wine_features, wine_quality, test_size=0.33, random_state=13)

#Create a Logistic Regression model to classify the data

In [None]:
# Create and fit the Logistic Regression model named 'wine_model' with the 'saga' solver and no regularization and an increased number of iterations and reduced tolerance (to allow the algorithm to converge)

from sklearn.linear_model import LogisticRegression



In [None]:
# Make predictions using the features_test dataset (save the results as the variable 'test_results') and examine the confusion matrix

from sklearn.metrics import ConfusionMatrixDisplay 


In [None]:
# Calculate the precision, recall, F1-score, AUC and accuracy for the model

from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_auc_score



In [None]:
# Plot the ROC curve for the model from the estimator and from the test predictions
from sklearn.metrics import roc_curve, RocCurveDisplay



In [None]:
# Plot the Precision/Recall graph for the model using the estimator and from the test results
from sklearn.metrics import PrecisionRecallDisplay



In [None]:
# Find the threshold that maximizes precision and recall for the 'good' (1) class label
# Display the F1 score, precision, and recall for this threshold

# This step is complete

from sklearn.metrics import precision_recall_curve

test_results_proba = wine_model.predict_proba(features_test)
precision, recall, thresholds = precision_recall_curve(predictions_test, test_results_proba[:, 1])

precision[precision == 0] = 1e-99
recall[recall == 0] = 1e-99
fscores = (2 * precision * recall) / (precision + recall)

ix = np.argmax(fscores)
print(f'Optimal threshold is {thresholds[ix]}\nF1 Score is {fscores[ix]}\nPrecision is {precision[ix]}\nRecall is {recall[ix]}')

**What do you conclude from these statistics?**

# Evaluate the model

In [None]:
# Calculate the Gini Coefficient for the model
# Gini Coefficient=2×(AUC−1)

from sklearn.metrics import roc_auc_score


**What does this coefficient signify?**

In [None]:
# Calculate Cohen's Kappa for the model

from sklearn.metrics import cohen_kappa_score


**What does this value mean?**

In [None]:
# Calculate the Hamming Loss for the model
from sklearn.metrics import hamming_loss


**What proportion of the predictions are incorrect?**

In [None]:
# Calculate the Matthews Correlation Coefficient for the model

from sklearn.metrics import matthews_corrcoef


**How strong is the relationship between the predicted and observed class labels?**

In [None]:
# Plot the cumulative gains chart for the model
!pip install Scikit-plot

import scikitplot as skplt

test_results_proba = wine_model.predict_proba(features_test)
_ = skplt.metrics.plot_cumulative_gain(predictions_test, test_results_proba, figsize=(10, 10))

**Overall, do your findings confirm your earlier conclusions about the precision and recall of the model?**


# Refine the model - scale the data

In [None]:
# Apply a MinMaxScaler to the wine_features dataframe to create a new dataframe named 'scaled_wine_features'

from sklearn.preprocessing import MinMaxScaler


In [None]:
# Rebuild the model with scaled features:
# - Recreate test and training datasets
# - Build the Logistic Regression model with the same parameters as before

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


In [None]:
# Test the model
# - Make predictions and examine the confusion matrix
# - Calculate the precision, recall, F1-score, AUC and accuracy for the model
# - Plot the ROC curve for the model from the estimator and from the test predictions
# - Plot the Precision/Recall graph for the model using the estimator and from the test results


In [None]:
# Evaluate the model
# - Calculate the Gini Coefficient for the model
# - Calculate Cohen's Kappa
# - Calculate the Hamming Loss
# - Calculate the Matthews Correlation Coefficient
# - Plot the cumulative gains chart for the model

from sklearn.metrics import roc_auc_score, cohen_kappa_score, hamming_loss, log_loss, matthews_corrcoef
import scikitplot as skplt


**Has the model improved?**

# Refine the model - remove constant and quasi-constant features

In [None]:
# Look for features with little variance in the scaled dataframe using the 'var()' function of the dataframe.


**Which features have a notably small variance?**

In [None]:
# Verify that 'consumable' has only one value - display all the unique values in this feature


In [None]:
# Rebuild the model without this feature:
# - Drop the feature from the scaled_wine_features dataframe to create a dataframe named 'no_constants_wine_features'
# - Recreate test and training datasets
# - Build the Logistic Regression model with the same parameters as before

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


In [None]:
# Test the model
# - Make predictions and examine the confusion matrix
# - Calculate the precision, recall, F1-score, AUC and accuracy for the model
# - Plot the ROC curve for the model from the estimator and from the test predictions
# - Plot the Precision/Recall graph for the model using the estimator and from the test results

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_auc_score
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import roc_curve, RocCurveDisplay


In [None]:
# Evaluate the model
# - Calculate the Gini Coefficient for the model
# - Calculate Cohen's Kappa
# - Calculate the Hamming Loss
# - Calculate the Matthews Correlation Coefficient
# - Plot the cumulative gains chart for the model

from sklearn.metrics import roc_auc_score, cohen_kappa_score, hamming_loss, log_loss, matthews_corrcoef
import scikitplot as skplt



**Has the model improved?**

# Refine the model - find and remove correlated features

In [None]:
# Find correlated features in the 'no_constants_wine_features' dataset

import seaborn as sns
from matplotlib import pyplot as plt


**Which features show a strong correlation?**

In [None]:
# Remove the e330.level and alkalinity features and rebuild the model
# - Drop the features from the no_constants_wine_features dataframe to create the no_correlation_wine_features dataset
# - Recreate test and training datasets
# - Build the Logistic Regression model with the same parameters as before

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


In [None]:
# Test the model
# - Make predictions and examine the confusion matrix
# - Calculate the precision, recall, F1-score, AUC and accuracy for the model
# - Plot the ROC curve for the model from the estimator and from the test predictions
# - Plot the Precision/Recall graph for the model using the estimator and from the test results

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_auc_score
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import roc_curve, RocCurveDisplay


In [None]:
# Evaluate the model
# - Calculate the Gini Coefficient for the model
# - Calculate Cohen's Kappa
# - Calculate the Hamming Loss
# - Calculate the Matthews Correlation Coefficient
# - Plot the cumulative gains chart for the model

from sklearn.metrics import roc_auc_score, cohen_kappa_score, hamming_loss, log_loss, matthews_corrcoef
import scikitplot as skplt


**Has the model improved?**

# Refine the model - remove noise using univariate feature selection

In [None]:
# Perform SHAP analysis to find the features that have the most impact on predictions
# This step is complete

!pip install shap

import shap

explainer = shap.Explainer(wine_model.predict, features_test) 
values = explainer(features_train)

shap.summary_plot(shap_values=values, features=features_train, plot_type="bar")
shap.summary_plot(shap_values=values, features=features_train, plot_type="violin") 

In [None]:
# Create a dataset named 'reduced_wine_features' from the 'no_correlation_wine_features' dataset with only the top five features and rebuild the model

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


In [None]:
# Test the model (repeat all previous tests)

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_auc_score
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import roc_curve, RocCurveDisplay



In [None]:
# Evaluate the model (repeat all previous tests)

from sklearn.metrics import roc_auc_score, cohen_kappa_score, hamming_loss, log_loss, matthews_corrcoef
import scikitplot as skplt



**Has the model improved?**


# Refine the model - find the combination of features that give the lowest false positive rate

In [None]:
# Use selectFpr() function to find the best combination of features that minimize the FPR using the 'no_correlation_wine_features' dataset

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFpr
from sklearn.feature_selection import chi2


In [None]:
# Print the feature names and scores


**How does this compare to the features found by using SHAP analysis?**

In [None]:
# Rebuild and fit the model with these features. Name the new dataframe 'reduced_wine_features'

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


In [None]:
# Test the model

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_auc_score
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import roc_curve, RocCurveDisplay


In [None]:
# Evaluate the model

from sklearn.metrics import roc_auc_score, cohen_kappa_score, hamming_loss, log_loss, matthews_corrcoef
import scikitplot as skplt



**Has the model improved?**

# Refine the model - remove noise using multivariate feature selection

In [None]:
# Use forward selection to find the best combination of features
# This step is complete

from sklearn.linear_model import LogisticRegression 
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import train_test_split

logistic_model = LogisticRegression(solver='saga', penalty='none', max_iter=2000, tol=1e-3) 

features_train, features_test, predictions_train, predictions_test = train_test_split(no_correlation_wine_features, wine_quality, test_size=0.33, random_state=13)

sfs_forward = SequentialFeatureSelector(logistic_model, n_features_to_select=5, direction="forward")
_ = sfs_forward.fit(features_train, predictions_train) 

print(f'Features selected by forward sequential selection: {sfs_forward.get_feature_names_out()}') 

In [None]:
# Rebuild the model with only the top five features

from sklearn.linear_model import LogisticRegression


In [None]:
# Test the model

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_auc_score
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import roc_curve, RocCurveDisplay


In [None]:
# Evaluate the model

from sklearn.metrics import roc_auc_score, cohen_kappa_score, hamming_loss, log_loss, matthews_corrcoef
import scikitplot as skplt


**Has the model improved?**

# Investigate the impact of regularization on the model

In [None]:
# Measure the learning rate of the model before regularization. Use the 'no_constants_wine_features' dataframe to generate the test and training datasets
# This step is complete

from sklearn.model_selection import learning_curve, train_test_split
from sklearn.linear_model import LogisticRegression

features_train, features_test, predictions_train, predictions_test = train_test_split(no_constants_wine_features, wine_quality, test_size=0.33, random_state=13)
wine_model = LogisticRegression(solver='saga', penalty='none', max_iter=2000, tol=1e-3)
_ = wine_model.fit(features_train, predictions_train)

# Compute the data for the learning curve using 10-fold cross validation of the model
train_sizes, train_scores, test_scores = learning_curve(estimator=wine_model, X=features_train, y=predictions_train, train_sizes=np.linspace(0.1, 1.0, 19), cv=10, scoring='precision')

In [None]:
# Plot the learning curve
# This step is complete

import matplotlib.pyplot as plt

plt.figure(figsize=(10,10))
plt.plot((0,3000), (0.75,0.75), c='Grey', alpha=0.5)
plt.plot((0,3000), (0.80,0.80), c='Grey', alpha=0.5)
plt.plot(train_sizes, np.mean(train_scores,axis=1), label='Train (no penalty)')
plt.plot(train_sizes, np.mean(test_scores,axis=1), label='Test (no penalty)')
plt.xlabel('Dataset Size', fontdict={'family': 'serif', 'color':'darkred', 'weight':'normal', 'size': 28})
plt.ylabel('Precision', fontdict={'family': 'serif', 'color':'darkred', 'weight':'normal', 'size': 28})
plt.ylim(bottom=0.7, top=0.85)
plt.legend(prop={'size': 20})
plt.show()

print(f'Best test score precision: {np.max(np.mean(test_scores,axis=1))}')

In [None]:
# Measure the learning rate of the model with L1 regularization



In [None]:
# Plot the learning curve

import matplotlib.pyplot as plt


In [None]:
# Measure the learning rate of the model with L2 regularization


In [None]:
# Plot the learning curve

import matplotlib.pyplot as plt


In [None]:
# Measure the learning rate of the model with Elastic Net regularization


In [None]:
# Plot the learning curve

import matplotlib.pyplot as plt


**What do you conclude about applying the different forms of regularization to this model?**

# Compare the Logistic Regression model to a Random Forest model

In [None]:
# Create a random forest model over the same data
# This step is complete

from sklearn.ensemble import RandomForestClassifier

forest_model = RandomForestClassifier()
_ = forest_model.fit(features_train, predictions_train)

In [None]:
# Test the random forest model

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_auc_score
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import roc_curve, RocCurveDisplay


In [None]:
# Evaluate the model

from sklearn.metrics import roc_auc_score, cohen_kappa_score, hamming_loss, log_loss, matthews_corrcoef
import scikitplot as skplt


**How does this model compare to the Logistic Regression model?**

In [None]:
# Perform McNemar's test to compute the marginal error rates of the models and compare these error rates

!pip install Mlxtend

from mlxtend.evaluate import mcnemar_table, mcnemar


**What does this test indicate?**


In [None]:
# Perform 5x2 cross-validation test to compare the models

from mlxtend.evaluate import paired_ttest_5x2cv


**Is there a significance in the difference of the accuracy of the two models?**


In [None]:
# Compare the DET curves for the two models

from sklearn.metrics import DetCurveDisplay
import matplotlib.pyplot as plt



**How does the Logistic Regression model compare to the Random Forest model**


#Conclusions

It is important to understand how to measure the effects of tuning a model in different ways, and how to compare the performance of two models.

Scaling the features can has a notable effect on a linear model, although the results will likely be less dramatic on a tree-based model.

This exercise also highlights that algorithm selection is an important part of building a machine learning classification model. The random forest model worked much better than the logistic regression model, even without performing any tuning.