<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_2/Section_8_Python_Example__Model_Comparison_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 8 Python example - model comparison techniques

Model comparison is a crucial aspect of the model selection process in data science. It involves evaluating and contrasting the performance of different statistical or machine learning models to determine which one best suits the specific needs of a project. This involves considering not only accuracy but also other performance metrics and characteristics like model simplicity, computation time, and ease of interpretation. In this section, we'll demonstrate how to implement model comparison techniques in Python using the Scikit-learn library.

1. Setting Up the Environment:

First, ensure that Python and Scikit-learn are installed in your environment. If Scikit-learn is not installed, you can install it using pip:

In [None]:
pip install scikit-learn

2. Importing Required Libraries:

Import necessary libraries for data manipulation, model fitting, and evaluation. We'll use Pandas for data handling, Scikit-learn for modeling and metrics, and Matplotlib for visualization.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve

3. Generating Synthetic Data:

For this example, let’s create a synthetic dataset suitable for classification:

In [None]:
from sklearn.datasets import make_classification

# Generate a binary classification dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

4. Training Models:

We'll compare two commonly used models: Logistic Regression and Random Forest.

In [None]:
# Initialize the models
model_lr = LogisticRegression()
model_rf = RandomForestClassifier()

# Fit the models
model_lr.fit(X_train, y_train)
model_rf.fit(X_train, y_train)

5. Evaluating and Comparing Models:

Use cross-validation and ROC curves to evaluate and compare these models.

In [None]:
# Perform cross-validation
scores_lr = cross_val_score(model_lr, X_train, y_train, cv=5, scoring='accuracy')
scores_rf = cross_val_score(model_rf, X_train, y_train, cv=5, scoring='accuracy')

print("Average accuracy for Logistic Regression: {:.2f}%".format(np.mean(scores_lr) * 100))
print("Average accuracy for Random Forest: {:.2f}%".format(np.mean(scores_rf) * 100))

# Compute ROC AUC scores
roc_auc_lr = roc_auc_score(y_test, model_lr.predict_proba(X_test)[:, 1])
roc_auc_rf = roc_auc_score(y_test, model_rf.predict_proba(X_test)[:, 1])

print("ROC AUC for Logistic Regression: {:.2f}".format(roc_auc_lr))
print("ROC AUC for Random Forest: {:.2f}".format(roc_auc_rf))

# Generate ROC curves
fpr_lr, tpr_lr, _ = roc_curve(y_test, model_lr.predict_proba(X_test)[:, 1])
fpr_rf, tpr_rf, _ = roc_curve(y_test, model_rf.predict_proba(X_test)[:, 1])

plt.figure()
plt.plot(fpr_lr, tpr_lr, label='Logistic Regression (area = {:.2f})'.format(roc_auc_lr))
plt.plot(fpr_rf, tpr_rf, label='Random Forest (area = {:.2f})'.format(roc_auc_rf))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

6. Conclusion:

The above example illustrates how to perform model comparison using cross-validation and ROC curves, two powerful techniques for assessing model performance. While accuracy gives a quick snapshot of model effectiveness, ROC curves and AUC scores provide deeper insights into model behaviour across different classification thresholds. These techniques help in making an informed choice about which model to deploy based on the project's specific requirements.