<a href="https://colab.research.google.com/github/cloudpedagogy/machine-learning-scikit-learn/blob/main/07_Model_Evaluation_and_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Model Evaluation and Validation


##Overview

Model evaluation and validation are essential steps in the machine learning workflow to ensure that the trained model performs well on unseen data and generalizes effectively. Python provides a variety of libraries and tools to perform these tasks efficiently. In this paragraph, we will discuss the key concepts and techniques involved in model evaluation and validation in Python.

One of the primary goals of model evaluation is to assess the performance of a trained machine learning model. A common practice is to split the dataset into two subsets: the training set and the testing set. The training set is used to train the model, while the testing set is used to evaluate its performance. Scikit-learn, a popular Python library for machine learning, provides the `train_test_split` function to easily perform this data splitting.

To evaluate the model's performance, various metrics can be employed, depending on the type of machine learning problem. For classification tasks, metrics such as accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC) are commonly used. For regression tasks, metrics like mean squared error (MSE), mean absolute error (MAE), and R-squared are frequently employed.

Cross-validation is another critical technique in model evaluation, especially when the dataset is limited. It involves dividing the data into multiple subsets (folds) and iteratively training and testing the model on different combinations of these folds. This process helps in obtaining a more robust estimate of the model's performance. Scikit-learn provides the `cross_val_score` function to perform cross-validation easily.

Model validation is the process of tuning hyperparameters to find the best configuration for the model. Grid search and randomized search are popular techniques used in Python, and they can be implemented using the `GridSearchCV` and `RandomizedSearchCV` classes from Scikit-learn. These methods automatically explore various hyperparameter combinations to find the optimal configuration that yields the best performance.



#Training and testing split



##Overview


Training and testing split is a fundamental concept in machine learning that plays a crucial role in developing reliable and accurate predictive models. When building a machine learning model, it is essential to evaluate its performance on unseen data to assess its generalization ability. The process of dividing the available dataset into two distinct subsets - the training set and the testing set - enables us to achieve this objective.

**1. Purpose of Training and Testing Split:**
The main purpose of the training and testing split is to assess how well a machine learning model can generalize to new, unseen data. During the training phase, the model learns the underlying patterns and relationships present in the training data, adjusting its parameters to minimize the prediction errors. However, it is critical to ensure that the model does not simply memorize the training data but rather captures the underlying patterns that apply to unseen data as well. The testing set serves as a proxy for the real-world data that the model will encounter in production, allowing us to evaluate its performance on data it has never seen before.

**2. Dividing the Dataset:**
The process of creating a training and testing split involves dividing the original dataset into two distinct and non-overlapping subsets. Typically, the majority of the data (around 70-80%) is allocated to the training set, while the remaining portion (around 20-30%) becomes the testing set. This division ensures that the model is exposed to a sufficient amount of data during training while leaving enough unseen data for evaluation.

**3. Importance of Unbiased Split:**
It is crucial to ensure that the training and testing sets are representative of the overall dataset. The split should be unbiased, meaning that both subsets should contain a diverse and random sampling of data points from different classes or categories. An unbiased split helps avoid introducing biases into the model's training process, leading to more accurate and reliable performance evaluations.

**4. Cross-Validation:**
In some cases, a single train-test split may not provide a robust evaluation of the model's performance. To overcome this limitation and obtain more reliable results, techniques like k-fold cross-validation can be employed. Cross-validation involves dividing the data into multiple subsets or "folds," using each fold alternately as the testing set while the remaining folds are used for training. This process is repeated several times, and the performance metrics are averaged to obtain a more comprehensive assessment of the model's generalization performance.



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Split the dataset into features (X) and target variable (y)
X = dataset.drop('Outcome', axis=1)
y = dataset['Outcome']

# Perform the training and testing split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting splits
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


In this example, we first load the Pima Indian Diabetes dataset using Pandas. We then split the dataset into features (X) and the target variable (y). The features are all the columns except for the 'Outcome' column, which represents the presence or absence of diabetes. The target variable is the 'Outcome' column itself.

Next, we use the `train_test_split` function from scikit-learn to split the data into training and testing sets. We pass the features (X) and the target variable (y) to the function along with the `test_size` parameter, which specifies the proportion of the dataset that should be allocated to the testing set (in this case, 20%). The `random_state` parameter ensures reproducibility of the split.

Finally, we print the shapes of the resulting splits (`X_train`, `X_test`, `y_train`, `y_test`) to verify the sizes of the training and testing sets.

Note: Make sure you have the scikit-learn library installed to run this example (`pip install scikit-learn`).


##Cross-validation techniques
Cross-validation is a technique used in machine learning to evaluate the performance and generalization ability of a model. It involves dividing the dataset into multiple subsets, training the model on a portion of the data, and evaluating its performance on the remaining part. Scikit-Learn provides several cross-validation techniques through its `model_selection` module, such as K-fold cross-validation and stratified K-fold cross-validation.

Here's an example of using K-fold cross-validation and stratified K-fold cross-validation on the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Separate features and target variable
X = dataset.drop('Outcome', axis=1)
y = dataset['Outcome']

# Perform K-fold cross-validation with Logistic Regression
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=kfold)

print("K-fold cross-validation scores:")
print(scores)
print("Mean accuracy:", scores.mean())
print("Standard deviation:", scores.std())

# Perform stratified K-fold cross-validation with Logistic Regression
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression()
stratified_scores = cross_val_score(model, X, y, cv=stratified_kfold)

print("Stratified K-fold cross-validation scores:")
print(stratified_scores)
print("Mean accuracy:", stratified_scores.mean())
print("Standard deviation:", stratified_scores.std())


In this example, we first load the Pima Indian Diabetes dataset using Pandas. Then, we separate the features (X) and the target variable (y).

For K-fold cross-validation, we create a `StratifiedKFold` object with 5 splits, shuffle the data, and set a random seed for reproducibility. We initialize a logistic regression model and use the `cross_val_score` function to perform K-fold cross-validation on the model using the features (X) and target variable (y). The resulting scores are printed along with the mean accuracy and standard deviation.

For stratified K-fold cross-validation, we follow a similar process as K-fold but using the `StratifiedKFold` object. This technique ensures that each fold contains approximately the same proportion of target classes as the whole dataset, which is important when dealing with imbalanced datasets. The scores, mean accuracy, and standard deviation are printed for stratified K-fold cross-validation as well.

Note that in the example above, we use logistic regression as a simple example model, but you can replace it with any other model or algorithm from Scikit-Learn according to your requirements.


##Evaluation metrics (accuracy, precision, recall, F1-score)

Scikit-Learn provides various evaluation metrics to assess the performance of classification models. Here are the commonly used evaluation metrics in Scikit-Learn:

1. Accuracy: Accuracy measures the proportion of correctly classified instances out of the total instances. It is computed as the ratio of the number of correct predictions to the total number of predictions.

2. Precision: Precision measures the proportion of correctly predicted positive instances out of the total instances predicted as positive. It is computed as the ratio of true positives to the sum of true positives and false positives.

3. Recall: Recall, also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive instances out of the total actual positive instances. It is computed as the ratio of true positives to the sum of true positives and false negatives.

4. F1-score: The F1-score is the harmonic mean of precision and recall. It provides a balanced measure that considers both precision and recall. F1-score is useful when you want to find a balance between precision and recall.

Here's an example using the Pima Indian Diabetes dataset to demonstrate the calculation of these evaluation metrics:


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Separate features and target variable
X = dataset.drop('Outcome', axis=1)
y = dataset['Outcome']

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)


In this example, we load the Pima Indian Diabetes dataset using Pandas library. We split the dataset into training and test sets using `train_test_split()` function from Scikit-Learn. Then, we train a logistic regression model on the training set and make predictions on the test set.

After that, we calculate the evaluation metrics by comparing the predicted values (`y_pred`) with the actual values (`y_test`). We use the functions `accuracy_score()`, `precision_score()`, `recall_score()`, and `f1_score()` from Scikit-Learn to compute these metrics.

Finally, we print the evaluation metrics to see the output.


##Overfitting and underfitting

Overfitting and underfitting are two common problems in machine learning models, including those implemented with the Scikit-Learn library. Let's understand each concept and provide examples using the Pima Indian Diabetes dataset.

1. Overfitting:
Overfitting occurs when a model learns the training data too well and becomes too specialized, failing to generalize well to new, unseen data. It often happens when the model is too complex and captures noise or random fluctuations in the training data. Signs of overfitting include high accuracy on the training set but poor performance on the test/validation set.

Example:


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Split the dataset into features and target variable
X = dataset.drop('Outcome', axis=1)
y = dataset['Outcome']

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create an overly complex Decision Tree classifier
model = DecisionTreeClassifier(max_depth=10)

# Train the model
model.fit(X_train, y_train)

# Evaluate the model on the training set
train_predictions = model.predict(X_train)
train_accuracy = accuracy_score(y_train, train_predictions)

# Evaluate the model on the test set
test_predictions = model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_predictions)

# Print the accuracies
print("Training accuracy:", train_accuracy)
print("Test accuracy:", test_accuracy)


In this example, we use a Decision Tree classifier with a max_depth of 10. The model is trained on the Pima Indian Diabetes dataset, and the accuracies are calculated for both the training and test sets. If the training accuracy is significantly higher than the test accuracy, it indicates overfitting.

2. Underfitting:
Underfitting occurs when a model is too simple to capture the underlying patterns in the data. It often happens when the model has insufficient complexity or features to learn from the data. Signs of underfitting include low accuracy on both the training and test/validation sets.

Example:


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Split the dataset into features and target variable
X = dataset.drop('Outcome', axis=1)
y = dataset['Outcome']

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create an overly simple Logistic Regression classifier
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Evaluate the model on the training set
train_predictions = model.predict(X_train)
train_accuracy = accuracy_score(y_train, train_predictions)

# Evaluate the model on the test set
test_predictions = model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_predictions)

# Print the accuracies
print("Training accuracy:", train_accuracy)
print("Test accuracy:", test_accuracy)


In this example, we use a simple Logistic Regression classifier. The model is trained on the Pima Indian Diabetes dataset, and the accuracies are calculated for both the training and test sets. If both the training and test accuracies are low, it suggests underfitting, indicating that the model is too simple to capture the patterns in the data effectively.

To mitigate overfitting, you can reduce the model complexity (e.g., by reducing the number of features or decreasing the model's hyperparameters) or use regularization techniques. To address underfitting, you can try increasing the model complexity (e.g., by adding more features, using a more complex algorithm, or adjusting hyperparameters) or using more advanced algorithms.


#Reflection points

1. **Training and Testing Split:**
   - What is the purpose of splitting data into training and testing sets?
     - The purpose is to evaluate the performance of a machine learning model on unseen data. Training set is used for model training, while testing set helps assess how well the model generalizes.

   - What is the recommended ratio for splitting data into training and testing sets?
     - A common practice is to use a 70-30 or 80-20 ratio, where 70% or 80% of the data is used for training and the remaining portion for testing. However, the ratio can vary depending on the dataset size and specific requirements.

2. **Cross-Validation Techniques:**
   - What is cross-validation, and why is it useful?
     - Cross-validation is a technique to assess the model's performance by partitioning the data into multiple subsets and performing training and testing iteratively. It provides a more reliable estimate of the model's generalization ability and helps mitigate issues like overfitting.

   - What are some common cross-validation methods?
     - Common cross-validation methods include k-fold cross-validation, stratified k-fold cross-validation, and leave-one-out cross-validation. Each method has its own advantages and suits different scenarios.

3. **Evaluation Metrics (Accuracy, Precision, Recall, F1-Score):**
   - What is accuracy, and when is it suitable to use?
     - Accuracy measures the overall correctness of the model's predictions. It is suitable when the class distribution is balanced and false positives and false negatives have similar consequences.

   - What is precision, and when is it important to consider?
     - Precision measures the proportion of true positive predictions among positive predictions. It is important to consider when false positives have significant consequences, such as in medical diagnoses or spam detection.

   - What is recall, and when is it important to consider?
     - Recall (also known as sensitivity or true positive rate) measures the proportion of true positives captured among all actual positives. It is important to consider when false negatives have severe consequences, such as identifying diseases or detecting fraudulent transactions.

   - What is the F1-score, and why is it useful?
     - The F1-score is the harmonic mean of precision and recall, providing a balanced measure of the model's performance. It is particularly useful when there is an imbalance between the positive and negative classes in the dataset.

4. **Overfitting and Underfitting:**
   - What is overfitting, and how does it occur?
     - Overfitting happens when a model learns the training data too well, resulting in poor performance on unseen data. It occurs when the model becomes too complex or when the training data is insufficient.

   - What is underfitting, and how does it occur?
     - Underfitting occurs when a model is too simple or lacks the capacity to capture the underlying patterns in the data. It leads to poor performance both on the training set and unseen data.

   - What are some methods to address overfitting and underfitting?
     - Regularization techniques (e.g., L1 and L2 regularization), collecting more data, reducing model complexity, feature selection, and early stopping are common approaches to address overfitting and underfitting.


#A quiz on Model Evaluation and Validation