# Drift Monitoring and Detection

### Purpose:
In this exercise, you will learn how to detect concept drift in data over time and understand how it impacts model predictions. Concept drift occurs when the statistical properties of data change between when the model was trained and when it is used, leading to reduced accuracy and performance

### Objective:
By the end of this exercise you wull be able to:
1. Split a dataset to simulate training and new data.
2. Detect and visualize drift using Evidently AI.
3. Quantify drift using Scikit-learn's statistical tests.

### Tools Used:
- Python: For data manipulation and analysis
- Scikit-learn to calculate drift metrics.
- Evidently AI: To detect and visualize data drift with its powerful dashboard and reports

## Step 0: Setup
**Task**: Install and import the required libraries.

In [None]:
# Install Evidently AI
!pip install evidently

# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
from scipy.stats import ks_2samp



## Step 1: Load and Prepare the Dataset
**Task**: Load the Wine Quality dataset and split it into "training" and "new" datasets. Simulate drift by modifyign specific features

In [None]:
# Load the Wine Quality dataset from OpenML
wine_data = fetch_openml(name="wine-quality-red", version=1, as_frame=True)
df = wine_data.frame

# Simulate drift by splitting the data into "training" and "new" datasets
train_data, new_data = train_test_split(df, test_size=0.5, random_state=42)

# Introduce drift in "new" data
new_data["alcohol"] = new_data["alcohol"] * 1.1
new_data["pH"] = new_data["pH"] + np.random.normal(0, 0.05, new_data.shape[0])

## Step 2: Detect Drfit with Evidently AI
**Task**: Generate a drift report using Evidently AI and save it as an HTML file.

In [None]:
# Define column mapping for Evidently
column_mapping = ColumnMapping()

# Create a report with Data Drift Metrics
drift_report = Report(metrics=[DataDriftPreset()])
drift_report.run(reference_data=train_data, current_data=new_data, column_mapping=column_mapping)

# Save and display the drift report
drift_report.save_html("drift_report.html")
print("Drift report saved as 'drift_report.html'")

Drift report saved as 'drift_report.html'


**Instructions**:
1. After running the code, download the `drift_report.html` file from your Colab environment
2. Open it in your browser to explore detailed metrics and visualizations, then answer the questions below

**Questions**:
1. What percentage of columns are identified as drifted?
2. What specific columns are flagged as drifted?



## Step 3: Quantify Drift with Statistical Tests
**Task**: Perform drift metrics using Kolmogorov-Smirnov tests

In [None]:
# Calculate drift metrics using Kolmogorov-Smirnov tests
print("Kolmogorov-Smirnov Test Results:")
for column in train_data.columns:
  stat, p_value = ks_2samp(train_data[column], new_data[column])
  print(f"Feature: {column} | KS Statistic: {stat:.3f}, P-value: {p_value:.3f}")

Kolmogorov-Smirnov Test Results:
Feature: fixed_acidity | KS Statistic: 0.061, P-value: 0.095
Feature: volatile_acidity | KS Statistic: 0.040, P-value: 0.528
Feature: citric_acid | KS Statistic: 0.036, P-value: 0.646
Feature: residual_sugar | KS Statistic: 0.032, P-value: 0.806
Feature: chlorides | KS Statistic: 0.027, P-value: 0.927
Feature: free_sulfur_dioxide | KS Statistic: 0.032, P-value: 0.794
Feature: total_sulfur_dioxide | KS Statistic: 0.015, P-value: 1.000
Feature: density | KS Statistic: 0.045, P-value: 0.377
Feature: pH | KS Statistic: 0.075, P-value: 0.021
Feature: sulphates | KS Statistic: 0.025, P-value: 0.963
Feature: alcohol | KS Statistic: 0.439, P-value: 0.000
Feature: class | KS Statistic: 0.034, P-value: 0.710


**Instructions**:
1. Analyze the KS Statistic and p-value for each value
  - **High KS Statistic and low p-value (< 0.05)**: Significant drift.
  - **Low KS Statistic and high p-value (> 0.05)**: No significant drift.
  - **Moderate KS Statistic**: Small but noteworthy changes

**Questions**:
1. Which features show significant drift based on their p-values (p-value < 0.05)?
2. What might be the implications of drift in a highly significant feature like "alcohol" for a model predicting wine quality?