<img src="media/LandingPage-Header-RED-CENTRE.jpg" alt="Notebook Banner" style="width:100%; height:auto; display:block; margin-left:auto; margin-right:auto;">

# Understanding Data Drift, Concept Drift, and Outliers

This section provides an overview of the theoretical concepts and methods used for detecting different types of drift and outliers in machine learning models, as demonstrated in the subsequent code sections.

## Data Drift

Refers to changes in the statistical properties of the input data used by a machine learning model over time. This can lead to a degradation in model performance, even if the underlying relationship between features and the target remains constant. We use statistical tests and distance metrics to compare the distributions of features between a reference dataset (e.g., training data) and a current production dataset.

### Techniques to Measure Data Drift

The following statistical tests and metrics are frequently used to quantify data drift. They are grouped by the type of variable each method best addresses.

#### For Numerical Variables

- **Population Stability Index (PSI)**  
  Quantifies distributional differences between two data sets by binning the data and comparing percentage shares per bin. A high PSI indicates a material shift.

- **Kolmogorov–Smirnov (KS) test**  
  A non-parametric test that compares the cumulative distribution functions of two samples. Sensitive to changes in location, shape, or scale, and often a default choice for numerical features.

- **Jensen–Shannon Divergence (JSD)**  
  Measures similarity between two probability distributions, bounded between 0 and 1 for ease of interpretation. Higher values denote greater divergence.


#### For Categorical Variables

- **Chi-squared test**  
  Compares observed category frequencies in the new data against expected frequencies from a baseline data set to detect significant differences.

In the example that follows, we will explore the concept of data drift.


---------------------------------

# Introduction to Data Drift for *Heritage Brew Collective*

Consider the case of a neighborhood coffee shop, *Heritage Brew Collective*. Over the prioryear, management collected customer level data to understand behavior and forecast loyalty or churn. That historical dataset serves as the **reference data** used to train a customer-prediction model.

This year, the model is being applied to new, incoming observations designated as the **current data**.  The central question is whether the customer base has changed materially year over year. If meaningful differences exist, model performance may degrade due to **data drift**.

## Feature Categories Under Review

- **Numerical features** — for example, the average number of visits per month.  
- **Categorical features** — for example, the preferred drink category (Coffee, Tea, Specialty).


## Drift-Detection Metrics

- **Population Stability Index (PSI)** for numerical variables.  
- **Chi-squared test** for categorical variables.


The following analysis determines whether the current customer population remains consistent with the historical baseline.

### Scenario setup

### Visual comparison

In [None]:
# --- 2. Visualizations to Compare Distributions ---

# --- Numerical Features ---




In [None]:
# --- Categorical Features ---



### Analysing drift 
#### Numerical data

In [None]:
# --- 2. Numerical Drift Detection: Population Stability Index (PSI) ---




In [None]:
# --- Numerical Features ---


#### Categorical data

In [None]:
# --- 3. Categorical Drift Detection: Chi-Squared Test ---




In [None]:
# --- Categorical Features ---


# Analysis

- **Average Visits Per Month (Numerical)** — The Population Stability Index (PSI) is **0.4402**, well above the 0.25 action threshold, indicating a meaningful shift toward **fewer** monthly visits in the current data. Potential drivers may include increased local competition or a greater prevalence of remote work.

- **Favorite Drink Category (Categorical)** — The Chi-squared test produced a p-value of **0.0000**, far below the 0.05 significance level, confirming a statistically significant change in customer preferences. The categorical distribution shows a decline in **“Coffee”** orders and a rise in **“Specialty”** beverages. Menu positioning and promotional strategy should be adjusted accordingly.

- **Age and Loyalty-Program Membership** — Both variables exhibit **no significant drift**, suggesting that the customer demographic profile and loyalty engagement remain stable.


## Interpreting a “Global Drift Score” in Practice

A single composite metric risks masking critical feature-level issues. e.g., If 99 features show no drift but 1 crucial feature has extreme drift, a simple average might make the overall score look fine—misleading us. Instead, a practical overview combines:

1. **Feature-Level Drift Counts**  
   Drift was identified in **2 of 4 monitored features**, immediately highlighting that half of the key variables have materially changed.

2. **Impact on Model Performance**  
   Ultimately, the most consequential indicator is the performance of downstream models (e.g., churn-prediction accuracy). Abrupt degradation is a clear signal that the detected drift is affecting business outcomes. Feature-level diagnostics (PSI, Chi-squared) help to isolate the contributing variables.



### Summary

Significant drift is present in customer visit frequency and drink preferences, while age and loyalty metrics remain stable. Ongoing monitoring of the churn-prediction model is recommended, with consideration for retraining on the updated data—particularly to account for evolving drink-preference patterns.


# Evidently AI for Automated Drift Diagnostics

Earlier, we manually calculated the **Population Stability Index (PSI)** for numerical variables and the **Chi-squared test** for categorical variables. While this clarifies the underlying statistics, production workflows typically rely on specialised tooling for efficiency and reproducibility.

**Evidently AI** is an open-source Python library that streamlines the drift-evaluation process by providing:

- **Drift detection**: applies statistical tests (including PSI and Chi-squared) across selected features.  
- **Visualisation**: produces interactive plots that highlight distributional changes.  
- **Reporting**: compiles a comprehensive HTML report that can be viewed in-notebook or shared with stakeholders.

https://docs.evidentlyai.com/introduction

**Prerequisites**

To use Evidently AI, you'll need a Python environment with the library installed. If you haven't already, you can install it using pip:
```bash

pip install evidently
```

Evidently AI enables rapid, repeatable drift assessments with minimal code overhead.

*Next, we shall integrate Evidently AI into our workflow and review its results.*


In [None]:
# -------------------------------------------------
# 2. Map the schema
# -------------------------------------------------


Evidently selects the statistical test automatically, and the choice depends on how many distinct values each column contains.

| Column | Distinct categories (`n_unique`) | Default test Evidently applies | Reason |
|--------|----------------------------------|--------------------------------|--------|
| `Loyalty_Program_Member` | 2 (`Yes`, `No`) | Two-proportion z-test | When a feature is binary, Evidently compares two independent proportions and applies the z-test. |
| `Favorite_Drink_Category` | 3 (`Coffee`, `Tea`, `Specialty`) | Chi-squared test | For multi-class categorical variables (`n_unique` greater than 2) Evidently runs the Pearson chi-squared test to compare the full distribution across classes. |

The rule is part of Evidently’s default drift-detection logic for tabular data:

- Binary categorical → proportion z-test  
- Categorical with more than two levels → chi-squared test  
- Numerical (or categorical with very few unique values) → other tests such as KS or Wasserstein, depending on sample size  


In [None]:
# Save the Report to an HTML file ---


In [None]:
# Accessing Report Results Programmatically (for interpretation) ---
# You can also access the results of the report programmatically if needed.
# This can be useful for automated alerting or further analysis.


{'metrics': [{'id': '15e89f895b482f9b84ba7274ed18a106',
   'metric_id': 'DriftedColumnsCount(drift_share=0.5)',
   'value': {'count': 2.0, 'share': 0.5}},
  {'id': '23fa9953455b31fa1983292360fee686',
   'metric_id': 'ValueDrift(column=Average_Visits_Per_Month)',
   'value': 9.185171040635447e-122},
  {'id': '8f5d1c60a32d6fc1bd54bc53af61d8e8',
   'metric_id': 'ValueDrift(column=Age)',
   'value': 0.24068202486600215},
  {'id': 'a38739929e1f77c72ee0757c627b673c',
   'metric_id': 'ValueDrift(column=Favorite_Drink_Category)',
   'value': 3.529680030793938e-97},
  {'id': 'ca1f379701267b53d00b5547c04ac4fa',
   'metric_id': 'ValueDrift(column=Loyalty_Program_Member)',
   'value': 0.6652439504716796}],
 'tests': []}

# Understanding Concept Drift in Predictive Models

Having examined **data drift**, which concerns shifts in the characteristics of input data, we now turn to **concept drift**, an equally critical challenge for machine-learning models.

Consider a predictive model in a retail context, such as a coffee shop, designed to forecast **customer churn** (the cessation of customer engagement).

- **Data drift** occurs when the *distribution of input features* changes (for example, an overall reduction in average customer visits).  
- **Concept drift** arises when the *underlying relationship* between the input features and the target variable (churn) changes. In this situation the concept that the model initially learned to recognise is no longer valid.

**Concept drift** occurs when the underlying relationship between the input features and the target variable changes over time. This directly impacts the model's ability to make correct predictions, even if the input data distribution itself hasn't changed. Detecting concept drift primarily relies on monitoring the model's performance metrics on incoming data, ideally where ground truth labels are available.

These metrics are derived from the **Confusion Matrix**, which summarises the performance of a classification model:
* **True Positives (TP):** Correctly predicted positive instances.
* **True Negatives (TN):** Correctly predicted negative instances.
* **False Positives (FP):** Incorrectly predicted positive instances (Type I error).
* **False Negatives (FN):** Incorrectly predicted negative instances (Type II error).

##### 1. Accuracy

* **Definition:** The proportion of total predictions that were correct. It measures how often the classifier is correct overall.
* **Formula:**
    $$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$
* **Relevance to Concept Drift:** A significant drop in accuracy on new data, compared to baseline, indicates the model's overall correctness has degraded, suggesting concept drift.

##### 2. Precision

* **Definition:** Of all the instances predicted as positive, what proportion were actually positive. Answers: "When the model predicts positive, how often is it correct?"
* **Formula:**
    $$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$$
* **Relevance to Concept Drift:** A drop in precision suggests the model is making more false positive errors than before, which can be a symptom of concept drift, especially if the cost of false positives is high (e.g., incorrectly identifying a customer as churning and offering a costly retention incentive).

##### 3. Recall (Sensitivity or True Positive Rate)

* **Definition:** Of all the actual positive instances, what proportion did the model correctly identify. Answers: "Of all the actual positive cases, how many did the model 'recall' or find?"
* **Formula:**
    $$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$$
* **Relevance to Concept Drift:** A drop in recall suggests the model is missing more actual positive cases (false negatives) than before. This is critical for churn prediction, as missing actual churners means missed opportunities for retention, strongly indicating concept drift.

##### 4. F1-Score

* **Definition:** The harmonic mean of Precision and Recall. It provides a single score that balances both precision and recall.
* **Formula:**
    $$\text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$
* **Relevance to Concept Drift:** A drop in F1-Score indicates a degradation in the model's ability to balance false positives and false negatives, making it a robust overall indicator of concept drift, especially in imbalanced datasets.

In the example that follows, we will explore concept drift on the scenario for *Heritage Brew Collective*.

## Shifting Loyalty Dynamics for *Heritage Brew Collective*

Previously, the model may have established a clear concept: customers with high visit frequency (for example, more than 10 visits per month) **and** participation in a loyalty programme were reliably classified as **non-churners**.

A fundamental change in market conditions could alter this relationship:

- **Scenario**: A new, highly competitive establishment enters the market, offering innovative products and an enhanced customer experience. Even historically frequent visitors and loyalty members might now explore alternatives.

In this evolving environment, a customer's high **`Average_Visits_Per_Month`** and **`Loyalty_Programme_Member`** status may no longer predict non-churn behaviour. The **relationship** between these features and **`Churn`** has changed. Consequently, a model trained on the previous concept of loyalty would begin to generate inaccurate predictions.

This illustrates concept drift: the input data may appear stable, yet the underlying predictive rules have changed. Monitoring model performance is the primary means of detecting concept drift, but with an interpretable model such as logistic regression we can also observe how the **coefficients** (the learned relationships) themselves change, offering direct insight into the nature of the drift.

The following analysis will simulate such a scenario, demonstrating the impact on a churn-prediction model and, crucially, the shift in its learned coefficients.

--- Reference Data ---


Unnamed: 0,customer_id,Loyalty_Program_Member,Age,Average_Visits_Per_Month,Spend_Per_Visit,Churn
0,0,Yes,64,8.0,5.13,0
1,1,No,29,9.7,3.79,0
2,2,No,33,8.2,3.75,0
3,3,Yes,41,7.8,5.78,0
4,4,Yes,36,9.8,5.63,0



Reference Churn Rate: 0.02

--- Current Data ---


Unnamed: 0,customer_id,Loyalty_Program_Member,Age,Average_Visits_Per_Month,Spend_Per_Visit,Churn
0,0,Yes,51,4.8,7.71,0
1,1,No,23,1.5,1.42,1
2,2,No,39,13.6,3.62,0
3,3,Yes,31,2.1,2.4,0
4,4,Yes,20,9.8,4.11,1



Current Churn Rate: 0.35


In [None]:
# --- 2. Preprocessing for Model Training ---



In [None]:
# --- 4. Evaluate Model Performance 





In [None]:
print("\n--- Visualising Logistic Regression Coefficients ---")



In [None]:
# Evaluate on Current Data (Expected Performance Drop due to Concept Drift)


In [None]:
# --- 5. Visualize Performance Degradation (Concept Drift Indicator) ---



In [None]:
# --- 4. Generate Predictions for Evidently AI ---

# Get predictions and probabilities from the trained model for both datasets


### Data Drift Analysis

In [None]:
# -------------------------------------------------
# 2. Map the schema (mandatory in the new API)
# -------------------------------------------------


## Take Away

This analysis with Evidently AI has demonstrated a key aspect of **concept drift**. We trained a model on the *reference* data, where customer loyalty and visit frequency were strong predictors of churn. When the same model was applied to the *current* data—where that relationship had weakened—we observed a clear decline in performance metrics such as **accuracy** and **ROC AUC**.

The visible degradation in model performance is a direct indicator of concept drift. The model’s learned *concept* of churn is no longer a good fit for present-day behaviour, making its predictions less reliable.

## Next Steps

Having confirmed concept drift (via performance decline) and data drift (through your earlier analysis), you can consider the following actions:

1. **Retrain the model**  
   The most common remedy is to retrain using the current data so the model can learn the updated relationship between features and churn. Retraining periodically can help the model adapt to a continually shifting concept.

2. **Explore a different model**  
   An alternative architecture may be more robust to changes in the underlying concept. You could evaluate a more flexible algorithm, such as Gradient Boosting Machines or Neural Networks, to determine whether they maintain performance over time.


## Outlier

An **outlier** is a data point that differs markedly from other observations in a data set. Outliers may arise for several reasons, including:

- **Measurement errors**: mistakes made during data collection  
- **Data-entry errors**: typographical errors when values are recorded  
- **True anomalies**: rare yet legitimate observations that genuinely deviate from the general pattern  

Outliers can significantly affect machine-learning models, particularly linear models such as logistic regression. They may distort learnt relationships and lead to poorer overall performance. Identifying and handling them appropriately is therefore crucial.

### Main Methods to Detect Outliers

Various techniques exist for detecting outliers. Below are two straightforward statistical approaches:

1. **Z-score method**  
   This measures how many standard deviations a data point lies from the mean. A common rule of thumb is to flag any point with a Z-score greater than 3 or less than -3 as an outlier.

2. **Interquartile range (IQR) method**  
   This defines outliers as values falling below the first quartile (Q1) or above the third quartile (Q3) by more than 1.5 × IQR. It is especially useful for data that are not normally distributed.


In the example that follows, we will explor this concept by manually insert a few outliers, and then locate them using visualisations and the Z-score method.

In [None]:
# --- 3. Visualize the Outliers ---
# A box plot is an excellent tool for visualizing outliers in a single feature.
# A scatter plot helps visualize outliers in the context of two features.



In [None]:
# Scatter plot for two features


In [None]:
# --- 4. Outlier Detection using the Z-score Method ---
# This function now returns the indices of the outliers.



--- Outliers Detected by Z-score Method ---
Indices of detected outliers: [596, 284, 685, 551]


### Key Strategies for Handling Outliers

1. **Removal**  
   Delete clearly erroneous values (e.g., an age of 200). Suitable when the outlier stems from data-entry or measurement errors. Remove sparingly to avoid excessive data loss.

2. **Imputation / Capping**  
   Retain the record but limit its influence, for example by replacing values above the 99th percentile with the 99th-percentile value (Winsorisation) or by substituting the median. Use when the point is genuine yet its magnitude distorts the model.

3. **Transformation**  
   Apply a mathematical function—logarithm or square root compress extreme values. Ideal for naturally skewed distributions (e.g., income) where outliers are inherent.

4. **Separate Modelling**  
   Treat outliers as a distinct class when they represent events of interest, such as fraud or rare faults. Build a dedicated detection model rather than removing or altering these observations.

There is no universal remedy; the optimal technique depends on the data characteristics and the business objective.

In [None]:
# --- 4. Outlier Detection using the Z-score Method ---
# This function now returns the indices of the outliers.




--- Outliers Detected by Z-score Method ---
Indices of detected outliers: [596, 284, 685, 551]


In [None]:
# --- 5. Handling Outliers with Functions ---



In [None]:
# Demonstrate the new functions




--- Data after Outlier Removal ---
Original number of rows: 1000
New number of rows: 996

--- Data after Outlier Capping (with Median) ---
Original number of rows: 1000
New number of rows: 1000


In [None]:
# --- 3. Visualize the Outliers ---
# A box plot is an excellent tool for visualizing outliers in a single feature.
# A scatter plot helps visualize outliers in the context of two features.


In [None]:
# --- 3. Visualize the Outliers ---
# A box plot is an excellent tool for visualizing outliers in a single feature.
# A scatter plot helps visualize outliers in the context of two features.



## Conclusion

This notebook has shown how to identify and handle outliers. By visualising the data with box plots and scatter plots, we could clearly see those observations that lay far from the main cluster. The **z-score method** then provided a robust, statistical approach to detect the same outliers programmatically.

We have also created functions to either **remove** the outliers entirely or **cap** them at a less extreme value (for example, the mean, median, or maximum of the non-outlier data). These options give you flexible ways to prepare your data.

## Next Steps

With tools in place to detect and handle outliers, here are two avenues to explore:

1. **Evaluate the impact**  
   Use the cleaned data (either the removed-outlier or capped-outlier version) to train the churn-prediction model and compare performance. This will reveal whether addressing outliers actually improves accuracy.

2. **Compare handling methods**  
   Train one model on data with removed outliers and another on data with capped outliers, then compare their performance. This helps determine which strategy is best for your specific dataset.

