### Detect Data Drift in ML Models
**Objective**: Monitor and detect changes in data distributions that impact ML model performance.

**Task**: Categorical Feature Drift

**Steps**:
1. Load the baseline distribution for a categorical feature (e.g., gender ) from your training dataset.
2. Load the same feature from your current production data.
3. Use chi-squared tests to compare the distributions of the categorical feature.
4. Step 4: If significant drift is detected, investigate the cause and update the model as needed.

In [None]:
# write your code from here

In [1]:
import pandas as pd
from scipy.stats import chi2_contingency

# Example baseline distribution (training data)
baseline_data = pd.DataFrame({'Gender': ['F', 'M'], 'Count': [60, 40]})

# Example production data (current data)
production_data = pd.DataFrame({'Gender': ['F', 'M'], 'Count': [50, 50]})

# Merge the two distributions on the categorical feature
merged_data = pd.merge(baseline_data, production_data, on='Gender', suffixes=('_baseline', '_production'))

# Perform chi-squared test
chi2, p, dof, expected = chi2_contingency([merged_data['Count_baseline'], merged_data['Count_production']])

# Check for significant drift
alpha = 0.05
if p < alpha:
    print("Significant drift detected (p-value:", p, ")")
else:
    print("No significant drift detected (p-value:", p, ")")

No significant drift detected (p-value: 0.20082512269514174 )
