# Machine-Learning-based Methods for Drift Detection

### ROC Approach for Drift Detection
The Receiver Operating Characteristic (ROC) approach for drift detection involves using a classifier to distinguish between data from the reference period and the target period. If the classifier performs well, it suggests that the data distributions are different (indicating drift).

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

# Load the dataset
housing_data = pd.read_csv('housing.csv')

# Drop the 'ocean_proximity' and 'median_house_value' feature
housing_data = housing_data.drop(columns=['ocean_proximity'])
housing_data = housing_data.drop(columns=['median_house_value'])

housing_data

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462
...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672


In [2]:
# Split the data into two periods
split_index = len(housing_data) // 2
data_period_1 = housing_data[:split_index].copy()
data_period_2 = housing_data[split_index:].copy()

In [3]:
# Create a new column 'period' to label the data
data_period_1['period'] = 0
data_period_2['period'] = 1

# Combine the datasets
combined_data = pd.concat([data_period_1, data_period_2])

# Separate features and target
X = combined_data.drop(columns='period')
y = combined_data['period']

In [4]:
from sklearn.impute import SimpleImputer

# Impute missing values
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.3, random_state=42)

# Train a RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Predict probabilities
y_pred_proba = clf.predict_proba(X_test)[:, 1]

# Calculate the ROC AUC score
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC AUC Score: {roc_auc}")


ROC AUC Score: 0.9961335356664237


### Analysis
The idea is that if the classifier can easily distinguish between the two subsets, there is significant drift.
- The performance of the classifier (measured by ROC AUC score) indicates the extent of the drift.

**What does it mean?**

    less than 0.5: No drift
    0.5 - 0.7: Small drift
    0.7 - 0.9: Moderate drift
    greater than 0.9: Significant drift

A score of **0.9961** suggests that the classifier can almost perfectly distinguish between the two periods. This implies that there has been a significant change in the distribution of the features between the two periods.

**Reasons for high ROC AUC Score:**
1. **Significant Temporal Drift**: The distributions of one or more features have changed substantially over time.
2. **Strong Feature Changes**: Specific features, such as median_house_value, might have experienced large shifts that make the periods easily distinguishable.


### Further Analysis
To better understand which features contribute most to the drift, we can examine the feature importances from the trained <code>**RandomForestClassifier**<code/>

In [5]:
# Get feature importances from the classifier
feature_importances = clf.feature_importances_

# Create a DataFrame for feature importances
feature_importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance': feature_importances
})

# Sort the DataFrame by importance
feature_importance_df = feature_importance_df.sort_values(by='importance', ascending=False)

# Display the feature importances
feature_importance_df


Unnamed: 0,feature,importance
0,longitude,0.399874
1,latitude,0.318976
2,housing_median_age,0.068858
7,median_income,0.050403
3,total_rooms,0.043032
5,population,0.041109
6,households,0.039837
4,total_bedrooms,0.037912


### Interpretation

**Feature Importance Interpretation**

**1. longitude (0.399874):**
- **Importance**: 39.99%
- **Interpretation**: The longitude of the house's location is the most significant predictor in our model. This high importance indicates that where a house is located (east-west direction) greatly influences the target variable, which is often the house price in housing datasets.

**2. latitude (0.318976):**
- **Importance**: 31.90%
- **Interpretation**: The latitude (north-south direction) is also highly significant, but slightly less so than longitude. Together, longitude and latitude indicate that the geographical location is crucial for predicting the target variable, reaffirming the importance of location in real estate valuation.

**3. housing_median_age (0.068858):**
- **Importance**: 6.89%
- **Interpretation**: The median age of houses in the area is the third most important feature, suggesting that newer or older housing stock has a noticeable impact on the prediction, likely reflecting the desirability or value associated with the age of properties.

**Key Point:**
- **Geographical Location** ist the most critical factor, emphasizing the well-known real estate principle of "location, location, location."

### How does the Feature Importance Examination work?
During the learning process, the model evaluates how much each feature contributes to reducing the error in its predictions. For example, in Random Forests, the model looks at how much splitting on each feature improves the prediction accuracy.

#### Calculation in Tree-based Models
Feature importance is often calculated by looking at the amount of impurity (e.g., Gini impurity or entropy) that each feature splits in the trees reduce across all the trees in the forest. The importance of a feature is computed as the (normalized) total reduction in the criterion brought by that feature.
#### Gini Impurity/Entropy Reduction: 
When a decision tree splits a node based on a feature, it measures how much the split reduces impurity. The more a feature reduces impurity, the more important it is.

### Next Steps
- **Feature Engineering**: Continuously update and refine feature engineering processes to incorporate new patterns and relationships as they emerge.
This might involve adding new features or transforming existing ones to better capture the dynamics of the housing market.