<a href="https://colab.research.google.com/github/calicartels/XAI--Explainable-Techniques-II/blob/main/Explainable_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Explainable Techniques II


In [None]:

!pip install numpy==1.25.2 pandas==2.0.3 scikit-learn==1.2.2 shap==0.45.1
!pip install git+https://github.com/MaximeJumelle/ALEPython.git@dev#egg=alepython

In [None]:

# Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Models
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor


# XAI
import shap
from alepython import ale_plot
from sklearn.inspection import PartialDependenceDisplay
from sklearn.inspection import permutation_importance

np.random.seed(1)

In [None]:
X = pd.DataFrame(X, columns=[
    'MedInc', 'HouseAge', 'AveRooms', 'AveOccup',
    'Latitude', 'Longitude'
])

plt.figure(figsize=(10, 6))
correlation_matrix = X.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

The correlation matrix reveals several important relationships:
- Strong negative correlation (-0.92) between Latitude and Longitude
- Weak positive correlation (0.27) between MedInc and AveRooms
- Most other features show weak correlations (< |0.15|)
- This high correlation between Latitude and Longitude suggests potential issues with interpreting their individual effects

In [None]:

n_points = 1000
X, y = shap.datasets.california(n_points=n_points)

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [None]:
# Training RF
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)  # Use raw features for training



In [None]:
# Partial Dependence Plots (PDP)
print("\nGenerating Partial Dependence Plots (PDP)...")
features_to_plot = [0, 5]  # Choose features to plot (e.g., 'MedInc' and 'HouseAge')
PartialDependenceDisplay.from_estimator(
    model,
    X_train,
    features=features_to_plot,
    grid_resolution=50
)
plt.show()



- MedInc : Strong positive relationship with the target variable

* Almost monotonic increase
* Steeper slope after MedInc value of 6
* Suggests higher median income strongly predicts higher house values


- AveOccup : Negative relationship with the target

* Steep decline initially
* Levels off after AveOccup value of ~3
* Suggests houses with very high occupancy tend to have lower values

In [None]:
# SHAP values
print("\nGenerating SHAP values...")
explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_test)

# SHAP summary
shap.summary_plot(shap_values, X_test)
plt.show()

- MedInc has the highest impact on predictions
- Wide spread of SHAP values indicates complex interactions
Color gradient shows feature value relationships:

- Higher MedInc (pink) → higher positive impact
- Lower AveOccup (blue) → higher negative impact
- Geographic features (Latitude/Longitude) show clustered effects

**Key insights and recommendations**

Feature Importance:


- MedInc is clearly the most influential feature
Geographic location (Lat/Long) has significant but complex effects
AveOccup has a moderate negative impact


Model Interpretation Considerations:


- The strong correlation between Latitude and Longitude means their individual PDPs should be interpreted with caution

- The smooth nature of the PDPs suggests the model is capturing reasonable relationships