### eXplainable AI - XAI
- XAI (Explainable AI): answers "why did the model make this specific prediction?"
- Supports better decision-making by revealing influential factors.
- Improves trust in the model.


#### LIME
- Builds a simple model around that instance to approximate behavior
- Slightly changes inputs → sees how outputs change → finds key features
- Shows which features mattered most for that prediction
- Highlights magnitude & direction of each feature
- Practical for human-understandable explanations of feature influence


#### Regression


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from lime.lime_tabular import LimeTabularExplainer
from sklearn.ensemble import RandomForestRegressor

# 1. Load and prepare dataset (Car Price Prediction)
df = pd.read_csv('primary_features_boolean_converted_final.csv')
df = df.dropna(subset=['price(Georgian Lari)'])  # Remove rows with missing target

# Feature engineering - create meaningful features for regression
df['vehicle_age'] = 2024 - df['product_year']  # Age affects value and condition
df['luxury_score'] = df[['engine_volume', 'cylinders', 'airbags']].sum(axis=1)  # Luxury indicators
df['safety_score'] = df[['airbags', 'ABS', 'ESP', 'Central Locking', 'Alarm System']].sum(axis=1)  # Safety features

# Select numerical features for regression
numerical_features = ['vehicle_age', 'luxury_score', 'safety_score', 'mileage', 'engine_volume']
X = df[numerical_features].fillna(0)  # Fill missing values with 0
y = df['price(Georgian Lari)']  # Target: continuous price

feature_names = numerical_features

# Take only 1000 random samples for faster computation
np.random.seed(42)
indices = np.random.choice(len(X), min(1000, len(X)), replace=False)
X = X.iloc[indices] if isinstance(X, pd.DataFrame) else X[indices]
y = y.iloc[indices] if isinstance(y, pd.Series) else y[indices]

# Convert to numpy arrays for compatibility
X = np.array(X)
y = np.array(y)

# 2. Train a regression model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestRegressor(n_estimators=200, random_state=42)
model.fit(X_train, y_train)

# 3. Initialize LIME explainer
explainer = LimeTabularExplainer(
    training_data=X_train,
    feature_names=feature_names,
    mode="regression"
)

# 4. Pick a test instance
i = 6
exp = explainer.explain_instance(
    data_row=X_test[i],
    predict_fn=model.predict
)

# 5. Get explanation as table
explanation_list = exp.as_list()
df_explanation = pd.DataFrame(explanation_list, columns=["Feature", "Contribution"])

print("True value:", y_test[i])
print("Predicted value:", model.predict([X_test[i]])[0])
print(df_explanation)


#### SHAP
- Explains predictions by assigning each feature a contribution value
- Based on Shapley values from cooperative game theory. Fairly distribute the "payout" (prediction) among all features
- Provides both local explanations for individual predictions and global explanations through aggregated feature importance.


In [None]:
import shap
shap.initjs()


In [None]:
# Create a SHAP explainer:
# This object calculates Shapley values, which show how much each feature
# contributes to the model's predictions. It works with the trained model,
# uses X_test as background data, and labels outputs with feature_names.
# This sets up the explainer object. Think of it as "preparing the tool" that knows how to compute SHAP values for your model.
shap_explainer = shap.Explainer(model.predict, X_test, feature_names=feature_names)


In [None]:
# Generate SHAP values for the test set:
# Each value shows how much a feature contributes (positively or negatively)
# to the model's prediction for each instance in X_test.
# This actually runs the explainer on your test data. It outputs the SHAP values (numbers that tell you each feature's contribution).
shap_values = shap_explainer(X_test)


In [None]:
test_instance_index = 4
shap_values[test_instance_index]


- SHAP values → feature contributions for each instance.
- Base value → the model's average prediction (baseline).
- Data → the actual input samples used (here, rows from X_test).


In [None]:
import matplotlib.pyplot as plt
import matplotlib.text as mtext
import matplotlib

# Get the Axes object
ax = shap.plots.waterfall(shap_values[test_instance_index], show=False)

# Get the figure from the axes
fig = ax.figure

# Resize the figure
fig.set_size_inches(10, 6)

for ax in fig.axes:
    for child in ax.get_children():
        if isinstance(child, mtext.Text):
            #child.set_fontsize(10)         # Font size
            child.set_fontfamily('Calibri')  # Font family
            child.set_color('black')    # Font color

    # Optional: update axis tick labels
    ax.tick_params(axis='both', labelsize=12, colors='black')
    ax.set_title("SHAP explanation for Car Price Prediction", fontsize=12, fontfamily='Cambria', color='black')
    #ax.set_title("SHAP explanation for test instance 02", fontsize=5, fontfamily='Cambria', color='black')

# Show the resized figure
plt.show()


In [None]:
X_test[test_instance_index]


- Baseline car price: The model's average prediction
- Feature contributions: Each feature's SHAP value shows how much it increases or decreases the predicted price
- Final prediction = base_value + sum(SHAP values)
- Positive SHAP values → increase predicted price
- Negative SHAP values → decrease predicted price


#### Classification


In [None]:
from sklearn.ensemble import RandomForestClassifier
from lime.lime_tabular import LimeTabularExplainer

# 1. Load and prepare dataset (Car Price Categories for Classification)
df_cls = pd.read_csv('primary_features_boolean_converted_final.csv')
df_cls = df_cls.dropna(subset=['price(Georgian Lari)'])  # Remove rows with missing target

# Feature engineering - create meaningful features for classification
df_cls['vehicle_age'] = 2024 - df_cls['product_year']  # Age affects value and condition
df_cls['luxury_score'] = df_cls[['engine_volume', 'cylinders', 'airbags']].sum(axis=1)  # Luxury indicators
df_cls['safety_score'] = df_cls[['airbags', 'ABS', 'ESP', 'Central Locking', 'Alarm System']].sum(axis=1)  # Safety features

# Create target variable for classification - use quantile-based binning
price_data = df_cls['price(Georgian Lari)']
q33 = price_data.quantile(0.33)
q67 = price_data.quantile(0.67)

# Create 3 bins if possible, otherwise 2 bins
unique_values = sorted(list(set([price_data.min(), q33, q67, price_data.max()])))
if len(unique_values) >= 4 and q33 < q67:
    try:
        bins = [price_data.min(), q33, q67, price_data.max()]
        df_cls['price_category'] = pd.cut(price_data, bins=bins, labels=[0, 1, 2], include_lowest=True)
    except Exception as e:
        median_price = price_data.median()
        df_cls['price_category'] = (price_data > median_price).astype(int)
else:
    median_price = price_data.median()
    df_cls['price_category'] = (price_data > median_price).astype(int)

df_cls = df_cls.dropna(subset=['price_category'])

# Select numerical features for classification
numerical_features_cls = ['vehicle_age', 'luxury_score', 'safety_score', 'mileage', 'engine_volume']
X_cls = df_cls[numerical_features_cls].fillna(0)  # Fill missing values with 0
y_cls = df_cls['price_category'].astype(int)  # Target: categorical price

feature_names_cls = numerical_features_cls

# Take only 1000 random samples for faster computation
np.random.seed(42)
indices_cls = np.random.choice(len(X_cls), min(1000, len(X_cls)), replace=False)
X_cls = X_cls.iloc[indices_cls] if isinstance(X_cls, pd.DataFrame) else X_cls[indices_cls]
y_cls = y_cls.iloc[indices_cls] if isinstance(y_cls, pd.Series) else y_cls[indices_cls]

# Convert to numpy arrays for compatibility
X_cls = np.array(X_cls)
y_cls = np.array(y_cls)

# Get class names
unique_classes = np.unique(y_cls)
class_names = [f'Price Category {i}' for i in unique_classes]

# 2. Train a model
X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(X_cls, y_cls, test_size=0.2, random_state=42)
model_cls = RandomForestClassifier(n_estimators=500, random_state=42)
model_cls.fit(X_train_cls, y_train_cls)

# 3. Initialize LIME explainer
explainer_cls = LimeTabularExplainer(
    training_data=X_train_cls,
    feature_names=feature_names_cls,
    class_names=class_names,
    mode="classification"
)

# 4. Pick a test instance to explain
i = 10
exp_cls = explainer_cls.explain_instance(
    data_row=X_test_cls[i],
    predict_fn=model_cls.predict_proba
)

# 5. Show explanation
print("True class:", class_names[y_test_cls[i]])
print("Predicted class:", class_names[model_cls.predict([X_test_cls[i]])[0]])
explanation_list_cls = exp_cls.as_list()

# Convert to DataFrame
df_explanation_cls = pd.DataFrame(explanation_list_cls, columns=["Feature", "Contribution"])

print(df_explanation_cls)
