# **Local Interpretable Model Agnostic Explanations (LIME)**

This notebook demonstrates the use of LIME - a post-hoc methods for explainability.

- In this technique, simpler explainable surrogate models are trained to approximate the predictions of an underlying complex black box model. 

- It provides local explanations, meaning it explains why a model made a specific prediction for an individual instance (e.g., one patient, one image, one data point).
- LIME tests what happens to black box model predictions when pertubations are applied to the inputs. 
- STEPS:
    1. Take the Instance to Explain
    2. Perturb the features slightly for this instance
        - tabular data: modifying feature valus slightly
        - image data: masking parts of an image
    3. Predict with complex black box model
    4. Assign weights to the perturbed instances based on how similar they are to original instance. Closer instances are weighted higher
    5. Train a simpler model to the predictions of the black box model to the perturbed instances
    6. From simpler model, extract feature importance scores for the original instance. Scores indicate which feature contributed most to the prediction.
    7. Visualize explanations.


The original paper describing this approach is https://arxiv.org/pdf/1602.04938


In [1]:
import lime
import sklearn
import numpy as np
import lime.lime_tabular
np.random.seed(1)
import torch
import torch.nn as nn
import numpy as np
from sklearn.preprocessing import StandardScaler
import pandas as pd
import pickle 
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
import pickle

## **1. Import Gas Turbine Sensor Dataset**

The dataset contains 36733 instances of 11 sensor measures aggregated over one hour, from a gas turbine located in Turkey for the purpose of studying flue gas emissions, namely CO and NOx. Link to dataset is https://archive.ics.uci.edu/dataset/551/gas+turbine+co+and+nox+emission+data+set

The dataset has 11 sensor measures aggregated over one hour. 
| Variable Name | Role    | Type       | Description                                    | Units  | Min    | Max    | Mean   | Missing Values |
|--------------|--------|------------|------------------------------------------------|--------|--------|--------|--------|----------------|
| year         | Feature | Integer    | Year of observation                           | -      | -      | -      | -      | no             |
| AT          | Feature | Continuous | Ambient temperature                           | °C     | –6.23  | 37.10  | 17.71  | no             |
| AP          | Feature | Continuous | Ambient pressure                              | mbar   | 985.85 | 1036.56| 1013.07| no             |
| AH          | Feature | Continuous | Ambient humidity                              | %      | 24.08  | 100.20 | 77.87  | no             |
| AFDP        | Feature | Continuous | Air filter difference pressure                | mbar   | 2.09   | 7.61   | 3.93   | no             |
| GTEP        | Feature | Continuous | Gas turbine exhaust pressure                  | mbar   | 17.70  | 40.72  | 25.56  | no             |
| TIT         | Feature | Continuous | Turbine inlet temperature                     | °C     | 1000.85| 1100.89| 1081.43| no             |
| TAT         | Feature | Continuous | Turbine after temperature                     | °C     | 511.04 | 550.61 | 546.16 | no             |
| TEY         | Feature | Continuous | Turbine energy yield                          | MWH    | 100.02 | 179.50 | 133.51 | no             |
| CDP         | Feature | Continuous | Compressor discharge pressure                 | mbar   | 9.85   | 15.16  | 12.06  | no             |
| CO          | Feature | Continuous | Carbon monoxide concentration                 | mg/m³  | 0.00   | 44.10  | 2.37   | no             |
| NOx         | Feature | Continuous | Nitrogen oxides concentration                 | mg/m³  | 25.90  | 119.91 | 65.29  | no             |



Let us treat this as a regression problem and use all sensor data to predict turbine energy yield. We first will concatenate data from 4 years (2011-2014) and use data from 2015 to test the model.

In [None]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split

# Define the folder path
folder_path = '/Users/Dhaneshr/code/interpretability-bootcamp/reference_implementations/Post-hoc/datasets/gas+turbine+co+and+nox+emission+data+set'

# Load the CSV files
files_to_load = ['gt_2011.csv', 'gt_2012.csv', 'gt_2013.csv', 'gt_2014.csv']
dataframes = [pd.read_csv(os.path.join(folder_path, file)) for file in files_to_load]


# Concatenate the dataframes
train_val_data = pd.concat(dataframes, ignore_index=True)
display(train_val_data.head())

# Load the test data
test_data = pd.read_csv(os.path.join(folder_path, 'gt_2015.csv'))


scaler = StandardScaler()
train_val_data = pd.DataFrame(scaler.fit_transform(train_val_data), columns=train_val_data.columns)
test_data = pd.DataFrame(scaler.transform(test_data), columns=test_data.columns)


# Split the training and validation data
train_data, val_data = train_test_split(train_val_data, test_size=0.2, random_state=42)

#use column TEY as target
target = 'TEY'
X_train = train_data.drop(target, axis=1)
y_train = train_data[target]
X_val = val_data.drop(target, axis=1)
y_val = val_data[target]
X_test = test_data.drop(target, axis=1)
y_test = test_data[target]



# Print the shapes of the datasets
print(f'Training data shape: {train_data.shape}')
print(f'Validation data shape: {val_data.shape}')
print(f'Test data shape: {test_data.shape}')

## **2. Set up Regression Model**

- To illustrate this example, we shall use a Gradient Boosting Regression Model. 
- Let's use grid search to find the best hyperparameters for this problem.  For simplicity, we will just pick 3 sets of hyperparameters to save time. 
- after grid search, we will save the best model 

In [None]:


# Define the parameter grid
param_grid = {
    'n_estimators': [100, 150, 200],
    'max_depth': [3, 4, 6],
    'learning_rate': [0.05, 0.1, 0.2]
}

# Initialize the model
gbr = GradientBoostingRegressor(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=gbr, param_grid=param_grid, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print(f'Best parameters found: {best_params}')

# Initialize the model with the best parameters
gbr = GradientBoostingRegressor(**best_params, random_state=1)

# Train the model
gbr.fit(X_train, y_train)

# Make predictions on the validation data
y_val_pred = gbr.predict(X_val)

# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(gbr, open(filename, 'wb'))


# Evaluate the model's performance
mse = mean_squared_error(y_val, y_val_pred)
print(f'Mean Squared Error on validation data: {mse}')

ytest_pred = gbr.predict(X_test)
mse_test = mean_squared_error(y_test, ytest_pred)
print(f'Mean Squared Error on test data: {mse_test}')
r2 = r2_score(y_test, ytest_pred)
print(f'R2 score on test data: {r2}')



## **3.LIME local Explanations**

Let's use LIME to provide insights into its predictions and correlate that with domain knowledge about gas turbine generators.
Here we will load the trained model, and use LIME's `explain_instance()` function to predict the 12th sample from the test set 
and plot attributions related to 6 different sensor readings.

In [None]:
#let's load the saved model and use LIME to explain the predictions
filename = 'finalized_model.sav'
gbr = pickle.load(open(filename, 'rb'))

explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values, mode='regression', feature_names=X_train.columns.tolist(), discretize_continuous=True)

# Pick the 12th instance from the test data
i = 12
exp = explainer.explain_instance(X_test.values[i], gbr.predict, num_features=6) 
display(X_test.iloc[i]) 
exp.show_in_notebook()
# fig = exp.as_pyplot_figure(label=1)
# fig.savefig('lime_oi.png')
# print(exp.as_list())


Here iwe note that for the model's prediction on that 12th instance, the low ambient temperature and 
| Feature Condition         | Contribution | Interpretation |
|---------------------------|-------------|----------------|
| **AT <= -0.81** (Positive)   | **+0.239**  | A very **low Ambient Temperature (AT)** significantly **increases** the model’s predicted output. This suggests that turbine performance improves in colder conditions. |
| **-0.58 < TIT <= 0.27** (Negative)  | **-0.100** | A **moderate Turbine Inlet Temperature (TIT)** **lowers** the predicted output, which may indicate that higher temperatures are needed for optimal turbine efficiency. |
| **-0.58 < CDP <= -0.08** (Negative)  | **-0.099** | A **low Compressor Discharge Pressure (CDP)** **reduces** the predicted output, possibly due to lower compression efficiency in the turbine system. |
| **-0.56 < GTEP <= -0.09** (Negative) | **-0.063** | A **low Gas Turbine Exhaust Pressure (GTEP)** is associated with a **decrease** in the predicted value, likely due to inefficiencies in exhaust flow. |
| **AP > 0.62** (Negative)   | **-0.060**  | A **higher Ambient Pressure (AP)** has a **negative effect** on the predicted outcome, suggesting that increased pressure might create inefficiencies in the turbine system. |
| **AFDP <= -0.66** (Positive) | **+0.046**  | A **low Air Filter Difference Pressure (AFDP)** slightly **increases** the predicted value, indicating that less resistance in air intake improves efficiency. |

The turbine’s energy yield (TEY) depends on environmental conditions and machine parameters: 

- Cooler temperatures (low AT) improve efficiency, which aligns with real-world turbine behavior (lower temperatures improve air density and combustion).
- Compressor and turbine inlet pressures impact performance, and a low compressor discharge pressure (CDP) reduces efficiency.
- Turbine exhaust pressure (GTEP) also influences yield, as inefficient exhaust flow can reduce power generation.

## **Extracting Global Explanations**

LIME provide inherently local explanations. How can we then obtain global interpretation of the model's decision?
One way is to use a **Global Surrogate** : the idea here is to use a simple intepretable model to explain the predictions of the more complex model. 
In this example, let's use a simple decision tree...

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Train a simple Decision Tree as a surrogate model
surrogate_model = DecisionTreeRegressor(max_depth=3)
surrogate_model.fit(X_train, gbr.predict(X_train))

# Visualize the tree to interpret global importance
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plot_tree(surrogate_model, feature_names=X_train.columns, filled=True)
plt.show()

From the plot of the decision tree, we can see that CDP (Compressor Discharge Pressure) seems to be the most important feature, followed by the AT (Ambient Temperature). 

| Feature Split | Impact on TEY | Explanation |
|--------------|--------------|-------------|
| **CDP ≤ 0.473** | Decreases TEY | Low compressor discharge pressure reduces turbine efficiency. |
| **CDP > 1.461** | Increases TEY | High CDP improves performance. |
| **AT ≤ -0.925** | Increases TEY | Low ambient temperature enhances turbine efficiency. |
| **CDP ≤ -1.111** | Strongly Decreases TEY | Extremely low CDP causes major efficiency loss. |


1.	CDP (Compressor Discharge Pressure) is the Most Important Feature:
- It is used for the first split, meaning it strongly influences TEY.
- Lower CDP results in lower TEY (left side of the tree).
- Higher CDP leads to higher TEY (right side of the tree).
2.	Ambient Temperature (AT) is Also Important:
- The model splits on AT ≤ -0.925, showing that colder temperatures increase TEY.
- This aligns with the physics of turbines, where cooler air improves efficiency.
3.	Variance Decreases as We Move Down the Tree:
- The squared error (uncertainty) decreases at deeper nodes.
- This means the model is making more confident predictions as it refines splits.