# Interpreting a Random Forest Model with LIME
### A Demonstration for Federal Government AI Professionals

In this notebook, we will:
1. Install and import [**LIME**](https://github.com/marcotcr/lime), a popular library for model interpretability.
2. Load the **Boston Housing** dataset from a **new URL**.
3. Train a **Random Forest** regressor to predict housing prices.
4. Use **LIME** to explain the predictions of specific instances.

> **Why it matters**: Federal agencies often require interpretable models to ensure fairness, transparency, and accountability in decision-making processes. Whether you’re working on a housing grants project or any other high-stakes federal program, LIME can help you provide clear explanations to policymakers and stakeholders.

## 1. Install LIME (if needed)

In [None]:
!pip install lime --quiet

## 2. Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import requests

# Sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# LIME
from lime.lime_tabular import LimeTabularExplainer

# Visualization
import matplotlib.pyplot as plt
%matplotlib inline

## 3. Load the Dataset

We'll pull the Boston Housing dataset from GitHub (hosted by [Selva Prabhakaran](https://github.com/selva86/datasets/)).

> **Note**: The original code attempted to load a `.pkl` file from a URL that’s no longer valid. Here, we’ve switched to a CSV-based version of the dataset for simplicity.

In [None]:
# New URL for the Boston Housing CSV
url = 'https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv'

# Fetch the data
response = requests.get(url)
data_as_string = response.content.decode('utf-8')

# Read into a pandas DataFrame
housing_df = pd.read_csv(pd.io.common.StringIO(data_as_string))

print("Data loaded successfully!")
housing_df.head()

### Quick Data Exploration

In [None]:
# Summary stats
housing_df.describe()

## 4. Define Features and Target

The dataset columns:
- `medv` is the median home value (our **target**).
- All other columns are **features**.

For this demo, we use all features, but in a real Federal agency scenario, you might select features aligned with policy or domain constraints (e.g., focusing on geospatial, demographic, or environmental attributes).

In [None]:
# Separate features (X) and target (y)
X = housing_df.drop('medv', axis=1)
y = housing_df['medv']

# We will store feature names for LIME
feature_names = X.columns.tolist()

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)
print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

## 5. Train a Random Forest Model

Random Forest is a popular ensemble method that often works well out-of-the-box. After training, we can interpret individual predictions using LIME.

In [None]:
regressor = RandomForestRegressor(n_estimators=100, random_state=42)
regressor.fit(X_train, y_train)

train_score = regressor.score(X_train, y_train)
test_score = regressor.score(X_test, y_test)

print(f"Random Forest Training R^2 Score: {train_score:.3f}")
print(f"Random Forest Test R^2 Score: {test_score:.3f}")

## 6. Interpreting Predictions with LIME

[**LIME**](https://github.com/marcotcr/lime) stands for **Local Interpretable Model-agnostic Explanations**. It explains individual predictions by approximating the complex model locally (near the instance you want to explain) with a simpler model (like a linear model).

### Defining Categorical Features
In some Federal use-cases, you may have a mix of numeric and categorical data. For demonstration, let’s assume `chas` (Charles River dummy variable) is categorical:
- `chas` = 1 if the property bounds the river; 0 otherwise.

In the Boston dataset, that’s the 4th column (0-indexed = 3).

In [None]:
# Let's identify categorical columns for LIME
cat_cols = [3]  # 'chas' column

explainer = LimeTabularExplainer(
    training_data=np.array(X_train),
    feature_names=feature_names,
    categorical_features=cat_cols,
    mode='regression'
)
print("LIME explainer initialized!")

### 6.1 Explain a Single Instance
We will select an instance (say, index=10 in our test set) and explain the Random Forest’s predicted price.

In [None]:
instance_idx = 10  # pick an index from the test set

exp = explainer.explain_instance(
    data_row=X_test.iloc[instance_idx],
    predict_fn=regressor.predict,
    num_features=8
)

fig = exp.as_pyplot_figure()
plt.title(f"LIME Explanation for Test Instance #{instance_idx}")
plt.show()

exp_list = exp.as_list()
print("\nExplanation as list of (feature, effect):")
for feature, val in exp_list:
    print(f"{feature} => {val:.3f}")

### 6.2 Compare Explanations for Multiple Instances
For a better understanding, let’s compare explanations for two more instances.

In [None]:
instances_to_explain = [5, 25]  # you can pick any two test indices

for i in instances_to_explain:
    exp = explainer.explain_instance(
        data_row=X_test.iloc[i],
        predict_fn=regressor.predict,
        num_features=8
    )
    fig = exp.as_pyplot_figure()
    plt.title(f"LIME Explanation for Test Instance #{i}")
    plt.show()

## Conclusion

You have just seen how **LIME** can help Federal Government AI professionals—and really anyone—explain the predictions of complex models, such as **Random Forest** in this example.

In a real-world Federal use-case, such transparency is often required for:
- **Policy compliance** (e.g., ensuring no unlawful biases).
- **Audit and oversight** by internal or external agencies.
- **Stakeholder communication** with leadership and the public.

Experiment with different rows, features, and model parameters to see how explanations change. This will help build trust and confidence in your machine learning solutions!