# Model Feature Importance & SHAP Analysis

This notebook analyzes feature importances and SHAP values for multiple trained models (RandomForest, XGBoost, LightGBM, CatBoost) on a food/nutrition dataset. We will:
- Load each model and generate feature importance plots
- For tree-based models, compute SHAP values and summary plots
- Save all plots to the `model_insights/` folder
- Compare which features are consistently important across models


## 1. Import Required Libraries
We will import libraries for model loading, plotting, and SHAP analysis.

In [1]:
# Import libraries
import os
import joblib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import shap
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.ensemble import RandomForestRegressor

# Ensure output folder exists
os.makedirs('model_insights', exist_ok=True)


  from .autonotebook import tqdm as notebook_tqdm


## 2. Define Utility Functions for Loading Models and Plotting
Reusable functions for loading models, plotting feature importances, computing SHAP values, and saving plots.

In [2]:
# Utility functions
def load_model(path):
    return joblib.load(path)

def plot_feature_importance(importances, feature_names, model_name, top_n=20):
    indices = np.argsort(importances)[::-1][:top_n]
    plt.figure(figsize=(8, 6))
    sns.barplot(x=importances[indices], y=np.array(feature_names)[indices], orient='h')
    plt.title(f'{model_name} Feature Importances (Top {top_n})')
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.tight_layout()
    fname = f"model_insights/{model_name.lower()}_importance.png"
    plt.savefig(fname)
    plt.close()
    print(f"Saved: {fname}")

def plot_shap_summary(model, X, model_name, max_display=20):
    explainer = shap.Explainer(model, X)
    shap_values = explainer(X)
    plt.figure()
    shap.summary_plot(shap_values, X, show=False, max_display=max_display)
    fname = f"model_insights/{model_name.lower()}_shap.png"
    plt.tight_layout()
    plt.savefig(fname)
    plt.close()
    print(f"Saved: {fname}")
    return shap_values


## 3. Load Trained Models
Load the trained RandomForest, XGBoost, LightGBM, and CatBoost models from their .joblib files.

In [3]:
# Load dataset (features must match training)
df = pd.read_csv('data/cleaned_nutricare.csv')
def feature_engineering(df):
    total = df['Recommended_Carbs'] + df['Recommended_Protein'] + df['Recommended_Fats']
    total = total.replace(0, pd.NA)
    df['Carb_ratio'] = df['Recommended_Carbs'] / total
    df['Protein_ratio'] = df['Recommended_Protein'] / total
    df['Fat_ratio'] = df['Recommended_Fats'] / total
    df[['Carb_ratio', 'Protein_ratio', 'Fat_ratio']] = df[['Carb_ratio', 'Protein_ratio', 'Fat_ratio']].fillna(0)
    one_hot_cols = [col for col in ['Chronic_Disease', 'Gender'] if col in df.columns]
    df = pd.get_dummies(df, columns=one_hot_cols, drop_first=True)
    features = ['Age', 'BMI', 'Carb_ratio', 'Protein_ratio', 'Fat_ratio'] + \
               [col for col in df.columns if col.startswith('Chronic_Disease_') or col.startswith('Gender_')]
    features = [f for f in features if f in df.columns]
    return df[features], features

X, feature_names = feature_engineering(df)

# Load available models only
rf = load_model('models/best_model_RandomForest.joblib')
xgb = load_model('models/best_model_Advanced.joblib')  # XGBoost
models = {
    'RandomForest': rf,
    'XGBoost': xgb
}

## 4. Generate and Save Feature Importance Plots
For each model, extract feature importances and generate bar charts. Save each plot as a PNG file in the `model_insights/` folder.

In [4]:
# Generate and save feature importance plots
for name, model in models.items():
    if hasattr(model, 'feature_importances_'):
        importances = model.feature_importances_
    elif hasattr(model, 'estimators_') and hasattr(model.estimators_[0], 'feature_importances_'):
        # MultiOutputRegressor: average importances across outputs
        importances = np.mean([est.feature_importances_ for est in model.estimators_], axis=0)
    else:
        print(f"No feature_importances_ for {name}")
        continue
    plot_feature_importance(importances, feature_names, name)

Saved: model_insights/randomforest_importance.png
Saved: model_insights/xgboost_importance.png


## 5. Compute and Save SHAP Value Plots for Tree-Based Models
For XGBoost, LightGBM, and CatBoost, compute SHAP values using a sample of the dataset and generate SHAP summary plots. Save each plot in the `model_insights/` folder.

**Note:**

If you still encounter errors with SHAP and XGBoost (such as `'TreeEnsemble' object has no attribute 'values'`), upgrade both packages to the latest versions and restart the kernel:

```bash
pip install --upgrade shap xgboost
```

After upgrading, restart the notebook kernel and re-run the SHAP analysis cell.

In [5]:
# Compute and save SHAP value plots for available tree-based models
shap_summaries = {}
# Ensure the SHAP sample is numeric to avoid dtype errors
shap_sample = X.sample(n=min(500, len(X)), random_state=42)
shap_sample = shap_sample.apply(pd.to_numeric, errors='coerce').fillna(0)
for name, model in models.items():
    # If model is MultiOutputRegressor, use the first estimator for SHAP
    base_model = model
    if hasattr(model, "estimators_") and len(model.estimators_) > 0:
        base_model = model.estimators_[0]
    try:
        shap_values = plot_shap_summary(base_model, shap_sample, name)
        shap_summaries[name] = shap_values
    except Exception as e:
        print(f"SHAP failed for {name}: {e}")

SHAP failed for RandomForest: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'


SHAP failed for XGBoost: 'TreeEnsemble' object has no attribute 'values'
SHAP failed for XGBoost: 'TreeEnsemble' object has no attribute 'values'


## 6. Compare Important Features Across Models
Analyze and display which features are consistently important across all models.

In [6]:
# Compare top features across models
def get_top_features(model, feature_names, n=10):
    if hasattr(model, 'feature_importances_'):
        importances = model.feature_importances_
    elif hasattr(model, 'estimators_') and hasattr(model.estimators_[0], 'feature_importances_'):
        importances = np.mean([est.feature_importances_ for est in model.estimators_], axis=0)
    else:
        return []
    indices = np.argsort(importances)[::-1][:n]
    return [feature_names[i] for i in indices]

feature_rankings = {name: get_top_features(model, feature_names) for name, model in models.items()}

# Display as DataFrame
top_features_df = pd.DataFrame(feature_rankings)
display(top_features_df)

# Text summary
from collections import Counter
all_top = sum(feature_rankings.values(), [])
common = Counter(all_top).most_common(10)
print("Most consistently important features across models:")
for feat, count in common:
    print(f"{feat}: {count} models")

Unnamed: 0,RandomForest,XGBoost
0,Fat_ratio,Fat_ratio
1,Protein_ratio,Protein_ratio
2,Carb_ratio,Carb_ratio
3,BMI,BMI
4,Age,Age
5,Chronic_Disease_none,Chronic_Disease_none
6,Chronic_Disease_heart_disease,Gender_Male
7,Gender_Other,Gender_Other
8,Chronic_Disease_obesity,Chronic_Disease_heart_disease
9,Chronic_Disease_hypertension,Chronic_Disease_hypertension


Most consistently important features across models:
Fat_ratio: 2 models
Protein_ratio: 2 models
Carb_ratio: 2 models
BMI: 2 models
Age: 2 models
Chronic_Disease_none: 2 models
Chronic_Disease_heart_disease: 2 models
Gender_Other: 2 models
Chronic_Disease_hypertension: 2 models
Chronic_Disease_obesity: 1 models
