# SHAP Visualization & Interpretation

This notebook collects SHAP artifacts, plots, and related interpretation files produced during the analysis. Run cells sequentially to display available figures and CSV summaries. Cells are robust to missing files and will print messages if artifacts are not found.


**Purpose.** The following cells present model-agnostic explanations (SHAP) and feature-importance outputs for the Random Forest classifier trained to predict high vs low cognitive load. Each figure is accompanied by a short caption that highlights the key insight to help you include these visuals in reports or presentations.


In [None]:

# Helper imports and lookup paths
from pathlib import Path
from IPython.display import display, Image, HTML, Markdown
import pandas as pd, numpy as np, matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 150

SEARCH_DIRS = [
    Path('/mnt/data/analysis/interpretation'),
    Path('/mnt/data/analysis/results'),
    Path('/mnt/data/analysis/behavioral_patterns'),
    Path('/mnt/data')
]

def find_file(names):
    for d in SEARCH_DIRS:
        for n in names:
            p = d / n
            if p.exists():
                return p
    return None

def show_image(p, caption=None, width=900):
    if p and p.exists():
        display(HTML(f"<div><img src='{p}' width='{width}'/><p style='font-size:90%'>" + (caption or p.name) + "</p></div>"))
    else:
        print('Not found:', p)


## SHAP Summary Plots
Below are the SHAP summary bar plot (mean(|SHAP|)) and the SHAP beeswarm. These plots summarize which features contribute most to model predictions and how feature values relate to predicted probability of 'High Load'.

In [None]:

bar = find_file(['shap_summary_bar.png','shap_summary_barplot.png','shap_summary_bar.png'])
bees = find_file(['shap_summary_beeswarm.png','shap_summary_beeswarmplot.png'])
print('SHAP bar path:', bar)
print('SHAP beeswarm path:', bees)
if bar:
    show_image(bar, caption='Figure: SHAP summary bar (mean absolute SHAP values). Caption: Features are ranked by average impact on model output; higher values indicate more predictive power for High Load.')
else:
    print('SHAP summary bar not found.')
if bees:
    show_image(bees, caption='Figure: SHAP beeswarm. Caption: Each point represents one sample-feature SHAP value; color shows the original feature value. The plot reveals how feature magnitude directionally affects predictions.')
else:
    print('SHAP beeswarm not found.')


## SHAP Clusters (PCA visualization)
If cluster analysis on SHAP vectors was performed, this PCA plot visualizes cluster separation. Clusters can indicate subgroups of participants/sessions with distinct explanation patterns.

In [None]:

pca = find_file(['shap_clusters_pca.png','shap_clusters_pca_plot.png','shap_clusters_pca.png'])
if pca:
    show_image(pca, caption='Figure: PCA of SHAP vectors with k-means clusters. Caption: Points are samples; color shows cluster membership derived from SHAP patterns, indicating groups with similar model explanations.')
else:
    print('SHAP clusters PCA not found.')


## Feature Importances (Random Forest)
The Random Forest feature importance plot and CSV list the features the model used most for classification. Treat RF importances as complementary to SHAP: SHAP gives local, signed contributions; RF importance gives global usage frequency/impurity reduction.

In [None]:

fi_csv = find_file(['feature_importances.csv','feature_importances_ultrarealistic_summary.csv','feature_importances_ultrarealistic_summary.csv'])
fi_png = find_file(['feature_importances_barplot.png','feature_importances.png','feature_importances_ultrarealistic_summary.png'])
if fi_csv:
    df_fi = pd.read_csv(fi_csv)
    display(df_fi.head(30))
    display(Markdown('**Caption:** Feature importances (Random Forest). Higher values indicate greater average contribution to splits.'))
else:
    print('feature_importances CSV not found.')
if fi_png:
    show_image(fi_png, caption='Figure: Random Forest feature importances. Caption: Visual ranking of features by importance.')
else:
    print('Feature importance image not found.')


## SHAP Cluster Labels and Sample Examples
This table (if present) shows cluster assignments based on SHAP vectors with optional participant/task metadata. Use it to inspect whether particular tasks or participants map to specific explanation clusters.

In [None]:

cluster_csv = find_file(['shap_cluster_labels.csv','shap_cluster_labels_with_meta.csv','participant_shap_clusters.csv'])
if cluster_csv:
    dfc = pd.read_csv(cluster_csv)
    display(dfc.head(50))
    display(Markdown('**Caption:** SHAP cluster labels. Each row is a sample (participant-task) with assigned cluster; examine cluster composition for intuition.'))
else:
    print('No SHAP cluster labels CSV found.')


## SHAP Arrays and Feature Names
If available, `shap_values.npy` provides the matrix of SHAP values (samples × features) and `shap_feature_names.json` lists feature order. These are useful for custom plotting and cluster analyses.

In [None]:

shap_npy = find_file(['shap_values.npy','shap_values_full.npy'])
shap_feat = find_file(['shap_feature_names.json','shap_feature_names.txt'])
if shap_npy and shap_feat:
    arr = np.load(shap_npy)
    import json
    feat_names = json.load(open(shap_feat))
    print('SHAP array shape:', arr.shape)
    print('First 20 feature names:', feat_names[:20])
    display(Markdown('**Caption:** SHAP array shape (samples × features) and feature order. Use for custom analyses (e.g., per-feature cluster centroids).'))
else:
    print('SHAP arrays or feature names not found.')


## Regenerating SHAP Values (Guidance)
If SHAP outputs are missing, use the example code below to compute TreeSHAP for the trained Random Forest pipeline. Ensure your model pipeline exposes the Random Forest as `pipeline.named_steps['rf']` or adapt accordingly.

In [None]:

print('Example regeneration code (edit paths and pipeline access as needed):\n')
print('''import joblib, shap, numpy as np, pandas as pd
pipeline = joblib.load('models/tuned_random_forest_model.joblib') # or path to saved pipeline
df = pd.read_csv('data/processed/modeling_dataset_ultrarealistic.csv')
feature_cols = [c for c in df.columns if c not in ('participantId','task_id','tlx','High_Load','shap_cluster')]
X = df[feature_cols]
# If you used preprocessing pipeline:
X_proc = pipeline.named_steps['scaler'].transform(pipeline.named_steps['imputer'].transform(X))
explainer = shap.TreeExplainer(pipeline.named_steps['rf'])
shap_vals = explainer.shap_values(X_proc)[1]  # use class index for 'High Load' if binary
np.save('analysis/interpretation/shap_values.npy', shap_vals)
np.save('analysis/interpretation/shap_feature_names.json', feature_cols)
''')



## Export & Figure Usage Notes

- Use the provided PNGs directly in reports.
- For publication figures, prefer the SHAP beeswarm for nuanced relationships and the SHAP bar for a compact importance summary.
- For vector graphics suitable for publication, regenerate the plots in Python and export as SVG or PDF.
