<a href="https://colab.research.google.com/github/abhinav-vollala/EXLAI-B-41/blob/main/Explainable_ai_assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1 Assignment: Feature Importance Analysis using SHAP
Objective
To select a publicly available dataset from any domain, apply SHAP (SHapley Additive exPlanations) to
identify important features, build a predictive model, and interpret the results in detail.
Dataset Selection Guidelines
Students choose datasets from the domain:.
 Education – e.g., student performance, exam score prediction.
Requirements for dataset selection:
 At least 500 rows of data.
 Minimum 5 independent variables (features).
 A clear target variable for classification or regression.
 Dataset must be publicly accessible (Kaggle, UCI Repository, government portals, etc.).
Tasks
 Data Collection & Preprocessing
 Download the chosen dataset in .csv format/ or any.
 Load it into Python using Pandas.
 Handle missing values, duplicates, and outliers.
 Encode categorical variables if needed.
 Normalize or standardize data when required.
Model Building
 Split the dataset into training (80%) and testing (20%) sets.
 Choose a suitable model (e.g., Random Forest, Logistic Regression, XGBoost).
 Train and evaluate the model using relevant metrics:
 Classification: Accuracy, Precision, Recall, F1-score, ROC.
 Regression: RMSE,MSE, MAPE,MPE, MAE, R² score.
SHAP Implementation
 Install and import SHAP (pip install shap).
 Select an appropriate SHAP explainer (TreeExplainer, KernelExplainer, etc.).
 Compute SHAP values for the test set.
Generate and include:
 Summary plot – overall feature importance.
 Force plot – individual prediction explanation.
 Waterfall plot – step-by-step feature contribution.
Result Interpretation
 Identify and explain the top 5 most influential features.
 Compare SHAP feature importance with the model’s built-in feature importance (if available).
 Discuss whether the results are meaningful in the chosen domain.
Report Preparation
 Title Page – Assignment title, student name, roll number, date.
 Introduction – Problem statement and dataset overview.
 Dataset Description – Source, size, features, target variable.
 Preprocessing Steps – Cleaning and transformation details.
 Model & Performance – Algorithm choice, parameters, evaluation metrics.
 SHAP Analysis – Plots and explanations.
Conclusion – Key insights, limitations, and possible improvements.
Submission Requirements
 Python code file (.ipynb or .py).
 Dataset file (.csv).
 Report (.pdf) including SHAP plots and explanations.

In [1]:
# ==========================================
# SHAP Feature Importance Analysis (Safe Version)
# ==========================================
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import shap
import warnings
warnings.filterwarnings("ignore")

# -------------------------
# 1) Load dataset
# -------------------------
CSV_PATH = "/content/student_performance_cleaned.csv"  # change if needed

# Try multiple separators
for sep in [",", ";", "\t"]:
    try:
        df = pd.read_csv(CSV_PATH, sep=sep)
        if df.shape[1] > 2:  # at least a few columns
            print(f"Loaded with sep='{sep}' → shape {df.shape}")
            break
    except Exception:
        pass

print(df.head())

# -------------------------
# 2) Target column
# -------------------------
if "pass" not in df.columns:
    if "G3" in df.columns:
        df["pass"] = (df["G3"] >= 10).astype(int)
    else:
        raise ValueError("Dataset must contain either 'pass' or 'G3' column.")

y = df["pass"].astype(int)
X = df.drop(columns=[c for c in ["G3", "pass"] if c in df.columns])

# -------------------------
# 3) Handle categorical & numeric
# -------------------------
categorical_cols = X.select_dtypes(include=["object", "category"]).columns.tolist()
numeric_cols = [c for c in X.columns if c not in categorical_cols]

if categorical_cols:
    ohe = OneHotEncoder(handle_unknown="ignore", sparse=False)
    ct = ColumnTransformer(
        [("ohe", ohe, categorical_cols)], remainder="passthrough"
    )
    X_trans = ct.fit_transform(X)
    ohe_features = ct.named_transformers_["ohe"].get_feature_names_out(categorical_cols)
    feature_names = list(ohe_features) + numeric_cols
else:
    X_trans = X.to_numpy()
    feature_names = numeric_cols

print(f"X shape after transform: {X_trans.shape}")

# -------------------------
# 4) Train model
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X_trans, y, test_size=0.2, random_state=42, stratify=y if y.nunique() > 1 else None
)

rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

# -------------------------
# 5) Importances
# -------------------------
rf_importances = rf.feature_importances_

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_train)

if isinstance(shap_values, list) and len(shap_values) == 2:
    shap_importance = np.abs(shap_values[1]).mean(axis=0)
elif isinstance(shap_values, list):
    shap_importance = np.sum([np.abs(s).mean(axis=0) for s in shap_values], axis=0)
else:
    shap_importance = np.abs(shap_values).mean(axis=0)

# -------------------------
# 6) Combine & save
# -------------------------


# Flatten arrays to 1D just in case
rf_importances = np.ravel(rf_importances)
shap_importance = np.ravel(shap_importance)
feature_names = list(np.ravel(feature_names))

print(f"Lengths → features={len(feature_names)}, rf={len(rf_importances)}, shap={len(shap_importance)}")

# Align lengths safely
min_len = min(len(feature_names), len(rf_importances), len(shap_importance))
feat_imp = pd.DataFrame({
    "Feature": feature_names[:min_len],
    "RF_importance": rf_importances[:min_len],
    "SHAP_importance": shap_importance[:min_len]
}).sort_values("SHAP_importance", ascending=False).reset_index(drop=True)

print("\nTop 10 features by SHAP importance:")
print(feat_imp.head(10))

feat_imp.to_csv("feature_importance_comparison.csv", index=False)
print("\nSaved feature_importance_comparison.csv ✅")



Loaded with sep=',' → shape (649, 40)
   age  Medu  Fedu  traveltime  studytime  failures  famrel  freetime  goout  \
0   18     4   4.0           2          2         0       4         3      4   
1   17     1   1.0           1          2         0       5         3      3   
2   15     1   1.0           1          2         0       4         3      2   
3   15     4   2.0           1          3         0       3         2      2   
4   16     3   3.0           1          2         0       4         3      2   

   Dalc  ...  guardian_other  schoolsup_yes  famsup_yes  paid_yes  \
0     1  ...               0              1           0         0   
1     1  ...               0              0           1         0   
2     2  ...               0              1           0         0   
3     1  ...               0              0           1         0   
4     1  ...               0              0           1         0   

   activities_yes  nursery_yes  higher_yes  internet_yes  romantic

In [None]:
import pandas as pd

# Try reading with comma separator
try:
    df_comma = pd.read_csv("/content/student_performance_cleaned.csv", sep=",")
    print("DataFrame read with comma separator:")
    display(df_comma.head())
    print("\nColumns after reading with comma separator:")
    print(df_comma.columns)
except Exception as e:
    print(f"Error reading with comma separator: {e}")

print("-" * 30)

# Try reading without specifying separator (pandas auto-detection)
try:
    df_auto = pd.read_csv("/content/student_performance_cleaned.csv")
    print("DataFrame read with auto-detected separator:")
    display(df_auto.head())
    print("\nColumns after reading with auto-detected separator:")
    print(df_auto.columns)
except Exception as e:
    print(f"Error reading with auto-detected separator: {e}")

DataFrame read with comma separator:


Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,...,guardian_other,schoolsup_yes,famsup_yes,paid_yes,activities_yes,nursery_yes,higher_yes,internet_yes,romantic_yes,pass
0,18,4,4.0,2,2,0,4,3,4,1,...,0,1,0,0,0,1,1,0,0,1
1,17,1,1.0,1,2,0,5,3,3,1,...,0,0,1,0,0,0,1,1,0,1
2,15,1,1.0,1,2,0,4,3,2,2,...,0,1,0,0,0,1,1,1,0,1
3,15,4,2.0,1,3,0,3,2,2,1,...,0,0,1,0,1,1,1,1,1,1
4,16,3,3.0,1,2,0,4,3,2,1,...,0,0,1,0,0,1,1,0,0,1



Columns after reading with comma separator:
Index(['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel',
       'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'school_MS',
       'sex_M', 'address_U', 'famsize_LE3', 'Pstatus_T', 'Mjob_health',
       'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_health',
       'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_home',
       'reason_other', 'reason_reputation', 'guardian_mother',
       'guardian_other', 'schoolsup_yes', 'famsup_yes', 'paid_yes',
       'activities_yes', 'nursery_yes', 'higher_yes', 'internet_yes',
       'romantic_yes', 'pass'],
      dtype='object')
------------------------------
DataFrame read with auto-detected separator:


Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,...,guardian_other,schoolsup_yes,famsup_yes,paid_yes,activities_yes,nursery_yes,higher_yes,internet_yes,romantic_yes,pass
0,18,4,4.0,2,2,0,4,3,4,1,...,0,1,0,0,0,1,1,0,0,1
1,17,1,1.0,1,2,0,5,3,3,1,...,0,0,1,0,0,0,1,1,0,1
2,15,1,1.0,1,2,0,4,3,2,2,...,0,1,0,0,0,1,1,1,0,1
3,15,4,2.0,1,3,0,3,2,2,1,...,0,0,1,0,1,1,1,1,1,1
4,16,3,3.0,1,2,0,4,3,2,1,...,0,0,1,0,0,1,1,0,0,1



Columns after reading with auto-detected separator:
Index(['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel',
       'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'school_MS',
       'sex_M', 'address_U', 'famsize_LE3', 'Pstatus_T', 'Mjob_health',
       'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_health',
       'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_home',
       'reason_other', 'reason_reputation', 'guardian_mother',
       'guardian_other', 'schoolsup_yes', 'famsup_yes', 'paid_yes',
       'activities_yes', 'nursery_yes', 'higher_yes', 'internet_yes',
       'romantic_yes', 'pass'],
      dtype='object')
