### Modeling Objectives
- Train and evaluate classification models to predict satisfaction.
- Use SHAP or LIME to explain key satisfaction drivers.
- Monitor performance using recall, F1, and AUC scores.


In [1]:
# Import libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score, recall_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")
# Import explainability libraries 
import shap
import lime
from lime.lime_tabular import LimeTabularExplainer 



c:\Users\User\anaconda3\envs\learn-env\lib\site-packages\numpy\.libs\libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64.dll
c:\Users\User\anaconda3\envs\learn-env\lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll


In [2]:
# df1 = pd.read_csv("cleaned.csv")
# df1.head()

In [3]:
# Load and prepare data
df = pd.read_csv('eda_incl.csv')  
df.head()

Unnamed: 0,Agency Name,Complaint Type,Descriptor,Borough,Resolution Description,Survey Year,Survey Month,Satisfaction Response,Dissatisfaction Reason,Justified Dissatisfaction,Cluster,Combined_Feedback,Sentiment Score,Sentiment Label
0,Department of Buildings,Adult Establishment,Zoning Violation,MANHATTAN,The Department of Buildings investigated this ...,2022,10,Strongly Agree,Not Applicable,Delays in inspections or provision of construc...,0,Delays in inspections or provision of construc...,0.0,neutral
1,Department of Buildings,Adult Establishment,Zoning Violation,BROOKLYN,The Department of Buildings investigated this ...,2024,11,Strongly Disagree,The Agency did not correct the issue.,Delays in inspections or provision of construc...,1,Delays in inspections or provision of construc...,0.0,neutral
2,Department for the Aging,Legal Services Provider Complaint,Not Provided,MANHATTAN,The Department for the Aging contacted you and...,2024,3,Strongly Agree,Not Applicable,Lack of timely support or services for senior ...,0,Lack of timely support or services for senior ...,0.1027,positive
3,Department of Buildings,Advertising Sign,Poster,MANHATTAN,The Department of Buildings reviewed this comp...,2024,2,Neutral,Not Applicable,Delays in inspections or provision of construc...,0,Delays in inspections or provision of construc...,0.0,neutral
4,Department of Buildings,Advertising Sign,Billboard,MANHATTAN,The Department of Buildings investigated this ...,2023,10,Strongly Disagree,"Status updates were unhelpful, inaccurate, inc...",Delays in inspections or provision of construc...,3,Delays in inspections or provision of construc...,0.0,neutral


In [4]:
# # Check for duplicate columns
# df1_cols = set(df1.columns)
# df2_cols = set(df2.columns)
# common_cols = df1_cols.intersection(df2_cols)
# print(f"Common columns: {common_cols}")


In [5]:
# # Combine df1 and df2 
# # Get unique columns from df2 that aren't in df1
# df2_unique_cols = [col for col in df2.columns if col not in df1.columns]

# # Merge on all common columns
# common_cols = list(set(df1.columns).intersection(set(df2.columns)))
# merged_df = pd.merge(df1, df2[common_cols + df2_unique_cols], on=common_cols, how='outer')

# merged_df.head()

In [6]:
# merged_df.shape

In [7]:
# Create binary target: 1 for satisfied (Strongly Agree, Agree), 0 for not satisfied
df['Satisfied'] = df['Satisfaction Response'].apply(
    lambda x: 1 if x in ['Strongly Agree', 'Agree'] else 0
)

In [8]:
# Select optimal features based on df2 structure
features = ['Agency Name', 'Complaint Type', 'Borough', 'Survey Year', 
           'Survey Month', 'Cluster', 'Sentiment Score']

X = df[features].copy()
y = df['Satisfied']

In [9]:
# Encode categorical variables
le_dict = {}
for col in X.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col].astype(str))
    le_dict[col] = le

In [10]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [11]:
# Train Random Forest (best for this type of data)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

In [24]:
# Evaluate
y_pred = clf.predict(X_test)
y_pred_proba = clf.predict_proba(X_test)[:, 1]

print(f"Recall: {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")
print(f"AUC Score: {roc_auc_score(y_test, y_pred_proba):.3f}")

Recall: 0.983
F1 Score: 0.926
AUC Score: 0.971


In [23]:
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Not Satisfied', 'Satisfied']))

Classification Report:
               precision    recall  f1-score   support

Not Satisfied       0.99      0.93      0.96     48544
    Satisfied       0.88      0.98      0.93     24394

     accuracy                           0.95     72938
    macro avg       0.93      0.96      0.94     72938
 weighted avg       0.95      0.95      0.95     72938



In [13]:
# Feature importance
feat_importances = pd.Series(clf.feature_importances_, index=features)
print("\nTop Features:")
print(feat_importances.nlargest(5))


Top Features:
Cluster            0.907347
Complaint Type     0.035440
Survey Month       0.017973
Sentiment Score    0.015016
Borough            0.009906
dtype: float64


### Key Steps Summary:

1. **Data Preparation**: Load data, handle categorical variables, split into train/test
2. **Model Training**: Train multiple classification models (Logistic Regression, Random Forest, Gradient Boosting)
3. **Performance Evaluation**: Compare models using recall, F1, and AUC scores
4. **Model Explanation**: Use SHAP for global feature importance and LIME for local explanations
5. **Performance Monitoring**: Visualize and track model performance metrics

**Next Steps:**
- Uncomment and adapt the code with your actual data
- Replace placeholder column names with your actual feature and target columns
- Consider hyperparameter tuning for the best performing model
- Analyze SHAP/LIME results to identify key satisfaction drivers