# XAI Pipeline - Google Colab Notebook
## Explainable AI for AWS CloudTrail Analysis

This notebook runs the XAI pipeline without Ollama/local LLMs.
Focus: **Quantitative validation of XAI techniques (SHAP, LIME)**

## 1. Setup & Installation

In [None]:
# Install dependencies
!pip install -q shap lime sentence-transformers transformers joblib scikit-learn scipy tensorflow

In [None]:
# Mount Google Drive (if using Drive for models)
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Clone your GitHub repository
!git clone https://github.com/anuradha-gh/xai-saas-access-control.git
%cd xai-saas-access-control

## 2. Configure Paths (Update These!)

In [None]:
# Option A: If models are in Google Drive
MODEL_PATH = '/content/drive/MyDrive/SAAS_XAI/'  # Update this path

CONFIG = {
    'c1_autoencoder': MODEL_PATH + 'autoencoder.h5',
    'c1_iso_forest': MODEL_PATH + 'isolation_forest.joblib',
    'c2_bert_path': MODEL_PATH + 'trained_role_classifier/checkpoint-15000',
    'c3_sbert_path': MODEL_PATH + 'c3_unsupervised_aws_model/sbert_model',
    'c3_iso_forest': MODEL_PATH + 'c3_unsupervised_aws_model/isolation_forest.joblib',
    'log_data': MODEL_PATH + 'flaws_cloudtrail00.json'
}

# Option B: Upload models directly to Colab (for small files)
# from google.colab import files
# uploaded = files.upload()

## 3. Configure XAI (No LLM Mode)

In [None]:
# Disable LLM for Colab
USE_LLM = False  # No Ollama needed
LOCAL_MODEL_NAME = None

XAI_CONFIG = {
    'enable_xai': True,
    'default_stakeholder': 'technical',  # Doesn't matter without LLM
    'enable_validation': True,  # IMPORTANT: Enable validation
    'num_shap_samples': 100,
    'num_lime_samples': 1000,
}

print("‚úÖ XAI configured for Colab (LLM disabled)")

## 4. Import XAI Modules

In [None]:
import sys
import json
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Import XAI modules
from xai_explainer import XAIExplainerFactory, SHAPExplainer, LIMETextExplainer
from llm_translator import LLMTranslator, StakeholderType
from xai_validator import XAIValidator

print("‚úÖ XAI modules imported")

## 5. Load Models

In [None]:
import tensorflow as tf
from tensorflow.keras.models import load_model
import joblib
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

state = {'models': {}, 'preprocessors': {}}

print("üìÇ Loading CloudTrail data for preprocessing...")
with open(CONFIG['log_data'], 'r') as f:
    log_data = json.load(f)

# Helper functions (from XAI.py)
def parse_log_c1(record):
    return {
        'eventName': record.get('eventName', 'Unknown'),
        'eventSource': record.get('eventSource', 'Unknown').split('.')[0],
        'userIdentityType': record.get('userIdentity', {}).get('type', 'Unknown'),
        'awsRegion': record.get('awsRegion', 'Unknown'),
    }

# Prepare preprocessing
records = log_data.get('Records', [])[:1000]  # Use subset for Colab
df = pd.DataFrame([parse_log_c1(r) for r in records])
cat_features = ['eventName', 'eventSource', 'userIdentityType', 'awsRegion']
for col in cat_features:
    df[col] = df[col].fillna('Unknown')

# C1 Preprocessing
print("üîß Fitting C1 preprocessors...")
preprocessor = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_features)], remainder='passthrough')
X_processed = preprocessor.fit_transform(df).toarray()
scaler = StandardScaler()
scaler.fit(X_processed)
state['preprocessors']['c1_prep'] = preprocessor
state['preprocessors']['c1_scaler'] = scaler

# Load C1 Models
print("üß† Loading C1 Autoencoder...")
state['models']['c1_ae'] = load_model(CONFIG['c1_autoencoder'])
print("üå≤ Loading C1 IsolationForest...")
state['models']['c1_if'] = joblib.load(CONFIG['c1_iso_forest'])
dense_layers = [l for l in state['models']['c1_ae'].layers if isinstance(l, tf.keras.layers.Dense)]
bottleneck = min(dense_layers, key=lambda l: l.units)
state['models']['c1_enc'] = tf.keras.models.Model(inputs=state['models']['c1_ae'].input, outputs=bottleneck.output)

# Load C2 Models
print("ü§ñ Loading C2 BERT model...")
tokenizer = AutoTokenizer.from_pretrained(CONFIG['c2_bert_path'])
c2_model = AutoModelForSequenceClassification.from_pretrained(CONFIG['c2_bert_path'])
state['models']['c2_pipe'] = pipeline("text-classification", model=c2_model, tokenizer=tokenizer, return_all_scores=True)

# Load C3 Models
print("üìù Loading C3 Sentence-BERT...")
state['models']['c3_sbert'] = SentenceTransformer(CONFIG['c3_sbert_path'])
print("üå≤ Loading C3 IsolationForest...")
state['models']['c3_if'] = joblib.load(CONFIG['c3_iso_forest'])

print("\n‚úÖ All models loaded successfully!")

## 6. Initialize XAI Pipeline

In [None]:
print("üîß Initializing XAI Pipeline...")

# Prepare background data for SHAP (latent + MSE features)
X_scaled = scaler.transform(X_processed[:100])
latent = state['models']['c1_enc'].predict(X_scaled, verbose=0)
recon = state['models']['c1_ae'].predict(X_scaled, verbose=0)
mse = np.mean((X_scaled - recon)**2, axis=1).reshape(-1, 1)
c1_features_for_shap = np.hstack([latent, mse])

# Create explainer factory
state['xai_explainer'] = XAIExplainerFactory(state['models'], state['preprocessors'])
background_data = {'c1_features': c1_features_for_shap}
state['xai_explainer'].initialize(background_data)

# Create LLM translator (template mode)
state['llm_translator'] = LLMTranslator(use_ollama=False)

# Create validator
state['xai_validator'] = XAIValidator()

print("‚úÖ XAI Pipeline Ready!")

## 7. Test XAI on Example Log

In [None]:
# Example CloudTrail log
test_log = {
    "eventTime": "2017-02-12T21:30:56Z",
    "eventSource": "s3.amazonaws.com",
    "eventName": "DeleteBucket",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "AWS Internal",
    "userIdentity": {
        "type": "Root",
        "userName": "root_account"
    }
}

# Prepare features
p_log = parse_log_c1(test_log)
processed = state['preprocessors']['c1_prep'].transform(pd.DataFrame([p_log])).toarray()
scaled = state['preprocessors']['c1_scaler'].transform(processed)

# Extract latent + MSE
latent = state['models']['c1_enc'].predict(scaled, verbose=0)
recon = state['models']['c1_ae'].predict(scaled, verbose=0)
mse = np.mean((scaled - recon)**2, axis=1).reshape(-1, 1)
features_for_if = np.hstack([latent, mse])

# Get XAI explanation
xai_result = state['xai_explainer'].explain_c1(features_for_if, p_log)

print("\nüìä XAI EXPLANATION:")
print("="*70)
if 'shap' in xai_result:
    print("\nüîç SHAP Feature Importance:")
    for feat in xai_result['shap']['feature_importance'][:5]:
        print(f"  - {feat['feature']}: {feat['shap_value']:.4f} (importance: {feat['importance']:.4f})")

if 'reconstruction' in xai_result:
    print("\nüîß Reconstruction Errors:")
    for feat in xai_result['reconstruction']['feature_errors'][:3]:
        print(f"  - {feat['feature']}: {feat['error']:.4f} (value: {feat['original_value']})")

## 8. Run XAI Validation Suite

In [None]:
print("\nüî¨ RUNNING XAI VALIDATION SUITE")
print("="*70)

# Prepare test data
test_records = log_data.get('Records', [])[:50]
df_test = pd.DataFrame([parse_log_c1(r) for r in test_records])
for col in cat_features:
    df_test[col] = df_test[col].fillna('Unknown')

X_test_processed = state['preprocessors']['c1_prep'].transform(df_test).toarray()
X_test_scaled = state['preprocessors']['c1_scaler'].transform(X_test_processed)

# Extract latent + MSE for test data
latent_test = state['models']['c1_enc'].predict(X_test_scaled, verbose=0)
recon_test = state['models']['c1_ae'].predict(X_test_scaled, verbose=0)
mse_test = np.mean((X_test_scaled - recon_test)**2, axis=1).reshape(-1, 1)
c1_test_features = np.hstack([latent_test, mse_test])

test_data = {'c1': c1_test_features}

# Run validation
report = state['xai_validator'].validate_all(state['models'], state['xai_explainer'], test_data)

# Display results
print(state['xai_validator'].generate_report())

## 9. Visualize SHAP Values

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Get SHAP values for multiple instances
explainer = state['xai_explainer'].get_explainer('c1_shap')
shap_results = []

for i in range(min(10, len(c1_test_features))):
    result = explainer.explain(c1_test_features[i:i+1])
    shap_results.append(result)

# Extract feature names and values
feature_names = [f['feature'] for f in shap_results[0]['feature_importance']]
shap_values = np.array([[f['shap_value'] for f in r['feature_importance']] for r in shap_results])

# Plot
plt.figure(figsize=(10, 6))
plt.barh(feature_names, np.mean(np.abs(shap_values), axis=0))
plt.xlabel('Mean Absolute SHAP Value')
plt.title('Feature Importance for C1 Anomaly Detection')
plt.tight_layout()
plt.show()

print("\n‚úÖ Visualization complete!")

## 10. Save Validation Report

In [None]:
# Save to JSON
with open('validation_report_colab.json', 'w') as f:
    json.dump(report, f, indent=2)

# Download to local machine
from google.colab import files
files.download('validation_report_colab.json')

print("‚úÖ Validation report saved and downloaded!")

## Summary

This notebook demonstrates:
- ‚úÖ Running XAI pipeline in Colab **without Ollama**
- ‚úÖ **SHAP explanations** with feature importance
- ‚úÖ **Validation metrics** (fidelity & stability)
- ‚úÖ **Visualization** of results
- ‚úÖ Works with free Colab GPU

**Key Insight**: You don't need LLM for XAI validation! The numerical metrics (SHAP values, perturbation sensitivity, Jaccard similarity) are the **ground truth** for XAI quality.