# Violence Classification Model

**Objective:** Build a classification model to predict whether crime incidents are violent vs non-violent

**Purpose:** Support resource allocation decisions by identifying violent incident patterns

**Methodology:**
- Binary classification: Violent (1) vs Non-Violent (0)
- Time-aware validation to prevent data leakage
- Feature importance analysis using built-in methods and SHAP
- Model card documenting performance and limitations

**Requirements:** FORECAST-02

## 1. Setup and Configuration

In [None]:
# Standard library imports
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
    precision_recall_curve,
)
from sklearn.model_selection import TimeSeriesSplit
import xgboost as xgb
import shap

# Ensure repo root is in path for imports
config_path = Path.cwd().parent / 'analysis' / 'phase1_config.yaml'
if config_path.exists():
    repo_root = config_path.parent.parent
else:
    repo_root = Path.cwd().parent

sys.path.insert(0, str(repo_root))

# Project imports
from analysis.config import CRIME_DATA_PATH, REPORTS_DIR
from analysis.utils import load_data, classify_crime_category
from analysis.models.classification import (
    create_time_aware_split,
    train_random_forest,
    train_xgboost,
    extract_feature_importance,
    evaluate_classifier,
)
from analysis.models.validation import create_model_card, validate_temporal_split

# Ensure reports directory exists
REPORTS_DIR.mkdir(parents=True, exist_ok=True)

# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Plot settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('Set2')

print(f"✓ Notebook initialized")
print(f"✓ Random seed: {RANDOM_SEED}")
print(f"✓ Data path: {CRIME_DATA_PATH}")
print(f"✓ Reports directory: {REPORTS_DIR}")

## 2. Data Loading and Preparation

Load crime data and create binary target variable for violent vs non-violent classification.

In [None]:
# Load crime data
df = load_data(clean=True)

# Classify crimes into categories
df = classify_crime_category(df)

# Create binary target: Violent (1) vs Non-Violent (0)
df['is_violent'] = (df['crime_category'] == 'Violent').astype(int)

# Sort by date to maintain temporal order (critical for time-aware validation)
df = df.sort_values('dispatch_date').reset_index(drop=True)

print(f"Total incidents: {len(df):,}")
print(f"Date range: {df['dispatch_date'].min()} to {df['dispatch_date'].max()}")
print(f"\nClass distribution:")
print(df['is_violent'].value_counts())
print(f"\nViolent crime percentage: {df['is_violent'].mean() * 100:.2f}%")

### Exploratory Analysis: Class Distribution

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
class_counts = df['is_violent'].value_counts()
axes[0].bar(['Non-Violent', 'Violent'], class_counts.values, color=['#457B9D', '#E63946'])
axes[0].set_ylabel('Count')
axes[0].set_title('Class Distribution: Violent vs Non-Violent')
axes[0].grid(axis='y', alpha=0.3)
for i, v in enumerate(class_counts.values):
    axes[0].text(i, v, f'{v:,}', ha='center', va='bottom')

# Time series of violent crime percentage
monthly_violent = df.groupby(df['dispatch_date'].dt.to_period('M'))['is_violent'].mean() * 100
monthly_violent.index = monthly_violent.index.to_timestamp()
axes[1].plot(monthly_violent.index, monthly_violent.values, linewidth=1.5, color='#E63946')
axes[1].axhline(df['is_violent'].mean() * 100, color='gray', linestyle='--', alpha=0.5, label='Overall mean')
axes[1].set_xlabel('Date')
axes[1].set_ylabel('Violent Crime %')
axes[1].set_title('Violent Crime Percentage Over Time (Monthly)')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig(REPORTS_DIR / '04_classification_class_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Class distribution visualization saved")

## 3. Feature Engineering

Create features suitable for classification while preserving temporal order.

In [None]:
# Extract datetime features
df['year'] = df['dispatch_date'].dt.year
df['month'] = df['dispatch_date'].dt.month
df['day_of_week'] = df['dispatch_date'].dt.dayofweek
df['hour'] = df['dispatch_date'].dt.hour
df['day_of_year'] = df['dispatch_date'].dt.dayofyear
df['week_of_year'] = df['dispatch_date'].dt.isocalendar().week
df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)

# Time of day categories
df['time_of_day'] = pd.cut(
    df['hour'],
    bins=[0, 6, 12, 18, 24],
    labels=['Night', 'Morning', 'Afternoon', 'Evening'],
    include_lowest=True
)

# District features (if available)
if 'dc_dist' in df.columns:
    # Clean district codes
    df['district'] = pd.to_numeric(df['dc_dist'], errors='coerce').fillna(0).astype(int)
else:
    df['district'] = 0

# UCR code features
if 'ucr_general' in df.columns:
    df['ucr_code'] = pd.to_numeric(df['ucr_general'], errors='coerce').fillna(0).astype(int)
    df['ucr_category'] = (df['ucr_code'] // 100).astype(int)  # Hundred-bands
else:
    df['ucr_code'] = 0
    df['ucr_category'] = 0

# Location features (if available)
if 'point_x' in df.columns and 'point_y' in df.columns:
    df['location_x'] = pd.to_numeric(df['point_x'], errors='coerce').fillna(0)
    df['location_y'] = pd.to_numeric(df['point_y'], errors='coerce').fillna(0)
else:
    df['location_x'] = 0
    df['location_y'] = 0

print("✓ Temporal features extracted")
print(f"\nFeature summary:")
print(f"  - Temporal: year, month, day_of_week, hour, day_of_year, week_of_year, is_weekend, time_of_day")
print(f"  - Location: district, location_x, location_y")
print(f"  - Crime type: ucr_code, ucr_category")

In [None]:
# Select features for modeling
feature_columns = [
    'year',
    'month',
    'day_of_week',
    'hour',
    'day_of_year',
    'week_of_year',
    'is_weekend',
    'district',
    'location_x',
    'location_y',
]

# Encode categorical time_of_day
time_dummies = pd.get_dummies(df['time_of_day'], prefix='time', drop_first=True)
df = pd.concat([df, time_dummies], axis=1)
feature_columns.extend(time_dummies.columns.tolist())

# Prepare feature matrix and target
X = df[feature_columns].copy()
y = df['is_violent'].copy()

# Add dispatch_date as index for temporal split
X.index = df['dispatch_date']
y.index = df['dispatch_date']

print(f"Feature matrix shape: {X.shape}")
print(f"Target distribution: {y.value_counts().to_dict()}")
print(f"\nFeatures: {list(X.columns)}")

## 4. Time-Aware Train/Test Split

Use temporal splitting (no shuffling) to prevent data leakage.

In [None]:
# Create time-aware split (80/20)
X_train, X_test, y_train, y_test = create_time_aware_split(
    X, y, test_size=0.2, ensure_sorted=True
)

# Validate temporal split
split_validation = validate_temporal_split(
    pd.Series(X_train.index),
    pd.Series(X_test.index),
    min_gap_days=0
)

print("Time-aware split created:")
print(f"  Training period: {X_train.index.min()} to {X_train.index.max()}")
print(f"  Testing period: {X_test.index.min()} to {X_test.index.max()}")
print(f"  Train size: {len(X_train):,} ({len(X_train)/len(X)*100:.1f}%)")
print(f"  Test size: {len(X_test):,} ({len(X_test)/len(X)*100:.1f}%)")
print(f"\nTrain class distribution: {y_train.value_counts().to_dict()}")
print(f"Test class distribution: {y_test.value_counts().to_dict()}")
print(f"\nValidation:")
print(f"  Temporal order preserved: {split_validation['valid_temporal_order']}")
print(f"  Gap between train and test: {split_validation['gap_days']} days")

## 5. Model Training: Random Forest

Train a Random Forest classifier with time-aware validation.

In [None]:
# Train Random Forest model
print("Training Random Forest classifier...")

rf_model, rf_scaler = train_random_forest(
    X_train,
    y_train,
    n_estimators=200,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=RANDOM_SEED,
    scale_features=True,
)

print("✓ Random Forest model trained")

In [None]:
# Make predictions on test set
X_test_scaled = rf_scaler.transform(X_test) if rf_scaler else X_test
y_pred_rf = rf_model.predict(X_test_scaled)
y_prob_rf = rf_model.predict_proba(X_test_scaled)[:, 1]

# Evaluate Random Forest
rf_metrics = evaluate_classifier(
    y_test,
    y_pred_rf,
    y_prob_rf,
    target_names=['Non-Violent', 'Violent']
)

print("\n" + "="*60)
print("RANDOM FOREST PERFORMANCE")
print("="*60)
print(f"\nROC-AUC Score: {rf_metrics['roc_auc']:.4f}")
print(f"\nClassification Report:")
print(pd.DataFrame(rf_metrics['classification_report']).T)
print(f"\nConfusion Matrix:")
print(rf_metrics['confusion_matrix'])

## 6. Model Training: XGBoost

Train an XGBoost classifier for comparison.

In [None]:
# Train XGBoost model
print("Training XGBoost classifier...")

xgb_model, xgb_scaler = train_xgboost(
    X_train,
    y_train,
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    random_state=RANDOM_SEED,
    scale_features=False,  # XGBoost doesn't need scaling
)

print("✓ XGBoost model trained")

In [None]:
# Make predictions on test set
y_pred_xgb = xgb_model.predict(X_test)
y_prob_xgb = xgb_model.predict_proba(X_test)[:, 1]

# Evaluate XGBoost
xgb_metrics = evaluate_classifier(
    y_test,
    y_pred_xgb,
    y_prob_xgb,
    target_names=['Non-Violent', 'Violent']
)

print("\n" + "="*60)
print("XGBOOST PERFORMANCE")
print("="*60)
print(f"\nROC-AUC Score: {xgb_metrics['roc_auc']:.4f}")
print(f"\nClassification Report:")
print(pd.DataFrame(xgb_metrics['classification_report']).T)
print(f"\nConfusion Matrix:")
print(xgb_metrics['confusion_matrix'])

## 7. Feature Importance Analysis

Extract and visualize feature importance from both models.

In [None]:
# Extract feature importance from Random Forest
rf_importance = extract_feature_importance(
    rf_model,
    feature_names=X_train.columns.tolist(),
    top_n=15
)

# Extract feature importance from XGBoost
xgb_importance = extract_feature_importance(
    xgb_model,
    feature_names=X_train.columns.tolist(),
    top_n=15
)

print("Top 10 Features (Random Forest):")
print(rf_importance.head(10))
print("\nTop 10 Features (XGBoost):")
print(xgb_importance.head(10))

In [None]:
# Visualize feature importance comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Random Forest importance
axes[0].barh(range(len(rf_importance)), rf_importance['importance'], color='#457B9D')
axes[0].set_yticks(range(len(rf_importance)))
axes[0].set_yticklabels(rf_importance['feature'])
axes[0].set_xlabel('Importance')
axes[0].set_title('Random Forest: Top 15 Feature Importances')
axes[0].invert_yaxis()
axes[0].grid(axis='x', alpha=0.3)

# XGBoost importance
axes[1].barh(range(len(xgb_importance)), xgb_importance['importance'], color='#E63946')
axes[1].set_yticks(range(len(xgb_importance)))
axes[1].set_yticklabels(xgb_importance['feature'])
axes[1].set_xlabel('Importance')
axes[1].set_title('XGBoost: Top 15 Feature Importances')
axes[1].invert_yaxis()
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.savefig(REPORTS_DIR / '04_classification_feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Feature importance visualization saved")

## 8. SHAP Analysis for Interpretability

Use SHAP values to understand model decisions (using XGBoost model).

In [None]:
# Compute SHAP values for a sample of test data
print("Computing SHAP values (this may take a minute)...")

# Sample 500 instances for SHAP analysis (to speed up computation)
X_test_sample = X_test.sample(min(500, len(X_test)), random_state=RANDOM_SEED)

# Create SHAP explainer
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_test_sample)

print("✓ SHAP values computed")

In [None]:
# SHAP summary plot
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_values, X_test_sample, show=False)
plt.title('SHAP Feature Importance Summary', pad=20)
plt.tight_layout()
plt.savefig(REPORTS_DIR / '04_classification_shap_summary.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ SHAP summary visualization saved")

## 9. Performance Visualizations

In [None]:
# ROC curves and Precision-Recall curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ROC Curve
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_prob_rf)
fpr_xgb, tpr_xgb, _ = roc_curve(y_test, y_prob_xgb)

axes[0].plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC={rf_metrics["roc_auc"]:.3f})', linewidth=2, color='#457B9D')
axes[0].plot(fpr_xgb, tpr_xgb, label=f'XGBoost (AUC={xgb_metrics["roc_auc"]:.3f})', linewidth=2, color='#E63946')
axes[0].plot([0, 1], [0, 1], 'k--', alpha=0.3, label='Random Classifier')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curves')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Precision-Recall Curve
precision_rf, recall_rf, _ = precision_recall_curve(y_test, y_prob_rf)
precision_xgb, recall_xgb, _ = precision_recall_curve(y_test, y_prob_xgb)

axes[1].plot(recall_rf, precision_rf, label='Random Forest', linewidth=2, color='#457B9D')
axes[1].plot(recall_xgb, precision_xgb, label='XGBoost', linewidth=2, color='#E63946')
axes[1].axhline(y_test.mean(), color='k', linestyle='--', alpha=0.3, label='Baseline')
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curves')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig(REPORTS_DIR / '04_classification_performance_curves.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Performance curves saved")

## 10. Model Card: Documentation and Limitations

Create comprehensive model documentation including performance metrics and known limitations.

In [None]:
# Create model cards for both models
rf_card = create_model_card(
    model_name="Violence Classification - Random Forest",
    model_type="RandomForestClassifier",
    features=X_train.columns.tolist(),
    train_metrics={
        "accuracy": rf_model.score(rf_scaler.transform(X_train) if rf_scaler else X_train, y_train),
    },
    test_metrics={
        "accuracy": rf_metrics['classification_report']['accuracy'],
        "roc_auc": rf_metrics['roc_auc'],
        "precision_violent": rf_metrics['classification_report']['Violent']['precision'],
        "recall_violent": rf_metrics['classification_report']['Violent']['recall'],
        "f1_violent": rf_metrics['classification_report']['Violent']['f1-score'],
    },
    limitations=[
        "Model trained on historical data; performance may degrade with changing crime patterns",
        f"Class imbalance: {(1-y.mean())*100:.1f}% non-violent vs {y.mean()*100:.1f}% violent",
        "Temporal features may not capture sudden policy changes or external events",
        "Limited to features available at time of dispatch (no investigation outcomes)",
        "Geographic coverage limited to Philadelphia; not generalizable to other cities",
        "Model should be retrained periodically as new data becomes available",
    ]
)

xgb_card = create_model_card(
    model_name="Violence Classification - XGBoost",
    model_type="XGBClassifier",
    features=X_train.columns.tolist(),
    train_metrics={
        "accuracy": xgb_model.score(X_train, y_train),
    },
    test_metrics={
        "accuracy": xgb_metrics['classification_report']['accuracy'],
        "roc_auc": xgb_metrics['roc_auc'],
        "precision_violent": xgb_metrics['classification_report']['Violent']['precision'],
        "recall_violent": xgb_metrics['classification_report']['Violent']['recall'],
        "f1_violent": xgb_metrics['classification_report']['Violent']['f1-score'],
    },
    limitations=[
        "Model trained on historical data; performance may degrade with changing crime patterns",
        f"Class imbalance: {(1-y.mean())*100:.1f}% non-violent vs {y.mean()*100:.1f}% violent",
        "Temporal features may not capture sudden policy changes or external events",
        "Limited to features available at time of dispatch (no investigation outcomes)",
        "Geographic coverage limited to Philadelphia; not generalizable to other cities",
        "Model should be retrained periodically as new data becomes available",
    ]
)

print("Model Cards Created\n")
print("="*80)
print("RANDOM FOREST MODEL CARD")
print("="*80)
print(f"Model: {rf_card['model_name']}")
print(f"Type: {rf_card['model_type']}")
print(f"Features: {rf_card['n_features']}")
print(f"\nTest Performance:")
for metric, value in rf_card['test_performance'].items():
    print(f"  {metric}: {value:.4f}")
print(f"\nLimitations:")
for i, limitation in enumerate(rf_card['limitations'], 1):
    print(f"  {i}. {limitation}")

print("\n" + "="*80)
print("XGBOOST MODEL CARD")
print("="*80)
print(f"Model: {xgb_card['model_name']}")
print(f"Type: {xgb_card['model_type']}")
print(f"Features: {xgb_card['n_features']}")
print(f"\nTest Performance:")
for metric, value in xgb_card['test_performance'].items():
    print(f"  {metric}: {value:.4f}")
print(f"\nLimitations:")
for i, limitation in enumerate(xgb_card['limitations'], 1):
    print(f"  {i}. {limitation}")

In [None]:
# Save model cards to JSON
import json

with open(REPORTS_DIR / '04_classification_rf_model_card.json', 'w') as f:
    json.dump(rf_card, f, indent=2)

with open(REPORTS_DIR / '04_classification_xgb_model_card.json', 'w') as f:
    json.dump(xgb_card, f, indent=2)

print("✓ Model cards saved to reports directory")

## 11. Summary and Recommendations

Key findings and operational recommendations for resource allocation.

In [None]:
# Generate summary report
summary = f"""
VIOLENCE CLASSIFICATION MODEL SUMMARY
{'='*80}

OBJECTIVE:
  Predict whether crime incidents are violent vs non-violent to support
  resource allocation and operational planning.

DATA:
  Total incidents: {len(df):,}
  Date range: {df['dispatch_date'].min()} to {df['dispatch_date'].max()}
  Violent incidents: {y.sum():,} ({y.mean()*100:.2f}%)
  Non-violent incidents: {(~y.astype(bool)).sum():,} ({(1-y.mean())*100:.2f}%)

METHODOLOGY:
  - Time-aware train/test split (80/20) - no shuffling to prevent data leakage
  - Training period: {X_train.index.min()} to {X_train.index.max()}
  - Testing period: {X_test.index.min()} to {X_test.index.max()}
  - Random seed: {RANDOM_SEED} (for reproducibility)

MODELS TRAINED:
  1. Random Forest (n_estimators=200, max_depth=10)
  2. XGBoost (n_estimators=200, max_depth=6, lr=0.1)

PERFORMANCE (Test Set):
  Random Forest:
    - ROC-AUC: {rf_metrics['roc_auc']:.4f}
    - Accuracy: {rf_metrics['classification_report']['accuracy']:.4f}
    - Violent Precision: {rf_metrics['classification_report']['Violent']['precision']:.4f}
    - Violent Recall: {rf_metrics['classification_report']['Violent']['recall']:.4f}
    - Violent F1: {rf_metrics['classification_report']['Violent']['f1-score']:.4f}

  XGBoost:
    - ROC-AUC: {xgb_metrics['roc_auc']:.4f}
    - Accuracy: {xgb_metrics['classification_report']['accuracy']:.4f}
    - Violent Precision: {xgb_metrics['classification_report']['Violent']['precision']:.4f}
    - Violent Recall: {xgb_metrics['classification_report']['Violent']['recall']:.4f}
    - Violent F1: {xgb_metrics['classification_report']['Violent']['f1-score']:.4f}

TOP 5 PREDICTIVE FEATURES:
  Random Forest:
{chr(10).join([f'    {i+1}. {row["feature"]}: {row["importance"]:.4f}' for i, row in rf_importance.head(5).iterrows()])}

  XGBoost:
{chr(10).join([f'    {i+1}. {row["feature"]}: {row["importance"]:.4f}' for i, row in xgb_importance.head(5).iterrows()])}

KEY LIMITATIONS:
  - Class imbalance may affect minority class predictions
  - Model performance depends on stable crime patterns
  - Limited to features available at dispatch time
  - Requires periodic retraining with new data

OPERATIONAL RECOMMENDATIONS:
  1. Use model predictions to prioritize violent incident response
  2. Monitor model performance monthly; retrain quarterly
  3. Combine predictions with officer judgment for final decisions
  4. Track false positives/negatives to identify improvement areas
  5. Consider ensemble of both models for robust predictions

ARTIFACTS GENERATED:
  - reports/04_classification_class_distribution.png
  - reports/04_classification_feature_importance.png
  - reports/04_classification_shap_summary.png
  - reports/04_classification_performance_curves.png
  - reports/04_classification_rf_model_card.json
  - reports/04_classification_xgb_model_card.json

{'='*80}
"""

print(summary)

# Save summary to file
with open(REPORTS_DIR / '04_classification_summary.txt', 'w') as f:
    f.write(summary)

print("✓ Summary report saved")

## Reproducibility Cell

This notebook can be re-run end-to-end to reproduce all results.

In [None]:
# Environment information for reproducibility
import sys
print("Python version:", sys.version)
print("\nKey package versions:")
import pandas, numpy, sklearn, xgboost, shap, matplotlib, seaborn
print(f"  pandas: {pandas.__version__}")
print(f"  numpy: {numpy.__version__}")
print(f"  scikit-learn: {sklearn.__version__}")
print(f"  xgboost: {xgboost.__version__}")
print(f"  shap: {shap.__version__}")
print(f"  matplotlib: {matplotlib.__version__}")
print(f"  seaborn: {seaborn.__version__}")
print(f"\nRandom seed: {RANDOM_SEED}")
print(f"\nAll results are reproducible by re-running this notebook with the same seed.")