---

# 1. Setup

This section handles the initial setup, including importing necessary libraries, defining file paths, and configuring the environment for spatial clustering analysis. Custom transformers for spatial feature engineering are imported from our utilities module.

## Optional: Google Colab Setup

Uncomment and run this cell if working in Google Colab environment.

In [1]:
# Run on Google Colab (optional)
# from google.colab import drive
# drive.mount('/drive', force_remount=True)

## Import Libraries

Import all libraries required for spatial clustering analysis, pipeline construction,

In [2]:
# Core data manipulation and computation
import pandas as pd
import numpy as np
import os
import sys
import warnings
import joblib
import json
import time
import math
from statistics import mean

# Shapely and geographic projection
from shapely.geometry import Point, MultiPoint, mapping
from shapely.ops import transform as shp_transform
from pyproj import Transformer

# Machine learning and clustering
from sklearn.base import clone
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.cluster import DBSCAN, KMeans
from sklearn.model_selection import ParameterGrid
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score, adjusted_rand_score, adjusted_mutual_info_score

# Visualization libraries
import matplotlib.pyplot as plt

# custom transformers
try:
    from Utilities.clustering_transformers import (
        SpatialProjectionTransformer, 
        CyclicalTransformer, 
        CategoricalPreprocessor,
        MCATransformer,
        MixedFeaturePreprocessor,
        IdentityPreprocessor,
        ColumnBinner
    )
except ImportError:
    import importlib
    import Utilities.clustering_transformers as ct
    ct = importlib.reload(ct)
    from Utilities.clustering_transformers import (
        SpatialProjectionTransformer, 
        CyclicalTransformer, 
        CategoricalPreprocessor,
        MCATransformer,
        MixedFeaturePreprocessor,
        IdentityPreprocessor,
        ColumnBinner
    )

# Geographic and mapping libraries
try:
    import folium
    from folium import plugins, FeatureGroup, GeoJson
    FOLIUM_AVAILABLE = True
except ImportError:
    print("Folium not available. Map visualizations will be limited.")
    FOLIUM_AVAILABLE = False

# Configure plotting
plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10
warnings.filterwarnings('ignore')

## Configure Paths and Custom Utilities

Set up file paths and import custom clustering utilities.

In [3]:
# Configure working directory and paths
current_dir = os.getcwd()
project_root = os.path.abspath(os.path.join(current_dir, '../..'))
data_dir = os.path.join(project_root, 'Data')
output_dir = os.path.join(project_root, 'JupyterOutputs', 'Clustering')

# Create output directories if they don't exist
os.makedirs(output_dir, exist_ok=True)

print(f"Project root: {project_root}")
print(f"Data directory: {data_dir}")
print(f"Output directory: {output_dir}")

# Add utilities to Python path
utilities_path = os.path.join(os.getcwd(), 'Utilities')
if utilities_path not in sys.path:
    sys.path.append(utilities_path)

Project root: c:\UNIVERSITA MAG\Data mining and Machine learning\Progetto\crime-analyzer
Data directory: c:\UNIVERSITA MAG\Data mining and Machine learning\Progetto\crime-analyzer\Data
Output directory: c:\UNIVERSITA MAG\Data mining and Machine learning\Progetto\crime-analyzer\JupyterOutputs\Clustering


## Configure Analysis Parameters

Define key parameters for the clustering analysis.

In [4]:
# Random state for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Feature columns for spatial analysis (aligned with actual dataset)
SPATIAL_FEATURES = ['Latitude', 'Longitude']

# Primary temporal features for clustering
TEMPORAL_FEATURES = ['HOUR', 'WEEKDAY', 'MONTH']

# Extended temporal features available in dataset
EXTENDED_TEMPORAL_FEATURES = [
    'HOUR', 'DAY', 'WEEKDAY', 'IS_WEEKEND', 'MONTH', 'YEAR', 
    'SEASON', 'TIME_BUCKET', 'IS_HOLIDAY', 'IS_PAYDAY'
]

# Categorical features
CATEGORICAL_FEATURES = ['BORO_NM', 'OFNS_DESC', 'PREM_TYP_DESC']

# Extended categorical features available
EXTENDED_CATEGORICAL_FEATURES = [
    'BORO_NM', 'LAW_CAT_CD', 'LOC_OF_OCCUR_DESC', 'OFNS_DESC', 'PREM_TYP_DESC',
    'SUSP_AGE_GROUP', 'SUSP_RACE', 'SUSP_SEX', 'VIC_AGE_GROUP', 'VIC_RACE', 'VIC_SEX'
]

# Spatial context features (POI-based features for enhanced spatial analysis)
SPATIAL_CONTEXT_FEATURES = [
    'BAR_DISTANCE', 'NIGHTCLUB_DISTANCE', 'ATM_DISTANCE', 'METRO_DISTANCE',
    'MIN_POI_DISTANCE', 'AVG_POI_DISTANCE', 'MAX_POI_DISTANCE',
    'ATMS_COUNT', 'BARS_COUNT', 'BUS_STOPS_COUNT', 'METROS_COUNT', 
    'NIGHTCLUBS_COUNT', 'SCHOOLS_COUNT', 'TOTAL_POI_COUNT',
    'POI_DIVERSITY', 'POI_DENSITY_SCORE'
]

# Social features 
SOCIAL_FEATURES = ['SAME_AGE_GROUP', 'SAME_SEX']

print("Analysis parameters configured successfully!")
print(f"Primary spatial features: {SPATIAL_FEATURES}")
print(f"Primary temporal features: {TEMPORAL_FEATURES}")
print(f"Primary categorical features: {CATEGORICAL_FEATURES}")
print(f"Available spatial context features: {len(SPATIAL_CONTEXT_FEATURES)} POI-based features")
print(f"Available extended temporal features: {len(EXTENDED_TEMPORAL_FEATURES)} temporal features")

Analysis parameters configured successfully!
Primary spatial features: ['Latitude', 'Longitude']
Primary temporal features: ['HOUR', 'WEEKDAY', 'MONTH']
Primary categorical features: ['BORO_NM', 'OFNS_DESC', 'PREM_TYP_DESC']
Available spatial context features: 16 POI-based features
Available extended temporal features: 10 temporal features


## Cross-Validation Functions

Define custom cross-validation functions for clustering evaluation.

In [5]:
def clustering_cross_validation(pipeline, X, param_grid, cv=5, random_state=42):
    """
    Cross-validation for clustering with appropriate metrics.

    Evaluates clustering quality using:
    - Silhouette Score: Quality of clusters (intra vs inter cluster distance)

    Parameters
    ----------
    pipeline : sklearn.Pipeline
        Clustering pipeline to evaluate
    X : DataFrame
        Input data
    param_grid : dict
        Parameter grid to test
    cv : int, default=5
        Number of cross-validation folds
    random_state : int, default=42
        Random state for reproducibility

    Returns
    -------
    DataFrame
        Results sorted by composite score
    """
    from sklearn.model_selection import KFold
    from sklearn.base import clone

    print(f"\nüîÑ CROSS-VALIDATION CLUSTERING EVALUATION")
    print(f"Dataset shape: {X.shape}")
    print(f"CV folds: {cv}")
    print(f"Parameter combinations: {len(list(ParameterGrid(param_grid)))}")
    print("-" * 50)

    kf = KFold(n_splits=cv, shuffle=True, random_state=random_state)
    results = []

    for i, params in enumerate(ParameterGrid(param_grid)):
        print(f"\nTesting combination {i+1}: {params}")

        # Metrics for each fold
        fold_silhouettes = []
        fold_inertias = []

        for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
            X_train = X.iloc[train_idx]
            X_val = X.iloc[val_idx]

            try:
                # Fit pipeline on training fold
                pipeline_copy = clone(pipeline)
                pipeline_copy.set_params(**params)
                pipeline_copy.fit(X_train)

                # Predict on validation fold
                val_labels = pipeline_copy.predict(X_val)

                # Skip if only one cluster found
                if len(np.unique(val_labels)) < 2:
                    continue

                # Get transformed validation data for metrics
                X_val_transformed = pipeline_copy.named_steps['preprocess'].transform(X_val)

                # Convert to numpy array if needed for metrics
                if hasattr(X_val_transformed, 'values'):
                    X_val_array = X_val_transformed.values
                else:
                    X_val_array = X_val_transformed

                # Calculate clustering metrics
                # Silhouette Score (range: -1 to 1, higher better)
                sil_score = silhouette_score(X_val_array, val_labels)
                fold_silhouettes.append(sil_score)

                # Inertia (for K-Means based methods, lower better)
                if hasattr(pipeline_copy.named_steps['cluster'], 'inertia_'):
                    # Re-fit on validation to get inertia
                    cluster_step = clone(pipeline_copy.named_steps['cluster'])
                    cluster_step.set_params(**{k.replace('cluster__', ''): v \
                                             for k, v in params.items() \
                                             if k.startswith('cluster__')})
                    cluster_step.fit(X_val_array)
                    fold_inertias.append(cluster_step.inertia_)

            except Exception as e:
                print(f"    Fold {fold+1} failed: {e}")
                continue

        # Aggregate results across folds
        if fold_silhouettes:
            mean_silhouette = np.mean(fold_silhouettes)
            std_silhouette = np.std(fold_silhouettes)
            mean_inertia = np.mean(fold_inertias) if fold_inertias else float('inf')

            # Composite score: rely on silhouette only
            composite_score = mean_silhouette

            results.append({
                'params': params,
                'cv_silhouette_mean': mean_silhouette,
                'cv_silhouette_std': std_silhouette,
                'cv_inertia_mean': mean_inertia,
                'composite_score': composite_score,
                'n_successful_folds': len(fold_silhouettes)
            })

            print(f"  ‚úì CV Results:")
            print(f"    Silhouette: {mean_silhouette:.3f} (¬±{std_silhouette:.3f})")
            print(f"    Composite Score: {composite_score:.4f}")
            print(f"    Successful folds: {len(fold_silhouettes)}/{cv}")
        else:
            print(f"  ‚ùå No successful folds for this parameter combination")

    df_results = pd.DataFrame(results)
    if not df_results.empty:
        df_results = df_results.sort_values('composite_score', ascending=False)
        print(f"\nüèÜ CROSS-VALIDATION COMPLETED")
        print(f"Total successful parameter combinations: {len(df_results)}")
        if len(df_results) > 0:
            best_params = df_results.iloc[0]['params']
            best_score = df_results.iloc[0]['composite_score']
            print(f"Best parameters: {best_params}")
            print(f"Best composite score: {best_score:.4f}")

    return df_results


def evaluate_best_pipeline(pipeline, X, best_params, method_name):
    """
    Evaluate the best pipeline with full dataset and detailed metrics.
    """
    print(f"\nüéØ FINAL {method_name.upper()} MODEL EVALUATION")
    print("-" * 50)

    # Set best parameters and fit on full dataset
    pipeline.set_params(**best_params)
    t0 = time.perf_counter()
    labels = pipeline.fit_predict(X)
    runtime = time.perf_counter() - t0

    # Get transformed data
    X_transformed = pipeline.named_steps['preprocess'].transform(X)
    if hasattr(X_transformed, 'values'):
        X_array = X_transformed.values
    else:
        X_array = X_transformed

    # Calculate final metrics
    n_clusters = len(np.unique(labels))
    silhouette = silhouette_score(X_array, labels)

    print(f"‚úì Model fitted successfully")
    print(f"Runtime: {runtime:.2f} seconds")
    print(f"Number of clusters: {n_clusters}")
    print(f"Dataset size: {len(labels):,}")

    print(f"\nFinal Quality Metrics:")
    print(f"  Silhouette Score: {silhouette:.4f}")

    # Cluster distribution
    cluster_counts = pd.Series(labels).value_counts().sort_index()
    print(f"\nCluster Distribution:")
    for cluster_id, count in cluster_counts.items():
        percentage = (count / len(labels)) * 100
        print(f"  Cluster {cluster_id}: {count:,} samples ({percentage:.1f}%)")

    return labels, {
        'n_clusters': n_clusters,
        'silhouette_score': silhouette,
        'runtime_seconds': runtime,
        'cluster_sizes': cluster_counts.to_dict()
    }

---

# 2. Data Loading & Feature Preparation

This section loads the preprocessed crime dataset and prepares features specifically for clustering analysis. We validate coordinate accuracy, assess feature completeness, and prepare the data for various clustering algorithms.

## Load Preprocessed Crime Dataset

Load and validate the preprocessed crime data.

In [6]:
# Define data file path
data_file = os.path.join(data_dir, 'final_crime_data.csv')

# Check if data file exists
if not os.path.exists(data_file):
    raise FileNotFoundError(f"Data file not found: {data_file}")

print(f"Loading data from: {data_file}")

# Load the dataset
try:
    df = pd.read_csv(data_file)
    print(f"Dataset loaded successfully!")
    print(f"Shape: {df.shape}")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
except Exception as e:
    raise RuntimeError(f"Error loading dataset: {e}")

# Display basic dataset information
print("\n" + "="*60)
print("DATASET OVERVIEW")
print("="*60)
print(f"Total records: {len(df):,}")
print(f"Total features: {df.shape[1]}")

Loading data from: c:\UNIVERSITA MAG\Data mining and Machine learning\Progetto\crime-analyzer\Data\final_crime_data.csv


Dataset loaded successfully!
Shape: (2493835, 44)
Memory usage: 2453.35 MB

DATASET OVERVIEW
Total records: 2,493,835
Total features: 44
Memory usage: 2453.35 MB

DATASET OVERVIEW
Total records: 2,493,835
Total features: 44


## Data Cleaning and Validation


In [7]:
# Validate feature availability
print("\n" + "="*60)
print("FEATURE VALIDATION")
print("="*60)

# Check spatial features
spatial_available = [col for col in SPATIAL_FEATURES if col in df.columns]
temporal_available = [col for col in TEMPORAL_FEATURES if col in df.columns]
categorical_available = [col for col in CATEGORICAL_FEATURES if col in df.columns]

print("Feature availability:")
print("-" * 30)
print(f"Spatial features: {spatial_available} ({len(spatial_available)}/{len(SPATIAL_FEATURES)})")
print(f"Temporal features: {temporal_available} ({len(temporal_available)}/{len(TEMPORAL_FEATURES)})")
print(f"Categorical features: {categorical_available} ({len(categorical_available)}/{len(CATEGORICAL_FEATURES)})")

# Apply the same temporal filter as classification test set for consistency
print(f"\n" + "="*60)
print("TEMPORAL FILTERING FOR CLUSTERING")
print("="*60)

if 'YEAR' in df.columns and 'MONTH' in df.columns:
    print(f"Original dataset years: {sorted(df['YEAR'].unique())}")
    print(f"Total records before temporal filter: {len(df):,}")
    
    # Create YearMonth for consistent filtering with classification approach
    df['YearMonth'] = df['YEAR'] * 100 + df['MONTH']
    print(f"Year-Month distribution in original dataset:")
    ym_counts = df['YearMonth'].value_counts().sort_index()
    for ym, count in ym_counts.items():
        print(f"  {ym}: {count:,} records")

    # Use the same temporal split point as classification test set (YearMonth >= 202401)
    # This corresponds to the most recent data used in classification evaluation
    test_set_start_ym = 202401  # January 2024
    print(f"\nFiltering for clustering analysis using classification test set period:")
    print(f"Using YearMonth >= {test_set_start_ym} (same as classification test set)")
    
    # Apply filter - using the classification test set period
    df_filtered = df[df['YearMonth'] >= test_set_start_ym].copy()
    
    print(f"Records after temporal filter: {len(df_filtered):,} ({(len(df_filtered)/len(df))*100:.1f}% of original)")
    
    if len(df_filtered) > 0:
        print(f"Filtered dataset year-months:")
        filtered_ym_counts = df_filtered['YearMonth'].value_counts().sort_index()
        for ym, count in filtered_ym_counts.items():
            print(f"  {ym}: {count:,} records")
        
        # Drop the temporary YearMonth column
        df_filtered.drop(columns=['YearMonth'], inplace=True)
        df.drop(columns=['YearMonth'], inplace=True)
        
        # Use filtered data for clustering
        df = df_filtered
        print(f"\nUsing filtered dataset for clustering analysis.")
        print(f"This approach ensures consistency with classification methodology")
        print(f"and focuses on the most recent crime patterns.")
    else:
        print(f"Warning: No data found for YearMonth >= {test_set_start_ym}")
        print("Falling back to recent years filter (YEAR >= 2023)")
        # Fallback to the previous approach
        recent_years_threshold = 2023
        df_filtered = df[df['YEAR'] >= recent_years_threshold].copy()
        df = df_filtered
        df.drop(columns=['YearMonth'], inplace=True)
else:
    print("Warning: YEAR or MONTH column not found. Skipping temporal filtering.")
    print("Using full dataset for clustering analysis.")

# Create clean dataset for spatial analysis
print(f"\nPreparing dataset for spatial clustering...")

# Filter for valid coordinates
if len(spatial_available) >= 2:
    # Remove rows with missing coordinates
    valid_coords_mask = df[spatial_available].notna().all(axis=1)
    df_spatial = df[valid_coords_mask].copy()
    
    print(f"Records with valid coordinates: {len(df_spatial):,} ({(len(df_spatial)/len(df))*100:.2f}%)")
    
    # Additional coordinate validation
    lat_col = 'Latitude'
    lon_col = 'Longitude'
    
    if lat_col in df_spatial.columns and lon_col in df_spatial.columns:
        # NYC coordinate bounds
        nyc_bounds_mask = (
            df_spatial[lat_col].between(40.4774, 40.9176) &
            df_spatial[lon_col].between(-74.2591, -73.7004)
        )
        df_spatial = df_spatial[nyc_bounds_mask].copy()
        
        print(f"Records within NYC bounds: {len(df_spatial):,}")
    else:
        print(f"Warning: Coordinate columns {lat_col}/{lon_col} not found for geographic filtering")
else:
    raise ValueError("Insufficient spatial features for clustering analysis")

# Display final dataset summary
print(f"\nFinal dataset for clustering:")
print(f"Shape: {df_spatial.shape}")
print(f"Coordinate coverage: {len(df_spatial):,} records")
print(f"Time range: {df_spatial['YEAR'].min()} - {df_spatial['YEAR'].max()}")
print(f"Memory usage: {df_spatial.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Check availability of extended features
print(f"\nExtended features availability:")
print("-" * 40)

# Check spatial context features (POI-based)
spatial_context_available = [col for col in SPATIAL_CONTEXT_FEATURES if col in df.columns]
print(f"Spatial context features: {len(spatial_context_available)}/{len(SPATIAL_CONTEXT_FEATURES)} available")
if spatial_context_available:
    print(f"  Available: {spatial_context_available[:5]}...")  # Show first 5

# Check extended temporal features
extended_temporal_available = [col for col in EXTENDED_TEMPORAL_FEATURES if col in df.columns]
print(f"Extended temporal features: {len(extended_temporal_available)}/{len(EXTENDED_TEMPORAL_FEATURES)} available")
if extended_temporal_available:
    print(f"  Available: {extended_temporal_available}")

# Check extended categorical features
extended_categorical_available = [col for col in EXTENDED_CATEGORICAL_FEATURES if col in df.columns]
print(f"Extended categorical features: {len(extended_categorical_available)}/{len(EXTENDED_CATEGORICAL_FEATURES)} available")
if extended_categorical_available:
    print(f"  Available: {extended_categorical_available[:5]}...")  # Show first 5

# Check social features
social_available = [col for col in SOCIAL_FEATURES if col in df.columns]
print(f"Social features: {len(social_available)}/{len(SOCIAL_FEATURES)} available")
if social_available:
    print(f"  Available: {social_available}")

# Display temporal distribution after filtering
if 'YEAR' in df_spatial.columns and 'MONTH' in df_spatial.columns:
    print(f"\nTemporal distribution in filtered dataset:")
    print("-" * 40)
    yearly_counts = df_spatial['YEAR'].value_counts().sort_index()
    for year, count in yearly_counts.items():
        print(f"  {year}: {count:,} records")
    
    # Show monthly distribution for each year in the filtered data
    years_in_data = sorted(df_spatial['YEAR'].unique())
    for year in years_in_data:
        monthly_counts = df_spatial[df_spatial['YEAR'] == year]['MONTH'].value_counts().sort_index()
        print(f"\nMonthly distribution for {year}:")
        print("-" * 40)
        for month, count in monthly_counts.items():
            print(f"  Month {month:2d}: {count:,} records")

# Final clustering dataset summary
print(f"\n" + "="*60)
print("CLUSTERING DATASET SUMMARY")
print("="*60)
print(f"Dataset period: Gen 2024 onwards")
print(f"Total records for clustering: {len(df_spatial):,}")
print(f"Temporal consistency: ‚úì Matches classification evaluation period")
print(f"Geographic validity: ‚úì NYC coordinate bounds enforced")


FEATURE VALIDATION
Feature availability:
------------------------------
Spatial features: ['Latitude', 'Longitude'] (2/2)
Temporal features: ['HOUR', 'WEEKDAY', 'MONTH'] (3/3)
Categorical features: ['BORO_NM', 'OFNS_DESC', 'PREM_TYP_DESC'] (3/3)

TEMPORAL FILTERING FOR CLUSTERING
Original dataset years: [np.int64(2020), np.int64(2021), np.int64(2022), np.int64(2023), np.int64(2024)]
Total records before temporal filter: 2,493,835
Year-Month distribution in original dataset:
  202001: 38,695 records
  202002: 35,446 records
  202003: 32,679 records
  202004: 24,907 records
  202005: 32,023 records
  202006: 32,604 records
  202007: 35,531 records
  202008: 37,524 records
  202009: 35,983 records
  202010: 37,403 records
  202011: 35,148 records
  202012: 33,516 records
  202101: 33,217 records
  202102: 28,332 records
  202103: 34,612 records
  202104: 32,619 records
  202105: 36,711 records
  202106: 37,516 records
  202107: 39,402 records
  202108: 38,945 records
  202109: 39,906 rec

In [8]:
# --- Data Configuration: Full Dataset vs Sample ---
# Set to True to use full dataset, False to use sample
USE_FULL_DATA = False

# Helper: balanced stratified sampling with shortfall redistribution
def stratified_sample_balanced(df_in, strat_col, n_total, random_state=42):
    if strat_col not in df_in.columns:
        # Fallback to simple sample
        n_take = min(n_total, len(df_in))
        return df_in.sample(n_take, random_state=random_state)

    df_in = df_in[df_in[strat_col].notna()].copy()
    if len(df_in) == 0:
        return df_in

    total = len(df_in)
    if n_total >= total:
        # If asking more than available, return all rows
        return df_in.sample(frac=1.0, random_state=random_state)

    sizes = df_in[strat_col].value_counts().sort_index()
    groups = df_in.groupby(strat_col)

    # Ideal proportional allocation
    ideal = sizes / total * n_total
    base = np.floor(ideal).astype(int)

    # Cap by availability
    cap = sizes
    alloc = base.clip(upper=cap)

    # Largest remainder method + capacity-aware fill to reach exactly n_total
    remaining = int(n_total - alloc.sum())
    remainders = (ideal - base)

    # First pass: distribute by largest remainders
    for key in remainders.sort_values(ascending=False).index:
        if remaining == 0:
            break
        if alloc[key] < cap[key]:
            alloc[key] += 1
            remaining -= 1

    # If still remaining, do capacity-aware round-robin until filled or no capacity left
    while remaining > 0:
        progressed = False
        for key in remainders.sort_values(ascending=False).index:
            if remaining == 0:
                break
            if alloc[key] < cap[key]:
                alloc[key] += 1
                remaining -= 1
                progressed = True
        if not progressed:
            break  # No more capacity anywhere

    # Draw samples per stratum deterministically
    parts = []
    for key, g in groups:
        k = int(alloc.get(key, 0))
        if k <= 0:
            continue
        if k >= len(g):
            parts.append(g)
        else:
            parts.append(g.sample(n=k, random_state=random_state))

    out = pd.concat(parts, axis=0)
    # Final shuffle for randomness while preserving reproducibility
    out = out.sample(frac=1.0, random_state=random_state).reset_index(drop=True)

    # Safety: trim in rare case of over-allocation due to concurrency of caps
    if len(out) > n_total:
        out = out.iloc[:n_total].copy()

    return out

# Configure dataset based on flag
if USE_FULL_DATA:
    df = df_spatial.copy()
    print(f"Using full dataset: {df.shape[0]:,} rows")
else:
    N_SAMPLE = 5_000  # target number of rows to take
    if 'BORO_NM' in df_spatial.columns and df_spatial['BORO_NM'].notna().any():
        df = stratified_sample_balanced(df_spatial, 'BORO_NM', N_SAMPLE, random_state=RANDOM_STATE).copy()
        print(f"Using stratified sample by BORO_NM: {df.shape[0]:,} rows out of {df_spatial.shape[0]:,} total")
        try:
            counts = df['BORO_NM'].value_counts().sort_index()
            props = (counts / len(df)).round(3)
            print("Stratum distribution (sample):")
            for name, count in counts.items():
                print(f"  {name}: {count:,} ({props[name]:.3f})")
        except Exception:
            pass
    else:
        n_take = min(N_SAMPLE, len(df_spatial))
        df = df_spatial.sample(n_take, random_state=RANDOM_STATE).copy()
        print(f"Using simple sample: {df.shape[0]:,} rows out of {df_spatial.shape[0]:,} total")

print(f"Dataset created: {df.shape[0]} rows out of {df_spatial.shape[0]} total")
pd.set_option("display.width", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 44)
print(df.head(1))

Using stratified sample by BORO_NM: 5,000 rows out of 560,819 total
Stratum distribution (sample):
  BRONX: 1,091 (0.218)
  BROOKLYN: 1,383 (0.277)
  MANHATTAN: 1,204 (0.241)
  QUEENS: 1,109 (0.222)
  STATEN ISLAND: 213 (0.043)
Dataset created: 5000 rows out of 560819 total
    BORO_NM  KY_CD   LAW_CAT_CD LOC_OF_OCCUR_DESC                 OFNS_DESC  \
0  BROOKLYN    348  MISDEMEANOR              REAR  VEHICLE AND TRAFFIC LAWS   

   PD_CD PREM_TYP_DESC SUSP_AGE_GROUP SUSP_RACE SUSP_SEX VIC_AGE_GROUP  \
0    916        STREET        UNKNOWN   UNKNOWN        U       UNKNOWN   

  VIC_RACE VIC_SEX   Latitude  Longitude  BAR_DISTANCE  NIGHTCLUB_DISTANCE  \
0    BLACK       E  40.653066 -73.889789    2932.76284         4077.648209   

   ATM_DISTANCE  ATMS_COUNT  BARS_COUNT  BUS_STOPS_COUNT  METROS_COUNT  \
0   4893.870711         0.0         0.0              1.0           0.0   

   NIGHTCLUBS_COUNT  SCHOOLS_COUNT  METRO_DISTANCE  MIN_POI_DISTANCE  \
0               0.0            0.0     

## 3. K-Modes for Categorical Crime Patterns


In [9]:
# Import K-Modes clustering for categorical data
from kmodes.kmodes import KModes

# Import additional libraries for categorical clustering
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import LabelEncoder
import itertools

print("‚úì K-Modes clustering library loaded")
print("K-Modes clustering setup completed")

‚úì K-Modes clustering library loaded
K-Modes clustering setup completed


### Categorical Feature Preparation for K-Modes

K-Modes clustering is specifically designed for categorical data. We prepare our categorical features for pattern discovery in crime types, locations, and demographic information.

In [10]:
# Define categorical features for K-Modes clustering
# Base, highly interpretable police features
BASE_KMODES_FEATURES = ['BORO_NM', 'OFNS_DESC', 'PREM_TYP_DESC']

# Additional categorical features directly useful for operations
# Kept: severity, detailed location, coarse time bucket, weekend/holiday flags
# Commented out: WEEKDAY (redundant with TIME_BUCKET), IS_PAYDAY (low impact), SAME_* flags (low operational signal)
EXTRA_CATEGORICAL_FEATURES = [
    # 'LAW_CAT_CD',               # Optional: not useful having OFNS_DESC
    # 'LOC_OF_OCCUR_DESC',        # Optional: not useful having PREM_TYP_DESC
    'TIME_BUCKET',              # Coarse time-of-day buckets
    'IS_WEEKEND', 'IS_HOLIDAY', # Operationally meaningful flags
    # 'WEEKDAY',               # Optional: more granular temporal, commented to reduce noise
    # 'IS_PAYDAY',            # Optional: typically low impact
    # 'SAME_AGE_GROUP',       # Optional: low operational value
    # 'SAME_SEX',             # Optional: low operational value
]

# Demographics: keep age/sex; exclude race to avoid bias and improve fairness
DEMOGRAPHIC_CATEGORICAL = [
    'SUSP_SEX', 'SUSP_AGE_GROUP',
    'VIC_SEX', 'VIC_AGE_GROUP',
    # 'SUSP_RACE', 'VIC_RACE',   # Excluded intentionally
]

# Numeric POI/distances transformed into interpretable categories
# Keep a compact, high-signal subset; comment out the rest
DISTANCE_COLS = [
    'METRO_DISTANCE',           # Kept: strong mobility/transit signal
    # 'BAR_DISTANCE', 'NIGHTCLUB_DISTANCE', 'ATM_DISTANCE',
    # 'MIN_POI_DISTANCE', 'AVG_POI_DISTANCE', 'MAX_POI_DISTANCE',
]
COUNT_COLS = [
    # 'TOTAL_POI_COUNT',
    # 'ATMS_COUNT', 'BARS_COUNT', 'BUS_STOPS_COUNT', 'METROS_COUNT',
    # 'NIGHTCLUBS_COUNT', 'SCHOOLS_COUNT',
]
SCORE_COLS = [
    'POI_DENSITY_SCORE',        # Kept: density proxy
    # 'POI_DIVERSITY',          # Optional, commented to simplify
]

print("=== CATEGORICAL FEATURE PREPARATION FOR K-MODES ===")

# Use the configured dataset
_df = df.copy()

# Helper: top-K mapping for high-cardinality categoricals
from collections import Counter

def top_k_map(series, k=10):
    if series.isna().all():
        return series.fillna('Unknown')
    vc = series.value_counts()
    top = set(vc.head(k).index)
    return series.where(series.isin(top), 'OTHER').astype(str).fillna('Unknown')

# Use ColumnBinner instead of inline binning helpers
binner_config = {
    'METRO_DISTANCE': {
        'kind': 'distance',
        'bins': [-np.inf, 250, 1000, np.inf],
        'labels': ['Near', 'Mid', 'Far']
    },
    'TOTAL_POI_COUNT': {
        'kind': 'count',
        'quantiles': 4,
        'labels': ['Low', 'Medium', 'High', 'VeryHigh'],
        'zero_label': 'Zero'
    },
    'POI_DENSITY_SCORE': {
        'kind': 'score',
        'quantiles': 4,
        'labels': ['Low', 'Medium', 'High', 'VeryHigh']
    }
}

# Instantiate and apply binner
column_binner = ColumnBinner(config=binner_config, suffix="_BIN", fill_unknown="Unknown")
column_binner.fit(_df)
_df = column_binner.transform(_df)

# Track created bins consistently
created_bins = [col for col in column_binner.get_feature_names_out() if col in _df.columns]

# Apply top-K mapping to selected demographics (race excluded)
for col in DEMOGRAPHIC_CATEGORICAL:
    if col in _df.columns:
        k = 10
        _df[col] = top_k_map(_df[col].astype(str), k=k)

# Ensure core categoricals are strings and filled
for col in BASE_KMODES_FEATURES + EXTRA_CATEGORICAL_FEATURES:
    if col in _df.columns:
        _df[col] = _df[col].astype(str).fillna('Unknown')

# Compose the final feature list for K-Modes
CATEGORICAL_FEATURES_KMODES = BASE_KMODES_FEATURES + [
    c for c in EXTRA_CATEGORICAL_FEATURES if c in _df.columns
] + created_bins + [c for c in DEMOGRAPHIC_CATEGORICAL if c in _df.columns]

print("Base categorical features:", BASE_KMODES_FEATURES)
print("Added operational categorical features:", [c for c in EXTRA_CATEGORICAL_FEATURES if c in _df.columns])
print("Added POI/context bins:", created_bins)
print("Added demographics (race excluded):", [c for c in DEMOGRAPHIC_CATEGORICAL if c in _df.columns])

# Check feature availability
categorical_available = [col for col in CATEGORICAL_FEATURES_KMODES if col in _df.columns]
print(f"Total categorical features for K-Modes: {len(categorical_available)}")

# Prepare dataset holders used downstream
if not categorical_available:
    raise ValueError("No categorical features available for K-Modes clustering")

# Keep a wide copy for labeling/ops; drop rows only if core/base features are missing
# (to avoid losing useful non-feature columns like HOUR, IS_WEEKEND later)
df_kmodes_input = _df.copy()
required_for_row = [c for c in BASE_KMODES_FEATURES if c in df_kmodes_input.columns]
df_kmodes = df_kmodes_input.dropna(subset=required_for_row).copy() if required_for_row else df_kmodes_input.copy()

# Build X_categorical as the feature matrix used by the pipeline
CATEGORICAL_FEATURES_KMODES_AVAILABLE = [c for c in CATEGORICAL_FEATURES_KMODES if c in df_kmodes.columns]
X_categorical = df_kmodes[CATEGORICAL_FEATURES_KMODES_AVAILABLE].astype(str).fillna('Unknown')

print(f"Rows available for K-Modes after base-feature check: {len(df_kmodes):,}")
print(f"Feature matrix shape: {X_categorical.shape}")
print(f"First feature columns: {CATEGORICAL_FEATURES_KMODES_AVAILABLE[:5]}")

=== CATEGORICAL FEATURE PREPARATION FOR K-MODES ===
Base categorical features: ['BORO_NM', 'OFNS_DESC', 'PREM_TYP_DESC']
Added operational categorical features: ['TIME_BUCKET', 'IS_WEEKEND', 'IS_HOLIDAY']
Added POI/context bins: [np.str_('METRO_DISTANCE_BIN'), np.str_('TOTAL_POI_COUNT_BIN'), np.str_('POI_DENSITY_SCORE_BIN')]
Added demographics (race excluded): ['SUSP_SEX', 'SUSP_AGE_GROUP', 'VIC_SEX', 'VIC_AGE_GROUP']
Total categorical features for K-Modes: 13
Base categorical features: ['BORO_NM', 'OFNS_DESC', 'PREM_TYP_DESC']
Added operational categorical features: ['TIME_BUCKET', 'IS_WEEKEND', 'IS_HOLIDAY']
Added POI/context bins: [np.str_('METRO_DISTANCE_BIN'), np.str_('TOTAL_POI_COUNT_BIN'), np.str_('POI_DENSITY_SCORE_BIN')]
Added demographics (race excluded): ['SUSP_SEX', 'SUSP_AGE_GROUP', 'VIC_SEX', 'VIC_AGE_GROUP']
Total categorical features for K-Modes: 13
Rows available for K-Modes after base-feature check: 5,000
Feature matrix shape: (5000, 13)
First feature columns: ['BORO_

### K-Modes Pipeline Construction

Following the same modular pipeline approach as SpatialHotspotAnalysis, we create a preprocessing pipeline for categorical features and K-Modes clustering.

In [11]:
print("=== K-MODES PIPELINE CONSTRUCTION ===")

# Create preprocessing pipeline (following SpatialHotspotAnalysis structure)
# CategoricalPreprocessor is now imported from clustering_transformers
categorical_preprocessor = CategoricalPreprocessor(handle_missing='drop')

# K-Modes clustering pipeline (similar to DBSCAN pipeline in SpatialHotspotAnalysis)
kmodes_pipeline = Pipeline([
    ('preprocess', categorical_preprocessor),
    ('cluster', KModes(n_clusters=5, init='Huang', n_init=5, verbose=1, random_state=RANDOM_STATE))
])

print("‚úì K-Modes pipeline constructed successfully")
print(f"Pipeline steps: {[step[0] for step in kmodes_pipeline.steps]}")

# Define parameter grid for systematic exploration
kmodes_param_grid = {
    'cluster__n_clusters': [3, 4, 5, 6, 7, 8, 9, 10],
    'cluster__init': ['Huang', 'Cao'],
    'cluster__n_init': [5, 10]
}

print(f"Parameter grid defined:")
print(f"  n_clusters: {kmodes_param_grid['cluster__n_clusters']}")
print(f"  init methods: {kmodes_param_grid['cluster__init']}")
print(f"  n_init: {kmodes_param_grid['cluster__n_init']}")
print(f"Total combinations: {len(list(ParameterGrid(kmodes_param_grid)))}")

# Prepare data for clustering (convert to numpy array as K-Modes expects)
print(f"\n=== DATA PREPARATION FOR K-MODES ===")
X_categorical_processed = categorical_preprocessor.fit_transform(X_categorical)
X_categorical_array = X_categorical_processed.values  # K-Modes requires numpy array

print(f"Processed data shape: {X_categorical_array.shape}")
print(f"Data type: {type(X_categorical_array)}")
print(f"First few samples:")
for i in range(min(3, len(X_categorical_array))):
    print(f"  Sample {i+1}: {X_categorical_array[i]}")

=== K-MODES PIPELINE CONSTRUCTION ===
‚úì K-Modes pipeline constructed successfully
Pipeline steps: ['preprocess', 'cluster']
Parameter grid defined:
  n_clusters: [3, 4, 5, 6, 7, 8, 9, 10]
  init methods: ['Huang', 'Cao']
  n_init: [5, 10]
Total combinations: 32

=== DATA PREPARATION FOR K-MODES ===
Processed data shape: (5000, 13)
Data type: <class 'numpy.ndarray'>
First few samples:
  Sample 1: ['BROOKLYN' 'VEHICLE AND TRAFFIC LAWS' 'STREET' 'EVENING' '0' '0' 'Mid'
 'High' 'High' 'U' 'UNKNOWN' 'E' 'UNKNOWN']
  Sample 2: ['MANHATTAN' 'HARRASSMENT 2' 'COMMERCIAL BUILDING' 'MORNING' '0' '0'
 'Near' 'High' 'High' 'M' '25-44' 'F' '25-44']
  Sample 3: ['MANHATTAN' 'HARRASSMENT 2' 'STREET' 'AFTERNOON' '0' '0' 'Mid' 'Low'
 'Low' 'F' 'UNKNOWN' 'M' '<18']


### K-Modes Parameter Grid Search & Evaluation

Following the same systematic parameter optimization approach as SpatialHotspotAnalysis, we perform grid search to find optimal K-Modes parameters.

In [12]:
print("=== K-MODES PARAMETER GRID SEARCH ===")

# Grid search implementation (following SpatialHotspotAnalysis methodology)
kmodes_results = []
best_params = None
best_score = -np.inf

# Custom evaluation metric for categorical clustering
def evaluate_kmodes_clustering(X, labels, centroids):
    """
    Evaluate K-Modes clustering quality using multiple metrics.
    Similar to the evaluation approach in SpatialHotspotAnalysis.
    """
    n_clusters = len(np.unique(labels))
    n_samples = len(labels)
    
    # Basic cluster statistics
    cluster_sizes = pd.Series(labels).value_counts().sort_index()
    min_cluster_size = cluster_sizes.min()
    max_cluster_size = cluster_sizes.max()
    
    # Cluster balance (smaller is better for balanced clusters)
    cluster_balance = max_cluster_size / max(min_cluster_size, 1)
    
    # Intra-cluster homogeneity (categorical version of compactness)
    total_dissimilarity = 0
    for cluster_id in range(n_clusters):
        cluster_mask = labels == cluster_id
        if cluster_mask.sum() > 1:
            cluster_data = X[cluster_mask]
            centroid = centroids[cluster_id]
            
            # Calculate dissimilarity to centroid for categorical data
            for sample in cluster_data:
                dissimilarity = np.sum(sample != centroid)
                total_dissimilarity += dissimilarity
    
    avg_dissimilarity = total_dissimilarity / max(n_samples, 1)
    
    return {
        'n_clusters': n_clusters,
        'min_cluster_size': min_cluster_size,
        'max_cluster_size': max_cluster_size,
        'cluster_balance': cluster_balance,
        'avg_dissimilarity': avg_dissimilarity
    }

print(f"Starting grid search with {len(list(ParameterGrid(kmodes_param_grid)))} parameter combinations...")

# Execute grid search
for i, params in enumerate(ParameterGrid(kmodes_param_grid)):
    print(f"\\nTesting combination {i+1}: n_clusters={params['cluster__n_clusters']}, "
          f"init={params['cluster__init']}, n_init={params['cluster__n_init']}")
    
    try:
        # Set parameters and fit model
        kmodes_pipeline.set_params(**params)
        t0 = time.perf_counter()
        labels = kmodes_pipeline.fit_predict(X_categorical)
        runtime = time.perf_counter() - t0
        
        # Get centroids from the fitted model
        centroids = kmodes_pipeline.named_steps['cluster'].cluster_centroids_
        
        # Evaluate clustering quality
        eval_metrics = evaluate_kmodes_clustering(X_categorical_array, labels, centroids)
        
        # Calculate composite score (similar to SpatialHotspotAnalysis approach)
        # Lower dissimilarity and better balance = higher score
        composite_score = 1.0 / (1.0 + eval_metrics['avg_dissimilarity']) - eval_metrics['cluster_balance'] / 100.0
        
        # Store results
        result = {
            'n_clusters': params['cluster__n_clusters'],
            'init_method': params['cluster__init'],
            'n_init': params['cluster__n_init'],
            'runtime_s': runtime,
            'composite_score': composite_score,
            **eval_metrics
        }
        
        kmodes_results.append(result)
        
        # Track best parameters
        if composite_score > best_score:
            best_score = composite_score
            best_params = params.copy()
        
        print(f"  Runtime: {runtime:.2f}s")
        print(f"  Clusters: {eval_metrics['n_clusters']}")
        print(f"  Cluster sizes: {eval_metrics['min_cluster_size']}-{eval_metrics['max_cluster_size']}")
        print(f"  Balance ratio: {eval_metrics['cluster_balance']:.2f}")
        print(f"  Avg dissimilarity: {eval_metrics['avg_dissimilarity']:.3f}")
        print(f"  Composite score: {composite_score:.4f}")
        
    except Exception as e:
        print(f"  ‚ùå Failed: {str(e)}")
        continue

# Convert results to DataFrame for analysis (following SpatialHotspotAnalysis approach)
df_kmodes_results = pd.DataFrame(kmodes_results)

if not df_kmodes_results.empty:
    print(f"\\n=== K-MODES GRID SEARCH RESULTS ===")
    print(f"Total successful runs: {len(df_kmodes_results)}")
    print(f"Best composite score: {best_score:.4f}")
    print(f"Best parameters: {best_params}")
    
    # Display top results
    top_results = df_kmodes_results.nlargest(5, 'composite_score')
    print(f"\\nTop 5 parameter combinations:")
    print(top_results[['n_clusters', 'init_method', 'n_init', 'composite_score', 
                      'min_cluster_size', 'max_cluster_size', 'avg_dissimilarity']].round(4))
else:
    print("‚ùå No successful K-Modes runs completed")

=== K-MODES PARAMETER GRID SEARCH ===
Starting grid search with 32 parameter combinations...
\nTesting combination 1: n_clusters=3, init=Huang, n_init=5
Init: initializing centroids
Init: initializing centroids
Init: initializing clusters
Init: initializing clusters
Starting iterations...
Starting iterations...
Run 1, iteration: 1/100, moves: 1340, cost: 30492.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 1340, cost: 30492.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 1797, cost: 30391.0
Run 2, iteration: 1/100, moves: 1797, cost: 30391.0
Run 2, iteration: 2/100, moves: 347, cost: 30391.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 2/100, moves: 347, cost: 30391.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 1158, cost: 31601.0
Init:

### K-Modes Results Analysis & Pattern Discovery

We analyze the discovered categorical crime patterns and create detailed cluster profiles.

In [13]:
if 'df_kmodes_results' in locals() and not df_kmodes_results.empty:
    print("=== K-MODES FINAL MODEL & PATTERN ANALYSIS ===")
    
    # Fit best model (following SpatialHotspotAnalysis approach)
    print(f"Fitting final K-Modes model with best parameters...")
    print(f"Best parameters: {best_params}")
    
    # Create and fit final pipeline
    final_kmodes_pipeline = Pipeline([
        ('preprocess', CategoricalPreprocessor(handle_missing='drop')),
        ('cluster', KModes(
            n_clusters=best_params['cluster__n_clusters'],
            init=best_params['cluster__init'],
            n_init=best_params['cluster__n_init'],
            verbose=1,
            random_state=RANDOM_STATE
        ))
    ])
    
    # Fit and predict
    final_labels = final_kmodes_pipeline.fit_predict(X_categorical)
    final_centroids = final_kmodes_pipeline.named_steps['cluster'].cluster_centroids_
    
    # Add cluster labels to original dataframe for analysis
    df_kmodes_labeled = df_kmodes.copy()
    df_kmodes_labeled['cluster'] = final_labels
    
    print(f"‚úì Final model fitted successfully")
    print(f"Number of clusters: {len(np.unique(final_labels))}")
    print(f"Cluster distribution:")
    cluster_counts = pd.Series(final_labels).value_counts().sort_index()
    for cluster_id, count in cluster_counts.items():
        print(f"  Cluster {cluster_id}: {count} samples ({(count/len(final_labels))*100:.1f}%)")
    
    # Create detailed cluster profiles (similar to SpatialHotspotAnalysis cluster profiling)
    print(f"\n=== CATEGORICAL CRIME PATTERN PROFILES ===")
    
    cluster_profiles = []
    feature_names = [f for f in X_categorical.columns.tolist() if f != 'TOTAL_POI_COUNT_BIN']
    
    for cluster_id in sorted(np.unique(final_labels)):
        cluster_mask = final_labels == cluster_id
        cluster_data = df_kmodes_labeled[cluster_mask]
        cluster_size = cluster_mask.sum()
        
        print(f"\n--- CLUSTER {cluster_id} PROFILE ---")
        print(f"Size: {cluster_size} samples ({(cluster_size/len(final_labels))*100:.1f}%)")
        
        # Get centroid pattern
        centroid = final_centroids[cluster_id]
        print(f"Centroid pattern:")
        for i, feature in enumerate(feature_names):
            print(f"  {feature}: {centroid[i]}")
        
        # Top distributions per key features
        summary_cols = [
            'BORO_NM', 'PREM_TYP_DESC', 'OFNS_DESC',
            'TIME_BUCKET', 'IS_WEEKEND', 'IS_HOLIDAY',
            'METRO_DISTANCE_BIN', 'POI_DENSITY_SCORE_BIN',
        ]
        summary_cols = [c for c in summary_cols if c in cluster_data.columns]
        for col in summary_cols:
            dist = cluster_data[col].value_counts(normalize=True).head(5)
            print(f"Top {col}:")
            for val, pct in dist.items():
                print(f"  {val}: {pct*100:.1f}%")
        
        # Store a compact profile
        profile = {
            'cluster': int(cluster_id),
            'size': int(cluster_size),
        }
        for col in summary_cols:
            top_val = cluster_data[col].value_counts().idxmax()
            profile[f'top_{col.lower()}'] = str(top_val)
        cluster_profiles.append(profile)
    
    df_cluster_profiles = pd.DataFrame(cluster_profiles)
    print("\nSample cluster profiles:")
    print(df_cluster_profiles.head())

=== K-MODES FINAL MODEL & PATTERN ANALYSIS ===
Fitting final K-Modes model with best parameters...
Best parameters: {'cluster__init': 'Huang', 'cluster__n_clusters': 10, 'cluster__n_init': 10}
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 1609, cost: 25724.0
Run 1, iteration: 2/100, moves: 967, cost: 25240.0
Run 1, iteration: 3/100, moves: 523, cost: 25095.0
Run 1, iteration: 4/100, moves: 106, cost: 25095.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 1397, cost: 25557.0
Run 2, iteration: 2/100, moves: 558, cost: 25488.0
Run 2, iteration: 3/100, moves: 23, cost: 25488.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 1607, cost: 25351.0
Run 3, iteration: 2/100, moves: 375, cost: 25351.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 4, iteration: 1/100, m

---

# 4. Police-Focused Crime Intelligence Analysis

This section transforms the K-Modes clustering results into **actionable intelligence** for law enforcement operations. We create operational crime profiles, tactical recommendations, and priority assessment for police deployment.

## Crime Pattern Intelligence Reports

Transform abstract clustering results into concrete operational insights for police commanders and patrol units.

In [14]:
# Create binned context features (drop TOTAL_POI_COUNT_BIN entirely)
created_bins = []
if 'column_binner' in locals():
    _df = df_kmodes.copy()
    _df = column_binner.fit_transform(_df)
    created_bins = [col for col in column_binner.get_feature_names_out() if col in _df.columns]
    # Remove any accidental legacy bins
    created_bins = [c for c in created_bins if c != 'TOTAL_POI_COUNT_BIN']
    df_kmodes = _df

# Print concise list of bins actually used
print("Added POI/context bins:", [c for c in created_bins])

# Build final categorical feature list for kmodes (no TOTAL_POI_COUNT_BIN)
CATEGORICAL_FEATURES_KMODES = [
    *[c for c in BASE_KMODES_FEATURES if c in df_kmodes.columns],
    *[c for c in EXTRA_CATEGORICAL_FEATURES if c in df_kmodes.columns],
    *[c for c in DEMOGRAPHIC_CATEGORICAL if c in df_kmodes.columns],
    *[c for c in created_bins if c in df_kmodes.columns]
]

# Ensure TOTAL_POI_COUNT_BIN is not present
CATEGORICAL_FEATURES_KMODES = [c for c in CATEGORICAL_FEATURES_KMODES if c != 'TOTAL_POI_COUNT_BIN']

# Prepare X for K-Modes
X_categorical = df_kmodes[CATEGORICAL_FEATURES_KMODES].astype(str).fillna('Unknown')
X_categorical_processed = X_categorical
X_categorical_array = X_categorical_processed.values
print(f"X_categorical shape: {X_categorical.shape}")

Added POI/context bins: [np.str_('METRO_DISTANCE_BIN'), np.str_('POI_DENSITY_SCORE_BIN')]
X_categorical shape: (5000, 12)


In [15]:
# Binner config
binner_config = {
    'METRO_DISTANCE': {
        'kind': 'distance',
        'bins': [0, 250, 1000, np.inf],
        'labels': ['Near', 'Mid', 'Far']
    },
    'POI_DENSITY_SCORE': {
        'kind': 'score', 'quantiles': 4,
        'labels': ['Low','Medium','High','VeryHigh']
    }
}

column_binner = ColumnBinner(config=binner_config, suffix="_BIN", fill_unknown="Unknown")

In [21]:
# === POLICE EXECUTIVE DASHBOARD ===
print("\n" + "="*60)
print("üìä EXECUTIVE CRIME INTELLIGENCE DASHBOARD")
print("="*60)

if 'df_operational' in locals() and not df_operational.empty:
    
    # Executive summary statistics
    total_crimes = df_operational['crime_count'].sum()
    high_priority_patterns = len(df_operational[df_operational['priority'] == 'HIGH'])
    high_priority_crimes = df_operational[df_operational['priority'] == 'HIGH']['crime_count'].sum()
    
    most_concentrated = df_operational.iloc[0]  # Already sorted by concentration
    highest_volume = df_operational.loc[df_operational['crime_count'].idxmax()]
    
    print(f"\nüìà EXECUTIVE SUMMARY")
    print(f"   üî¢ Total Crimes Analyzed: {total_crimes:,}")
    print(f"   üéØ Crime Patterns Identified: {len(df_operational)}")
    print(f"   üö® High Priority Patterns: {high_priority_patterns}")
    print(f"   üìä High Priority Crime Volume: {high_priority_crimes:,} ({(high_priority_crimes/total_crimes)*100:.1f}%)")
    
    print(f"\nüîç KEY INSIGHTS")
    print(f"   üéØ Most Concentrated Pattern: {most_concentrated['primary_crime']} in {most_concentrated['primary_borough']}")
    print(f"      ‚îî‚îÄ‚îÄ {most_concentrated['concentration_score']:.0%} concentration, {most_concentrated['crime_count']:,} crimes")
    print(f"   üìà Highest Volume Pattern: {highest_volume['primary_crime']} in {highest_volume['primary_borough']}")
    print(f"      ‚îî‚îÄ‚îÄ {highest_volume['crime_count']:,} crimes ({(highest_volume['crime_count']/total_crimes)*100:.1f}% of total)")
    
    # Borough-level intelligence
    print(f"\nüó∫Ô∏è BOROUGH CRIME INTELLIGENCE")
    cluster_data_all = df_kmodes_labeled.copy()
    
    # Add cluster priority information to the main dataset
    priority_map = dict(zip(df_operational['cluster_id'], df_operational['priority']))
    cluster_data_all['priority'] = cluster_data_all['cluster'].map(priority_map)
    
    borough_intelligence = cluster_data_all.groupby('BORO_NM').agg({
        'cluster': 'count',
        'priority': lambda x: (x == 'HIGH').sum()
    }).rename(columns={'cluster': 'total_crimes', 'priority': 'high_priority_crimes'})
    
    borough_intelligence['high_priority_pct'] = (borough_intelligence['high_priority_crimes'] / 
                                               borough_intelligence['total_crimes'] * 100).round(1)
    
    borough_intelligence = borough_intelligence.sort_values('high_priority_crimes', ascending=False)
    
    for borough, stats in borough_intelligence.iterrows():
        print(f"   üìç {borough}:")
        print(f"      ‚îî‚îÄ‚îÄ Total: {stats['total_crimes']:,} | High Priority: {stats['high_priority_crimes']:,} ({stats['high_priority_pct']:.1f}%)")
    
    # Crime type intelligence
    print(f"\nüîç CRIME TYPE INTELLIGENCE")
    crime_intelligence = cluster_data_all.groupby('OFNS_DESC').agg({
        'cluster': 'count',
        'priority': lambda x: (x == 'HIGH').sum()
    }).rename(columns={'cluster': 'total_crimes', 'priority': 'high_priority_crimes'})
    
    crime_intelligence['high_priority_pct'] = (crime_intelligence['high_priority_crimes'] / 
                                             crime_intelligence['total_crimes'] * 100).round(1)
    
    top_crimes = crime_intelligence.sort_values('total_crimes', ascending=False).head(5)
    
    for crime_type, stats in top_crimes.iterrows():
        print(f"   {crime_type}:")
        print(f"      ‚îî‚îÄ‚îÄ {stats['total_crimes']:,} crimes | {stats['high_priority_crimes']:,} high priority ({stats['high_priority_pct']:.1f}%)")
    
    # Operational recommendations summary
    print(f"\n‚ö° IMMEDIATE ACTION ITEMS")
    
    # Get high priority patterns for immediate action
    high_priority_df = df_operational[df_operational['priority'] == 'HIGH'].head(3)
    
    if not high_priority_df.empty:
        print("   üö® DEPLOY IMMEDIATELY:")
        for i, (_, pattern) in enumerate(high_priority_df.iterrows(), 1):
            print(f"      {i}. {pattern['primary_crime']} in {pattern['primary_borough']}")
            print(f"         ‚îî‚îÄ‚îÄ {pattern['crime_count']:,} crimes, {pattern['concentration_score']:.0%} concentration")
    
    # Medium priority for planning
    medium_priority_df = df_operational[df_operational['priority'].isin(['MEDIUM-HIGH', 'MEDIUM'])].head(2)
    
    if not medium_priority_df.empty:
        print("   üìã PLAN ENHANCED OPERATIONS:")
        for i, (_, pattern) in enumerate(medium_priority_df.iterrows(), 1):
            print(f"      {i}. {pattern['primary_crime']} in {pattern['primary_borough']}")
            print(f"         ‚îî‚îÄ‚îÄ {pattern['crime_count']:,} crimes")
    
    # Resource allocation recommendation
    print(f"\nüí∞ RESOURCE ALLOCATION RECOMMENDATION")
    total_budget_crimes = df_operational[df_operational['priority'].isin(['HIGH', 'MEDIUM-HIGH'])]['crime_count'].sum()
    print(f"   üéØ Focus 80% of resources on {len(df_operational[df_operational['priority'].isin(['HIGH', 'MEDIUM-HIGH'])])} patterns")
    print(f"   üìä Targeting {total_budget_crimes:,} crimes ({(total_budget_crimes/total_crimes)*100:.1f}% of total volume)")
    print(f"   üí° Expected Result: Maximum impact with focused deployment")
    
    # Create executive summary for export
    executive_summary = {
        'analysis_date': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M'),
        'total_crimes_analyzed': int(total_crimes),
        'patterns_identified': int(len(df_operational)),
        'high_priority_patterns': int(high_priority_patterns),
        'high_priority_crime_percentage': round((high_priority_crimes/total_crimes)*100, 1),
        'most_concentrated_pattern': {
            'crime_type': most_concentrated['primary_crime'],
            'location': most_concentrated['primary_borough'],
            'concentration': f"{most_concentrated['concentration_score']:.0%}",
            'volume': int(most_concentrated['crime_count'])
        },
        'highest_volume_pattern': {
            'crime_type': highest_volume['primary_crime'],
            'location': highest_volume['primary_borough'],
            'volume': int(highest_volume['crime_count']),
            'percentage_of_total': round((highest_volume['crime_count']/total_crimes)*100, 1)
        },
        'immediate_deployment_needed': high_priority_patterns > 0,
        'resource_focus_patterns': int(len(df_operational[df_operational['priority'].isin(['HIGH', 'MEDIUM-HIGH'])]))
    }
    
    # Save executive summary
    with open(os.path.join(output_dir, 'executive_crime_summary.json'), 'w') as f:
        json.dump(executive_summary, f, indent=2, default=str)
    
    print(f"\n‚úÖ Executive summary saved to: executive_crime_summary.json")
    print(f"üìÅ All police intelligence reports saved to: {output_dir}")
    
else:
    print("‚ùå No operational data available for executive dashboard")


üìä EXECUTIVE CRIME INTELLIGENCE DASHBOARD

üìà EXECUTIVE SUMMARY
   üî¢ Total Crimes Analyzed: 5,000
   üéØ Crime Patterns Identified: 10
   üö® High Priority Patterns: 0
   üìä High Priority Crime Volume: 0 (0.0%)

üîç KEY INSIGHTS
   üéØ Most Concentrated Pattern: PETIT LARCENY in QUEENS
      ‚îî‚îÄ‚îÄ 55% concentration, 363 crimes
   üìà Highest Volume Pattern: PETIT LARCENY in MANHATTAN
      ‚îî‚îÄ‚îÄ 739 crimes (14.8% of total)

üó∫Ô∏è BOROUGH CRIME INTELLIGENCE
   üìç BRONX:
      ‚îî‚îÄ‚îÄ Total: 1,091.0 | High Priority: 0.0 (0.0%)
   üìç BROOKLYN:
      ‚îî‚îÄ‚îÄ Total: 1,383.0 | High Priority: 0.0 (0.0%)
   üìç MANHATTAN:
      ‚îî‚îÄ‚îÄ Total: 1,204.0 | High Priority: 0.0 (0.0%)
   üìç QUEENS:
      ‚îî‚îÄ‚îÄ Total: 1,109.0 | High Priority: 0.0 (0.0%)
   üìç STATEN ISLAND:
      ‚îî‚îÄ‚îÄ Total: 213.0 | High Priority: 0.0 (0.0%)

üîç CRIME TYPE INTELLIGENCE
   PETIT LARCENY:
      ‚îî‚îÄ‚îÄ 916.0 crimes | 0.0 high priority (0.0%)
   HARRASSMENT 2:
      ‚î

In [22]:
# === POLICE-READY DELIVERABLES SUMMARY ===
print("\n" + "="*60)
print("üìã POLICE-READY DELIVERABLES GENERATED")
print("="*60)

print("\nüéØ The following actionable intelligence reports have been generated:")
print("\n1. üìä OPERATIONAL INTELLIGENCE REPORT")
print("   üìÅ File: police_operational_intelligence.csv") 
print("   üìù Content: Crime patterns with priority levels, concentration scores, and volume analysis")
print("   üëÆ Use: Daily briefings, resource allocation planning, patrol deployment decisions")

print("\n2. üéØ TACTICAL RECOMMENDATIONS")
print("   üìÅ File: tactical_recommendations.json")
print("   üìù Content: Specific tactical advice for each crime pattern (deployment strategies, focus areas)")
print("   üëÆ Use: Field operations planning, specialized unit deployment, tactical decision making")

print("\n3. üìà EXECUTIVE SUMMARY")
print("   üìÅ File: executive_crime_summary.json")
print("   üìù Content: High-level intelligence summary for command staff")
print("   üëÆ Use: Budget planning, strategic decisions, performance metrics, public reporting")

print("\n4. üóÇÔ∏è DETAILED CLUSTER DATA")
print("   üìÅ File: kmodes_clustered_data.csv")
print("   üìù Content: Full crime dataset with cluster assignments for detailed analysis")
print("   üëÆ Use: Detective investigations, pattern analysis, evidence correlation")

print(f"\nüìç All files saved to: {output_dir}")

# Quick verification of file sizes
import os
files_info = []
expected_files = [
    'police_operational_intelligence.csv',
    'tactical_recommendations.json', 
    'executive_crime_summary.json',
    'kmodes_clustered_data.csv'
]

for filename in expected_files:
    filepath = os.path.join(output_dir, filename)
    if os.path.exists(filepath):
        size_kb = os.path.getsize(filepath) / 1024
        files_info.append(f"   ‚úÖ {filename} ({size_kb:.1f} KB)")
    else:
        files_info.append(f"   ‚ùå {filename} (not found)")

print(f"\nüìÅ FILE STATUS:")
for info in files_info:
    print(info)

print(f"\nüöÄ NEXT STEPS FOR POLICE IMPLEMENTATION:")
print("   1. üìã Review operational intelligence report for immediate deployment decisions")
print("   2. üéØ Implement tactical recommendations for high-priority patterns")
print("   3. üìä Use executive summary for resource allocation and strategic planning")
print("   4. üîÑ Establish regular analysis schedule (weekly/monthly) for updated intelligence")
print("   5. üìà Track effectiveness of deployments and adjust strategies based on results")

print(f"\n‚úÖ MISSION ACCOMPLISHED: Crime clustering analysis now provides direct operational value to law enforcement!")


üìã POLICE-READY DELIVERABLES GENERATED

üéØ The following actionable intelligence reports have been generated:

1. üìä OPERATIONAL INTELLIGENCE REPORT
   üìÅ File: police_operational_intelligence.csv
   üìù Content: Crime patterns with priority levels, concentration scores, and volume analysis
   üëÆ Use: Daily briefings, resource allocation planning, patrol deployment decisions

2. üéØ TACTICAL RECOMMENDATIONS
   üìÅ File: tactical_recommendations.json
   üìù Content: Specific tactical advice for each crime pattern (deployment strategies, focus areas)
   üëÆ Use: Field operations planning, specialized unit deployment, tactical decision making

3. üìà EXECUTIVE SUMMARY
   üìÅ File: executive_crime_summary.json
   üìù Content: High-level intelligence summary for command staff
   üëÆ Use: Budget planning, strategic decisions, performance metrics, public reporting

4. üóÇÔ∏è DETAILED CLUSTER DATA
   üìÅ File: kmodes_clustered_data.csv
   üìù Content: Full crime dataset 

---

# 5. Advanced Clustering Methods

This section explores advanced clustering techniques for discovering complex crime patterns that might be missed by traditional methods. These approaches complement the operational analysis above and provide research-grade insights for academic and advanced analytical purposes.


## Categorical Dimensionality Reduction + Clustering

Categorical dimensionality reduction transforms high-cardinality categorical data into a lower-dimensional continuous space, enabling the application of distance-based clustering algorithms while preserving the essential categorical relationships. 

The **Categorical Dimensionality Reduction Pipeline** follows our established architecture and uses a robust **OneHot + PCA approach** instead of traditional MCA:

`CategoricalPreprocessor ‚Üí CategoricalDimensionalityReducer ‚Üí KMeans`

**Why OneHot + PCA instead of MCA?**
- **Numerical Stability**: No NaN values produced during transformation
- **Robust Implementation**: Well-tested sklearn components
- **Consistent Results**: Reproducible across different data distributions
- **Better Performance**: More efficient and scalable for large datasets

In [18]:
print("=== CATEGORICAL DIMENSIONALITY REDUCTION + KMEANS ===")

# Ensure Utilities path is available
import sys, os
if 'utilities_path' in globals():
    if utilities_path not in sys.path and os.path.isdir(utilities_path):
        sys.path.append(utilities_path)
else:
    # Fallbacks
    for candidate in [
        os.path.join(os.getcwd(), "Notebooks", "Clustering", "Utilities"),
        os.path.join(os.path.dirname(os.getcwd()), "Notebooks", "Clustering", "Utilities")
    ]:
        if os.path.isdir(candidate) and candidate not in sys.path:
            sys.path.append(candidate)

# Import the custom dimensionality reducer from clustering utilities
try:
    from clustering_transformers import CategoricalDimensionalityReducer
except Exception as e:
    try:
        from Utilities.clustering_transformers import CategoricalDimensionalityReducer
    except Exception as e2:
        print("Warning: CategoricalDimensionalityReducer import failed:", e2)

# Pipeline: OneHot+PCA-like reducer + KMeans
categorical_dimred_pipeline = Pipeline([
    ('dimred', CategoricalDimensionalityReducer(n_components=5, random_state=RANDOM_STATE)),
    ('cluster', KMeans(n_clusters=5, n_init=10, random_state=RANDOM_STATE))
])
print("‚úì Pipeline constructed")

# Parameter grid
cat_dimred_param_grid = {
    'dimred__n_components': [3, 5, 8],
    'cluster__n_clusters': [3, 4, 5, 6, 7, 8],
    'cluster__n_init': [5, 10]
}
print("Grid ready:", cat_dimred_param_grid)


=== CATEGORICAL DIMENSIONALITY REDUCTION + KMEANS ===
‚úì Pipeline constructed
Grid ready: {'dimred__n_components': [3, 5, 8], 'cluster__n_clusters': [3, 4, 5, 6, 7, 8], 'cluster__n_init': [5, 10]}


In [19]:
print("=== SPECTRAL CLUSTERING (k-NN affinity) ===")

spectral_pipeline = Pipeline([
    ('preprocess', IdentityPreprocessor()),
    ('cluster', SpectralClustering(
        n_clusters=4,
        affinity='nearest_neighbors',
        n_neighbors=15,
        assign_labels='kmeans',
        n_init=5,
        random_state=RANDOM_STATE,
    ))
])

print("‚úì Spectral pipeline constructed")

spectral_param_grid = {
    'cluster__n_clusters': [3, 4, 5, 6, 7],
    'cluster__n_neighbors': [10, 15, 20],
}

print("Grid:")
print(f"  n_clusters: {spectral_param_grid['cluster__n_clusters']}")
print(f"  n_neighbors: {spectral_param_grid['cluster__n_neighbors']}")
print(f"Total combinations: {len(list(ParameterGrid(spectral_param_grid)))}")

=== SPECTRAL CLUSTERING (k-NN affinity) ===


NameError: name 'IdentityPreprocessor' is not defined

## Spectral Clustering for Non-Convex Patterns

Spectral clustering can discover complex, non-linear patterns by using the eigenvectors of similarity matrices. This method is particularly useful for finding clusters with irregular shapes that traditional methods might miss.

The **Spectral Pipeline** handles mixed categorical and numerical features:

`MixedFeaturePreprocessor ‚Üí SpectralClustering`

The MixedFeaturePreprocessor combines categorical preprocessing (via CategoricalPreprocessor) with numerical feature standardization, maintaining consistency with our pipeline architecture.

In [None]:
# === SPECTRAL CLUSTERING FOR NON-CONVEX PATTERNS ===
print("=== SPECTRAL CLUSTERING ANALYSIS ===")

from sklearn.cluster import SpectralClustering
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold


def spectral_clustering_evaluation(pipeline, X, param_grid, cv=3, random_state=42):
    """
    Custom evaluation for SpectralClustering which doesn't have predict() method.
    SpectralClustering is a transductive method that only works on the data it was trained on.
    """
    print(f"\nüîÑ SPECTRAL CLUSTERING EVALUATION (CUSTOM)")
    print(f"Dataset shape: {X.shape}")
    print(f"CV folds: {cv}")
    print(f"Parameter combinations: {len(list(ParameterGrid(param_grid)))}")
    print("-" * 50)

    kf = KFold(n_splits=cv, shuffle=True, random_state=random_state)
    results = []

    for i, params in enumerate(ParameterGrid(param_grid)):
        print(f"\nTesting combination {i+1}: {params}")

        # For SpectralClustering, we evaluate on the full fold (not train/validation split)
        fold_silhouettes = []

        for fold, (_, fold_idx) in enumerate(kf.split(X)):
            X_fold = X.iloc[fold_idx]

            try:
                # Fit pipeline on fold data
                pipeline_copy = clone(pipeline)
                pipeline_copy.set_params(**params)

                # For SpectralClustering, fit_predict gives us the labels directly
                fold_labels = pipeline_copy.fit_predict(X_fold)

                # Skip if only one cluster found
                if len(np.unique(fold_labels)) < 2:
                    print(f"    Fold {fold+1}: Only one cluster found, skipping")
                    continue

                # Get transformed data for metrics
                X_fold_transformed = pipeline_copy.named_steps['preprocess'].fit_transform(X_fold)

                # Convert to numpy array if needed
                if hasattr(X_fold_transformed, 'values'):
                    X_fold_array = X_fold_transformed.values
                else:
                    X_fold_array = X_fold_transformed

                # Calculate clustering metrics
                sil_score = silhouette_score(X_fold_array, fold_labels)
                fold_silhouettes.append(sil_score)

                print(f"    Fold {fold+1}: Silhouette={sil_score:.3f}")

            except Exception as e:
                print(f"    Fold {fold+1} failed: {e}")
                continue

        # Aggregate results across folds
        if fold_silhouettes:
            mean_silhouette = np.mean(fold_silhouettes)
            std_silhouette = np.std(fold_silhouettes)

            # Composite score: silhouette only
            composite_score = mean_silhouette

            results.append({
                'params': params,
                'cv_silhouette_mean': mean_silhouette,
                'cv_silhouette_std': std_silhouette,
                'composite_score': composite_score,
                'n_successful_folds': len(fold_silhouettes)
            })

            print(f"  ‚úì Results: Silhouette={mean_silhouette:.3f}¬±{std_silhouette:.3f}, Composite={composite_score:.4f}")
        else:
            print(f"  ‚ùå No successful folds for this parameter combination")

    df_results = pd.DataFrame(results)
    if not df_results.empty:
        df_results = df_results.sort_values('composite_score', ascending=False)
    return df_results


if 'df_kmodes_labeled' in locals() and not df_kmodes_labeled.empty:

    print("\nüî¨ ADVANCED RESEARCH METHOD: SPECTRAL CLUSTERING PIPELINE")
    print("="*55)

    # Following the same pipeline structure as K-Modes and categorical dimred
    print(f"Constructing Spectral Clustering pipeline...")

    # Prepare input data with mixed features (following preprocessing approach)
    X_mixed_input = df_kmodes_labeled.copy()

    # Exclude raw coordinates from clustering features
    _excluded_coords = ['Latitude', 'Longitude']
    X_mixed_input.drop(columns=_excluded_coords, errors='ignore', inplace=True)
    actually_excluded = [c for c in _excluded_coords if c in df_kmodes_labeled.columns]
    print(f"Excluded raw coordinate columns from clustering: {actually_excluded}")

    print(f"Input data shape: {X_mixed_input.shape}")
    print(f"Available categorical features: {CATEGORICAL_FEATURES_KMODES}")
    print(f"Available numerical features: {TEMPORAL_FEATURES + SPATIAL_CONTEXT_FEATURES[:5]}")

    # Construct Spectral pipeline with sparse nearest-neighbors affinity for scalability
    spectral_pipeline = Pipeline([
        ('preprocess', MixedFeaturePreprocessor(max_categorical_features=20, max_numerical_features=10)),
        ('cluster', SpectralClustering(
            random_state=RANDOM_STATE,
            affinity='nearest_neighbors',
            n_neighbors=20,
            assign_labels='kmeans',
            n_init=5
        ))
    ])

    print(f"‚úì Spectral pipeline constructed")
    print(f"Pipeline steps: {[step[0] for step in spectral_pipeline.steps]}")

    # Lean parameter grid for computational efficiency
    spectral_param_grid = {
        'cluster__n_clusters': [3, 4, 5],
        'cluster__n_neighbors': [10, 15, 20]
    }

    print(f"\nParameter grid defined:")
    print(f"  n_clusters: {spectral_param_grid['cluster__n_clusters']}")
    print(f"  n_neighbors: {spectral_param_grid['cluster__n_neighbors']}")

    # Execute custom evaluation
    print(f"\nüîÑ SPECTRAL CLUSTERING EVALUATION")
    print("-" * 40)

    # Use custom evaluation function for SpectralClustering
    df_spectral_results = spectral_clustering_evaluation(
        pipeline=spectral_pipeline,
        X=X_mixed_input,
        param_grid=spectral_param_grid,
        cv=3,  # 3-fold CV for computational efficiency
        random_state=RANDOM_STATE
    )

    if not df_spectral_results.empty:
        # Get best parameters from evaluation
        best_spectral_params = df_spectral_results.iloc[0]['params']
        best_spectral_score = df_spectral_results.iloc[0]['composite_score']

        print(f"\nüìä TOP SPECTRAL CLUSTERING RESULTS")
        print(f"Best parameters: {best_spectral_params}")
        print(f"Best score: {best_spectral_score:.4f}")

        # Display top 3 results
        top_spectral_results = df_spectral_results.head(3)
        print(f"\nTop 3 parameter combinations:")
        display_cols = ['cv_silhouette_mean', 'composite_score']
        print(top_spectral_results[['params'] + display_cols].round(4))

        # Fit final model on full dataset
        print(f"\nüéØ FITTING FINAL SPECTRAL MODEL")
        final_spectral_pipeline = clone(spectral_pipeline)
        final_spectral_pipeline.set_params(**best_spectral_params)

        # Use fit_predict for SpectralClustering
        final_spectral_labels = final_spectral_pipeline.fit_predict(X_mixed_input)

        # Calculate final metrics
        X_transformed = final_spectral_pipeline.named_steps['preprocess'].fit_transform(X_mixed_input)
        if hasattr(X_transformed, 'values'):
            X_array = X_transformed.values
        else:
            X_array = X_transformed

        spectral_final_metrics = {
            'silhouette_score': silhouette_score(X_array, final_spectral_labels),
            'n_clusters': len(np.unique(final_spectral_labels)),
            'cluster_sizes': np.bincount(final_spectral_labels).tolist()
        }

        print(f"Final model performance:")
        print(f"  Silhouette Score: {spectral_final_metrics['silhouette_score']:.4f}")
        print(f"  Number of clusters: {spectral_final_metrics['n_clusters']}")
        print(f"  Cluster sizes: {spectral_final_metrics['cluster_sizes']}")

        # Agreement analysis with other methods (optional, only if available in session)
        if 'final_labels' in locals():
            print(f"\nüîç SPECTRAL vs K-MODES COMPARISON")
            final_ari_kmodes = adjusted_rand_score(final_labels, final_spectral_labels)
            final_ami_kmodes = adjusted_mutual_info_score(final_labels, final_spectral_labels)

            print(f"Spectral vs K-Modes agreement:")
            print(f"  Adjusted Rand Index: {final_ari_kmodes:.3f}")
            print(f"  Adjusted Mutual Information: {final_ami_kmodes:.3f}")

        # Save Spectral results
        spectral_analysis_results = {
            'best_parameters': best_spectral_params,
            'best_score': best_spectral_score,
            'final_metrics': spectral_final_metrics,
            'evaluation_summary': {
                'total_combinations_tested': len(df_spectral_results),
                'cv_folds': 3,
                'evaluation_method': 'custom_spectral_evaluation',
                'note': 'Nearest-neighbors affinity for efficiency'
            },
            'detailed_results': df_spectral_results.to_dict('records')
        }

        with open(os.path.join(output_dir, 'spectral_analysis_results.json'), 'w') as f:
            json.dump(spectral_analysis_results, f, indent=2, default=str)

        print(f"\n‚úÖ Spectral analysis results saved to: spectral_analysis_results.json")

    else:
        print("‚ùå No successful Spectral runs completed")

else:
    print("‚ö†Ô∏è Skipping Spectral analysis - no K-Modes data available")

=== SPECTRAL CLUSTERING ANALYSIS ===

üî¨ ADVANCED RESEARCH METHOD: SPECTRAL CLUSTERING PIPELINE
Constructing Spectral Clustering pipeline...
Input data shape: (10000, 45)
Available categorical features: ['BORO_NM', 'OFNS_DESC', 'PREM_TYP_DESC']
Available numerical features: ['HOUR', 'WEEKDAY', 'MONTH', 'BAR_DISTANCE', 'NIGHTCLUB_DISTANCE', 'ATM_DISTANCE', 'METRO_DISTANCE', 'MIN_POI_DISTANCE']
‚úì Spectral pipeline constructed
Pipeline steps: ['preprocess', 'cluster']

Parameter grid defined:
  n_clusters: [3, 4, 5]
  n_neighbors: [10, 15, 20]

üîÑ SPECTRAL CLUSTERING EVALUATION
----------------------------------------

üîÑ SPECTRAL CLUSTERING EVALUATION (CUSTOM)
Dataset shape: (10000, 45)
CV folds: 3
Parameter combinations: 9
--------------------------------------------------

Testing combination 1: {'cluster__n_clusters': 3, 'cluster__n_neighbors': 10}
Input data shape: (10000, 45)
Available categorical features: ['BORO_NM', 'OFNS_DESC', 'PREM_TYP_DESC']
Available numerical featur

---

# 6. Comparative Analysis & Method Selection

This section provides a comprehensive comparison of all clustering methods applied, evaluates their strengths and weaknesses for crime analysis, and provides guidance for method selection based on specific use cases.

In [None]:
# === COMPARATIVE SUMMARY OF CLUSTERING METHODS ===
print("=== CLUSTERING METHODS COMPARISON & SUMMARY ===")

method_comparison = {}

# 1) K-MODES (categorical)
if 'best_params' in locals() and 'best_score' in locals():
    method_comparison['K-Modes'] = {
        'pipeline': 'CategoricalPreprocessor ‚Üí KModes',
        'n_clusters': best_params.get('cluster__n_clusters', 'N/A'),
        'best_score': float(best_score)
    }

# 2) CATEGORICAL DIMENSIONALITY REDUCTION + KMEANS
if 'categorical_dimred_analysis_results' in locals() and isinstance(categorical_dimred_analysis_results, dict):
    try:
        cd_best = categorical_dimred_analysis_results.get('best_parameters', {})
        cd_cv = float(categorical_dimred_analysis_results.get('best_cv_score', float('nan')))
        cd_final = categorical_dimred_analysis_results.get('final_metrics', {})
        method_comparison['Categorical DimRed + KMeans'] = {
            'pipeline': 'Identity ‚Üí OneHot+PCA ‚Üí KMeans',
            'n_components': cd_best.get('dimred__n_components', 'N/A'),
            'n_clusters': cd_best.get('cluster__n_clusters', 'N/A'),
            'best_cv_score': cd_cv,
            'final_silhouette': cd_final.get('silhouette_score', None)
        }
    except Exception:
        pass

# 3) SPECTRAL CLUSTERING (nearest_neighbors)
if 'spectral_analysis_results' in locals() and isinstance(spectral_analysis_results, dict):
    try:
        sp_best = spectral_analysis_results.get('best_parameters', {})
        sp_score = float(spectral_analysis_results.get('best_score', float('nan')))
        sp_final = spectral_analysis_results.get('final_metrics', {})
        method_comparison['Spectral (k-NN affinity)'] = {
            'pipeline': 'MixedFeaturePreprocessor ‚Üí SpectralClustering',
            'n_neighbors': sp_best.get('cluster__n_neighbors', 'N/A'),
            'n_clusters': sp_best.get('cluster__n_clusters', 'N/A'),
            'best_score': sp_score,
            'final_silhouette': sp_final.get('silhouette_score', None)
        }
    except Exception:
        pass

# Print concise summary
if method_comparison:
    print("\nAvailable methods:")
    for name, details in method_comparison.items():
        ncl = details.get('n_clusters', 'N/A')
        score = details.get('best_score', details.get('best_cv_score', 'N/A'))
        print(f" ‚Ä¢ {name}: clusters={ncl}, score={score}")
else:
    print("No clustering results available for comparison.")

# Save summary
comparison_summary = {
    'analysis_date': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M'),
    'methods': method_comparison
}
with open(os.path.join(output_dir, 'pipeline_methods_comparison.json'), 'w') as f:
    json.dump(comparison_summary, f, indent=2, default=str)

print("\n‚úÖ Summary saved to: pipeline_methods_comparison.json")

=== CLUSTERING METHODS COMPARISON & SUMMARY ===

Available methods:
 ‚Ä¢ K-Modes: clusters=9, score=0.39092018719906846
 ‚Ä¢ Categorical DimRed + KMeans: clusters=6, score=0.6064499282330785
 ‚Ä¢ Spectral (k-NN affinity): clusters=4, score=0.3798255424662044

‚úÖ Summary saved to: pipeline_methods_comparison.json


---

# Summary

This notebook successfully implemented and compared multiple clustering approaches for NYC crime data analysis:

## ‚úÖ Completed Analyses

1. **K-Modes Clustering**: Direct categorical pattern detection
2. **Categorical Dimensionality Reduction + K-Means**: OneHot + PCA + K-Means pipeline  
3. **Comprehensive Method Comparison**: Performance and operational value assessment

## üéØ Key Results

- **Best Approach**: Categorical Dimensionality Reduction achieved the highest performance (silhouette score: 0.337)
- **Operational Value**: K-Modes provides the most interpretable results for police operations
- **Technical Solution**: Successfully resolved NaN errors through stable OneHot + PCA pipeline

## üìä Methodology

All clustering approaches use consistent:
- Data preprocessing and validation
- Cross-validation with parameter grid search  
- Comprehensive evaluation metrics
- Pipeline architecture for reproducibility

## üîó Integration

Results integrate with the broader crime analysis project, providing clustering insights that complement the classification models implemented in separate notebooks.