# Multidimensional Clustering Analysis

Purpose
- Discover meaningful crime patterns from categorical and mixed features.
- Compare three unsupervised methods: K-Modes, Categorical Dimensionality Reduction + KMeans, and Spectral Clustering.
- Produce concise artifacts for analysis and operational intelligence.

Inputs
- DataFrame `base_df` prepared from project data (see preprocessing cells):
  - Temporal, spatial context, social, count/score feature groups (configured via lists like `TEMPORAL_FEATURES`, `SPATIAL_CONTEXT_FEATURES`, etc.).
  - Optional binned context features via `ColumnBinner` with `binner_config`.
- Hyperparameter grids:
  - `kmodes_param_grid`, `cat_dimred_param_grid`, `spectral_param_grid`.
- Reproducibility settings: `RANDOM_STATE`, sampling flags (`USE_FULL_DATA`, `N_SAMPLE`).

Outputs
- Model comparison (JSON): `JupyterOutputs/Clustering (MultidimensionalClusteringAnalysis)/pipeline_methods_comparison.json`.
- Executive summary (JSON): `JupyterOutputs/Clustering (MultidimensionalClusteringAnalysis)/executive_crime_summary.json`.
- Optional enriched artifacts (if enabled downstream):
  - `.../executive_crime_summary_enriched.json`
  - `.../police_operational_intelligence_enriched.csv`




# Setup

This section handles the initial setup, including importing necessary libraries, defining file paths, and configuring the environment for clustering analysis. Custom transformers for spatial feature engineering are imported from our utilities module.

## Import Libraries

Import all libraries required for data manipulation, clustering algorithms, and evaluation metrics

In [35]:

# Core data manipulation and computation
import pandas as pd, json, os
import numpy as np
import sys
import time
from statistics import mean

# Machine learning and clustering
from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans, SpectralClustering
from kmodes.kmodes import KModes
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import silhouette_score

from pathlib import Path


# custom transformers
from Utilities.clustering_transformers import (
    CategoricalPreprocessor,
    ColumnBinner,
    CategoricalDimensionalityReducer,
    GroupBalancedOneHotEncoder
)
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

from collections.abc import Mapping
from itertools import product

from datetime import datetime


## Configure Paths and Custom Utilities

Set up file paths and import custom clustering utilities.

In [36]:
# Configure working directory and paths
current_dir = os.getcwd()
project_root = os.path.abspath(os.path.join(current_dir, '../..'))
data_dir = os.path.join(project_root, 'Data')
output_dir = os.path.join(project_root, 'JupyterOutputs', 'Clustering (MultidimensionalClusteringAnalysis)')

# Create output directories if they don't exist
os.makedirs(output_dir, exist_ok=True)

print(f"Project root: {project_root}")
print(f"Data directory: {data_dir}")
print(f"Output directory: {output_dir}")

# Add utilities to Python path
utilities_path = os.path.join(os.getcwd(), 'Utilities')
if utilities_path not in sys.path:
    sys.path.append(utilities_path)

Project root: c:\UNIVERSITA MAG\Data mining and Machine learning\Progetto\crime-analyzer
Data directory: c:\UNIVERSITA MAG\Data mining and Machine learning\Progetto\crime-analyzer\Data
Output directory: c:\UNIVERSITA MAG\Data mining and Machine learning\Progetto\crime-analyzer\JupyterOutputs\Clustering (MultidimensionalClusteringAnalysis)


## Configure Analysis Parameters

Define key parameters for the clustering analysis.

In [37]:
# Random state for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Feature columns for spatial analysis (aligned with actual dataset)
SPATIAL_FEATURES = ['Latitude', 'Longitude']

# Primary temporal features for clustering
TEMPORAL_FEATURES = ['HOUR', 'WEEKDAY', 'MONTH']

# Extended temporal features available in dataset
EXTENDED_TEMPORAL_FEATURES = [
    'HOUR', 'DAY', 'WEEKDAY', 'IS_WEEKEND', 'MONTH', 'YEAR', 
    'SEASON', 'TIME_BUCKET', 'IS_HOLIDAY', 'IS_PAYDAY'
]

# Categorical features
CATEGORICAL_FEATURES = ['BORO_NM', 'OFNS_DESC', 'PREM_TYP_DESC']

# Extended categorical features available
EXTENDED_CATEGORICAL_FEATURES = [
    'BORO_NM', 'LAW_CAT_CD', 'LOC_OF_OCCUR_DESC', 'OFNS_DESC', 'PREM_TYP_DESC',
    'SUSP_AGE_GROUP', 'SUSP_RACE', 'SUSP_SEX', 'VIC_AGE_GROUP', 'VIC_RACE', 'VIC_SEX'
]

# Spatial context features (POI-based features for enhanced spatial analysis)
SPATIAL_CONTEXT_FEATURES = [
    'BAR_DISTANCE', 'NIGHTCLUB_DISTANCE', 'ATM_DISTANCE', 'METRO_DISTANCE',
    'MIN_POI_DISTANCE', 'AVG_POI_DISTANCE', 'MAX_POI_DISTANCE',
    'ATMS_COUNT', 'BARS_COUNT', 'BUS_STOPS_COUNT', 'METROS_COUNT', 
    'NIGHTCLUBS_COUNT', 'SCHOOLS_COUNT', 'TOTAL_POI_COUNT',
    'POI_DIVERSITY', 'POI_DENSITY_SCORE'
]

# Social features 
SOCIAL_FEATURES = ['SAME_AGE_GROUP', 'SAME_SEX']

print("Analysis parameters configured successfully!")
print(f"Primary spatial features: {SPATIAL_FEATURES}")
print(f"Primary temporal features: {TEMPORAL_FEATURES}")
print(f"Primary categorical features: {CATEGORICAL_FEATURES}")
print(f"Available spatial context features: {len(SPATIAL_CONTEXT_FEATURES)} POI-based features")
print(f"Available extended temporal features: {len(EXTENDED_TEMPORAL_FEATURES)} temporal features")

Analysis parameters configured successfully!
Primary spatial features: ['Latitude', 'Longitude']
Primary temporal features: ['HOUR', 'WEEKDAY', 'MONTH']
Primary categorical features: ['BORO_NM', 'OFNS_DESC', 'PREM_TYP_DESC']
Available spatial context features: 16 POI-based features
Available extended temporal features: 10 temporal features


---

# Data Loading & Feature Preparation

This section loads the preprocessed crime dataset and prepares features specifically for clustering analysis. We validate coordinate accuracy, assess feature completeness, and prepare the data for various clustering algorithms.

## Load Preprocessed Crime Dataset

Load and validate the preprocessed crime data.

In [38]:
# Define data file path
data_file = os.path.join(data_dir, 'final_crime_data.csv')

# Check if data file exists
if not os.path.exists(data_file):
    raise FileNotFoundError(f"Data file not found: {data_file}")

print(f"Loading data from: {data_file}")

# Load the dataset
try:
    df = pd.read_csv(data_file)
    print("Dataset loaded successfully.")
    print(f"Shape: {df.shape}")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
except Exception as e:
    raise RuntimeError(f"Error loading dataset: {e}")

# Display basic dataset information
print("\n" + "="*60)
print("DATASET OVERVIEW")
print("="*60)
print(f"Total records: {len(df):,}")
print(f"Total features: {df.shape[1]}")

Loading data from: c:\UNIVERSITA MAG\Data mining and Machine learning\Progetto\crime-analyzer\Data\final_crime_data.csv
Dataset loaded successfully.
Shape: (2493835, 44)
Memory usage: 2453.35 MB

DATASET OVERVIEW
Total records: 2,493,835
Total features: 44


## Data Cleaning and Validation


In [39]:
# Validate feature availability
print("\n" + "="*60)
print("FEATURE VALIDATION")
print("="*60)

# Check spatial features
spatial_available = [col for col in SPATIAL_FEATURES if col in df.columns]
temporal_available = [col for col in TEMPORAL_FEATURES if col in df.columns]
categorical_available = [col for col in CATEGORICAL_FEATURES if col in df.columns]

print("Feature availability:")
print("-" * 30)
print(f"Spatial features: {spatial_available} ({len(spatial_available)}/{len(SPATIAL_FEATURES)})")
print(f"Temporal features: {temporal_available} ({len(temporal_available)}/{len(TEMPORAL_FEATURES)})")
print(f"Categorical features: {categorical_available} ({len(categorical_available)}/{len(CATEGORICAL_FEATURES)})")

print(f"\n" + "="*60)
print("TEMPORAL FILTERING FOR CLUSTERING")
print("="*60)

if 'YEAR' in df.columns and 'MONTH' in df.columns:
    print(f"Original dataset years: {sorted(df['YEAR'].unique())}")
    print(f"Total records before temporal filter: {len(df):,}")
    
    # Create YearMonth for filtering
    df['YearMonth'] = df['YEAR'] * 100 + df['MONTH']
    print(f"Year-Month distribution in original dataset:")
    ym_counts = df['YearMonth'].value_counts().sort_index()
    for ym, count in ym_counts.items():
        print(f"  {ym}: {count:,} records")

    # Use the temporal split point for 2024
    test_set_start_ym = 202411  # November 2024
    print(f"\nFiltering for clustering analysis with period:")
    print(f"Using YearMonth >= {test_set_start_ym}")
    
    # Apply filter
    df_filtered = df[df['YearMonth'] >= test_set_start_ym].copy()
    
    print(f"Records after temporal filter: {len(df_filtered):,} ({(len(df_filtered)/len(df))*100:.1f}% of original)")
    
    if len(df_filtered) > 0:
        print(f"Filtered dataset year-months:")
        filtered_ym_counts = df_filtered['YearMonth'].value_counts().sort_index()
        for ym, count in filtered_ym_counts.items():
            print(f"  {ym}: {count:,} records")
        
        # Drop the temporary YearMonth column
        df_filtered.drop(columns=['YearMonth'], inplace=True)
        df.drop(columns=['YearMonth'], inplace=True)
        
        # Use filtered data for clustering
        df = df_filtered
        print(f"\nUsing filtered dataset for clustering analysis.")
    else:
        print(f"Warning: No data found for YearMonth >= {test_set_start_ym}")
        print("Falling back to recent years filter (YEAR >= 2023)")
        # Fallback to the previous approach
        recent_years_threshold = 2023
        df_filtered = df[df['YEAR'] >= recent_years_threshold].copy()
        df = df_filtered
        df.drop(columns=['YearMonth'], inplace=True)
else:
    print("Warning: YEAR or MONTH column not found. Skipping temporal filtering.")
    print("Using full dataset for clustering analysis.")

# Create clean dataset for spatial analysis
print(f"\nPreparing dataset for spatial clustering...")

# Filter for valid coordinates
if len(spatial_available) >= 2:
    # Remove rows with missing coordinates
    valid_coords_mask = df[spatial_available].notna().all(axis=1)
    df_spatial = df[valid_coords_mask].copy()
    
    print(f"Records with valid coordinates: {len(df_spatial):,} ({(len(df_spatial)/len(df))*100:.2f}%)")
    
    # Additional coordinate validation
    lat_col = 'Latitude'
    lon_col = 'Longitude'
    
    if lat_col in df_spatial.columns and lon_col in df_spatial.columns:
        # NYC coordinate bounds
        nyc_bounds_mask = (
            df_spatial[lat_col].between(40.4774, 40.9176) &
            df_spatial[lon_col].between(-74.2591, -73.7004)
        )
        df_spatial = df_spatial[nyc_bounds_mask].copy()
        
        print(f"Records within NYC bounds: {len(df_spatial):,}")
    else:
        print(f"Warning: Coordinate columns {lat_col}/{lon_col} not found for geographic filtering")
else:
    raise ValueError("Insufficient spatial features for clustering analysis")

# Display final dataset summary
print(f"\nFinal dataset for clustering:")
print(f"Shape: {df_spatial.shape}")
print(f"Coordinate coverage: {len(df_spatial):,} records")
print(f"Time range: {df_spatial['YEAR'].min()} - {df_spatial['YEAR'].max()}")
print(f"Memory usage: {df_spatial.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Check availability of extended features
print(f"\nExtended features availability:")
print("-" * 40)

# Check spatial context features (POI-based)
spatial_context_available = [col for col in SPATIAL_CONTEXT_FEATURES if col in df.columns]
print(f"Spatial context features: {len(spatial_context_available)}/{len(SPATIAL_CONTEXT_FEATURES)} available")
if spatial_context_available:
    print(f"  Available: {spatial_context_available[:5]}...")  # Show first 5

# Check extended temporal features
extended_temporal_available = [col for col in EXTENDED_TEMPORAL_FEATURES if col in df.columns]
print(f"Extended temporal features: {len(extended_temporal_available)}/{len(EXTENDED_TEMPORAL_FEATURES)} available")
if extended_temporal_available:
    print(f"  Available: {extended_temporal_available}")

# Check extended categorical features
extended_categorical_available = [col for col in EXTENDED_CATEGORICAL_FEATURES if col in df.columns]
print(f"Extended categorical features: {len(extended_categorical_available)}/{len(EXTENDED_CATEGORICAL_FEATURES)} available")
if extended_categorical_available:
    print(f"  Available: {extended_categorical_available[:5]}...")  # Show first 5

# Check social features
social_available = [col for col in SOCIAL_FEATURES if col in df.columns]
print(f"Social features: {len(social_available)}/{len(SOCIAL_FEATURES)} available")
if social_available:
    print(f"  Available: {social_available}")

# Display temporal distribution after filtering
if 'YEAR' in df_spatial.columns and 'MONTH' in df_spatial.columns:
    print(f"\nTemporal distribution in filtered dataset:")
    print("-" * 40)
    yearly_counts = df_spatial['YEAR'].value_counts().sort_index()
    for year, count in yearly_counts.items():
        print(f"  {year}: {count:,} records")
    
    # Show monthly distribution for each year in the filtered data
    years_in_data = sorted(df_spatial['YEAR'].unique())
    for year in years_in_data:
        monthly_counts = df_spatial[df_spatial['YEAR'] == year]['MONTH'].value_counts().sort_index()
        print(f"\nMonthly distribution for {year}:")
        print("-" * 40)
        for month, count in monthly_counts.items():
            print(f"  Month {month:2d}: {count:,} records")

# Final clustering dataset summary
print(f"\n" + "="*60)
print("CLUSTERING DATASET SUMMARY")
print("="*60)
print(f"Dataset period: Nov 2024 onwards")
print(f"Total records for clustering: {len(df_spatial):,}")
print(f"Geographic validity: NYC coordinate bounds enforced")


FEATURE VALIDATION
Feature availability:
------------------------------
Spatial features: ['Latitude', 'Longitude'] (2/2)
Temporal features: ['HOUR', 'WEEKDAY', 'MONTH'] (3/3)
Categorical features: ['BORO_NM', 'OFNS_DESC', 'PREM_TYP_DESC'] (3/3)

TEMPORAL FILTERING FOR CLUSTERING
Original dataset years: [np.int64(2020), np.int64(2021), np.int64(2022), np.int64(2023), np.int64(2024)]
Total records before temporal filter: 2,493,835
Year-Month distribution in original dataset:
  202001: 38,695 records
  202002: 35,446 records
  202003: 32,679 records
  202004: 24,907 records
  202005: 32,023 records
  202006: 32,604 records
  202007: 35,531 records
  202008: 37,524 records
  202009: 35,983 records
  202010: 37,403 records
  202011: 35,148 records
  202012: 33,516 records
  202101: 33,217 records
  202102: 28,332 records
  202103: 34,612 records
  202104: 32,619 records
  202105: 36,711 records
  202106: 37,516 records
  202107: 39,402 records
  202108: 38,945 records
  202109: 39,906 rec

In [40]:
# --- Data Configuration: Full Dataset vs Sample ---
# Set to True to use full dataset, False to use sample
USE_FULL_DATA = False

# Helper: balanced stratified sampling with shortfall redistribution
def stratified_sample_balanced(df_in, strat_col, n_total, random_state=42):
    if strat_col not in df_in.columns:
        # Fallback to simple sample
        n_take = min(n_total, len(df_in))
        return df_in.sample(n_take, random_state=random_state)

    df_in = df_in[df_in[strat_col].notna()].copy()
    if len(df_in) == 0:
        return df_in

    total = len(df_in)
    if n_total >= total:
        # Return all rows if requested sample >= population
        return df_in.sample(frac=1.0, random_state=random_state)

    sizes = df_in[strat_col].value_counts().sort_index()
    groups = df_in.groupby(strat_col)

    # Ideal proportional allocation
    ideal = sizes / total * n_total
    base = np.floor(ideal).astype(int)

    # Cap by availability
    cap = sizes
    alloc = base.clip(upper=cap)

    # Largest remainder method
    remaining = int(n_total - alloc.sum())
    remainders = (ideal - base)

    # First pass: distribute by largest remainders
    for key in remainders.sort_values(ascending=False).index:
        if remaining == 0:
            break
        if alloc[key] < cap[key]:
            alloc[key] += 1
            remaining -= 1

    # Draw samples per stratum
    parts = []
    for key, g in groups:
        k = int(alloc.get(key, 0))
        if k <= 0:
            continue
        if k >= len(g):
            parts.append(g)
        else:
            parts.append(g.sample(n=k, random_state=random_state))

    out = pd.concat(parts, axis=0)
    # Shuffle for randomness with reproducibility
    out = out.sample(frac=1.0, random_state=random_state).reset_index(drop=True)

    # Safety: trim over allocation
    if len(out) > n_total:
        out = out.iloc[:n_total].copy()

    return out

# Configure dataset based on flag
if USE_FULL_DATA:
    df = df_spatial.copy()
    print(f"Using full dataset: {df.shape[0]:,} rows")
else:
    N_SAMPLE = 10_000  # target number of rows to take
    if 'BORO_NM' in df_spatial.columns and df_spatial['BORO_NM'].notna().any():
        df = stratified_sample_balanced(df_spatial, 'BORO_NM', N_SAMPLE, random_state=RANDOM_STATE).copy()
        print(f"Using stratified sample by BORO_NM: {df.shape[0]:,} rows out of {df_spatial.shape[0]:,} total")
        try:
            counts = df['BORO_NM'].value_counts().sort_index()
            props = (counts / len(df)).round(3)
            print("Stratum distribution (sample):")
            for name, count in counts.items():
                print(f"  {name}: {count:,} ({props[name]:.3f})")
        except Exception:
            pass
    else:
        n_take = min(N_SAMPLE, len(df_spatial))
        df = df_spatial.sample(n_take, random_state=RANDOM_STATE).copy()
        print(f"Using simple sample: {df.shape[0]:,} rows out of {df_spatial.shape[0]:,} total")

print(f"Dataset created: {df.shape[0]} rows out of {df_spatial.shape[0]} total")
pd.set_option("display.width", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 44)
print(df.head(1))

Using stratified sample by BORO_NM: 10,000 rows out of 85,968 total
Stratum distribution (sample):
  BRONX: 2,230 (0.223)
  BROOKLYN: 2,747 (0.275)
  MANHATTAN: 2,447 (0.245)
  QUEENS: 2,160 (0.216)
  STATEN ISLAND: 416 (0.042)
Dataset created: 10000 rows out of 85968 total
     BORO_NM  KY_CD   LAW_CAT_CD LOC_OF_OCCUR_DESC  \
0  MANHATTAN    344  MISDEMEANOR            INSIDE   

                      OFNS_DESC  PD_CD           PREM_TYP_DESC SUSP_AGE_GROUP  \
0  ASSAULT 3 & RELATED OFFENSES    114  RESIDENCE - APT. HOUSE          18-24   

        SUSP_RACE SUSP_SEX VIC_AGE_GROUP        VIC_RACE VIC_SEX   Latitude  \
0  BLACK HISPANIC        M         45-64  BLACK HISPANIC       F  40.858427   

   Longitude  BAR_DISTANCE  NIGHTCLUB_DISTANCE  ATM_DISTANCE  ATMS_COUNT  \
0 -73.928801    486.953208          779.356614    766.720091         0.0   

   BARS_COUNT  BUS_STOPS_COUNT  METROS_COUNT  NIGHTCLUBS_COUNT  SCHOOLS_COUNT  \
0         0.0              0.0           0.0               0

## K-Modes for Categorical Crime Patterns


### Categorical Feature Preparation for K-Modes

K-Modes clustering is specifically designed for categorical data. We prepare our categorical features for pattern discovery in crime types, locations, and demographic information.

In [41]:
# Define categorical features for K-Modes clustering
BASE_KMODES_FEATURES = ['BORO_NM', 'OFNS_DESC', 'PREM_TYP_DESC']

# Additional categorical features directly useful for operations
EXTRA_CATEGORICAL_FEATURES = [
    # 'LAW_CAT_CD',               # Optional: not useful having OFNS_DESC
    # 'LOC_OF_OCCUR_DESC',        # Optional: not useful having PREM_TYP_DESC
    'TIME_BUCKET',              # Coarse time-of-day buckets
    'IS_WEEKEND', 'IS_HOLIDAY', # Operationally meaningful flags
    # 'WEEKDAY',               # Optional: more granular temporal, commented to reduce noise
    # 'IS_PAYDAY',            # Optional: typically low impact
    # 'SAME_AGE_GROUP',       # Optional: low operational value
    # 'SAME_SEX',             # Optional: low operational value
]

# Demographics: keep age/sex; exclude race to avoid bias and improve fairness
DEMOGRAPHIC_CATEGORICAL = [
    'SUSP_SEX', 'SUSP_AGE_GROUP',
    'VIC_SEX', 'VIC_AGE_GROUP',
    # 'SUSP_RACE', 'VIC_RACE',
]

# Numeric POI/distances transformed into interpretable categories
DISTANCE_COLS = [
    'METRO_DISTANCE',
    # 'BAR_DISTANCE', 'NIGHTCLUB_DISTANCE', 'ATM_DISTANCE',
    # 'MIN_POI_DISTANCE', 'AVG_POI_DISTANCE', 'MAX_POI_DISTANCE',
]
COUNT_COLS = [
    # 'TOTAL_POI_COUNT',
    # 'ATMS_COUNT', 'BARS_COUNT', 'BUS_STOPS_COUNT', 'METROS_COUNT',
    # 'NIGHTCLUBS_COUNT', 'SCHOOLS_COUNT',
]
SCORE_COLS = [
    'POI_DENSITY_SCORE',
    # 'POI_DIVERSITY',
]

print("=== CATEGORICAL FEATURE PREPARATION FOR K-MODES ===")

# Use the configured dataset
_df = df.copy()

def top_k_map(series, k=10):
    if series.isna().all():
        return series.fillna('Unknown')
    vc = series.value_counts()
    top = set(vc.head(k).index)
    return series.where(series.isin(top), 'OTHER').astype(str).fillna('Unknown')

# Use ColumnBinner instead of inline binning helpers
binner_config = {
    'METRO_DISTANCE': {
        'kind': 'distance',
        'bins': [-np.inf, 250, 1000, np.inf],
        'labels': ['Near', 'Mid', 'Far']
    },
    'POI_DENSITY_SCORE': {
        'kind': 'score',
        'quantiles': 4,
        'labels': ['Low', 'Medium', 'High', 'VeryHigh']
    }
}

# Instantiate and apply binner
column_binner = ColumnBinner(config=binner_config, suffix="_BIN", fill_unknown="Unknown")
column_binner.fit(_df)
_df = column_binner.transform(_df)

# Track created bins consistently
created_bins = [col for col in column_binner.get_feature_names_out() if col in _df.columns]

# Ensure core categoricals are strings and filled
for col in BASE_KMODES_FEATURES + EXTRA_CATEGORICAL_FEATURES:
    if col in _df.columns:
        _df[col] = _df[col].astype(str).fillna('Unknown')

# Compose the final feature list for K-Modes
CATEGORICAL_FEATURES_KMODES = BASE_KMODES_FEATURES + [
    c for c in EXTRA_CATEGORICAL_FEATURES if c in _df.columns
] + created_bins + [c for c in DEMOGRAPHIC_CATEGORICAL if c in _df.columns]

print("Base categorical features:", BASE_KMODES_FEATURES)
print("Added operational categorical features:", [c for c in EXTRA_CATEGORICAL_FEATURES if c in _df.columns])
print("Added POI/context bins:", created_bins)
print("Added demographics (race excluded):", [c for c in DEMOGRAPHIC_CATEGORICAL if c in _df.columns])

# Check feature availability
categorical_available = [col for col in CATEGORICAL_FEATURES_KMODES if col in _df.columns]
print(f"Total categorical features for K-Modes: {len(categorical_available)}")

# Prepare dataset holders used downstream
if not categorical_available:
    raise ValueError("No categorical features available for K-Modes clustering")

# Keep a wide copy for labeling/ops; drop rows only if core/base features are missing
# (to avoid losing useful non-feature columns like HOUR, IS_WEEKEND later)
df_kmodes_input = _df.copy()
required_for_row = [c for c in BASE_KMODES_FEATURES if c in df_kmodes_input.columns]
df_kmodes = df_kmodes_input.dropna(subset=required_for_row).copy() if required_for_row else df_kmodes_input.copy()

# Build X_categorical as the feature matrix used by the pipeline
CATEGORICAL_FEATURES_KMODES_AVAILABLE = [c for c in CATEGORICAL_FEATURES_KMODES if c in df_kmodes.columns]
X_categorical = df_kmodes[CATEGORICAL_FEATURES_KMODES_AVAILABLE].astype(str).fillna('Unknown')

print(f"Rows available for K-Modes after base-feature check: {len(df_kmodes):,}")
print(f"Feature matrix shape: {X_categorical.shape}")
print(f"First feature columns: {CATEGORICAL_FEATURES_KMODES_AVAILABLE[:5]}")

=== CATEGORICAL FEATURE PREPARATION FOR K-MODES ===
Base categorical features: ['BORO_NM', 'OFNS_DESC', 'PREM_TYP_DESC']
Added operational categorical features: ['TIME_BUCKET', 'IS_WEEKEND', 'IS_HOLIDAY']
Added POI/context bins: [np.str_('METRO_DISTANCE_BIN'), np.str_('POI_DENSITY_SCORE_BIN')]
Added demographics (race excluded): ['SUSP_SEX', 'SUSP_AGE_GROUP', 'VIC_SEX', 'VIC_AGE_GROUP']
Total categorical features for K-Modes: 12
Rows available for K-Modes after base-feature check: 10,000
Feature matrix shape: (10000, 12)
First feature columns: ['BORO_NM', 'OFNS_DESC', 'PREM_TYP_DESC', 'TIME_BUCKET', 'IS_WEEKEND']


In [42]:
# Snapshot the exact categorical dataset used for K-Modes to ensure consistency downstream
if 'X_categorical' in globals() and 'X_categorical_kmodes_ref' not in globals():
    X_categorical_kmodes_ref = X_categorical.copy()

In [43]:
if 'df_kmodes_input' in globals():
    for c in df_kmodes_input.columns:
        if str(df_kmodes_input[c].dtype) in ('object', 'category'):
            df_kmodes_input[c] = df_kmodes_input[c].fillna('Unknown').astype(str)

if 'X_categorical' in globals():
    for c in X_categorical.columns:
        if str(X_categorical[c].dtype) in ('object', 'category'):
            X_categorical[c] = X_categorical[c].fillna('Unknown').astype(str)

### K-Modes Pipeline Construction

Following the same modular pipeline approach as SpatialHotspotAnalysis, we create a preprocessing pipeline for categorical features and K-Modes clustering.

In [44]:
print("=== K-MODES PIPELINE CONSTRUCTION ===")

categorical_preprocessor = CategoricalPreprocessor(handle_missing='drop')

kmodes_pipeline = Pipeline([
    ('preprocess', categorical_preprocessor),
    ('cluster', KModes(n_clusters=5, init='Huang', n_init=5, verbose=1, random_state=RANDOM_STATE))
])

print("K-Modes pipeline constructed successfully")
print(f"Pipeline steps: {[step[0] for step in kmodes_pipeline.steps]}")

kmodes_param_grid = {
    'cluster__n_clusters': list(range(6, 21)),
    'cluster__init': ['Huang', 'Cao'],
    'cluster__n_init': [5, 10]
}

print(f"Parameter grid defined:")
print(f"  n_clusters: {kmodes_param_grid['cluster__n_clusters']}")
print(f"  init methods: {kmodes_param_grid['cluster__init']}")
print(f"  n_init: {kmodes_param_grid['cluster__n_init']}")
print(f"Total combinations: {len(list(ParameterGrid(kmodes_param_grid)))}")

=== K-MODES PIPELINE CONSTRUCTION ===
K-Modes pipeline constructed successfully
Pipeline steps: ['preprocess', 'cluster']
Parameter grid defined:
  n_clusters: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
  init methods: ['Huang', 'Cao']
  n_init: [5, 10]
Total combinations: 60


### K-Modes Parameter Grid Search & Evaluation

Following the same systematic parameter optimization approach as SpatialHotspotAnalysis, we perform grid search to find optimal K-Modes parameters.

In [45]:
print("=== K-MODES PARAMETER GRID SEARCH ===")

kmodes_results = []
best_params = None
best_score = -np.inf

# Custom evaluation metric for categorical clustering
def evaluate_kmodes_clustering(X_encoded, labels, centroids):
    n_clusters = len(np.unique(labels))
    n_samples = len(labels)
    cluster_sizes = pd.Series(labels).value_counts().sort_index()
    min_cluster_size = cluster_sizes.min()
    max_cluster_size = cluster_sizes.max()
    cluster_balance = max_cluster_size / max(min_cluster_size, 1)

    total_dissimilarity = 0
    for cluster_id in range(n_clusters):
        cluster_mask = labels == cluster_id
        if cluster_mask.sum() > 1:
            cluster_data = X_encoded[cluster_mask]
            centroid = centroids[cluster_id]
            for sample in cluster_data:
                total_dissimilarity += np.sum(sample != centroid)
    avg_dissimilarity = total_dissimilarity / max(n_samples, 1)

    return {
        'n_clusters': n_clusters,
        'min_cluster_size': min_cluster_size,
        'max_cluster_size': max_cluster_size,
        'cluster_balance': cluster_balance,
        'avg_dissimilarity': avg_dissimilarity
    }

print(f"Starting grid search with {len(list(ParameterGrid(kmodes_param_grid)))} parameter combinations...")

for i, params in enumerate(ParameterGrid(kmodes_param_grid)):
    print(f"\nTesting combination {i+1}: n_clusters={params['cluster__n_clusters']}, "
          f"init={params['cluster__init']}, n_init={params['cluster__n_init']}")
    try:
        kmodes_pipeline.set_params(**params)
        t0 = time.perf_counter()
        labels = kmodes_pipeline.fit_predict(X_categorical)
        runtime = time.perf_counter() - t0

        # Encode X using the same pipeline preprocessor to align with centroids' space
        X_encoded = kmodes_pipeline.named_steps['preprocess'].transform(X_categorical)
        X_encoded_arr = X_encoded.values if hasattr(X_encoded, 'values') else X_encoded

        centroids = kmodes_pipeline.named_steps['cluster'].cluster_centroids_

        eval_metrics = evaluate_kmodes_clustering(X_encoded_arr, labels, centroids)

        # Normalized composite score
        num_features = X_encoded_arr.shape[1] if hasattr(X_encoded_arr, 'shape') and len(X_encoded_arr.shape) == 2 else 1
        norm_dissim = eval_metrics['avg_dissimilarity'] / max(1.0, float(num_features))
        imbalance = max(0.0, (float(eval_metrics['cluster_balance']) - 1.0) / max(1.0, float(eval_metrics['n_clusters']) - 1.0))
        alpha = 0.3
        composite_score = (1.0 - norm_dissim) - alpha * imbalance

        result = {
            'n_clusters': params['cluster__n_clusters'],
            'init_method': params['cluster__init'],
            'n_init': params['cluster__n_init'],
            'runtime_s': runtime,
            'composite_score': composite_score,
            **eval_metrics
        }

        kmodes_results.append(result)
        if composite_score > best_score:
            best_score = composite_score
            best_params = params.copy()

        print(f"  Runtime: {runtime:.2f}s")
        print(f"  Clusters: {eval_metrics['n_clusters']}")
        print(f"  Cluster sizes: {eval_metrics['min_cluster_size']}-{eval_metrics['max_cluster_size']}")
        print(f"  Balance ratio: {eval_metrics['cluster_balance']:.2f}")
        print(f"  Avg dissimilarity: {eval_metrics['avg_dissimilarity']:.3f}")
        print(f"  Composite score: {composite_score:.4f}")

    except Exception as e:
        print(f"  Failed: {str(e)}")
        continue

# Convert results to DataFrame for analysis
df_kmodes_results = pd.DataFrame(kmodes_results)

if not df_kmodes_results.empty:
    print(f"\n=== K-MODES GRID SEARCH RESULTS ===")
    print(f"Total successful runs: {len(df_kmodes_results)}")
    print(f"Best composite score: {best_score:.4f}")
    print(f"Best parameters: {best_params}")

    top_results = df_kmodes_results.nlargest(5, 'composite_score')
    print(f"\nTop 5 parameter combinations:")
    print(top_results[['n_clusters', 'init_method', 'n_init', 'composite_score', 
                      'min_cluster_size', 'max_cluster_size', 'avg_dissimilarity']].round(4))
else:
    print("No successful K-Modes runs completed")

=== K-MODES PARAMETER GRID SEARCH ===
Starting grid search with 60 parameter combinations...

Testing combination 1: n_clusters=6, init=Huang, n_init=5
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 2325, cost: 51050.0
Run 1, iteration: 2/100, moves: 540, cost: 51050.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 3423, cost: 50380.0
Run 2, iteration: 2/100, moves: 1239, cost: 50096.0
Run 2, iteration: 3/100, moves: 125, cost: 50096.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 3248, cost: 50001.0
Run 3, iteration: 2/100, moves: 1334, cost: 49174.0
Run 3, iteration: 3/100, moves: 932, cost: 48779.0
Run 3, iteration: 4/100, moves: 211, cost: 48779.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 4, iteration: 1/100, moves: 3200, cost: 50887.0
Run 4, itera

### K-Modes Results Analysis & Pattern Discovery

We analyze the discovered categorical crime patterns and create detailed cluster profiles.

In [46]:
if 'df_kmodes_results' in locals() and not df_kmodes_results.empty:
    print("=== K-MODES FINAL MODEL & PATTERN ANALYSIS ===")
    
    print(f"Fitting final K-Modes model with best parameters...")
    print(f"Best parameters: {best_params}")
    
    final_kmodes_pipeline = Pipeline([
        ('preprocess', CategoricalPreprocessor(handle_missing='drop')),
        ('cluster', KModes(
            n_clusters=best_params['cluster__n_clusters'],
            init=best_params['cluster__init'],
            n_init=best_params['cluster__n_init'],
            verbose=1,
            random_state=RANDOM_STATE
        ))
    ])
    
    final_labels = final_kmodes_pipeline.fit_predict(X_categorical)
    final_centroids = final_kmodes_pipeline.named_steps['cluster'].cluster_centroids_

    # Feature names aligned with the preprocessor output
    try:
        feature_names = final_kmodes_pipeline.named_steps['preprocess'].get_feature_names_out().tolist()
    except Exception:
        feature_names = list(X_categorical.columns)
    feature_names = [f for f in feature_names]

    df_kmodes_labeled = df_kmodes.copy()
    df_kmodes_labeled['cluster'] = final_labels
    
    print(f"Final model fitted successfully")
    print(f"Number of clusters: {len(np.unique(final_labels))}")
    print(f"Cluster distribution:")
    cluster_counts = pd.Series(final_labels).value_counts().sort_index()
    for cluster_id, count in cluster_counts.items():
        print(f"  Cluster {cluster_id}: {count} samples ({(count/len(final_labels))*100:.1f}%)")
    
    print(f"\n=== CATEGORICAL CRIME PATTERN PROFILES ===")
    
    cluster_profiles = []
    
    for cluster_id in sorted(np.unique(final_labels)):
        cluster_mask = final_labels == cluster_id
        cluster_data = df_kmodes_labeled[cluster_mask]
        cluster_size = cluster_mask.sum()
        
        print(f"\n--- CLUSTER {cluster_id} PROFILE ---")
        print(f"Size: {cluster_size} samples ({(cluster_size/len(final_labels))*100:.1f}%)")
        
        centroid = final_centroids[cluster_id]
        print(f"Centroid pattern:")
        for i, feature in enumerate(feature_names[:len(centroid)]):
            print(f"  {feature}: {centroid[i]}")
        
        summary_cols = [
            'BORO_NM', 'PREM_TYP_DESC', 'OFNS_DESC',
            'TIME_BUCKET', 'IS_WEEKEND', 'IS_HOLIDAY',
            'METRO_DISTANCE_BIN', 'POI_DENSITY_SCORE_BIN',
        ]
        summary_cols = [c for c in summary_cols if c in cluster_data.columns]
        for col in summary_cols:
            dist = cluster_data[col].value_counts(normalize=True).head(5)
            print(f"Top {col}:")
            for val, pct in dist.items():
                print(f"  {val}: {pct*100:.1f}%")
        
        profile = {
            'cluster': int(cluster_id),
            'size': int(cluster_size),
        }
        for col in summary_cols:
            top_val = cluster_data[col].value_counts().idxmax()
            profile[f'top_{col.lower()}'] = str(top_val)
        cluster_profiles.append(profile)
    
    df_cluster_profiles = pd.DataFrame(cluster_profiles)
    print("\nSample cluster profiles:")
    print(df_cluster_profiles.head())

=== K-MODES FINAL MODEL & PATTERN ANALYSIS ===
Fitting final K-Modes model with best parameters...
Best parameters: {'cluster__init': 'Huang', 'cluster__n_clusters': 18, 'cluster__n_init': 5}
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 2662, cost: 42696.0
Run 1, iteration: 2/100, moves: 384, cost: 42696.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 2947, cost: 43421.0
Run 2, iteration: 2/100, moves: 617, cost: 43331.0
Run 2, iteration: 3/100, moves: 67, cost: 43331.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 2732, cost: 43238.0
Run 3, iteration: 2/100, moves: 324, cost: 43238.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 4, iteration: 1/100, moves: 4065, cost: 42255.0
Run 4, iteration: 2/100, moves: 612, cost: 42255.0
Init: initializing centroi

In [47]:
# Preserve K-Modes best params/score for later summary
if 'best_params' in globals() and 'best_score' in globals():
    kmodes_best_params = best_params.copy()
    kmodes_best_score = float(best_score)
else:
    kmodes_best_params = None
    kmodes_best_score = None



# Police-Focused Crime Intelligence Analysis

This section transforms the K-Modes clustering results into **actionable intelligence** for law enforcement operations. We create operational crime profiles, tactical recommendations, and priority assessment for police deployment.

## Crime Pattern Intelligence Reports

Transform abstract clustering results into concrete operational insights for police commanders and patrol units.

### Operational cluster summaries (concise tables)

The tables below summarize clusters for police operations without narrative: top patterns and cross-tabs by borough, crime type, time, premises, weekend/holiday, and demographics.


In [48]:
# Build operational dataframe and produce concise police-focused tables
# Helper: mode with safe fallback
_def = lambda s, default='Unknown': (s.mode().iloc[0] if not s.mode().empty else default)

# Build ops
ops = None
try:
    if 'df_operational' in locals() and isinstance(df_operational, pd.DataFrame) and not df_operational.empty:
        print("[ops] Using provided df_operational")
        ops = df_operational.copy()
    elif 'df_cluster_profiles' in locals() and isinstance(df_cluster_profiles, pd.DataFrame) and not df_cluster_profiles.empty:
        print("[ops] Building ops from df_cluster_profiles")
        # Derive a minimal ops from profiles
        dfp = df_cluster_profiles.copy()
        rename_map = {
            'cluster': 'cluster_id',
            'top_ofns_desc': 'primary_crime',
            'top_boro_nm': 'primary_borough',
            'top_prem_typ_desc': 'primary_premises',
            'top_time_bucket': 'primary_time_bucket',
            'crime_count': 'crime_count',
            'concentration_score': 'concentration_score'
        }
        for col in list(rename_map):
            if col not in dfp.columns:
                if col == 'crime_count' and 'size' in dfp.columns:
                    print("[ops] Inferring crime_count from size")
                    dfp['crime_count'] = dfp['size']
                elif col == 'concentration_score' and 'composite_score' in dfp.columns:
                    print("[ops] Inferring concentration_score from composite_score")
                    dfp['concentration_score'] = dfp['composite_score']
                elif col == 'top_ofns_desc' and 'OFNS_DESC' in dfp.columns:
                    print("[ops] Inferring top_ofns_desc from OFNS_DESC")
                    dfp['top_ofns_desc'] = dfp['OFNS_DESC']
                elif col == 'top_boro_nm' and 'BORO_NM' in dfp.columns:
                    print("[ops] Inferring top_boro_nm from BORO_NM")
                    dfp['top_boro_nm'] = dfp['BORO_NM']
                elif col == 'top_prem_typ_desc' and 'PREM_TYP_DESC' in dfp.columns:
                    print("[ops] Inferring top_prem_typ_desc from PREM_TYP_DESC")
                    dfp['top_prem_typ_desc'] = dfp['PREM_TYP_DESC']
                elif col == 'top_time_bucket' and 'TIME_BUCKET' in dfp.columns:
                    print("[ops] Inferring top_time_bucket from TIME_BUCKET")
                    dfp['top_time_bucket'] = dfp['TIME_BUCKET']
        ops = dfp.rename(columns={k: v for k, v in rename_map.items() if k in dfp.columns})
    elif 'df_kmodes_labeled' in locals() and isinstance(df_kmodes_labeled, pd.DataFrame) and not df_kmodes_labeled.empty:
        print("[ops] Building ops from df_kmodes_labeled (groupby)")
        g = df_kmodes_labeled.groupby('cluster')
        rows = []
        for cid, grp in g:
            rows.append({
                'cluster_id': int(cid),
                'primary_crime': _def(grp['OFNS_DESC']) if 'OFNS_DESC' in grp.columns else 'Unknown',
                'primary_borough': _def(grp['BORO_NM']) if 'BORO_NM' in grp.columns else 'Unknown',
                'primary_premises': _def(grp['PREM_TYP_DESC']) if 'PREM_TYP_DESC' in grp.columns else 'Unknown',
                'primary_time_bucket': _def(grp['TIME_BUCKET']) if 'TIME_BUCKET' in grp.columns else 'Unknown',
                'crime_count': int(len(grp))
            })
        ops = pd.DataFrame(rows)
    else:
        print("[ops] No sources available; creating empty DataFrame")
        ops = pd.DataFrame()
except Exception as e:
    print("[ops] Exception while building ops:", e)
    ops = pd.DataFrame()

# Normalize to canonical column names if imported schema differs
if not ops.empty:
    def ensure_col(dst, target, candidates, transform=None, default_val=np.nan):
        if target in dst.columns:
            return
        for c in candidates:
            if c in dst.columns:
                print(f"[ops] Filling missing column '{target}' from '{c}'")
                dst[target] = dst[c] if transform is None else transform(dst[c])
                return
        print(f"[ops] Column '{target}' not found; filling default")
        dst[target] = default_val

    ensure_col(ops, 'cluster_id', ['cluster', 'cid'], transform=lambda s: s.astype('Int64'))
    ensure_col(ops, 'primary_borough', ['BORO_NM', 'borough', 'top_boro_nm'])
    ensure_col(ops, 'primary_crime', ['OFNS_DESC', 'crime_type', 'top_ofns_desc'])
    ensure_col(ops, 'primary_premises', ['PREM_TYP_DESC', 'premises', 'top_prem_typ_desc'])
    ensure_col(ops, 'primary_time_bucket', ['TIME_BUCKET', 'time_bucket', 'top_time_bucket'])

ops['crime_count'] = ops.get('crime_count', pd.Series(dtype=int)).fillna(0).astype(int)
if 'concentration_score' not in ops.columns:
    if 'df_kmodes_labeled' in locals() and not df_kmodes_labeled.empty and 'cluster' in df_kmodes_labeled.columns:
        print("[ops] Estimating concentration_score from labeled data")
        conc_vals = []
        for _, r in ops.iterrows():
            cid = r.get('cluster_id', None)
            if pd.isna(cid): conc_vals.append(np.nan); continue
            grp = df_kmodes_labeled[df_kmodes_labeled['cluster'] == cid]
            if grp.empty or 'OFNS_DESC' not in grp.columns:
                conc_vals.append(np.nan); continue
            top_frac = grp['OFNS_DESC'].value_counts(normalize=True).max()
            conc_vals.append(top_frac)
        ops['concentration_score'] = conc_vals
    else:
        print("[ops] concentration_score unavailable; filling NaN")
        ops['concentration_score'] = np.nan

# Priority tiers by crime_count with concentration as tiebreaker
q80 = ops['crime_count'].quantile(0.8) if len(ops) else 0
q60 = ops['crime_count'].quantile(0.6) if len(ops) else 0
q40 = ops['crime_count'].quantile(0.4) if len(ops) else 0

def pr_rank(row):
    if row['crime_count'] >= q80: return 'HIGH'
    if row['crime_count'] >= q60: return 'MEDIUM-HIGH'
    if row['crime_count'] >= q40: return 'MEDIUM'
    return 'LOW'

if not ops.empty:
    ops['priority'] = ops.apply(pr_rank, axis=1)
else:
    ops['priority'] = []

# Print concise tables
if not ops.empty:
    display_cols = ['cluster_id','priority','crime_count','primary_borough','primary_crime','primary_premises','primary_time_bucket']
    missing = [c for c in display_cols if c not in ops.columns]
    for c in missing:
        print(f"[ops] Missing display col '{c}', filling with 'Unknown'")
        ops[c] = 'Unknown'

    # Use ordered categorical to sort priority correctly
    pr_order = pd.CategoricalDtype(['HIGH','MEDIUM-HIGH','MEDIUM','LOW'], ordered=True)
    ops['priority_ord'] = ops['priority'].astype(pr_order)
    top_clusters = ops.sort_values(['priority_ord','crime_count','concentration_score'], ascending=[True, False, False])
    top_clusters = top_clusters[display_cols].head(12)
    print('Top clusters (by priority, volume, concentration):')
    print(top_clusters.to_string(index=False))

    if 'df_kmodes_labeled' in locals() and not df_kmodes_labeled.empty:
        pr_map = dict(zip(ops['cluster_id'], ops['priority']))
        base = df_kmodes_labeled.copy()
        base['priority'] = base['cluster'].map(pr_map).fillna('LOW')

        def ctab(col, top_k=None, title=None):
            if col not in base.columns:
                print(f"[ops] Skipping crosstab; missing '{col}'")
                return None
            s = base[col].astype(str)
            if top_k:
                top = s.value_counts().head(top_k).index
                s = s.where(s.isin(top), 'OTHER')
            t = pd.crosstab(s, base['priority'], normalize='index').round(2)
            if title:
                print('\n'+title)
            print(t)
            return t

        borough_priority = ctab('BORO_NM', title='Borough x Priority')
        crime_priority = ctab('OFNS_DESC', top_k=12, title='Top Crime Types x Priority (top 12)')
        premises_priority = ctab('PREM_TYP_DESC', top_k=12, title='Premises x Priority (top 12)')
        time_priority = ctab('TIME_BUCKET', title='Time Bucket x Priority')
        weekend_priority = ctab('IS_WEEKEND', title='Weekend x Priority')
        holiday_priority = ctab('IS_HOLIDAY', title='Holiday x Priority')
        suspsex_priority = ctab('SUSP_SEX', top_k=6, title='Suspect Sex x Priority')
        suspage_priority = ctab('SUSP_AGE_GROUP', top_k=6, title='Suspect Age x Priority')
        vicsex_priority = ctab('VIC_SEX', top_k=6, title='Victim Sex x Priority')
        vicage_priority = ctab('VIC_AGE_GROUP', top_k=6, title='Victim Age x Priority')

        ops_export = ops.copy()
        export_cols = display_cols + ['concentration_score']
        ops_export = ops_export[export_cols]
else:
    print('No operational data available for concise tables.')

[ops] Building ops from df_cluster_profiles
[ops] Inferring crime_count from size
[ops] Estimating concentration_score from labeled data
Top clusters (by priority, volume, concentration):
 cluster_id    priority  crime_count primary_borough                   primary_crime       primary_premises primary_time_bucket
          2        HIGH          916       MANHATTAN                   PETIT LARCENY                 STREET             MORNING
          4        HIGH          816           BRONX                   HARRASSMENT 2                 STREET           AFTERNOON
          0        HIGH          797        BROOKLYN OTHER OFFENSES RELATED TO THEFT   TRANSIT - NYC SUBWAY           AFTERNOON
          1        HIGH          777          QUEENS                   HARRASSMENT 2        RESIDENCE-HOUSE             EVENING
         12 MEDIUM-HIGH          727        BROOKLYN        VEHICLE AND TRAFFIC LAWS                 STREET             EVENING
          3 MEDIUM-HIGH          674        

In [49]:
try:
    csv_path = os.path.join(output_dir, 'police_operational_intelligence_enriched.csv')
    json_path = os.path.join(output_dir, 'executive_crime_summary_enriched.json')

    if 'ops_export' in locals() and not ops_export.empty:
        ops_export.to_csv(csv_path, index=False)
        print(f'CSV saved: {csv_path}')

    if 'executive_summary' in locals() and isinstance(executive_summary, dict):
        with open(json_path, 'w') as f:
            json.dump(executive_summary, f, indent=2)
        print(f'JSON saved: {json_path}')
except Exception as e:
    print('Export skipped/error:', e)


CSV saved: c:\UNIVERSITA MAG\Data mining and Machine learning\Progetto\crime-analyzer\JupyterOutputs\Clustering (MultidimensionalClusteringAnalysis)\police_operational_intelligence_enriched.csv
JSON saved: c:\UNIVERSITA MAG\Data mining and Machine learning\Progetto\crime-analyzer\JupyterOutputs\Clustering (MultidimensionalClusteringAnalysis)\executive_crime_summary_enriched.json


In [50]:
if 'X_categorical' not in globals():
    column_binner = ColumnBinner(binner_config, suffix="_BIN", fill_unknown="Unknown")
    X_categorical = column_binner.fit_transform(base_df)
    created_bins_km_context = getattr(column_binner, 'created_bins_', [])
    print(f"Created {len(created_bins_km_context)} binned features.")
else:
    print("Using existing X_categorical; skipping re-binning.")


Using existing X_categorical; skipping re-binning.


In [51]:
# Snapshot for K-Modes reference (read-only copy)
X_categorical_kmodes_ref = X_categorical.copy()
created_bins_kmodes_ref = list(created_bins_km_context) if 'created_bins_km_context' in globals() else []
print(f"Snapshot created with {len(X_categorical_kmodes_ref.columns)} columns.")


Snapshot created with 12 columns.


In [52]:
# === POLICE EXECUTIVE DASHBOARD ===
print("\n" + "="*60)
print("EXECUTIVE CRIME INTELLIGENCE DASHBOARD")
print("="*60)

def _safe_mode(s, default="Unknown"):
    try:
        m = s.mode(dropna=True)
        return m.iloc[0] if not m.empty else default
    except Exception:
        return default

# Build a compact signature using cluster-level top attributes
# Falls back to row fields when detailed data isn't available

def signature_for_cluster(cluster_id=None, row=None, base_df=None):
    # Initialize fields from row fallbacks
    crime = (row.get('primary_crime') if isinstance(row, dict) else getattr(row, 'primary_crime', None)) or "Unknown"
    borough = (row.get('primary_borough') if isinstance(row, dict) else getattr(row, 'primary_borough', None)) or "Unknown"
    premises = (row.get('primary_premises') if isinstance(row, dict) else getattr(row, 'primary_premises', None)) or None
    time_bucket = (row.get('primary_time_bucket') if isinstance(row, dict) else getattr(row, 'primary_time_bucket', None)) or None

    weekend_tok = None
    holiday_tok = None
    susp_sex = None
    susp_age = None
    vic_sex = None
    vic_age = None

    if base_df is not None and cluster_id is not None and 'cluster' in base_df.columns:
        g = base_df[base_df['cluster'] == cluster_id]
        if not g.empty:
            # Prefer actual cluster modes when available
            if 'OFNS_DESC' in g.columns:
                crime = _safe_mode(g['OFNS_DESC'], crime)
            if 'BORO_NM' in g.columns:
                borough = _safe_mode(g['BORO_NM'], borough)
            if 'PREM_TYP_DESC' in g.columns:
                premises = _safe_mode(g['PREM_TYP_DESC'], premises)
            if 'TIME_BUCKET' in g.columns:
                time_bucket = _safe_mode(g['TIME_BUCKET'], time_bucket)
            if 'IS_WEEKEND' in g.columns:
                w = _safe_mode(g['IS_WEEKEND'], None)
                if pd.isna(w):
                    weekend_tok = None
                else:
                    weekend_tok = 'Weekend' if str(w).lower() in ['1', 'true', 'yes'] else 'Weekday'
            if 'IS_HOLIDAY' in g.columns:
                h = _safe_mode(g['IS_HOLIDAY'], None)
                if pd.isna(h):
                    holiday_tok = None
                else:
                    holiday_tok = 'Holiday' if str(h).lower() in ['1', 'true', 'yes'] else None
            if 'SUSP_SEX' in g.columns:
                susp_sex = _safe_mode(g['SUSP_SEX'], None)
            if 'SUSP_AGE_GROUP' in g.columns:
                susp_age = _safe_mode(g['SUSP_AGE_GROUP'], None)
            if 'VIC_SEX' in g.columns:
                vic_sex = _safe_mode(g['VIC_SEX'], None)
            if 'VIC_AGE_GROUP' in g.columns:
                vic_age = _safe_mode(g['VIC_AGE_GROUP'], None)

    tokens = [str(crime), str(borough)]
    if premises and str(premises) != 'Unknown':
        tokens.append(str(premises))
    if time_bucket and str(time_bucket) != 'Unknown':
        tokens.append(str(time_bucket))
    if weekend_tok:
        tokens.append(weekend_tok)
    if holiday_tok:
        tokens.append(holiday_tok)

    demo_parts = []
    if susp_sex or susp_age:
        demo_parts.append(f"SUSP: {susp_sex or '?'} {susp_age or ''}".strip())
    if vic_sex or vic_age:
        demo_parts.append(f"VIC: {vic_sex or '?'} {vic_age or ''}".strip())
    if demo_parts:
        tokens.append(" | ".join(demo_parts))

    return " - ".join([t for t in tokens if t and str(t).strip()])

# Prefer df_operational; fallback to previously built 'ops'; then to saved enriched CSV
ops_df = None
if 'df_operational' in locals() and isinstance(df_operational, pd.DataFrame) and not df_operational.empty:
    ops_df = df_operational.copy()
elif 'ops' in locals() and isinstance(ops, pd.DataFrame) and not ops.empty:
    ops_df = ops.copy()
else:
    try:
        export_dir = os.path.join(project_root, 'JupyterOutputs', 'Final')
        csv_path = os.path.join(export_dir, 'police_operational_intelligence_enriched.csv')
        if os.path.exists(csv_path):
            ops_df = pd.read_csv(csv_path)
    except Exception as _e:
        ops_df = None

# Shared selector with fallback
_get_key_pattern = lambda df, key, fallback_col: (
    df.loc[df[key].astype(float).idxmax()] if key in df.columns and pd.api.types.is_numeric_dtype(df[key]) and df[key].notna().any()
    else df.loc[df[fallback_col].idxmax()] if fallback_col in df.columns and df[fallback_col].notna().any()
    else df.iloc[0]
)

if ops_df is not None and not ops_df.empty:
    # Ensure expected columns exist
    for col in ['cluster_id','primary_crime','primary_borough','primary_premises','primary_time_bucket','crime_count']:
        if col not in ops_df.columns:
            ops_df[col] = np.nan

    # Executive summary statistics
    total_crimes = int(ops_df.get('crime_count', pd.Series(dtype=int)).sum()) if 'crime_count' in ops_df.columns else int(len(ops_df))
    high_priority_patterns = int((ops_df.get('priority', pd.Series([])) == 'HIGH').sum()) if 'priority' in ops_df.columns else 0
    high_priority_crimes = int(ops_df[ops_df.get('priority','') == 'HIGH']['crime_count'].sum()) if 'crime_count' in ops_df.columns and 'priority' in ops_df.columns else 0

    # Identify key patterns using shared helper
    most_concentrated = _get_key_pattern(ops_df, 'concentration_score', 'crime_count')
    highest_volume = _get_key_pattern(ops_df, 'crime_count', 'crime_count')

    # Base dataframe for cluster-level modes
    base_df = df_kmodes_labeled.copy() if 'df_kmodes_labeled' in locals() and isinstance(df_kmodes_labeled, pd.DataFrame) and not df_kmodes_labeled.empty else None

    # Compact signatures
    mc_sig = signature_for_cluster(cluster_id=int(most_concentrated.get('cluster_id')) if 'cluster_id' in most_concentrated else None,
                                   row=most_concentrated.to_dict() if hasattr(most_concentrated, 'to_dict') else most_concentrated,
                                   base_df=base_df)
    hv_sig = signature_for_cluster(cluster_id=int(highest_volume.get('cluster_id')) if 'cluster_id' in highest_volume else None,
                                   row=highest_volume.to_dict() if hasattr(highest_volume, 'to_dict') else highest_volume,
                                   base_df=base_df)

    print(f"\nEXECUTIVE SUMMARY")
    print(f"   Total crimes analyzed: {total_crimes:,}")
    print(f"   Crime patterns identified: {len(ops_df)}")
    print(f"   High priority patterns: {high_priority_patterns}")
    if total_crimes > 0 and high_priority_crimes:
        print(f"   High priority crime volume: {high_priority_crimes:,} ({(high_priority_crimes/total_crimes)*100:.1f}%)")

    print(f"\nKEY INSIGHTS")
    if 'concentration_score' in ops_df.columns:
        conc_pct = most_concentrated.get('concentration_score')
        conc_str = f"{float(conc_pct):.0%}" if pd.notna(conc_pct) else "N/A"
        print(f"   Most concentrated pattern: {mc_sig}")
        print(f"      - {conc_str} concentration, {int(most_concentrated.get('crime_count', 0)):,} crimes")
    else:
        print(f"   Key pattern: {mc_sig}")
        print(f"      - {int(most_concentrated.get('crime_count', 0)):,} crimes")
    hv_pct = (highest_volume.get('crime_count', 0) / total_crimes * 100) if total_crimes else 0
    print(f"   Highest volume pattern: {hv_sig}")
    print(f"      - {int(highest_volume.get('crime_count', 0)):,} crimes ({hv_pct:.1f}% of total)")

    # Borough-level intelligence
    if base_df is not None and 'BORO_NM' in base_df.columns:
        print(f"\nBOROUGH CRIME INTELLIGENCE")
        # Map priority to labeled data
        pr_map = dict(zip(ops_df['cluster_id'], ops_df.get('priority', '')))
        cluster_data_all = base_df.copy()
        cluster_data_all['priority'] = cluster_data_all['cluster'].map(pr_map)

        borough_intelligence = cluster_data_all.groupby('BORO_NM').agg({
            'cluster': 'count',
            'priority': lambda x: (x == 'HIGH').sum()
        }).rename(columns={'cluster': 'total_crimes', 'priority': 'high_priority_crimes'})
        borough_intelligence['high_priority_pct'] = (borough_intelligence['high_priority_crimes'] /
                                                     borough_intelligence['total_crimes'] * 100).round(1)
        borough_intelligence = borough_intelligence.sort_values('high_priority_crimes', ascending=False)
        for borough, stats in borough_intelligence.iterrows():
            print(f"   {borough}:")
            print(f"      - Total: {int(stats['total_crimes']):,} | High priority: {int(stats['high_priority_crimes']):,} ({stats['high_priority_pct']:.1f}%)")

    # Operational recommendations summary
    print(f"\nIMMEDIATE ACTION ITEMS")
    high_priority_df = ops_df[ops_df.get('priority','') == 'HIGH'].head(3) if 'priority' in ops_df.columns else pd.DataFrame()
    if not high_priority_df.empty:
        print("   Deploy immediately:")
        for i, (_, pattern) in enumerate(high_priority_df.iterrows(), 1):
            sig = signature_for_cluster(cluster_id=int(pattern.get('cluster_id')) if 'cluster_id' in pattern else None,
                                        row=pattern.to_dict(), base_df=base_df)
            print(f"      {i}. {sig}")
            extra = []
            if 'crime_count' in pattern:
                extra.append(f"{int(pattern['crime_count']):,} crimes")
            if 'concentration_score' in pattern and pd.notna(pattern['concentration_score']):
                extra.append(f"{float(pattern['concentration_score']):.0%} concentration")
            if extra:
                print(f"         - {', '.join(extra)}")

    medium_priority_df = ops_df[ops_df.get('priority','').isin(['MEDIUM-HIGH', 'MEDIUM'])].head(2) if 'priority' in ops_df.columns else pd.DataFrame()
    if not medium_priority_df.empty:
        print("   Plan enhanced operations:")
        for i, (_, pattern) in enumerate(medium_priority_df.iterrows(), 1):
            sig = signature_for_cluster(cluster_id=int(pattern.get('cluster_id')) if 'cluster_id' in pattern else None,
                                        row=pattern.to_dict(), base_df=base_df)
            print(f"      {i}. {sig}")
            if 'crime_count' in pattern:
                print(f"         - {int(pattern['crime_count']):,} crimes")

    # Resource allocation recommendation
    print(f"\nRESOURCE ALLOCATION RECOMMENDATION")
    focus_mask = ops_df.get('priority','').isin(['HIGH', 'MEDIUM-HIGH']) if 'priority' in ops_df.columns else pd.Series([False]*len(ops_df))
    total_budget_crimes = int(ops_df.loc[focus_mask, 'crime_count'].sum()) if 'crime_count' in ops_df.columns else 0
    print(f"   Focus 80% of resources on {int(focus_mask.sum())} patterns")
    if total_crimes:
        print(f"   Targeting {total_budget_crimes:,} crimes ({(total_budget_crimes/total_crimes)*100:.1f}% of total volume)")
    print(f"   Expected result: maximum impact with focused deployment")

    # Structured details for the two key patterns (where available)
    def structured_details(row_obj):
        cid = int(row_obj.get('cluster_id')) if 'cluster_id' in row_obj else None
        # Derive modes from base_df if possible
        details = {}
        if base_df is not None and cid is not None:
            g = base_df[base_df['cluster'] == cid]
            if not g.empty:
                details.update({
                    'crime_type': _safe_mode(g['OFNS_DESC']) if 'OFNS_DESC' in g.columns else row_obj.get('primary_crime', 'Unknown'),
                    'borough': _safe_mode(g['BORO_NM']) if 'BORO_NM' in g.columns else row_obj.get('primary_borough', 'Unknown'),
                    'premises': _safe_mode(g['PREM_TYP_DESC']) if 'PREM_TYP_DESC' in g.columns else row_obj.get('primary_premises', 'Unknown'),
                    'time_bucket': _safe_mode(g['TIME_BUCKET']) if 'TIME_BUCKET' in g.columns else row_obj.get('primary_time_bucket', 'Unknown'),
                    'is_weekend_mode': bool(str(_safe_mode(g['IS_WEEKEND'], 'False')).lower() in ['1','true','yes']) if 'IS_WEEKEND' in g.columns else None,
                    'is_holiday_mode': bool(str(_safe_mode(g['IS_HOLIDAY'], 'False')).lower() in ['1','true','yes']) if 'IS_HOLIDAY' in g.columns else None,
                    'suspect_sex_mode': _safe_mode(g['SUSP_SEX']) if 'SUSP_SEX' in g.columns else None,
                    'suspect_age_mode': _safe_mode(g['SUSP_AGE_GROUP']) if 'SUSP_AGE_GROUP' in g.columns else None,
                    'victim_sex_mode': _safe_mode(g['VIC_SEX']) if 'VIC_SEX' in g.columns else None,
                    'victim_age_mode': _safe_mode(g['VIC_AGE_GROUP']) if 'VIC_AGE_GROUP' in g.columns else None,
                })
        else:
            details.update({
                'crime_type': row_obj.get('primary_crime', 'Unknown'),
                'borough': row_obj.get('primary_borough', 'Unknown'),
                'premises': row_obj.get('primary_premises', 'Unknown'),
                'time_bucket': row_obj.get('primary_time_bucket', 'Unknown'),
            })
        # Always add volume metrics
        details['volume'] = int(row_obj.get('crime_count', 0))
        if 'concentration_score' in row_obj and pd.notna(row_obj['concentration_score']):
            details['concentration'] = f"{float(row_obj['concentration_score']):.0%}"
        return details

    most_concentrated = most_concentrated if isinstance(most_concentrated, dict) else most_concentrated.to_dict()
    highest_volume = highest_volume if isinstance(highest_volume, dict) else highest_volume.to_dict()
    mc_details = structured_details(most_concentrated)
    hv_details = structured_details(highest_volume)

    # Create executive summary for export
    executive_summary = {
        'analysis_date': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M'),
        'total_crimes_analyzed': int(total_crimes),
        'patterns_identified': int(len(ops_df)),
        'high_priority_patterns': int(high_priority_patterns),
        'high_priority_crime_percentage': round((high_priority_crimes/total_crimes)*100, 1) if total_crimes else 0.0,
        'most_concentrated_pattern_signature': mc_sig,
        'highest_volume_pattern_signature': hv_sig,
        'most_concentrated_pattern': mc_details,
        'highest_volume_pattern': hv_details,
        'immediate_deployment_needed': high_priority_patterns > 0,
        'resource_focus_patterns': int(focus_mask.sum()) if isinstance(focus_mask, pd.Series) else 0
    }

    # Save executive summary
    with open(os.path.join(output_dir, 'executive_crime_summary.json'), 'w') as f:
        json.dump(executive_summary, f, indent=2, default=str)

    print(f"\nExecutive summary saved to: executive_crime_summary.json")
    print(f"All police intelligence reports saved to: {output_dir}")

else:
    print("No operational data available for executive dashboard")


EXECUTIVE CRIME INTELLIGENCE DASHBOARD

EXECUTIVE SUMMARY
   Total crimes analyzed: 10,000
   Crime patterns identified: 18
   High priority patterns: 4
   High priority crime volume: 3,306 (33.1%)

KEY INSIGHTS
   Most concentrated pattern: PETIT LARCENY - MANHATTAN - CHAIN STORE - MORNING - Weekday - SUSP: M UNKNOWN | VIC: D UNKNOWN
      - 75% concentration, 531 crimes
   Highest volume pattern: PETIT LARCENY - MANHATTAN - STREET - MORNING - Weekday - SUSP: U UNKNOWN | VIC: M 45-64
      - 916 crimes (9.2% of total)

BOROUGH CRIME INTELLIGENCE
   BRONX:
      - Total: 2,230 | High priority: 1,031 (46.2%)
   QUEENS:
      - Total: 2,160 | High priority: 843 (39.0%)
   BROOKLYN:
      - Total: 2,747 | High priority: 675 (24.6%)
   MANHATTAN:
      - Total: 2,447 | High priority: 598 (24.4%)
   STATEN ISLAND:
      - Total: 416 | High priority: 159 (38.2%)

IMMEDIATE ACTION ITEMS
   Deploy immediately:
      1. OTHER OFFENSES RELATED TO THEFT - BROOKLYN - TRANSIT - NYC SUBWAY - AFTERN

In [53]:
# Helpers
_def = object()

def _get(name, default=None):
    return globals().get(name, default)

def _as_mapping(x):
    return dict(x) if isinstance(x, Mapping) else {}

def _to_int(x, default=None):
    try:
        if x is None: return default
        if isinstance(x, (int,)):
            return int(x)
        if isinstance(x, float) and math.isfinite(x):
            return int(x)
        if isinstance(x, str) and x.strip():
            return int(float(x))
    except Exception:
        pass
    return default

def _to_pct(x, default=None):
    try:
        if x is None: return default
        if isinstance(x, (int, float)) and math.isfinite(x):
            return round(float(x), 1)
        s = str(x).strip()
        if s.endswith('%'):
            return round(float(s[:-1]), 1)
        return round(float(s), 1)
    except Exception:
        return default

def _build_signature(p):
    if not p:
        return None
    crime = p.get('crime_type') or p.get('CRIME') or p.get('OFNS_DESC') or p.get('PD_DESC') or p.get('LAW_CAT') or p.get('LAW_CAT_CD')
    borough = p.get('borough') or p.get('BORO_NM') or p.get('BOROUGH') or p.get('BORO')
    premises = p.get('premises') or p.get('LOC_OF_OCCUR') or p.get('PREM_TYP_DESC') or p.get('LOCATION')
    timeb = p.get('time_bucket') or p.get('TIME_BUCKET') or p.get('TIME') or p.get('TIME_BIN')
    is_weekend = bool(p.get('is_weekend_mode')) if 'is_weekend_mode' in p else bool(p.get('WEEKEND'))
    wkd = 'Weekend' if is_weekend else 'Weekday'
    susp_sex = p.get('suspect_sex_mode') or p.get('SUSP_SEX') or p.get('SUSPECT_SEX') or 'U'
    susp_age = p.get('suspect_age_mode') or p.get('SUSP_AGE') or p.get('SUSPECT_AGE') or 'UNKNOWN'
    vic_sex = p.get('victim_sex_mode') or p.get('VIC_SEX') or p.get('VICTIM_SEX') or 'U'
    vic_age = p.get('victim_age_mode') or p.get('VIC_AGE') or p.get('VICTIM_AGE') or 'UNKNOWN'
    parts = [crime, borough, premises, timeb]
    parts = [str(x) for x in parts if x not in (None, '', 'UNKNOWN')]
    left = ' - '.join(parts) if parts else 'Pattern'
    return f"{left} - {wkd} - SUSP: {susp_sex} {susp_age} | VIC: {vic_sex} {vic_age}"

def _coerce_pattern(src):
    p = _as_mapping(src)
    if not p:
        return None
    # Normalize selected fields, keep extras if already present
    out = {
        'crime_type': p.get('crime_type') or p.get('CRIME') or p.get('OFNS_DESC') or p.get('PD_DESC') or p.get('LAW_CAT') or p.get('LAW_CAT_CD'),
        'borough': p.get('borough') or p.get('BORO_NM') or p.get('BOROUGH') or p.get('BORO'),
        'premises': p.get('premises') or p.get('LOC_OF_OCCUR') or p.get('PREM_TYP_DESC') or p.get('LOCATION'),
        'time_bucket': p.get('time_bucket') or p.get('TIME_BUCKET') or p.get('TIME') or p.get('TIME_BIN'),
        'is_weekend_mode': bool(p.get('is_weekend_mode')) if 'is_weekend_mode' in p else bool(p.get('WEEKEND')),
        'is_holiday_mode': bool(p.get('is_holiday_mode')) if 'is_holiday_mode' in p else bool(p.get('HOLIDAY')),
        'suspect_sex_mode': p.get('suspect_sex_mode') or p.get('SUSP_SEX') or p.get('SUSPECT_SEX'),
        'suspect_age_mode': p.get('suspect_age_mode') or p.get('SUSP_AGE') or p.get('SUSPECT_AGE'),
        'victim_sex_mode': p.get('victim_sex_mode') or p.get('VIC_SEX') or p.get('VICTIM_SEX'),
        'victim_age_mode': p.get('victim_age_mode') or p.get('VIC_AGE') or p.get('VICTIM_AGE'),
    }
    # Carry over optional metrics if available
    for k in ('volume','count','cases','concentration','concentration_score'):
        if p.get(k) is not None:
            out[k] = p.get(k)
    # Prefer unified names
    if out.get('count') is not None and out.get('volume') is None:
        out['volume'] = out['count']
    if out.get('cases') is not None and out.get('volume') is None:
        out['volume'] = out['cases']
    return {k:v for k,v in out.items() if v is not None}

# Collect inputs
project_root = _get('project_root') or os.getcwd()
output_dir = os.path.join(project_root, 'JupyterOutputs', 'Clustering (MultidimensionalClusteringAnalysis)')
os.makedirs(output_dir, exist_ok=True)

# Totals and counts
_total_crimes = _to_int(_get('total_crimes'))
if _total_crimes is None:
    df = _get('df')
    try:
        _total_crimes = int(len(df)) if df is not None else None
    except Exception:
        _total_crimes = None

cluster_profiles = _get('cluster_profiles') or []
_patterns_identified = _to_int(len(cluster_profiles), 0)

_high_priority_patterns = _to_int(_get('high_priority_patterns'), 0)
_high_priority_crimes = _to_int(_get('high_priority_crimes'), None)

_pct = None
if _high_priority_crimes is not None and _total_crimes:
    _pct = round(100.0 * _high_priority_crimes / max(1, _total_crimes), 1)

# Patterns
mc = _coerce_pattern(_get('most_concentrated'))
hv = _coerce_pattern(_get('highest_volume'))

mc_sig = _build_signature(mc) if mc else None
hv_sig = _build_signature(hv) if hv else None

# Deployment heuristic
conc_val = None
if mc and mc.get('concentration') is not None:
    conc_val = _to_pct(mc.get('concentration'))

# conc_val already in percent units (0-100); compare to 60.0 threshold
_immediate = bool((_high_priority_patterns or 0) > 0 or (conc_val is not None and conc_val >= 60.0))
_focus = max(1, min(5, _high_priority_patterns or max(1, _patterns_identified // 3)))

exec_enriched = {
    'analysis_date': datetime.now().strftime('%Y-%m-%d %H:%M'),
    'total_crimes_analyzed': _total_crimes,
    'patterns_identified': _patterns_identified,
    'high_priority_patterns': _high_priority_patterns,
    'high_priority_crime_percentage': _pct,
    'most_concentrated_pattern_signature': mc_sig,
    'highest_volume_pattern_signature': hv_sig,
    'most_concentrated_pattern': mc,
    'highest_volume_pattern': hv,
    'immediate_deployment_needed': _immediate,
    'resource_focus_patterns': _focus,
}

json_path = os.path.join(output_dir, 'executive_crime_summary_enriched.json')
with open(json_path, 'w', encoding='utf-8') as f:
    json.dump(exec_enriched, f, indent=2, ensure_ascii=False)
print(f"JSON saved: {json_path}")

JSON saved: c:\UNIVERSITA MAG\Data mining and Machine learning\Progetto\crime-analyzer\JupyterOutputs\Clustering (MultidimensionalClusteringAnalysis)\executive_crime_summary_enriched.json


In [54]:
expected_files = [
    os.path.join(output_dir, "executive_crime_summary.json"),
    os.path.join(output_dir, "executive_crime_summary_enriched.json"),
    os.path.join(output_dir, "police_operational_intelligence_enriched.csv")
]

files_info = []
for fp in expected_files:
    exists = os.path.exists(fp)
    size_kb = (os.path.getsize(fp) / 1024.0) if exists else 0.0
    files_info.append({
        "file": str(fp),
        "status": "OK" if exists else ("MISSING (optional)" if os.path.basename(fp).endswith(("_enriched.json", "_enriched.csv")) else "MISSING"),
        "size_kb": round(size_kb, 1)
    })

print("Deliverables summary:")
for info in files_info:
    print(info)

Deliverables summary:
{'file': 'c:\\UNIVERSITA MAG\\Data mining and Machine learning\\Progetto\\crime-analyzer\\JupyterOutputs\\Clustering (MultidimensionalClusteringAnalysis)\\executive_crime_summary.json', 'status': 'OK', 'size_kb': 1.3}
{'file': 'c:\\UNIVERSITA MAG\\Data mining and Machine learning\\Progetto\\crime-analyzer\\JupyterOutputs\\Clustering (MultidimensionalClusteringAnalysis)\\executive_crime_summary_enriched.json', 'status': 'OK', 'size_kb': 0.7}
{'file': 'c:\\UNIVERSITA MAG\\Data mining and Machine learning\\Progetto\\crime-analyzer\\JupyterOutputs\\Clustering (MultidimensionalClusteringAnalysis)\\police_operational_intelligence_enriched.csv', 'status': 'OK', 'size_kb': 1.5}


---

# Advanced Clustering Methods

This section explores advanced clustering techniques for discovering complex crime patterns that might be missed by traditional methods. These approaches complement the operational analysis above and provide research-grade insights for academic and advanced analytical purposes.


## Categorical Dimensionality Reduction + Clustering

Categorical dimensionality reduction transforms high-cardinality categorical data into a lower-dimensional continuous space, enabling the application of distance-based clustering algorithms while preserving the essential categorical relationships. 

The Categorical Dimensionality Reduction Pipeline follows our established architecture and uses a robust OneHot + PCA approach instead of traditional MCA:

`CategoricalPreprocessor → CategoricalDimensionalityReducer → KMeans`

Why OneHot + PCA instead of MCA?
- Numerical stability: no NaN values produced during transformation
- Robust implementation: well-tested sklearn components
- Consistent results: reproducible across different data distributions
- Better performance: more efficient and scalable for large datasets

In [55]:
def make_cd_pipeline(n_components=5, random_state=RANDOM_STATE):
    dimred = CategoricalDimensionalityReducer(n_components=n_components, random_state=random_state)
    cluster = KMeans(n_clusters=16, random_state=random_state, n_init=10)
    return Pipeline([
        ("dimred", dimred),
        ("cluster", cluster)
    ])

categorical_dimred_pipeline = make_cd_pipeline()

In [56]:
if 'X_eval' not in globals():
    if 'X_categorical' in globals():
        X_eval = X_categorical.copy()
    elif 'df_kmodes' in globals() and 'CATEGORICAL_FEATURES_KMODES_AVAILABLE' in globals():
        X_eval = df_kmodes[CATEGORICAL_FEATURES_KMODES_AVAILABLE].astype(str).fillna('Unknown')
    elif 'categorical_available' in globals() and 'df' in globals():
        cols = [c for c in categorical_available if c in df.columns]
        X_eval = df[cols].astype(str).fillna('Unknown')
    else:
        X_eval = pd.DataFrame()

if 'cat_dimred_param_grid' not in globals():
    cat_dimred_param_grid = {
        'dimred__n_components': [10, 20, 30],
        'cluster__n_clusters': [8, 12, 16, 20],
        'cluster__n_init': [10, 20]
    }

In [57]:
cd_best_score = -1.0
cd_best_params = None
cd_final = None

# Normalize/expand grid to a list of dicts (cartesian product for dict-of-lists)
if isinstance(cat_dimred_param_grid, Mapping):
    keys = list(cat_dimred_param_grid.keys())
    vals = []
    for k in keys:
        v = cat_dimred_param_grid[k]
        if isinstance(v, (list, tuple, np.ndarray)):
            vals.append(list(v))
        else:
            vals.append([v])
    grid_iter = [dict(zip(keys, combo)) for combo in product(*vals)]
elif isinstance(cat_dimred_param_grid, (list, tuple)):
    # Ensure each element is a dict
    for p in cat_dimred_param_grid:
        if not isinstance(p, Mapping):
            raise TypeError("Each entry in cat_dimred_param_grid must be a dict of parameters")
    grid_iter = list(cat_dimred_param_grid)
else:
    raise TypeError("cat_dimred_param_grid must be a dict or a list of dicts")

for params in grid_iter:
    pipe_copy = Pipeline(categorical_dimred_pipeline.steps)

    # Allow parameters for both dimred and cluster
    prefixed = {}
    for k, v in params.items():
        if k.startswith("cluster__") or k.startswith("dimred__"):
            prefixed[k] = v
        else:
            # Default to cluster params if unprefixed common names appear
            prefixed[f"cluster__{k}"] = v

    pipe_copy.set_params(**prefixed)

    # Fit-transform through the dimred step and ensure numpy arrays for KMeans to avoid feature-name warnings
    X_val_transformed_df = pipe_copy.named_steps['dimred'].fit_transform(X_eval)
    X_val_transformed = np.asarray(X_val_transformed_df)
    labels = pipe_copy.named_steps['cluster'].fit_predict(X_val_transformed)

    # Compute silhouette only if >1 cluster and <n_samples-1
    if len(np.unique(labels)) > 1 and len(np.unique(labels)) < len(labels):
        score = silhouette_score(X_val_transformed, labels)
    else:
        score = -1.0

    if score > cd_best_score:
        cd_best_score = score
        cd_best_params = dict(prefixed)
        cd_final = pipe_copy

print(f"Best CD+KMeans score: {cd_best_score:.4f}")
print(f"Best CD+KMeans params: {cd_best_params}")


Best CD+KMeans score: 0.1973
Best CD+KMeans params: {'dimred__n_components': 10, 'cluster__n_clusters': 16, 'cluster__n_init': 10}


In [58]:
if 'spectral_param_grid' not in globals():
    spectral_param_grid = {
        'n_clusters': [8, 12, 16],
        'affinity': ['rbf', 'nearest_neighbors'],
        'gamma': [0.1, 0.5, 1.0, 2.0],
        'n_neighbors': [10, 20, 30],
        'assign_labels': ['kmeans']
    }

### Spectral Clustering: pipeline overview

We apply Spectral Clustering on a mixed-feature representation assembled with a ColumnTransformer:
- Categorical features → GroupBalancedOneHotEncoder (each feature scaled by 1/√k across its one-hot group)
- Numerical features → StandardScaler

The resulting numeric matrix feeds SpectralClustering with a small grid over n_clusters and affinity/nearest-neighbors settings.

Pipeline: `ColumnTransformer(GroupBalancedOneHotEncoder for categoricals, StandardScaler for numericals) → SpectralClustering`

In [59]:
# Spectral clustering evaluation with integrated group-balanced categorical encoding.

def _expand_param_grid(grid):
    if isinstance(grid, Mapping):
        keys = list(grid.keys())
        vals = []
        for k in keys:
            v = grid[k]
            if isinstance(v, (list, tuple, np.ndarray)):
                vals.append(list(v))
            else:
                vals.append([v])
        return [dict(zip(keys, combo)) for combo in product(*vals)]
    elif isinstance(grid, (list, tuple)):
        out = []
        for p in grid:
            if not isinstance(p, Mapping):
                raise TypeError("Each entry in spectral_param_grid must be a dict of parameters")
            out.append(p)
        return out
    else:
        raise TypeError("spectral_param_grid must be a dict or a list of dicts")

def _normalize_colnames(X_df, cols):
    df_cols_str = set(map(str, X_df.columns))
    normalized = []
    for c in cols or []:
        cs = str(c)
        if cs in df_cols_str and cs not in normalized:
            normalized.append(cs)
    return normalized

def _sanitize_params(params, n_samples):
    allowed_unprefixed = {
        'n_clusters','affinity','gamma','n_neighbors','assign_labels',
        'eigen_solver','n_init','n_components','random_state'
    }
    prefixed = {}
    for k, v in params.items():
        if k.startswith('dimred__'):
            continue
        if '__' in k:
            if k.startswith('cluster__'):
                key = k.split('__', 1)[1]
                if key in allowed_unprefixed:
                    prefixed[k] = v
        else:
            if k in allowed_unprefixed:
                prefixed[f"cluster__{k}"] = v
    nk_key = 'cluster__n_clusters'
    if nk_key in prefixed:
        try:
            nk = int(prefixed[nk_key])
            if nk <= 1 or nk >= n_samples:
                return None
        except Exception:
            return None
    aff_key = 'cluster__affinity'
    nn_key = 'cluster__n_neighbors'
    if prefixed.get(aff_key) == 'nearest_neighbors':
        nn = int(prefixed.get(nn_key, min(10, max(2, n_samples-1))))
        if nn <= 1:
            nn = 2
        if nn >= n_samples:
            nn = max(2, n_samples-1)
        prefixed[nn_key] = nn
    return prefixed

def spectral_clustering_evaluation(X_df, categorical_cols, numerical_cols, param_grid, random_state=RANDOM_STATE):
    if not isinstance(X_df, pd.DataFrame):
        raise TypeError("X_df must be a pandas DataFrame")

    categorical_cols = _normalize_colnames(X_df, categorical_cols)
    numerical_cols = _normalize_colnames(X_df, numerical_cols)

    if not categorical_cols and not numerical_cols:
        raise ValueError("No features available for SpectralClustering: categorical_cols and numerical_cols are both empty.")

    pre = ColumnTransformer(
        transformers=[
            ("cat", GroupBalancedOneHotEncoder(), categorical_cols),
            ("num", StandardScaler(with_mean=True, with_std=True), numerical_cols),
        ],
        remainder='drop'
    )

    pipe = Pipeline([
        ("pre", pre),
        ("cluster", SpectralClustering(random_state=random_state, assign_labels='kmeans')),
    ])

    best_score = -np.inf
    best_params = None
    best_pipe = None

    grid_iter = _expand_param_grid(param_grid)

    # Fit preprocessing once (deterministic embedding)
    X_emb_base = pipe.named_steps['pre'].fit_transform(X_df)
    n_samples = X_emb_base.shape[0]

    for params in grid_iter:
        prefixed = _sanitize_params(params, n_samples)
        if prefixed is None:
            continue
        pipe_try = Pipeline(pipe.steps)
        try:
            pipe_try.set_params(**prefixed)
            labels = pipe_try.named_steps['cluster'].fit_predict(X_emb_base)
            if len(np.unique(labels)) > 1 and len(np.unique(labels)) < len(labels):
                sil = silhouette_score(X_emb_base, labels, metric='euclidean')
            else:
                sil = -1.0
            if sil > best_score:
                best_score = sil
                best_params = dict(prefixed)
                best_pipe = pipe_try
        except Exception:
            continue

    if not np.isfinite(best_score):
        try:
            nk = min(8, max(2, n_samples-1))
            fallback = {'n_clusters': nk, 'affinity': 'rbf', 'assign_labels': 'kmeans'}
            pipe_fallback = Pipeline(pipe.steps)
            pipe_fallback.set_params(**{f"cluster__{k}": v for k, v in fallback.items()})
            labels = pipe_fallback.named_steps['cluster'].fit_predict(X_emb_base)
            if len(np.unique(labels)) > 1 and len(np.unique(labels)) < len(labels):
                sil = silhouette_score(X_emb_base, labels, metric='euclidean')
            else:
                sil = -1.0
            best_score = sil
            best_params = {f"cluster__{k}": v for k, v in fallback.items()}
            best_pipe = pipe_fallback
        except Exception:
            pass

    if best_pipe is not None:
        # Reuse the already fitted preprocessor (contains balanced encoding)
        best_pipe.named_steps['pre'] = pipe.named_steps['pre']

    return best_score, best_params, best_pipe, X_emb_base

# Run evaluation
categorical_cols = list(categorical_available) if 'categorical_available' in locals() else []
numerical_cols = list(available_numericals) if 'available_numericals' in locals() else []
param_grid = spectral_param_grid

best_spectral_score, best_spectral_params, spectral_pipeline, _X_spectral_embedding = spectral_clustering_evaluation(
    X_eval, categorical_cols, numerical_cols, param_grid, random_state=RANDOM_STATE
)

print(f"Best Spectral score: {best_spectral_score:.4f}")
print(f"Best Spectral params: {best_spectral_params}")
print("[Info] Group-balanced categorical encoding is always active.")

Best Spectral score: 0.1034
Best Spectral params: {'cluster__n_clusters': 8, 'cluster__affinity': 'nearest_neighbors', 'cluster__gamma': 0.1, 'cluster__n_neighbors': 30, 'cluster__assign_labels': 'kmeans'}
[Info] Group-balanced categorical encoding is always active.


### Spectral: search space and configuration

We evaluate SpectralClustering over a compact grid of parameters. Features are prepared with a ColumnTransformer combining GroupBalancedOneHotEncoder for categoricals and StandardScaler for numericals. The transformed matrix is then clustered via SpectralClustering.

Key search dimensions:
- n_clusters
- affinity (rbf vs nearest_neighbors with n_neighbors)
- assign_labels = 'kmeans'

In [60]:
method_comparison = {
    "K-Modes": {
        "best_score": float(kmodes_best_score) if 'kmodes_best_score' in globals() else None,
        "best_params": dict(kmodes_best_params) if 'kmodes_best_params' in globals() else None,
    },
    "CategoricalDimRed+KMeans": {
        "best_score": float(cd_best_score) if 'cd_best_score' in globals() else None,
        "best_params": dict(cd_best_params) if 'cd_best_params' in globals() else None,
    },
    "Spectral": {
        "best_score": float(best_spectral_score) if 'best_spectral_score' in globals() else None,
        "best_params": dict(best_spectral_params) if 'best_spectral_params' in globals() else None,
    }
}

comparison_path = Path(project_root) / "JupyterOutputs" / "Clustering (MultidimensionalClusteringAnalysis)" / "pipeline_methods_comparison.json"
comparison_path.parent.mkdir(parents=True, exist_ok=True)
with open(comparison_path, 'w', encoding='utf-8') as f:
    json.dump(method_comparison, f, indent=2)

print("Saved:", comparison_path)


Saved: c:\UNIVERSITA MAG\Data mining and Machine learning\Progetto\crime-analyzer\JupyterOutputs\Clustering (MultidimensionalClusteringAnalysis)\pipeline_methods_comparison.json


In [61]:
# Build comparison with distinct variable names to avoid overwrites
comparison_summary = {}

if 'kmodes_best_score' in globals() and kmodes_best_score is not None:
    comparison_summary['K-Modes'] = {
        'best_score': kmodes_best_score,
        'best_params': kmodes_best_params
    }

if 'cd_best_score' in globals():  # categorical dimred + kmeans
    comparison_summary['CatDimRed+KMeans'] = {
        'best_score': float(cd_best_score),
        'best_params': cd_best_params
    }

if 'best_spectral_score' in globals():
    comparison_summary['Spectral'] = {
        'best_score': float(best_spectral_score),
        'best_params': best_spectral_params
    }

comparison_summary

{'K-Modes': {'best_score': 0.5866698350071736,
  'best_params': {'cluster__init': 'Huang',
   'cluster__n_clusters': 18,
   'cluster__n_init': 5}},
 'CatDimRed+KMeans': {'best_score': 0.19733085500583938,
  'best_params': {'dimred__n_components': 10,
   'cluster__n_clusters': 16,
   'cluster__n_init': 10}},
 'Spectral': {'best_score': 0.10340519954298237,
  'best_params': {'cluster__n_clusters': 8,
   'cluster__affinity': 'nearest_neighbors',
   'cluster__gamma': 0.1,
   'cluster__n_neighbors': 30,
   'cluster__assign_labels': 'kmeans'}}}

### Brief Interpretation (Spectral vs K-Modes)
A low Spectral Clustering silhouette (≈0.10) is **not a failure**: spectral builds structure in the Laplacian eigenvector space and only then maps cluster labels back to the original mixed feature space, where Euclidean compactness is naturally weaker for smooth, overlapping crime patterns. We nevertheless adopt K‑Modes as the primary method because it is simpler, more directly interpretable (mode-based centroids), and achieves a higher silhouette under categorical dissimilarity. Spectral remains a complementary exploratory tool rather than the operational default.

## Literature comparison

| Study (year)                 | Domain / Data                                                                 | Method                                                                                   | Evaluation metric(s)                                                             | Key findings                                                                                                                                            | Differences vs our approach                                                                                                                                                                                                                                              |
| ---------------------------- | ----------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Nath (2006)**              | Real crime incident records (sheriff’s office)                                | K-means clustering (with attribute weighting & semi-supervised refinement)               | Qualitative validation; improved analyst productivity                            | Clustering exposed repeat crime patterns (e.g., same suspect profiles across incidents) aiding faster crime resolution                                  | Used generic K-means requiring manual tuning for categorical data; our approach uses K-modes/Spectral to handle categoricals natively and excludes bias-prone features (e.g., race) not addressed in their work.                                                         |
| **Gharehchopogh & Haggi (2020)** | Community-level crime stats (UCI “Community and Crime” dataset, 1994 samples) | Hybrid K-modes clustering with Elephant Herding Optimization for centroid initialization | Cluster purity (unsupervised accuracy) = 91.45%                                  | Metaheuristic-guided K-modes achieved high-purity clusters of crimes by similarity                                                                      | Focuses on optimizing one clustering algorithm for aggregated data; our approach clusters individual incidents with richer features (time, location context, demographics) and compares multiple clustering methods rather than a single optimized workflow.             |
| **Duda (2021)**              | Urban traffic accidents (42k crashes, Pittsburgh 2010–2019)                   | K-modes clustering on categorical crash attributes (road type, time, etc.)               | No explicit metric (clusters evaluated via chi-square feature association tests) | Identified distinct accident clusters (e.g., nearly all accidents in one cluster occurred at intersections), enabling narrative profiles of crash types | Different domain (traffic accidents vs. crime incidents); uses only categorical clustering, whereas we cluster crime incidents and integrate additional context features (e.g., POI proximity, holiday flags) and multiple algorithms for a more comprehensive analysis. |
| **Al-Ibrahim & Kurdi (2024)**    | Unstructured crime reports (text documents, \~3.5k crime news narratives)     | Spectral graph clustering enhanced with GCN (graph neural network) embeddings            | Silhouette = 0.77; Davies–Bouldin Index = 0.51 (high cluster quality)            | Graph-based spectral clustering outperformed traditional methods, finding clearer crime report groupings despite sparse high-dimensional data           | Targets textual data with complex graph/deep learning techniques; our approach handles structured categorical data with emphasis on interpretability and does not require neural networks, focusing instead on operational features and simpler clustering models.       |


## Conclusion

**The Multidimensional Clustering Workflow**:
- Uncovers categorical and mixed‑feature crime patterns beyond geography (borough, premises, time bucket, demographics, POI/context)
- Compares complementary methods — K‑Modes, One‑Hot + PCA → KMeans, and Spectral with group‑balanced encoding
- Uses balanced encoding and interpretable binning to prevent high‑cardinality dominance and keep clusters readable
- Produces operational intelligence artifacts (executive summary, priority tables, enriched CSV) for police planning and briefings
- Remains modular and extensible across feature groups, encoders, reducers, clustering algorithms, and export/reporting

**What the results tell us**:
- Clusters form consistent “profiles” of crime: primary crime types, premises, time windows, and demographic modes per borough
- Priority ranking focuses resources on the highest‑volume and most concentrated patterns for maximum impact

**Future Work**:
- **Constraint‑based clustering**: inject domain knowledge (must‑/cannot‑link) for finer operational relevance
- **Temporal dynamics**: rolling windows, drift monitoring, and decay weighting to emphasize recent incidents.
- **Significance and actionability**: simple permutation tests for concentration, counterfactual checks; integrate with Spatial Hotspots for joint targeting