# Hospital Anomalies: Isolation Forest Baseline Model

This notebook demonstrates the baseline Isolation Forest anomaly detection model with:
- **Deterministic NaN-free feature engineering**
- **Interactive anomaly date tables with Google search links**
- **Mathematical explanation of Isolation Forest algorithm**
- **Annotated visualizations with anomaly dates**

In [None]:
import sys
from pathlib import Path

sys.path.insert(0, str(Path.cwd().parent.parent.parent))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from case_studies.hospital_anomalies.src.io import load_config
from case_studies.hospital_anomalies.src.features import build_features
from case_studies.hospital_anomalies.src.models.isolation_forest import IsolationForestDetector
from case_studies.hospital_anomalies.src.evaluation import get_anomaly_dates, print_anomaly_dates_table
from case_studies.hospital_anomalies.src.visualize import plot_time_series_with_anomalies

%matplotlib inline

## Mathematical Explanation: Isolation Forest

### Overview
Isolation Forest is an unsupervised anomaly detection algorithm based on the principle that **anomalies are rare and different**, making them easier to isolate from normal data points.

### Algorithm

#### 1. **Isolation Trees**
The algorithm builds an ensemble of *isolation trees* (iTrees). Each tree is constructed by:

1. Randomly selecting a feature $q$ from the feature set
2. Randomly selecting a split value $p$ between the minimum and maximum values of $q$
3. Recursively partitioning the data until:
   - All points are isolated (external nodes), or
   - Tree reaches maximum depth $\lceil \log_2(n) \rceil$ where $n$ is sample size

#### 2. **Path Length**
For each data point $x$, the **path length** $h(x)$ is the number of edges traversed from the root to the leaf node.

**Key Insight**: Anomalies have shorter average path lengths because they are easier to isolate.

#### 3. **Anomaly Score**
The anomaly score for a point $x$ is calculated as:

$$s(x, n) = 2^{-\frac{E(h(x))}{c(n)}}$$

where:
- $E(h(x))$ = average path length of $x$ across all trees
- $c(n)$ = normalization constant (average path length of unsuccessful search in BST)
- $c(n) = 2H(n-1) - \frac{2(n-1)}{n}$ where $H(i)$ is the harmonic number $\approx \ln(i) + 0.5772$

#### 4. **Interpretation**
- $s(x) \approx 1$: Anomaly (very short path length)
- $s(x) < 0.5$: Normal point (average path length)
- $s(x) \approx 0$: Normal point (very long path length)

### Advantages
1. **Linear time complexity**: $O(n \log n)$ for training
2. **No distance/density computations**: Works well in high dimensions
3. **Unsupervised**: No labeled data required
4. **Handles irrelevant features**: Random feature selection reduces impact

### Parameters
- `n_estimators`: Number of isolation trees (default: 100)
- `max_samples`: Number of samples to draw for each tree (default: 256)
- `contamination`: Expected proportion of anomalies (default: 'auto')
- `random_state`: Seed for reproducibility

## 1. Load Data and Configuration

In [None]:
# Load configuration
config_path = Path.cwd().parent / 'config' / 'default.yaml'
config = load_config(config_path)
config_dict = config

# For demo purposes, create synthetic data
# In practice, load from your data pipeline
np.random.seed(42)
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
df = pd.DataFrame({
    'date': dates,
    'admissions': np.random.poisson(100, len(dates)) + np.random.randn(len(dates)) * 5,
    'occupancy_rate': np.random.uniform(0.7, 0.95, len(dates))
})

# Add some anomalies
anomaly_indices = [50, 51, 52, 150, 151, 250, 300]
df.loc[anomaly_indices, 'admissions'] = df.loc[anomaly_indices, 'admissions'] * 1.5
df.loc[anomaly_indices, 'occupancy_rate'] = 0.99

print(f"Loaded {len(df)} days of data")
df.head()

## 2. Feature Engineering with NaN Handling

Using `build_features()` which guarantees a NaN-free feature matrix.

In [None]:
# Build features with automatic NaN handling
features_df = build_features(df, config_dict)

print(f"Engineered {len(features_df.columns)} features")
print(f"Total NaNs in feature matrix: {features_df.isna().sum().sum()}")
print(f"\nFeature columns: {list(features_df.columns[:15])}...")

## 3. Train Isolation Forest

In [None]:
# Select features (exclude date and metadata columns)
feature_cols = [
    col for col in features_df.columns
    if col not in ['date', 'region', 'hospital_id', 'year', 'month', 'day', 
                   'quarter', 'week_of_year', 'day_of_week', 'day_of_year']
]

X = features_df[feature_cols]
print(f"Training on {len(feature_cols)} features")

# Initialize and train Isolation Forest
if_config = config_dict.get('isolation_forest', {})
detector = IsolationForestDetector(
    n_estimators=if_config.get('n_estimators', 100),
    contamination=if_config.get('contamination', 0.1),
    random_state=42
)

detector.fit(X)
print("✓ Isolation Forest trained")

## 4. Detect Anomalies

In [None]:
# Get anomaly predictions
predictions = detector.get_anomalies(X)

print(f"Total anomalies detected: {predictions['is_anomaly'].sum()}")
print(f"Anomaly rate: {predictions['is_anomaly'].mean():.2%}")

# Combine with original data
results_df = features_df.copy()
results_df['is_anomaly'] = predictions['is_anomaly']
results_df['anomaly_score'] = predictions['anomaly_score']

## 5. Anomaly Dates with Google Search Links

**Interactive table with clickable links to investigate events on anomaly dates**

In [None]:
# Extract anomaly dates with Google search links
anomaly_dates = get_anomaly_dates(
    results_df,
    date_col='date',
    anomaly_col='is_anomaly',
    score_col='anomaly_score',
    value_cols=['admissions', 'occupancy_rate'],
    include_search_links=True,
    include_news_placeholder=True
)

# Display interactive table (clickable links in Jupyter)
print_anomaly_dates_table(anomaly_dates, max_rows=20)

## 6. Visualize Results with Date Labels

**Time series plot with anomalies highlighted and labeled with dates**

In [None]:
# Plot time series with anomaly labels
fig = plot_time_series_with_anomalies(
    results_df,
    date_col='date',
    value_col='admissions',
    anomaly_col='is_anomaly',
    title='Hospital Admissions with Anomaly Detection',
    annotate_dates=True,
    max_annotations=10
)
plt.show()

# Plot occupancy rate
fig = plot_time_series_with_anomalies(
    results_df,
    date_col='date',
    value_col='occupancy_rate',
    anomaly_col='is_anomaly',
    title='Occupancy Rate with Anomaly Detection',
    annotate_dates=True,
    max_annotations=10
)
plt.show()

## 7. Summary Statistics

In [None]:
# Summary of anomalies
print("=== Anomaly Detection Summary ===")
print(f"Total data points: {len(results_df)}")
print(f"Anomalies detected: {results_df['is_anomaly'].sum()}")
print(f"Anomaly rate: {results_df['is_anomaly'].mean():.2%}")
print(f"\nScore statistics:")
print(results_df['anomaly_score'].describe())