# Hidden Outlier Generation Tutorial

This notebook provides an interactive guide to generating hidden outliers using the `hog_bisect` library.

## What are Hidden Outliers?

Hidden outliers are data points that exhibit different outlier behavior depending on which feature subspace you examine:

- **H1 (Subspace Hidden)**: Outlier in some feature subspace but NOT in the full feature space
- **H2 (Full-space Hidden)**: Outlier in the full feature space but NOT in any subspace

These are useful for benchmarking outlier detection algorithms, especially subspace-aware methods.

## Setup

First, let's import the necessary libraries.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from pyod.models.lof import LOF
from pyod.models.knn import KNN

from hog_bisect import BisectHOGen, OutlierResultType

# For reproducibility
np.random.seed(42)

## 1. Basic Usage

Let's start with a simple example: generate hidden outliers from random data.

In [None]:
# Generate synthetic normal data
n_samples = 200
n_features = 5
data = np.random.randn(n_samples, n_features)

print(f"Dataset shape: {data.shape}")

In [None]:
# Create the hidden outlier generator
generator = BisectHOGen(
    data=data,
    outlier_detection_method=LOF,  # Local Outlier Factor
    seed=42
)

# Generate hidden outliers
hidden_outliers = generator.fit_generate(
    gen_points=50,  # Number of candidate points to generate
    n_jobs=1        # Use single core
)

print(f"Generated {len(hidden_outliers)} hidden outliers")
print(f"Shape: {hidden_outliers.shape}")

In [None]:
# Print summary
generator.print_summary()

## 2. Visualizing Hidden Outliers

Let's generate outliers in 2D so we can visualize them.

In [None]:
# Create 2D clustered data for visualization
cluster1 = np.random.randn(100, 2) * 0.5 + [0, 0]
cluster2 = np.random.randn(100, 2) * 0.5 + [4, 4]
data_2d = np.vstack([cluster1, cluster2])

# Generate hidden outliers
gen_2d = BisectHOGen(data=data_2d, outlier_detection_method=LOF, seed=42)
outliers_2d = gen_2d.fit_generate(gen_points=30, n_jobs=1)

print(f"Generated {len(outliers_2d)} hidden outliers")

In [None]:
# Visualize
fig, ax = plt.subplots(figsize=(10, 8))

# Plot original data
ax.scatter(data_2d[:, 0], data_2d[:, 1], c='blue', alpha=0.5, s=30, label='Original data')

# Plot hidden outliers
if len(outliers_2d) > 0:
    ax.scatter(outliers_2d[:, 0], outliers_2d[:, 1], c='red', marker='x', 
               s=100, linewidths=2, label=f'Hidden outliers (n={len(outliers_2d)})')

ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_title('Hidden Outlier Generation')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

## 3. Comparing Origin Methods

The origin is the starting point from which the bisection algorithm searches outward. Different strategies can produce different results.

In [None]:
origin_methods = ['centroid', 'weighted', 'random', 'least outlier']
results = {}

for origin in origin_methods:
    gen = BisectHOGen(data=data, outlier_detection_method=LOF, seed=42)
    outliers = gen.fit_generate(gen_points=30, get_origin_type=origin, n_jobs=1)
    
    h1_count = np.sum(gen.hidden_x_type == 'H1') if gen.hidden_x_type is not None else 0
    h2_count = np.sum(gen.hidden_x_type == 'H2') if gen.hidden_x_type is not None else 0
    
    results[origin] = {'total': len(outliers), 'H1': h1_count, 'H2': h2_count}
    print(f"{origin:15} -> {len(outliers):3} outliers (H1: {h1_count}, H2: {h2_count})")

In [None]:
# Plot comparison
fig, ax = plt.subplots(figsize=(10, 6))

x = np.arange(len(origin_methods))
width = 0.35

h1_vals = [results[o]['H1'] for o in origin_methods]
h2_vals = [results[o]['H2'] for o in origin_methods]

ax.bar(x - width/2, h1_vals, width, label='H1 (subspace outliers)', color='steelblue')
ax.bar(x + width/2, h2_vals, width, label='H2 (full-space outliers)', color='coral')

ax.set_xlabel('Origin Method')
ax.set_ylabel('Count')
ax.set_title('Hidden Outliers by Origin Method')
ax.set_xticks(x)
ax.set_xticklabels(origin_methods)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.show()

## 4. Comparing Detection Methods

Different outlier detection methods may find different hidden outliers.

In [None]:
from pyod.models.iforest import IForest

detectors = [
    ('LOF', LOF),
    ('KNN', KNN),
    ('IForest', IForest)
]

detector_results = {}

for name, detector in detectors:
    gen = BisectHOGen(data=data, outlier_detection_method=detector, seed=42)
    outliers = gen.fit_generate(gen_points=30, n_jobs=1)
    detector_results[name] = len(outliers)
    print(f"{name:10} -> {len(outliers):3} hidden outliers")

## 5. Understanding Outlier Types

The generator classifies each point into one of four categories:

| Type | Name | Description |
|------|------|-------------|
| H1 | Subspace Hidden | Outlier in subspace, NOT in full space |
| H2 | Full-space Hidden | Outlier in full space, NOT in subspace |
| OB | Outside Bounds | Outlier in both (not hidden) |
| IL | Inlier | Not an outlier anywhere |

In [None]:
# Check the OutlierResultType enum
for result_type in OutlierResultType:
    print(f"{result_type.name}: hidden={result_type.is_hidden_outlier()}, indicator={result_type.indicator}")

## 6. Reproducibility

Using the same seed produces identical results.

In [None]:
# Run twice with same seed
gen1 = BisectHOGen(data=data, outlier_detection_method=LOF, seed=123)
result1 = gen1.fit_generate(gen_points=20, n_jobs=1)

gen2 = BisectHOGen(data=data, outlier_detection_method=LOF, seed=123)
result2 = gen2.fit_generate(gen_points=20, n_jobs=1)

print(f"Results identical: {np.array_equal(result1, result2)}")

## Summary

Key takeaways:

1. **BisectHOGen** is the main class for generating hidden outliers
2. **Origin methods** affect which hidden outliers are found (`weighted` is recommended)
3. **Detection methods** from PyOD can be plugged in (LOF, KNN, IForest, etc.)
4. **H1** and **H2** are the two types of hidden outliers
5. Use **seed** for reproducible results

For more information, see the [GitHub repository](https://github.com/dschulmeist/hidden-outlier-generation).