# Experiment 3: What Do Hidden Outliers Look Like?

This experiment visually demonstrates the difference between generating hidden outliers in **latent space** versus **pixel space**.

## Motivation

A natural question arises: if we generate hidden outliers for image data, what do they actually look like?

The answer reveals a key insight about why manifold-based generation is superior:
- **Latent space outliers**: Look like slightly distorted but recognizable digits
- **Pixel space outliers**: Look like random noise added to images

## Why the Difference?

When generating in latent space:
- BISECT finds outliers in a **meaningful** low-dimensional representation
- Decoding maps these back to the **manifold of plausible images**
- The decoder acts as a "denoiser," constraining outputs to look like real data

When generating directly in pixel space:
- BISECT can only examine a tiny fraction of 2^784 possible subspaces
- Perturbations spread across unexamined subspaces as random noise
- No constraint forces outputs to look like real images

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from pyod.models.lof import LOF

from hog_bisect import BisectHOGen

np.random.seed(42)

## 1. Load MNIST Data

In [None]:
try:
    from tensorflow.keras.datasets import mnist
except ImportError:
    from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Flatten and normalize
x_train_flat = x_train.reshape(-1, 784).astype('float32') / 255.0
x_test_flat = x_test.reshape(-1, 784).astype('float32') / 255.0

# Use a subset of digit '3' for this experiment
DIGIT = 3
N_SAMPLES = 500

X_data = x_train_flat[y_train == DIGIT][:N_SAMPLES]
print(f"Using {len(X_data)} samples of digit '{DIGIT}'")
print(f"Data shape: {X_data.shape}")

In [None]:
# Show some original samples
fig, axes = plt.subplots(2, 10, figsize=(15, 3))
fig.suptitle(f'Original MNIST Digit {DIGIT} Samples', fontsize=14)

for i in range(20):
    ax = axes[i // 10, i % 10]
    ax.imshow(X_data[i].reshape(28, 28), cmap='gray')
    ax.axis('off')

plt.tight_layout()
plt.savefig('../results/original_digits.png', dpi=150, bbox_inches='tight')
plt.show()

## 2. Generate Hidden Outliers in Latent Space

First, we project to a low-dimensional latent space using PCA, generate hidden outliers there, then project back to pixel space.

In [None]:
LATENT_DIM = 12  # Low enough for tractable subspace enumeration

pca = PCA(n_components=LATENT_DIM, random_state=42)
X_latent = pca.fit_transform(X_data)

explained_var = np.sum(pca.explained_variance_ratio_) * 100
print(f"Latent dimension: {LATENT_DIM}")
print(f"Variance explained: {explained_var:.1f}%")
print(f"Latent shape: {X_latent.shape}")

In [None]:
# Generate hidden outliers in latent space
gen_latent = BisectHOGen(
    data=X_latent,
    outlier_detection_method=LOF,
    seed=42
)

hidden_latent = gen_latent.fit_generate(
    gen_points=100,
    get_origin_type='weighted',
    n_jobs=1
)

print(f"Generated {len(hidden_latent)} hidden outliers in latent space")
gen_latent.print_summary()

In [None]:
# Decode back to pixel space using PCA inverse transform
hidden_latent_decoded = pca.inverse_transform(hidden_latent)

# Clip to valid range [0, 1]
hidden_latent_decoded = np.clip(hidden_latent_decoded, 0, 1)

print(f"Decoded hidden outliers shape: {hidden_latent_decoded.shape}")

In [None]:
# Show hidden outliers generated in latent space
n_show = min(20, len(hidden_latent_decoded))
rows = 2 if n_show > 10 else 1
cols = min(10, n_show)

fig, axes = plt.subplots(rows, cols, figsize=(15, 3 * rows))
fig.suptitle('Hidden Outliers Generated in LATENT Space (then decoded)', fontsize=14)

if rows == 1:
    axes = [axes]

for i in range(n_show):
    ax = axes[i // cols][i % cols] if rows > 1 else axes[i % cols]
    ax.imshow(hidden_latent_decoded[i].reshape(28, 28), cmap='gray')
    ax.axis('off')

plt.tight_layout()
plt.savefig('../results/hidden_outliers_from_latent.png', dpi=150, bbox_inches='tight')
plt.show()

## 3. Generate Hidden Outliers Directly in Pixel Space

Now let's generate hidden outliers directly in the 784-dimensional pixel space. Due to computational constraints, BISECT will only examine a subset of possible subspaces.

In [None]:
# Generate hidden outliers directly in pixel space
# Note: With 784 dimensions, BISECT will use random subspace sampling
# max_dimensions=11 means it samples 2^11 = 2048 subspaces (vs 2^784 total)

gen_pixel = BisectHOGen(
    data=X_data,
    outlier_detection_method=LOF,
    seed=42,
    max_dimensions=11  # Limit subspace enumeration
)

hidden_pixel = gen_pixel.fit_generate(
    gen_points=100,
    get_origin_type='weighted',
    n_jobs=1
)

print(f"Generated {len(hidden_pixel)} hidden outliers in pixel space")
gen_pixel.print_summary()

In [None]:
# Show hidden outliers generated directly in pixel space
n_show = min(20, len(hidden_pixel))
rows = 2 if n_show > 10 else 1
cols = min(10, n_show)

fig, axes = plt.subplots(rows, cols, figsize=(15, 3 * rows))
fig.suptitle('Hidden Outliers Generated Directly in PIXEL Space (784 dims)', fontsize=14)

if rows == 1:
    axes = [axes]

for i in range(n_show):
    ax = axes[i // cols][i % cols] if rows > 1 else axes[i % cols]
    img = np.clip(hidden_pixel[i], 0, 1)  # Clip to valid range
    ax.imshow(img.reshape(28, 28), cmap='gray')
    ax.axis('off')

plt.tight_layout()
plt.savefig('../results/hidden_outliers_from_pixel.png', dpi=150, bbox_inches='tight')
plt.show()

## 4. Side-by-Side Comparison

In [None]:
# Create side-by-side comparison figure
n_compare = min(10, len(hidden_latent_decoded), len(hidden_pixel))

fig, axes = plt.subplots(3, n_compare, figsize=(15, 5))
fig.suptitle('Comparison: Original vs Latent-Space Outliers vs Pixel-Space Outliers', fontsize=14)

for i in range(n_compare):
    # Original
    axes[0, i].imshow(X_data[i].reshape(28, 28), cmap='gray')
    axes[0, i].axis('off')
    if i == 0:
        axes[0, i].set_ylabel('Original', rotation=0, labelpad=50, fontsize=12)
    
    # Latent space outliers (decoded)
    axes[1, i].imshow(hidden_latent_decoded[i].reshape(28, 28), cmap='gray')
    axes[1, i].axis('off')
    if i == 0:
        axes[1, i].set_ylabel('Latent\nSpace', rotation=0, labelpad=50, fontsize=12)
    
    # Pixel space outliers
    img = np.clip(hidden_pixel[i], 0, 1)
    axes[2, i].imshow(img.reshape(28, 28), cmap='gray')
    axes[2, i].axis('off')
    if i == 0:
        axes[2, i].set_ylabel('Pixel\nSpace', rotation=0, labelpad=50, fontsize=12)

plt.tight_layout()
plt.savefig('../results/hidden_outliers_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

## 5. Quantitative Analysis

In [None]:
def compute_stats(images, name):
    """Compute statistics about generated images."""
    images = np.clip(images, 0, 1)
    
    # Mean pixel value
    mean_pixel = np.mean(images)
    
    # Standard deviation of pixel values
    std_pixel = np.std(images)
    
    # Sparsity (fraction of near-zero pixels)
    sparsity = np.mean(images < 0.1)
    
    # Distance from original data (mean L2)
    if len(images) > 0 and len(X_data) > 0:
        min_dists = []
        for img in images[:50]:  # Limit for speed
            dists = np.linalg.norm(X_data - img, axis=1)
            min_dists.append(np.min(dists))
        mean_min_dist = np.mean(min_dists)
    else:
        mean_min_dist = np.nan
    
    print(f"\n{name}:")
    print(f"  Mean pixel value: {mean_pixel:.3f}")
    print(f"  Std pixel value: {std_pixel:.3f}")
    print(f"  Sparsity (pixels < 0.1): {sparsity:.1%}")
    print(f"  Mean min L2 dist to original: {mean_min_dist:.3f}")

compute_stats(X_data, "Original Data")
compute_stats(hidden_latent_decoded, "Latent Space Outliers")
compute_stats(hidden_pixel, "Pixel Space Outliers")

## Conclusion

This experiment visually demonstrates a key insight:

### Latent Space Generation
- Hidden outliers look like **slightly distorted but recognizable digits**
- The PCA inverse transform (decoder) acts as a constraint, mapping outliers back onto the manifold of plausible images
- These outliers are **semantically meaningful** - they represent variations that push the boundaries of what the model considers "normal"

### Pixel Space Generation
- Hidden outliers look like **noisy, degraded images**
- Because BISECT can only examine a tiny fraction of possible subspaces (2^11 out of 2^784), perturbations spread as random noise
- The "outlierness" is distributed across unexamined dimensions

### Implications

1. **Manifold learning makes hidden outlier generation tractable** - we work with 12 dimensions instead of 784
2. **The decoder constrains outputs to be realistic** - generated outliers inherit the structure of real data
3. **Quality vs computational cost** - latent space generation produces more meaningful outliers with less computation

This explains why the approach from Experiments 1 and 2 works: the latent space captures meaningful structure, and hidden outliers generated there represent genuine boundary cases rather than random noise.