# 06 - Gaussian Mixture Model (GMM) Anomaly Detection

## Method 4 of 4: Probabilistic Approach

### What is GMM?

GMM is like K-Means but uses **probability** instead of hard assignments.

| K-Means | GMM |
|---------|-----|
| Point belongs to ONE cluster | Point has probability for EACH cluster |
| Hard boundaries | Soft, overlapping boundaries |
| Spherical clusters | Elliptical clusters |

### How to Detect Anomalies

GMM gives each point a **probability** of fitting the data.

- High probability → Fits well → Normal
- Low probability → Doesn't fit → **Anomaly**

### Pros & Cons

| Pros | Cons |
|------|------|
| Flexible cluster shapes | More complex than K-Means |
| Probability scores | Slower to train |
| Handles overlapping clusters | Sensitive to initialization |

---
## Step 1: Setup
---

In [None]:
# ═══════════════════════════════════════════════════════════════════
# IMPORTS
# ═══════════════════════════════════════════════════════════════════

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, udf, monotonically_increasing_id
from pyspark.sql.types import DoubleType
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.clustering import GaussianMixture
import os

print("✓ Libraries imported")

In [None]:
# ═══════════════════════════════════════════════════════════════════
# CREATE SPARK SESSION
# ═══════════════════════════════════════════════════════════════════

spark = SparkSession.builder \
    .appName("GMM_AnomalyDetection") \
    .master("local[*]") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

print("✓ Spark Session created")

In [None]:
# ═══════════════════════════════════════════════════════════════════
# CONFIGURATION
# ═══════════════════════════════════════════════════════════════════

DATA_PATH = "../data/processed/BTCUSDT_1h_processed.csv"
OUTPUT_PATH = "../data/results/"

FEATURE_COLUMNS = [
    "return", "log_return", "volatility_24h",
    "volume_change", "volume_ratio", "price_range"
]

# GMM parameters
K_COMPONENTS = 3  # Number of Gaussian components

print("✓ Configuration set")
print(f"  Components: {K_COMPONENTS}")

---
## Step 2: Load and Prepare Data
---

In [None]:
# ═══════════════════════════════════════════════════════════════════
# LOAD DATA
# ═══════════════════════════════════════════════════════════════════

df = spark.read.csv(DATA_PATH, header=True, inferSchema=True)

print("✓ Data loaded")
print(f"  Rows: {df.count()}")

In [None]:
# ═══════════════════════════════════════════════════════════════════
# PREPARE FEATURES
# ═══════════════════════════════════════════════════════════════════

assembler = VectorAssembler(inputCols=FEATURE_COLUMNS, outputCol="features_raw")
df_assembled = assembler.transform(df)

scaler = StandardScaler(inputCol="features_raw", outputCol="features", withMean=True, withStd=True)
scaler_model = scaler.fit(df_assembled)
df_scaled = scaler_model.transform(df_assembled)

print("✓ Features prepared")

---
## Step 3: Train GMM Model
---

In [None]:
# ═══════════════════════════════════════════════════════════════════
# TRAIN GMM MODEL
# ═══════════════════════════════════════════════════════════════════

gmm = GaussianMixture(
    k=K_COMPONENTS,
    featuresCol="features",
    predictionCol="cluster",
    probabilityCol="probability",
    seed=42
)

gmm_model = gmm.fit(df_scaled)

print("✓ GMM model trained")
print(f"  Components: {gmm_model.getK}")

In [None]:
# ═══════════════════════════════════════════════════════════════════
# MAKE PREDICTIONS
# ═══════════════════════════════════════════════════════════════════

df_gmm = gmm_model.transform(df_scaled)

print("Cluster distribution:")
df_gmm.groupBy("cluster").count().orderBy("cluster").show()

---
## Step 4: Calculate Anomaly Score
---

In [None]:
# ═══════════════════════════════════════════════════════════════════
# EXTRACT MAXIMUM PROBABILITY
# ═══════════════════════════════════════════════════════════════════

# The probability column contains probabilities for each cluster
# Max probability = how well the point fits its best cluster
# Low max probability = anomaly

def get_max_prob(prob_vector):
    """Get maximum probability from probability vector"""
    return float(max(prob_vector))

max_prob_udf = udf(get_max_prob, DoubleType())

df_gmm = df_gmm.withColumn("max_probability", max_prob_udf(col("probability")))

print("Max probability distribution:")
df_gmm.select("max_probability").describe().show()

---
## Step 5: Flag Anomalies
---

In [None]:
# ═══════════════════════════════════════════════════════════════════
# DEFINE THRESHOLD: Bottom 5% of probabilities
# ═══════════════════════════════════════════════════════════════════

# Use approxQuantile to find the 5th percentile
threshold = df_gmm.approxQuantile("max_probability", [0.05], 0.01)[0]

print(f"5th percentile threshold: {threshold:.4f}")
print(f"\nPoints with max_probability < {threshold:.4f} are anomalies")

In [None]:
# ═══════════════════════════════════════════════════════════════════
# FLAG ANOMALIES
# ═══════════════════════════════════════════════════════════════════

df_gmm = df_gmm.withColumn(
    "is_anomaly",
    when(col("max_probability") < threshold, 1).otherwise(0)
)

# Count results
total = df_gmm.count()
anomalies = df_gmm.filter(col("is_anomaly") == 1).count()
normal = total - anomalies

print("\n" + "=" * 50)
print("GMM RESULTS")
print("=" * 50)
print(f"Total data points:  {total}")
print(f"Normal:             {normal} ({100*normal/total:.1f}%)")
print(f"Anomalies:          {anomalies} ({100*anomalies/total:.1f}%)")
print("=" * 50)

---
## Step 6: Examine Anomalies
---

In [None]:
# ═══════════════════════════════════════════════════════════════════
# VIEW DETECTED ANOMALIES
# ═══════════════════════════════════════════════════════════════════

print("Sample anomalies detected (sorted by probability):")
print("-" * 60)

df_gmm.filter(col("is_anomaly") == 1) \
    .select("timestamp", "close", "return", "volume_change", "cluster", "max_probability") \
    .orderBy(col("max_probability")) \
    .show(10, truncate=False)

---
## Step 7: Save Results
---

In [None]:
# ═══════════════════════════════════════════════════════════════════
# SAVE RESULTS
# ═══════════════════════════════════════════════════════════════════

os.makedirs(OUTPUT_PATH, exist_ok=True)

df_result = df_gmm.withColumn("row_id", monotonically_increasing_id())
df_result = df_result.select(
    "row_id", "timestamp", "close", "return", "volume_change",
    "cluster", "max_probability",
    col("is_anomaly").alias("anomaly_gmm")
)

output_file = OUTPUT_PATH + "gmm_results"
df_result.coalesce(1).write.mode("overwrite").option("header", "true").csv(output_file)

print(f"✓ Results saved to: {output_file}")

In [None]:
# ═══════════════════════════════════════════════════════════════════
# STOP SPARK
# ═══════════════════════════════════════════════════════════════════

spark.stop()
print("✓ Spark stopped")
print("\n" + "=" * 50)
print("GMM COMPLETE - Proceed to 07_comparison.ipynb")
print("=" * 50)