# 05 - Random Forest Anomaly Detection

## Method 3 of 4: Classification Approach

### What is Random Forest?

Random Forest is an **ensemble** of many decision trees.

**Decision Tree:** A series of yes/no questions leading to a prediction.

**Random Forest:** Build many trees, each sees slightly different data. Final prediction = majority vote.

### How to Use for Anomaly Detection

**Problem:** Random Forest needs labeled data ("anomaly" or "normal").

**Solution:** Use K-Means results as pseudo-labels!

1. Load K-Means results (from previous notebook)
2. Train Random Forest on those labels
3. Model learns patterns of anomalies

### Pros & Cons

| Pros | Cons |
|------|------|
| Handles complex patterns | Needs labeled data |
| Shows feature importance | Slower to train |
| Resistant to overfitting | Less interpretable |

---
## Step 1: Setup
---

In [1]:
# ═══════════════════════════════════════════════════════════════════
# IMPORTS
# ═══════════════════════════════════════════════════════════════════

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, monotonically_increasing_id
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
import os

print("✓ Libraries imported")

✓ Libraries imported


In [3]:
# ═══════════════════════════════════════════════════════════════════
# CREATE SPARK SESSION
# ═══════════════════════════════════════════════════════════════════

spark = SparkSession.builder \
    .appName("RandomForest_AnomalyDetection") \
    .master("local[*]") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

print("✓ Spark Session created")

✓ Spark Session created


In [4]:
# ═══════════════════════════════════════════════════════════════════
# CONFIGURATION
# ═══════════════════════════════════════════════════════════════════

DATA_PATH = "../data/processed/BTCUSDT_1h_processed.csv"
KMEANS_RESULTS = "../data/results/kmeans_results"
OUTPUT_PATH = "../data/results/"

FEATURE_COLUMNS = [
    "return", "log_return", "volatility_24h",
    "volume_change", "volume_ratio", "price_range"
]

# Random Forest parameters
NUM_TREES = 100
MAX_DEPTH = 5

print("✓ Configuration set")
print(f"  Trees: {NUM_TREES}")
print(f"  Max depth: {MAX_DEPTH}")

✓ Configuration set
  Trees: 100
  Max depth: 5


---
## Step 2: Load Data and K-Means Labels
---

In [5]:
# ═══════════════════════════════════════════════════════════════════
# LOAD ORIGINAL DATA
# ═══════════════════════════════════════════════════════════════════

df = spark.read.csv(DATA_PATH, header=True, inferSchema=True)
df = df.withColumn("row_id", monotonically_increasing_id())

print("✓ Original data loaded")
print(f"  Rows: {df.count()}")

✓ Original data loaded
  Rows: 976


In [6]:
# ═══════════════════════════════════════════════════════════════════
# LOAD K-MEANS RESULTS (for labels)
# ═══════════════════════════════════════════════════════════════════

kmeans_df = spark.read.csv(KMEANS_RESULTS, header=True, inferSchema=True)

print("✓ K-Means results loaded")
print("\nLabel distribution:")
kmeans_df.groupBy("anomaly_kmeans").count().show()

✓ K-Means results loaded

Label distribution:
+--------------+-----+
|anomaly_kmeans|count|
+--------------+-----+
|             1|   30|
|             0|  946|
+--------------+-----+



In [7]:
# ═══════════════════════════════════════════════════════════════════
# JOIN DATA WITH LABELS
# ═══════════════════════════════════════════════════════════════════

# Get only the label column from K-Means results
labels = kmeans_df.select("row_id", col("anomaly_kmeans").alias("label"))

# Join with original data
df_labeled = df.join(labels, "row_id")

# Convert label to double (required by Spark ML)
df_labeled = df_labeled.withColumn("label", col("label").cast("double"))

print("✓ Data joined with labels")
print(f"  Rows: {df_labeled.count()}")

✓ Data joined with labels
  Rows: 976


---
## Step 3: Prepare Features
---

In [8]:
# ═══════════════════════════════════════════════════════════════════
# VECTORASSEMBLER + STANDARDSCALER
# ═══════════════════════════════════════════════════════════════════

# Combine features
assembler = VectorAssembler(inputCols=FEATURE_COLUMNS, outputCol="features_raw")
df_assembled = assembler.transform(df_labeled)

# Scale features
scaler = StandardScaler(inputCol="features_raw", outputCol="features", withMean=True, withStd=True)
scaler_model = scaler.fit(df_assembled)
df_scaled = scaler_model.transform(df_assembled)

print("✓ Features prepared")

✓ Features prepared


---
## Step 4: Train/Test Split
---

In [9]:
# ═══════════════════════════════════════════════════════════════════
# SPLIT DATA: 80% TRAIN, 20% TEST
# ═══════════════════════════════════════════════════════════════════

train_data, test_data = df_scaled.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_data.count()} rows")
print(f"Test set:     {test_data.count()} rows")

Training set: 818 rows
Test set:     158 rows


---
## Step 5: Train Random Forest
---

In [10]:
# ═══════════════════════════════════════════════════════════════════
# TRAIN RANDOM FOREST MODEL
# ═══════════════════════════════════════════════════════════════════

rf = RandomForestClassifier(
    numTrees=NUM_TREES,
    maxDepth=MAX_DEPTH,
    featuresCol="features",
    labelCol="label",
    predictionCol="prediction",
    seed=42
)

rf_model = rf.fit(train_data)

print("✓ Random Forest model trained")
print(f"  Trees: {rf_model.getNumTrees}")

✓ Random Forest model trained
  Trees: 100


In [11]:
# ═══════════════════════════════════════════════════════════════════
# FEATURE IMPORTANCE
# ═══════════════════════════════════════════════════════════════════

print("Feature Importance:")
print("=" * 50)
print("(Which features matter most for detecting anomalies?)")
print()

importances = rf_model.featureImportances.toArray()
feature_importance = sorted(zip(FEATURE_COLUMNS, importances), key=lambda x: x[1], reverse=True)

for feature, importance in feature_importance:
    bar = "█" * int(importance * 40)
    print(f"{feature:<18} {importance:.4f} {bar}")

Feature Importance:
(Which features matter most for detecting anomalies?)

price_range        0.2797 ███████████
volume_change      0.2510 ██████████
volatility_24h     0.1932 ███████
volume_ratio       0.1439 █████
return             0.0990 ███
log_return         0.0332 █


---
## Step 6: Evaluate Model
---

In [12]:
# ═══════════════════════════════════════════════════════════════════
# MAKE PREDICTIONS ON TEST SET
# ═══════════════════════════════════════════════════════════════════

predictions = rf_model.transform(test_data)

print("Predictions on test set:")
predictions.groupBy("prediction").count().show()

Predictions on test set:
+----------+-----+
|prediction|count|
+----------+-----+
|       0.0|  154|
|       1.0|    4|
+----------+-----+



In [13]:
# ═══════════════════════════════════════════════════════════════════
# CALCULATE METRICS
# ═══════════════════════════════════════════════════════════════════

evaluator_acc = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
evaluator_prec = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedPrecision")
evaluator_rec = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedRecall")
evaluator_f1 = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")

accuracy = evaluator_acc.evaluate(predictions)
precision = evaluator_prec.evaluate(predictions)
recall = evaluator_rec.evaluate(predictions)
f1 = evaluator_f1.evaluate(predictions)

print("\n" + "=" * 50)
print("MODEL PERFORMANCE")
print("=" * 50)
print(f"Accuracy:  {accuracy:.4f} ({accuracy*100:.1f}%)")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")
print("=" * 50)


MODEL PERFORMANCE
Accuracy:  0.9620 (96.2%)
Precision: 0.9635
Recall:    0.9620
F1 Score:  0.9543


---
## Step 7: Apply to Full Dataset & Save
---

In [14]:
# ═══════════════════════════════════════════════════════════════════
# APPLY MODEL TO FULL DATASET
# ═══════════════════════════════════════════════════════════════════

df_full_predictions = rf_model.transform(df_scaled)

# Count results
total = df_full_predictions.count()
anomalies = df_full_predictions.filter(col("prediction") == 1).count()
normal = total - anomalies

print("\n" + "=" * 50)
print("RANDOM FOREST RESULTS (Full Dataset)")
print("=" * 50)
print(f"Total data points:  {total}")
print(f"Normal:             {normal} ({100*normal/total:.1f}%)")
print(f"Anomalies:          {anomalies} ({100*anomalies/total:.1f}%)")
print("=" * 50)


RANDOM FOREST RESULTS (Full Dataset)
Total data points:  976
Normal:             956 (98.0%)
Anomalies:          20 (2.0%)


In [15]:
# ═══════════════════════════════════════════════════════════════════
# SAVE RESULTS
# ═══════════════════════════════════════════════════════════════════

os.makedirs(OUTPUT_PATH, exist_ok=True)

df_result = df_full_predictions.select(
    "row_id", "timestamp", "close", "return", "volume_change",
    col("prediction").cast("integer").alias("anomaly_rf")
)

output_file = OUTPUT_PATH + "rf_results"
df_result.coalesce(1).write.mode("overwrite").option("header", "true").csv(output_file)

print(f"✓ Results saved to: {output_file}")

✓ Results saved to: ../data/results/rf_results


In [16]:
# ═══════════════════════════════════════════════════════════════════
# STOP SPARK
# ═══════════════════════════════════════════════════════════════════

spark.stop()
print("✓ Spark stopped")
print("\n" + "=" * 50)
print("RANDOM FOREST COMPLETE - Proceed to 06_gmm.ipynb")
print("=" * 50)

✓ Spark stopped

RANDOM FOREST COMPLETE - Proceed to 06_gmm.ipynb
