# 03 - Isolation Forest Anomaly Detection

## What We're Doing

We have cryptocurrency data with 6 engineered features. Now we want to find **anomalies** (unusual hours).

**Algorithm:** Isolation Forest

**Approach:** 
- PySpark → Load and prepare data (Big Data tool)
- sklearn → Run the algorithm (ML tool)

**Why this combo?** Spark MLlib doesn't have Isolation Forest built-in. This hybrid approach is common in industry.

---
## Cell 1: Import Libraries & Create Spark Session

### What is an import?

Python doesn't have everything built-in. We need to load external tools.

**Java equivalent:**
```java
import java.util.ArrayList;
```

**Python:**
```python
import pandas
```

### What is SparkSession?

Think of it like a database connection in Java:
```java
Connection conn = DriverManager.getConnection(...);
```

SparkSession is your "connection" to Spark. Without it, you can't use Spark.

In [None]:
# ============================================================
# IMPORTS - Loading the tools we need
# ============================================================

# PySpark imports
# ---------------
# SparkSession: The entry point to Spark (like a database connection)
from pyspark.sql import SparkSession

# functions: Built-in functions for data manipulation (like SQL functions)
# Example: col("name") refers to a column called "name"
from pyspark.sql.functions import col

# VectorAssembler: Combines multiple columns into one vector column
# WHY? ML algorithms in Spark expect features as a single vector, not separate columns
# Example: [return, volatility, volume] → [0.5, 1.2, 0.8] (one vector)
from pyspark.ml.feature import VectorAssembler

# StandardScaler: Scales features to have mean=0 and std=1
# WHY? Features have different ranges. Without scaling, big numbers dominate.
from pyspark.ml.feature import StandardScaler

# sklearn imports
# ---------------
# IsolationForest: The anomaly detection algorithm
from sklearn.ensemble import IsolationForest

# numpy: For array operations (sklearn needs numpy arrays)
import numpy as np

# pandas: We'll convert Spark DataFrame to pandas at the end
# WHY? Easier to work with for small results
import pandas as pd

print("All libraries imported successfully!")

In [None]:
# ============================================================
# CREATE SPARK SESSION
# ============================================================

# SparkSession.builder = Start building a session
# .appName("...")     = Give it a name (shows up in Spark UI)
# .master("local[*]") = Run on local machine, use all CPU cores
#                       In a cluster, this would be the cluster address
# .getOrCreate()      = Create new session, or reuse if one exists

spark = SparkSession.builder \
    .appName("IsolationForest_AnomalyDetection") \
    .master("local[*]") \
    .getOrCreate()

# Reduce log verbosity (Spark logs A LOT by default)
spark.sparkContext.setLogLevel("WARN")

print("Spark Session created!")
print(f"Spark version: {spark.version}")

---
## Cell 2: Load the Processed Data

### What are we loading?

The CSV files you created in Session 2:
- `BTCUSDT_1h_processed.csv` (976 rows)
- `ETHUSDT_1h_processed.csv` (976 rows)

### How Spark reads CSV vs pandas

**pandas:**
```python
df = pd.read_csv("file.csv")
```

**PySpark:**
```python
df = spark.read.csv("file.csv", header=True, inferSchema=True)
```

Extra options in Spark:
- `header=True` → First row contains column names
- `inferSchema=True` → Automatically detect data types (int, float, string)

In [None]:
# ============================================================
# LOAD DATA FROM CSV
# ============================================================

# Path to our processed data files
# Note: "../" means "go up one folder" (from notebooks/ to project root)
btc_path = "../data/processed/BTCUSDT_1h_processed.csv"
eth_path = "../data/processed/ETHUSDT_1h_processed.csv"

# Read BTC data
# spark.read.csv() returns a Spark DataFrame (similar to a database table)
btc_df = spark.read.csv(
    btc_path,
    header=True,       # First row is column names
    inferSchema=True   # Auto-detect types (float, int, string)
)

# Read ETH data
eth_df = spark.read.csv(
    eth_path,
    header=True,
    inferSchema=True
)

print("Data loaded!")
print(f"BTC rows: {btc_df.count()}")
print(f"ETH rows: {eth_df.count()}")

In [None]:
# ============================================================
# INSPECT THE DATA
# ============================================================

# .printSchema() shows column names and data types
# Like DESCRIBE in SQL
print("=== BTC Data Schema ===")
btc_df.printSchema()

In [None]:
# .show(5) displays first 5 rows
# Like SELECT * FROM table LIMIT 5
print("=== First 5 rows of BTC data ===")
btc_df.show(5)

---
## Cell 3: Select Features for Anomaly Detection

### Which columns does Isolation Forest need?

Our data has 12 columns:
- **Original data:** timestamp, open, high, low, close, volume
- **Engineered features:** return, log_return, volatility_24h, volume_change, volume_ratio, price_range

**We only use the 6 engineered features** for anomaly detection.

### Why not use all columns?

- `timestamp` = Not a feature, just an identifier
- `open, high, low, close` = Raw prices (the engineered features are better)
- `volume` = Already captured in volume_change and volume_ratio

The 6 engineered features capture the **behavior**, not the raw values.

In [None]:
# ============================================================
# DEFINE FEATURE COLUMNS
# ============================================================

# These are the 6 features we engineered in Session 2
# We'll use ONLY these for anomaly detection

feature_columns = [
    "return",          # Price change %
    "log_return",      # Log of price change (mathematically nicer)
    "volatility_24h",  # How "crazy" the market was in last 24h
    "volume_change",   # Volume change %
    "volume_ratio",    # Current volume vs 24h average
    "price_range"      # (high - low) / close (intraday volatility)
]

print(f"Using {len(feature_columns)} features:")
for i, col_name in enumerate(feature_columns, 1):
    print(f"  {i}. {col_name}")

---
## Cell 4: Prepare Data for Machine Learning

### The Problem

Spark ML algorithms expect features in a **single vector column**, not separate columns.

**What we have:**
```
| return | log_return | volatility_24h | ... |
|--------|------------|----------------|-----|
| 0.5    | 0.49       | 1.2            | ... |
```

**What ML needs:**
```
| features                    |
|-----------------------------|
| [0.5, 0.49, 1.2, ...]       |
```

### The Solution: VectorAssembler

VectorAssembler takes multiple columns and combines them into one vector column.

### Also: Scaling

Remember, our features have different ranges:
- `return`: typically -5 to +5
- `volume_change`: can be -100 to +500

**StandardScaler** makes them all comparable by:
- Subtracting the mean (centering at 0)
- Dividing by standard deviation (same spread)

In [None]:
# ============================================================
# STEP 4A: COMBINE FEATURES INTO A VECTOR (VectorAssembler)
# ============================================================

# VectorAssembler configuration:
# - inputCols: List of columns to combine
# - outputCol: Name of the new vector column

assembler = VectorAssembler(
    inputCols=feature_columns,    # Our 6 feature columns
    outputCol="features_raw"      # Output column name
)

# Apply to BTC data
# .transform() applies the assembler and returns a NEW DataFrame with the extra column
btc_assembled = assembler.transform(btc_df)

# Let's see what it looks like
print("After VectorAssembler:")
print("(showing only timestamp and the new features_raw column)")
btc_assembled.select("timestamp", "features_raw").show(3, truncate=False)

In [None]:
# ============================================================
# STEP 4B: SCALE THE FEATURES (StandardScaler)
# ============================================================

# StandardScaler configuration:
# - inputCol: The vector column to scale
# - outputCol: Name of the scaled vector column
# - withMean=True: Subtract mean (center at 0)
# - withStd=True: Divide by standard deviation

scaler = StandardScaler(
    inputCol="features_raw",
    outputCol="features_scaled",
    withMean=True,   # Center the data at 0
    withStd=True     # Scale to unit variance
)

# Two-step process:
# 1. fit() - Calculate mean and std from the data
# 2. transform() - Apply the scaling

scaler_model = scaler.fit(btc_assembled)           # Learn the scaling parameters
btc_scaled = scaler_model.transform(btc_assembled)  # Apply scaling

print("After StandardScaler:")
print("(Notice the values are now centered around 0)")
btc_scaled.select("timestamp", "features_scaled").show(3, truncate=False)

---
## Cell 5: Run Isolation Forest

### The Handoff: Spark → sklearn

Now we need to:
1. Convert Spark DataFrame → pandas DataFrame → numpy array
2. Run sklearn's IsolationForest
3. Get results back

### Why this conversion?

- sklearn works with **numpy arrays** (simple Python arrays)
- Spark DataFrames are **distributed** across multiple machines
- We need to **collect** the data to one place for sklearn

### Is this okay for Big Data?

For 976 rows, absolutely fine. For millions of rows, we'd need a different approach.

In your paper, you can mention: "For larger datasets, we would use distributed implementations or sampling."

In [None]:
# ============================================================
# STEP 5A: CONVERT SPARK → NUMPY
# ============================================================

# Extract the scaled features as a numpy array
# 
# Step by step:
# 1. .select("features_scaled") - Get only the features column
# 2. .collect() - Bring all data from Spark to local memory
# 3. List comprehension - Extract the vector from each row
# 4. np.array() - Convert to numpy array

# Get the scaled features column
features_rows = btc_scaled.select("features_scaled").collect()

# Convert to numpy array
# row[0] gets the vector from each row
# .toArray() converts Spark vector to numpy array
X = np.array([row[0].toArray() for row in features_rows])

print(f"Data shape: {X.shape}")
print(f"This means: {X.shape[0]} rows (hours) x {X.shape[1]} features")
print(f"\nFirst row (first hour's features):")
print(X[0])

In [None]:
# ============================================================
# STEP 5B: RUN ISOLATION FOREST
# ============================================================

# IsolationForest parameters:
#
# n_estimators=100
#   Number of trees in the forest
#   More trees = more stable results, but slower
#   100 is a good default
#
# contamination=0.05
#   Expected proportion of anomalies in the data
#   0.05 = we expect about 5% of data points to be anomalies
#   This affects the threshold for labeling something as anomaly
#
# random_state=42
#   Seed for random number generator
#   Makes results reproducible (same result every time you run)
#   42 is a common choice (Hitchhiker's Guide reference)

iso_forest = IsolationForest(
    n_estimators=100,      # Number of trees
    contamination=0.05,    # Expected 5% anomalies
    random_state=42        # For reproducibility
)

# Fit the model and predict
# fit_predict() does two things:
# 1. fit() - Learn what "normal" looks like
# 2. predict() - Label each point as normal (1) or anomaly (-1)

predictions = iso_forest.fit_predict(X)

print("Isolation Forest complete!")
print(f"\nPrediction values: {np.unique(predictions)}")
print("  1 = Normal")
print(" -1 = Anomaly")

In [None]:
# ============================================================
# STEP 5C: GET ANOMALY SCORES
# ============================================================

# Besides labels (1 or -1), we can get continuous scores
# score_samples() returns the anomaly score for each point
#
# Score interpretation:
# - More negative = More anomalous
# - Close to 0 = Normal
# - The threshold is around -0.5 (depends on contamination)

anomaly_scores = iso_forest.score_samples(X)

print(f"Score range: {anomaly_scores.min():.3f} to {anomaly_scores.max():.3f}")
print(f"\nMost anomalous score: {anomaly_scores.min():.3f}")
print(f"Most normal score: {anomaly_scores.max():.3f}")

---
## Cell 6: Analyze Results

Let's see:
1. How many anomalies were found?
2. Which hours are anomalies?
3. What makes them anomalous?

In [None]:
# ============================================================
# STEP 6A: COUNT ANOMALIES
# ============================================================

# Count how many of each label
n_normal = np.sum(predictions == 1)
n_anomaly = np.sum(predictions == -1)
total = len(predictions)

print("=== RESULTS SUMMARY ===")
print(f"Total data points: {total}")
print(f"Normal points:     {n_normal} ({100*n_normal/total:.1f}%)")
print(f"Anomalies:         {n_anomaly} ({100*n_anomaly/total:.1f}%)")

In [None]:
# ============================================================
# STEP 6B: CREATE RESULTS DATAFRAME
# ============================================================

# Convert Spark DataFrame to pandas for easier analysis
# .toPandas() collects all data and converts to pandas DataFrame

btc_pandas = btc_df.toPandas()

# Add our results as new columns
btc_pandas["anomaly_score"] = anomaly_scores
btc_pandas["anomaly_label"] = predictions

# Create a human-readable label
# In Python: condition ? true_value : false_value becomes:
#            true_value if condition else false_value
btc_pandas["is_anomaly"] = btc_pandas["anomaly_label"].apply(
    lambda x: "ANOMALY" if x == -1 else "Normal"
)

print("Results DataFrame created!")
print(f"Columns: {list(btc_pandas.columns)}")

In [None]:
# ============================================================
# STEP 6C: LOOK AT THE ANOMALIES
# ============================================================

# Filter to only anomalies
anomalies = btc_pandas[btc_pandas["anomaly_label"] == -1]

# Sort by anomaly score (most anomalous first)
anomalies_sorted = anomalies.sort_values("anomaly_score")

print(f"=== TOP 10 MOST ANOMALOUS HOURS ===")
print()

# Show the most important columns for the top 10 anomalies
columns_to_show = ["timestamp", "close", "return", "volume_change", "volatility_24h", "anomaly_score"]
print(anomalies_sorted[columns_to_show].head(10).to_string(index=False))

In [None]:
# ============================================================
# STEP 6D: UNDERSTAND WHY THEY'RE ANOMALIES
# ============================================================

# Compare anomalies vs normal data
# This shows us what makes anomalies different

print("=== COMPARISON: Anomalies vs Normal ===")
print()

normal_data = btc_pandas[btc_pandas["anomaly_label"] == 1]
anomaly_data = btc_pandas[btc_pandas["anomaly_label"] == -1]

for col in feature_columns:
    normal_mean = normal_data[col].mean()
    anomaly_mean = anomaly_data[col].mean()
    print(f"{col}:")
    print(f"  Normal avg:  {normal_mean:>10.3f}")
    print(f"  Anomaly avg: {anomaly_mean:>10.3f}")
    print()

---
## Cell 7: Save Results

We'll save the results to a CSV file so we can:
1. Use it later for comparison with other algorithms
2. Include it in your paper

In [None]:
# ============================================================
# STEP 7: SAVE RESULTS TO CSV
# ============================================================

# Create results directory if it doesn't exist
import os
os.makedirs("../data/results", exist_ok=True)

# Save the full results
output_path = "../data/results/BTCUSDT_isolation_forest_results.csv"
btc_pandas.to_csv(output_path, index=False)

print(f"Results saved to: {output_path}")
print(f"Total rows: {len(btc_pandas)}")
print(f"Columns: {len(btc_pandas.columns)}")

In [None]:
# ============================================================
# STEP 8: STOP SPARK SESSION
# ============================================================

# Always stop Spark when done to free up resources
# Like closing a database connection

spark.stop()
print("Spark session stopped.")
print("\n" + "="*50)
print("ISOLATION FOREST COMPLETE!")
print("="*50)

---
## Summary

### What We Did

1. **Loaded data** using PySpark (Big Data tool)
2. **Prepared features** using VectorAssembler and StandardScaler
3. **Ran Isolation Forest** using sklearn
4. **Found anomalies** - unusual hours in the BTC data
5. **Saved results** for later use

### Key Findings

- Isolation Forest found approximately 5% of hours as anomalies
- These anomalies typically have extreme values in return, volume_change, or volatility
- The algorithm considers ALL features together (multivariate)

### Next Steps

- Run the same process for ETH data
- Implement One-Class SVM (next algorithm)
- Implement LOF (Local Outlier Factor)
- Compare all 4 algorithms