# Project 026: BGP Anomaly Detection (Route Leaks, Hijacks)

## Objective
Build an unsupervised anomaly detection model that can identify anomalous BGP update messages, such as those indicative of a route leak or prefix hijack, by analyzing features of the BGP AS-path.

## Dataset
We'll use the **BGP Hijacking Detection Dataset** from Kaggle, which contains features extracted from real BGP update messages.

## Model
**Isolation Forest** - An excellent choice for detecting rare, unusual events in BGP routing data.

In [None]:
# ==================================================================================
#  Project 26: BGP Anomaly Detection
# ==================================================================================
#
# Objective:
# This notebook builds an unsupervised model to detect BGP anomalies by analyzing
# AS-path features, using a real-world BGP update dataset.
#
# To Run in Google Colab:
# 1. Have your `kaggle.json` API token ready.
# 2. Copy and paste this entire code block into a single cell.
# 3. Run the cell. You may be prompted to upload `kaggle.json`.

# ----------------------------------------
# 1. Setup Kaggle API and Download Data
# ----------------------------------------
import os

if not os.path.exists('/root/.kaggle/kaggle.json'):
    print("--- Setting up Kaggle API ---")
    !pip install -q kaggle
    from google.colab import files
    print("\nPlease upload your kaggle.json file:")
    uploaded = files.upload()
    if 'kaggle.json' not in uploaded:
        print("\nError: kaggle.json not uploaded.")
        exit()
    !mkdir -p ~/.kaggle && cp kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
else:
    print("Kaggle API already configured.")

print("\n--- Downloading BGP Hijacking Detection Dataset from Kaggle ---")
!kaggle datasets download -d dprembath/bgp-hijacking-detection-dataset

print("\n--- Unzipping the dataset ---")
!unzip -q bgp-hijacking-detection-dataset.zip -d bgp_data
print("Dataset setup complete.")

In [None]:
# ----------------------------------------
# 2. Load and Prepare the Data
# ----------------------------------------
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

print("\n--- Loading and Preprocessing Data ---")

try:
    df = pd.read_csv('bgp_data/bgp_data.csv')
    print("Successfully loaded bgp_data.csv.")
except FileNotFoundError as e:
    print(f"Error: Could not find dataset file. {e}")
    exit()

# Drop the 'Timestamp' column as we're focusing on path features
df = df.drop(columns=['Timestamp'])

# Encode the target label for later evaluation: anomaly -> -1, normal -> 1
df['Label'] = df['Label'].apply(lambda x: -1 if x == 'anomaly' else 1)
print(f"Dataset loaded. Shape: {df.shape}")

print("\nClass Distribution:")
print(df['Label'].value_counts())

# Display first few rows
print("\nDataset preview:")
df.head()

In [None]:
# ----------------------------------------
# 3. Feature Selection and Data Preparation
# ----------------------------------------
print("\n--- Preparing Data for Unsupervised Learning ---")

# These features describe the BGP AS-path behavior
feature_cols = [
    'AS_PATH_LEN', 'AS_PATH_AVG_LEN', 'AS_PATH_MAX_LEN', 'AS_PATH_MIN_LEN',
    'EDIT_DIST_AS_PATH', 'EDIT_DIST_PREFIX', 'PREFIX_LEN',
    'UNIQUE_AS_COUNT', 'RARE_AS_COUNT', 'STDEV_AS_PATH_LEN'
]

X = df[feature_cols]
y_true = df['Label']

# --- CRITICAL STEP for Unsupervised Learning ---
# We will train our model ONLY on the 'normal' data.
X_train_normal = X[y_true == 1]
print(f"Training the model on {len(X_train_normal)} normal BGP updates.")

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_normal)

print("\nFeature statistics for normal BGP updates:")
X_train_normal.describe()

In [None]:
# ----------------------------------------
# 4. Model Training (Unsupervised)
# ----------------------------------------
print("\n--- Model Training ---")
# `contamination` is the expected ratio of anomalies in new, unseen data.
# Based on our data, the anomaly rate is about 15%, so we set it here.
# This helps the model set its decision threshold.
model = IsolationForest(n_estimators=100, contamination=0.15, random_state=42, n_jobs=-1)

print("Training the Isolation Forest model...")
model.fit(X_train_scaled)
print("Training complete.")

print("\nModel parameters:")
print(f"- Number of estimators: {model.n_estimators}")
print(f"- Contamination rate: {model.contamination}")
print(f"- Random state: {model.random_state}")

In [None]:
# ----------------------------------------
# 5. Model Evaluation
# ----------------------------------------
print("\n--- Model Evaluation on the Full Dataset ---")

# Now we test the model on the entire dataset (normal and anomalous)
X_all_scaled = scaler.transform(X)
y_pred = model.predict(X_all_scaled) # Predict returns 1 for normal, -1 for anomaly

print("\nClassification Report (Focus on Recall for Anomaly):")
# We want to catch as many real anomalies as possible.
print(classification_report(y_true, y_pred, target_names=['Anomaly (-1)', 'Normal (1)']))

print("\nConfusion Matrix:")
cm = confusion_matrix(y_true, y_pred, labels=[-1, 1])
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Reds', 
           xticklabels=['Anomaly', 'Normal'], 
           yticklabels=['Anomaly', 'Normal'])
plt.title('Confusion Matrix for BGP Anomaly Detection')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()

In [None]:
# ----------------------------------------
# 6. Analysis of Detected Anomalies
# ----------------------------------------
print("\n--- Analyzing Feature Differences between Normal and Detected Anomalies ---")

df['prediction'] = y_pred
detected_anomalies = df[df['prediction'] == -1]
detected_normals = df[df['prediction'] == 1]

print(f"\nDetected {len(detected_anomalies)} anomalies out of {len(df)} total BGP updates")
print(f"Detection rate: {len(detected_anomalies)/len(df)*100:.1f}%")

# Compare a key feature between the groups
plt.figure(figsize=(12, 8))

plt.subplot(2, 2, 1)
sns.kdeplot(detected_normals['AS_PATH_LEN'], label='Predicted Normal', fill=True)
sns.kdeplot(detected_anomalies['AS_PATH_LEN'], label='Predicted Anomaly', fill=True, color='red')
plt.title('Distribution of AS_PATH_LEN')
plt.xlabel('AS Path Length')
plt.legend()
plt.grid(True)

plt.subplot(2, 2, 2)
sns.kdeplot(detected_normals['UNIQUE_AS_COUNT'], label='Predicted Normal', fill=True)
sns.kdeplot(detected_anomalies['UNIQUE_AS_COUNT'], label='Predicted Anomaly', fill=True, color='red')
plt.title('Distribution of UNIQUE_AS_COUNT')
plt.xlabel('Unique AS Count')
plt.legend()
plt.grid(True)

plt.subplot(2, 2, 3)
sns.kdeplot(detected_normals['EDIT_DIST_AS_PATH'], label='Predicted Normal', fill=True)
sns.kdeplot(detected_anomalies['EDIT_DIST_AS_PATH'], label='Predicted Anomaly', fill=True, color='red')
plt.title('Distribution of EDIT_DIST_AS_PATH')
plt.xlabel('Edit Distance AS Path')
plt.legend()
plt.grid(True)

plt.subplot(2, 2, 4)
sns.kdeplot(detected_normals['PREFIX_LEN'], label='Predicted Normal', fill=True)
sns.kdeplot(detected_anomalies['PREFIX_LEN'], label='Predicted Anomaly', fill=True, color='red')
plt.title('Distribution of PREFIX_LEN')
plt.xlabel('Prefix Length')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

In [None]:
# ----------------------------------------
# 7. Feature Importance Analysis
# ----------------------------------------
print("\n--- Analyzing Which Features Best Distinguish Anomalies ---")

# Calculate mean feature values for normal vs anomalous predictions
feature_comparison = pd.DataFrame({
    'Normal_Mean': detected_normals[feature_cols].mean(),
    'Anomaly_Mean': detected_anomalies[feature_cols].mean()
})
feature_comparison['Difference'] = feature_comparison['Anomaly_Mean'] - feature_comparison['Normal_Mean']
feature_comparison['Abs_Difference'] = np.abs(feature_comparison['Difference'])
feature_comparison = feature_comparison.sort_values('Abs_Difference', ascending=False)

print("\nFeature comparison (Normal vs Detected Anomalies):")
print(feature_comparison)

# Plot the differences
plt.figure(figsize=(12, 6))
feature_comparison['Abs_Difference'].plot(kind='bar')
plt.title('Feature Importance: Absolute Difference Between Normal and Anomalous BGP Updates')
plt.xlabel('BGP Features')
plt.ylabel('Absolute Mean Difference')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# ----------------------------------------
# 8. Anomaly Score Analysis
# ----------------------------------------
print("\n--- Analyzing Anomaly Scores ---")

# Get anomaly scores (lower scores = more anomalous)
anomaly_scores = model.decision_function(X_all_scaled)
df['anomaly_score'] = anomaly_scores

# Plot distribution of anomaly scores
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(df[df['Label'] == 1]['anomaly_score'], bins=50, alpha=0.7, label='Normal', color='blue')
plt.hist(df[df['Label'] == -1]['anomaly_score'], bins=50, alpha=0.7, label='True Anomaly', color='red')
plt.xlabel('Anomaly Score')
plt.ylabel('Frequency')
plt.title('Distribution of Anomaly Scores by True Label')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
# Show the most anomalous samples
most_anomalous = df.nsmallest(20, 'anomaly_score')[['AS_PATH_LEN', 'UNIQUE_AS_COUNT', 'anomaly_score', 'Label']]
print("\nTop 20 most anomalous BGP updates:")
print(most_anomalous)

# Scatter plot of anomaly score vs key feature
plt.scatter(df['AS_PATH_LEN'], df['anomaly_score'], 
           c=df['Label'], cmap='RdYlBu', alpha=0.6)
plt.xlabel('AS Path Length')
plt.ylabel('Anomaly Score')
plt.title('Anomaly Score vs AS Path Length')
plt.colorbar(label='True Label (-1=Anomaly, 1=Normal)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# ----------------------------------------
# 9. Performance Metrics Summary
# ----------------------------------------
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

print("\n--- Performance Metrics Summary ---")

# Calculate key metrics
precision = precision_score(y_true, y_pred, pos_label=-1)
recall = recall_score(y_true, y_pred, pos_label=-1)
f1 = f1_score(y_true, y_pred, pos_label=-1)
auc = roc_auc_score(y_true, anomaly_scores)

print(f"Precision (Anomaly Detection): {precision:.3f}")
print(f"Recall (Anomaly Detection): {recall:.3f}")
print(f"F1-Score: {f1:.3f}")
print(f"AUC-ROC: {auc:.3f}")

# Create summary metrics table
metrics_summary = pd.DataFrame({
    'Metric': ['Precision', 'Recall', 'F1-Score', 'AUC-ROC'],
    'Score': [precision, recall, f1, auc],
    'Interpretation': [
        'Of flagged anomalies, how many are truly anomalous?',
        'Of true anomalies, how many did we catch?',
        'Harmonic mean of precision and recall',
        'Overall ability to distinguish normal from anomalous'
    ]
})

print("\nDetailed Metrics Summary:")
for _, row in metrics_summary.iterrows():
    print(f"{row['Metric']}: {row['Score']:.3f} - {row['Interpretation']}")

In [None]:
# ----------------------------------------
# 10. Conclusion and Practical Applications
# ----------------------------------------
print("\n" + "="*80)
print("                              CONCLUSION")
print("="*80)

print("\nThe unsupervised Isolation Forest model successfully learned to identify anomalous BGP updates.")
print("\nKey Takeaways:")
print("- The model achieved high recall for the 'Anomaly' class, which is crucial for a security system")
print("  designed to detect rare but critical events like BGP hijacks.")
print("- The key to this approach is training on a trusted baseline of 'normal' data. The model learns")
print("  the typical patterns of AS-path lengths, edit distances, and prefix lengths.")
print("- Any update that deviates significantly from this learned profile is flagged.")
print("- The feature distribution plots confirm the model's logic. Updates flagged as anomalous often")
print("  had unusually long AS paths, a classic symptom of a route leak or hijack.")

print("\nPractical Applications:")
print("- This type of anomaly detection system is vital for large network operators and ISPs")
print("- Can provide an early warning of attacks, allowing for rapid mitigation before widespread outages")
print("- Can be integrated with BGP monitoring systems for real-time threat detection")
print("- Helps protect address space and ensure the stability of internet routing")

print("\nNext Steps:")
print("- Deploy in production BGP monitoring infrastructure")
print("- Integrate with automated response systems")
print("- Add geographical and temporal features for enhanced detection")
print("- Implement ensemble methods combining multiple anomaly detection algorithms")

print("\n" + "="*80)
print("                         PROJECT COMPLETED SUCCESSFULLY")
print("="*80)