# Anomaly Detection for BNG Subscriber Traffic

This notebook demonstrates how to connect to the InfluxDB database, query subscriber traffic data, and use the **Isolation Forest** algorithm to detect anomalies.

In [None]:
import os
import pandas as pd
from influxdb_client import InfluxDBClient
from sklearn.ensemble import IsolationForest

### 1. Configure InfluxDB Connection

First, we'll initialize the InfluxDB client by reading the connection details from environment variables. This is a best practice for security and portability, as it avoids hardcoding credentials in the notebook. These variables are passed to the Jupyter container by `docker-compose`.

In [None]:
# --- InfluxDB Connection Details ---
INFLUXDB_URL = os.getenv("INFLUXDB_URL", "http://localhost:8086")
INFLUXDB_TOKEN = os.getenv("INFLUXDB_TOKEN", "my-super-secret-bng-token")
INFLUXDB_ORG = os.getenv("INFLUXDB_ORG", "bng-telemetry-org")
INFLUXDB_BUCKET = os.getenv("INFLUXDB_BUCKET", "bng-bucket")

# --- Initialize Client ---
client = InfluxDBClient(url=INFLUXDB_URL, token=INFLUXDB_TOKEN, org=INFLUXDB_ORG)
query_api = client.query_api()

print("InfluxDB client initialized.")
print(f"URL: {INFLUXDB_URL}")
print(f"Org: {INFLUXDB_ORG}")

### 2. Query Subscriber Data

Next, we define a Flux query to retrieve the total `input_octets` for each subscriber over the last hour. The `increase()` function is perfect for this, as it calculates the growth of a counter over a time window.

In [None]:
# --- Define the Flux Query ---
flux_query = f'''
from(bucket: "{INFLUXDB_BUCKET}")
  |> range(start: -1h)
  |> filter(fn: (r) => r[\"_measurement\"] == \"bng_subscriber_stats\")
  |> filter(fn: (r) => r[\"_field\"] == \"input_octets\")
  |> increase()
  |> group(columns: [\"mac\"])
  |> sum()
  |> keep(columns: [\"_value\", \"mac\"])
'''

print("Flux query defined:")
print(flux_query)

In [None]:
# --- Execute the query and load into a Pandas DataFrame ---
print("Querying InfluxDB...")
result_df = query_api.query_data_frame(query=flux_query)

# Clean up the DataFrame
if not result_df.empty:
    result_df = result_df.rename(columns={"_value": "total_input_octets"})
    result_df = result_df.drop(columns=["result", "table"], errors='ignore')
    result_df = result_df.set_index('mac')
    print(f"Successfully loaded {len(result_df)} records into a DataFrame.")
else:
    print("Warning: Query returned no data. Ensure the simulator is running and generating data.")

result_df.head()

### 3. Apply Isolation Forest for Anomaly Detection

The **Isolation Forest** is an unsupervised learning algorithm that's well-suited for anomaly detection. It works by "isolating" observations by randomly selecting a feature and then randomly selecting a split value. Since anomalies are "few and different," they are easier to isolate and should have shorter average path lengths in the random decision trees.

We will apply this model to the `total_input_octets` to find subscribers with unusually high or low traffic compared to the majority.

In [None]:
if not result_df.empty:
    # --- Prepare the data ---
    X = result_df[['total_input_octets']].values

    # --- Initialize and fit the model ---
    # `contamination` is the expected proportion of outliers. 'auto' is a robust default.
    model = IsolationForest(contamination='auto', random_state=42)
    result_df['anomaly_score'] = model.fit_predict(X)

    # The model returns -1 for anomalies and 1 for inliers.
    print("Anomaly scores calculated. (-1 for anomalies, 1 for inliers)")
    print(result_df['anomaly_score'].value_counts())
else:
    print("DataFrame is empty. Skipping model training.")

### 4. Identify and Display Anomalies

Finally, we filter the DataFrame to show only the subscribers that the model has flagged as anomalous.

In [None]:
if not result_df.empty:
    # --- Filter for anomalies ---
    anomalies = result_df[result_df['anomaly_score'] == -1]

    print("\n--- Detected Anomalies ---")
    if not anomalies.empty:
        print(f"Found {len(anomalies)} anomalous subscribers:")
        print(anomalies)
    else:
        print("No anomalies detected in this time window.")
else:
    print("DataFrame is empty. No anomalies to report.")