# Energy Anomaly & Automated Power Theft Detection System  
### A Data Science Research Framework for Context-Aware Grid Intelligence

---

## Project Overview

Electricity utilities in **Kenya** face significant financial and operational strain due to non-technical losses arising from electricity theft, meter tampering, illegal connections, and irregular consumption behavior. While smart meter infrastructure generates large volumes of high-frequency time-series data, many utilities lack intelligent systems capable of converting raw consumption signals into actionable risk alerts.

This project develops an end-to-end data science framework that not only detects abnormal electricity usage patterns but also generates structured, automated risk notifications suitable for investigation workflows.

To simulate a realistic operational environment, a multi-household electricity dataset is constructed using high-resolution consumption measurements. Natural behavioral variability is preserved across households, while selected households are injected with synthetic theft-like patterns such as sustained consumption drops and altered load distributions. This enables controlled validation of anomaly detection techniques in the absence of real labeled fraud data.

The system integrates three core data layers:

1. **Electricity Consumption Data (Behavioral Signal Layer)**  
   Minute-level power and voltage readings aggregated into structured daily behavioral indicators.

2. **Weather Data (Environmental Context Layer)**  
   Temperature, precipitation, and wind speed variables used to explain legitimate demand variability and reduce false anomaly detection.

3. **Scheduled Outage Information (Operational Filter Layer)**  
   Official maintenance interruption records structured into daily indicators to prevent misclassification of planned supply disruptions.

The analytical pipeline transitions from raw time-series inputs to a structured intelligence system that:

- Engineers behavioral and change-based features  
- Adjusts signals using environmental and operational context  
- Applies unsupervised anomaly detection techniques  
- Assigns quantitative theft-risk scores  
- Triggers automated structured notification outputs for high-risk cases  

The final system moves beyond static classification by producing prioritized, investigation-ready alerts supported by explainable risk indicators. This framework demonstrates how utilities can transition from reactive inspection-based fraud handling to proactive, data-driven anomaly intelligence with automated alert generation.

---

## Business Problem

Electricity utilities operate in environments where revenue protection, grid reliability, and operational efficiency are critical. A major persistent challenge is the presence of non-technical losses caused by electricity theft and irregular consumption behavior.

These losses:

- Reduce utility revenue  
- Increase operational and inspection costs  
- Introduce uneven demand stress on distribution infrastructure  
- Compromise grid stability  

Traditional fraud detection approaches rely on:

- Manual inspections  
- Customer complaints  
- Rule-based heuristics  

These methods are reactive, costly, and inefficient.

Although smart meters provide high-frequency consumption data, most utilities lack structured systems capable of distinguishing legitimate variability (e.g., weather shifts, seasonal effects, scheduled outages) from suspicious behavioral anomalies. Furthermore, even when anomalies are detected, many utilities lack automated mechanisms to translate analytical outputs into actionable investigation alerts.

The central business problem addressed in this project is:

> How can utilities leverage integrated consumption, environmental, and operational data to proactively detect abnormal electricity behavior and automatically generate structured investigation notifications?

Specifically, the challenge involves:

- Detecting anomalous patterns without fully labeled theft data  
- Minimizing false positives caused by legitimate variability  
- Translating anomaly scores into explainable risk indicators  
- Automatically producing structured alerts to support investigation workflows  
- Designing a scalable, context-aware detection framework suitable for operational deployment  

This project addresses these challenges by developing a layered anomaly detection system with an embedded automated notification mechanism that flags high-risk consumption cases.

---

## Project Objectives

The primary objective of this project is to design, implement, and evaluate a context-aware anomaly detection framework capable of identifying potential power theft and generating automated risk notifications using time-series smart meter data.

### 1 Data Preparation & Simulation

- Construct a multi-household electricity consumption dataset from high-frequency readings.
- Introduce controlled behavioral diversity across simulated households.
- Inject theft-like consumption patterns to enable controlled anomaly validation.

### 2 Feature Engineering

- Aggregate minute-level consumption into structured daily indicators.
- Engineer statistical and volatility-based features.
- Create change-based indicators (rolling averages, percentage shifts).
- Integrate weather variables for contextual adjustment.
- Incorporate scheduled outage indicators as operational filters.

### 3 Anomaly Detection Modeling

- Apply unsupervised anomaly detection techniques (e.g., Isolation Forest).
- Generate quantitative anomaly scores per household-day.
- Define risk thresholds to classify consumption into Low, Medium, and High-risk categories.

### 4 Evaluation & Validation

- Measure detection consistency across simulated theft scenarios.
- Analyze false positives resulting from weather or outage effects.
- Assess stability of anomaly detection across heterogeneous households.

### 5 Automated Risk Notification Layer

- Develop a structured alert-generation mechanism triggered by defined anomaly thresholds.
- Create investigation-ready outputs including:
  - Meter ID
  - Date
  - Risk score
  - Risk category
  - Supporting behavioral indicators
- Demonstrate how anomaly detection outputs can feed into downstream notification workflows (e.g., case export, dashboard alerting, automated email triggers).
Through these objectives, the project demonstrates how integrated data science techniques can power a proactive energy irregularity detection system that combines anomaly modeling with automated alert generation.

---





## Research Phase I â€” Anomaly Detection Engine Development (LEAD Dataset)

### Objective

Before deploying the electricity theft detection system on the Kenya dataset, we first develop and validate a robust anomaly detection engine using the LEAD (Large-scale Energy Anomaly Detection) dataset.

The LEAD dataset contains hourly smart meter readings from multiple buildings, along with labeled anomaly events. This makes it ideal for calibrating anomaly detection algorithms and evaluating performance before production deployment.

---

### Why We Start With LEAD

The Kenya deployment dataset does not contain confirmed theft labels. Therefore, building a model directly on it would make it difficult to evaluate performance objectively.

Using LEAD allows us to:

- Validate anomaly detection techniques on labeled data
- Evaluate precision and recall of detected anomalies
- Tune model parameters appropriately
- Avoid overfitting or under-sensitive detection in production

---

### What We Will Do in This Phase

1. Load and inspect the dataset structure  
2. Assess anomaly class imbalance  
3. Handle missing values  
4. Engineer time-based features  
5. Build an anomaly detection model (Isolation Forest)  
6. Evaluate model performance using anomaly labels  

---

### Expected Outcome

At the end of this phase, we will have:

- A validated anomaly detection engine  
- Performance metrics (precision, recall, F1-score)  
- A calibrated detection threshold  
- A robust foundation for deployment in the Kenya electricity theft detection system  

In [18]:
import pandas as pd

lead_df = pd.read_csv("lead1.0-small.csv")

lead_df.head()

Unnamed: 0,building_id,timestamp,meter_reading,anomaly
0,1,2016-01-01 00:00:00,,0
1,32,2016-01-01 00:00:00,,0
2,41,2016-01-01 00:00:00,,0
3,55,2016-01-01 00:00:00,,0
4,69,2016-01-01 00:00:00,,0


In [19]:
lead_df.info()
lead_df.isnull().sum()
lead_df["anomaly"].value_counts()
lead_df["building_id"].nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1749494 entries, 0 to 1749493
Data columns (total 4 columns):
 #   Column         Dtype  
---  ------         -----  
 0   building_id    int64  
 1   timestamp      object 
 2   meter_reading  float64
 3   anomaly        int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 53.4+ MB


200

In [20]:
# Convert timestamp to datetime
lead_df["timestamp"] = pd.to_datetime(lead_df["timestamp"])

# Check missing percentage
missing_percent = lead_df["meter_reading"].isna().mean() * 100
print("Missing % in meter_reading:", round(missing_percent, 2), "%")

# Anomaly distribution
print("\nAnomaly distribution:")
print(lead_df["anomaly"].value_counts(normalize=True) * 100)

lead_df.head()

Missing % in meter_reading: 6.15 %

Anomaly distribution:
anomaly
0    97.868184
1     2.131816
Name: proportion, dtype: float64


Unnamed: 0,building_id,timestamp,meter_reading,anomaly
0,1,2016-01-01,,0
1,32,2016-01-01,,0
2,41,2016-01-01,,0
3,55,2016-01-01,,0
4,69,2016-01-01,,0


## Data Cleaning and Time-Series Preparation

The LEAD dataset contains approximately 6.15% missing values in the `meter_reading` column. Since this dataset represents hourly time-series data, dropping rows could disrupt temporal continuity.

Therefore, we perform:

- Sorting by `building_id` and `timestamp`
- Group-based interpolation to preserve temporal structure
- Validation to ensure missing values are resolved appropriately

Maintaining time-series integrity is critical for anomaly detection performance, especially when using models such as Isolation Forest that rely on consistent feature distributions.

In [21]:
# Sort properly
lead_df = lead_df.sort_values(["building_id", "timestamp"])

# Interpolate missing meter readings per building
lead_df["meter_reading"] = (
    lead_df.groupby("building_id")["meter_reading"]
    .transform(lambda x: x.interpolate(method="linear"))
)

# Check remaining missing values
print("Remaining missing:", lead_df["meter_reading"].isna().sum())

Remaining missing: 28026


### Handling Edge Missing Values

After interpolation, some missing values remain at the beginning or end of building time series. These occur because interpolation requires surrounding values.

To preserve time continuity, we apply forward-fill and backward-fill within each building group. This ensures complete time-series integrity before modeling.

In [22]:
# Forward fill per building
lead_df["meter_reading"] = (
    lead_df.groupby("building_id")["meter_reading"]
    .transform(lambda x: x.ffill().bfill())
)

# Final check
print("Remaining missing after fill:", lead_df["meter_reading"].isna().sum())

Remaining missing after fill: 0
