# Smart Power Disconnection Analytics  
### An End-to-End Predictive Grid Intelligence Framework

## Project Overview

Electricity reliability is fundamental to economic stability, public safety, and household well-being. However, power distribution systems frequently experience interruptions due to a combination of environmental conditions, infrastructure maintenance, and demand-related stress. Utilities often operate reactively — responding to disruptions after they occur — rather than proactively identifying high-risk conditions before outages happen.

This project develops a comprehensive, data-driven analytics framework designed to analyze electricity consumption behavior, integrate environmental risk factors, and leverage scheduled outage information to model and forecast power disconnection events.

At its core, the system transforms raw time-series energy consumption data into structured intelligence by combining three major components:

1. **High-frequency electricity consumption data**  
   Minute-level power readings capturing voltage, active power, reactive power, and current intensity.

2. **Exogenous environmental signals**  
   Weather variables such as rainfall, temperature, and wind speed that influence infrastructure stability and demand patterns.

3. **Scheduled outage information**  
   Officially announced maintenance interruptions structured into analyzable datasets.

By consolidating these data streams, the project moves beyond simple descriptive reporting of consumption and instead supports a predictive perspective on power interruptions. The resulting analytics framework is designed to (a) quantify operational stress signals from consumption and voltage behavior, (b) incorporate weather-driven risk factors, and (c) use scheduled outages as structured event indicators to distinguish planned interruptions from anomaly-driven disruptions.

A key technical challenge addressed in this project is that raw smart-meter style datasets rarely include explicit outage or tampering labels. To support supervised learning and realistic evaluation, the project uses a hybrid event-labeling strategy:
- **Scheduled outages** are derived from Kenya Power planned interruption notices and aggregated into daily indicators.
- **Unexpected disruptions and abnormal behavior** are identified through time-series patterns such as sudden drops, extended near-zero consumption, and abnormal volatility, allowing the creation of event labels that mimic real operational scenarios.

The project outputs a unified modeling table that can support both:
- **Disconnection event classification** (e.g., scheduled outage vs. anomaly-driven disruption), and  
- **Outage risk forecasting** (estimating elevated interruption likelihood based on demand patterns and environmental conditions).

Ultimately, this work demonstrates how a utility-focused analytics layer can be built on top of time-series power measurements, enabling proactive decision-making such as:
- identifying high-risk operating days,
- understanding weather-driven stress effects,
- prioritizing monitoring or maintenance,
- and improving outage preparedness through data-driven early warning signals.

## Problem Statement

Power distribution systems operate in dynamic environments where infrastructure performance is influenced by consumption patterns, environmental stressors, and scheduled operational activities. Despite the availability of large volumes of electricity usage data and external contextual information such as weather conditions, many utilities lack an integrated analytical framework capable of transforming these data streams into predictive intelligence.

Electricity interruptions may arise from:

- Scheduled maintenance operations  
- Infrastructure stress during high demand periods  
- Environmental conditions such as heavy rainfall and strong winds  
- Irregular system behavior or localized faults  

However, without structured modeling, utilities typically respond to disruptions only after customers report service loss. This reactive approach limits preparedness, increases downtime, and constrains reliability planning.

The core problem addressed in this project is:

> How can historical electricity consumption data, weather variables, and scheduled outage information be integrated into a unified modeling framework to analyze and predict power disconnection events?

Specifically, this project seeks to answer the following analytical questions:

1. Can consumption volatility or voltage variability signal elevated infrastructure stress?
2. Do environmental conditions such as rainfall and wind speed significantly increase outage risk?
3. How can scheduled maintenance events be distinguished from unexpected or anomaly-driven disruptions?
4. Can we construct a predictive model that estimates outage likelihood under defined environmental and operational conditions?

Addressing this problem requires transforming raw time-series power data into engineered features, integrating exogenous environmental factors, and applying statistical and machine learning methods to model disconnection risk.

The ultimate objective is to demonstrate how utilities can transition from descriptive monitoring to predictive outage intelligence, enabling improved grid resilience and operational planning.

## Loading the DataSets

In [1]:
import pandas as pd

power_df = pd.read_csv(
    "power_raw_cleaned.csv",
    parse_dates=["datetime"]
)

print("Power Dataset")
power_df.info()

Power Dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075259 entries, 0 to 2075258
Data columns (total 8 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   datetime               datetime64[ns]
 1   Global_active_power    float64       
 2   Global_reactive_power  float64       
 3   Voltage                float64       
 4   Global_intensity       float64       
 5   Sub_metering_1         float64       
 6   Sub_metering_2         float64       
 7   Sub_metering_3         float64       
dtypes: datetime64[ns](1), float64(7)
memory usage: 126.7 MB


In [2]:
power_df.head()

Unnamed: 0,datetime,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,2006-12-16 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


In [3]:
weather_df = pd.read_csv(
    "nairobi_weather_2007_2008.csv",
    parse_dates=["date"]
)

print("Weather Dataset")
weather_df.info()

Weather Dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      731 non-null    datetime64[ns]
 1   tmax      731 non-null    float64       
 2   tmin      731 non-null    float64       
 3   prcp      731 non-null    float64       
 4   wspd_max  731 non-null    float64       
dtypes: datetime64[ns](1), float64(4)
memory usage: 28.7 KB


In [4]:
kplc_daily_df = pd.read_csv(
    "kplc_daily_schedule.csv",
    parse_dates=["date"]
)

print("KPLC Daily Schedule Dataset")
kplc_daily_df.info()

KPLC Daily Schedule Dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   date                     5 non-null      datetime64[ns]
 1   scheduled_outage_today   5 non-null      int64         
 2   n_scheduled_events       5 non-null      int64         
 3   total_scheduled_minutes  5 non-null      float64       
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 292.0 bytes


In [5]:
kplc_events_df = pd.read_csv(
    "kplc_planned_outages.csv",
    parse_dates=["date"]
)

print("KPLC Planned Outages (Event-Level)")
kplc_events_df.info()

KPLC Planned Outages (Event-Level)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   area                13 non-null     object        
 1   date                13 non-null     datetime64[ns]
 2   start_time          13 non-null     object        
 3   end_time            13 non-null     object        
 4   affected_customers  13 non-null     object        
dtypes: datetime64[ns](1), object(4)
memory usage: 652.0+ bytes


In [6]:
import numpy as np
import pandas as pd

# Number of simulated households
n_households = 10

households = []

for i in range(n_households):
    temp_df = power_df.copy()

    # Assign meter ID
    temp_df["meter_id"] = f"MTR_{i+1:03d}"
    
    # Slightly scale consumption (each household different size)
    scale_factor = np.random.uniform(0.7, 1.3)
    temp_df["Global_active_power"] *= scale_factor
    
    # Add small random noise
    noise = np.random.normal(0, 0.05, size=len(temp_df))
    temp_df["Global_active_power"] += noise
    
    households.append(temp_df)

multi_household_df = pd.concat(households, ignore_index=True)

multi_household_df.head()

Unnamed: 0,datetime,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3,meter_id
0,2006-12-16 17:24:00,3.540312,0.418,234.84,18.4,0.0,1.0,17.0,MTR_001
1,2006-12-16 17:25:00,4.58199,0.436,233.63,23.0,0.0,1.0,16.0,MTR_001
2,2006-12-16 17:26:00,4.622919,0.498,233.29,23.0,0.0,2.0,17.0,MTR_001
3,2006-12-16 17:27:00,4.59504,0.502,233.74,23.0,0.0,1.0,17.0,MTR_001
4,2006-12-16 17:28:00,3.148822,0.528,235.68,15.8,0.0,1.0,17.0,MTR_001


In [7]:
multi_household_df["meter_id"].value_counts()

meter_id
MTR_001    2075259
MTR_002    2075259
MTR_003    2075259
MTR_004    2075259
MTR_005    2075259
MTR_006    2075259
MTR_007    2075259
MTR_008    2075259
MTR_009    2075259
MTR_010    2075259
Name: count, dtype: int64

In [8]:
# Select 2 meters to simulate theft
theft_meters = ["MTR_003", "MTR_007"]

for meter in theft_meters:
    mask = multi_household_df["meter_id"] == meter
    
    # Randomly choose a start index for theft
    start_idx = multi_household_df[mask].sample(frac=0.1).index
    
    # Reduce consumption by 60%
    multi_household_df.loc[start_idx, "Global_active_power"] *= 0.4

In [9]:
power_df.columns

Index(['datetime', 'Global_active_power', 'Global_reactive_power', 'Voltage',
       'Global_intensity', 'Sub_metering_1', 'Sub_metering_2',
       'Sub_metering_3'],
      dtype='object')

In [10]:
multi_household_df.columns

Index(['datetime', 'Global_active_power', 'Global_reactive_power', 'Voltage',
       'Global_intensity', 'Sub_metering_1', 'Sub_metering_2',
       'Sub_metering_3', 'meter_id'],
      dtype='object')

In [11]:
multi_household_df.to_csv("power_multi_household.csv", index=False)

In [13]:
power_multi = pd.read_csv("power_multi_household.csv")

In [14]:
import os
print("File exists:", os.path.exists("power_multi_household.csv"))
print("File size (MB):", round(os.path.getsize("power_multi_household.csv") / (1024**2), 2))

File exists: True
File size (MB): 1510.66


In [15]:
import pandas as pd

# Ensure datetime is datetime
multi_household_df["datetime"] = pd.to_datetime(multi_household_df["datetime"])

multi_household_df["date"] = multi_household_df["datetime"].dt.date

daily_multi_df = (
    multi_household_df
    .groupby(["meter_id", "date"])
    .agg(
        daily_mean_power=("Global_active_power", "mean"),
        daily_std_power=("Global_active_power", "std"),
        daily_min_power=("Global_active_power", "min"),
        daily_max_power=("Global_active_power", "max"),
        voltage_mean=("Voltage", "mean"),
        voltage_std=("Voltage", "std"),
        intensity_mean=("Global_intensity", "mean")
    )
    .reset_index()
)

daily_multi_df["date"] = pd.to_datetime(daily_multi_df["date"])
daily_multi_df.to_csv("power_multi_household_daily.csv", index=False)

daily_multi_df.head()

Unnamed: 0,meter_id,date,daily_mean_power,daily_std_power,daily_min_power,daily_max_power,voltage_mean,voltage_std,intensity_mean
0,MTR_001,2006-12-16,2.617495,0.882098,0.222336,6.66046,236.243763,2.922896,13.082828
1,MTR_001,2006-12-17,2.018833,1.032528,0.097898,6.024453,240.087028,4.051467,9.999028
2,MTR_001,2006-12-18,1.311082,0.862724,0.065486,5.25853,241.231694,3.719576,6.421667
3,MTR_001,2006-12-19,0.992268,1.063872,0.020072,6.75039,241.999313,3.069492,4.926389
4,MTR_001,2006-12-20,1.325185,1.134982,0.026983,5.161571,242.308062,3.345704,6.467361


In [3]:
import pandas as pd
df = pd.read_csv("lead1.0-small.csv")
df.head()

Unnamed: 0,building_id,timestamp,meter_reading,anomaly
0,1,2016-01-01 00:00:00,,0
1,32,2016-01-01 00:00:00,,0
2,41,2016-01-01 00:00:00,,0
3,55,2016-01-01 00:00:00,,0
4,69,2016-01-01 00:00:00,,0


In [4]:
df1 = pd.read_csv("df.csv")
df1.head()

Unnamed: 0,0,Electricity:Facility [kW](Hourly),Fans:Electricity [kW](Hourly),Cooling:Electricity [kW](Hourly),Heating:Electricity [kW](Hourly),InteriorLights:Electricity [kW](Hourly),InteriorEquipment:Electricity [kW](Hourly),Gas:Facility [kW](Hourly),Heating:Gas [kW](Hourly),InteriorEquipment:Gas [kW](Hourly),Water Heater:WaterSystems:Gas [kW](Hourly),Class,theft
0,0,22.035977,3.586221,0.0,0.0,4.589925,8.1892,136.585903,123.999076,3.33988,9.246947,FullServiceRestaurant,Normal
1,1,14.649757,0.0,0.0,0.0,1.529975,7.4902,3.35988,0.0,3.33988,0.02,FullServiceRestaurant,Normal
2,2,14.669567,0.0,0.0,0.0,1.529975,7.4902,3.35988,0.0,3.33988,0.02,FullServiceRestaurant,Normal
3,3,14.677808,0.0,0.0,0.0,1.529975,7.4902,3.931932,0.0,3.33988,0.592052,FullServiceRestaurant,Normal
4,4,14.824794,0.0,0.0,0.0,1.529975,7.4902,3.35988,0.0,3.33988,0.02,FullServiceRestaurant,Normal
