# Household Power Consumption ML Predictor

This notebook will focus on creating a ML model that aims to predict a household power consumption. This will also be the final project for the Machine Learning Algorithms course of IPCA - Aplied Machine Learning course. 

The project requirements paper is available in this repository under the name of "Practical_Assessment_MAAI_MLA_2025_2026.pdf".

Course professor [*lufer*](https://github.com/luferIPCA)

Notebook made by [*Álvaro Terroso*](https://github.com/alvaroterroso)

Dataset is available at [*UC Irvine*](https://archive.ics.uci.edu/dataset/235/individual+household+electric+power+consumption)

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Dataset public informations and import

This is a public dataset found on kaggle, the original authors are [*Georges Hébrail*](https://www.linkedin.com/in/georges-hebrail-582a0813/?originalSubdomain=fr) and [*Alice Bérard*](https://www.linkedin.com/in/aliceberard/). 

This archive contains 2075259 measurements gathered in a house located in Sceaux (7km of Paris, France) between December 2006 and November 2010.

As the dataset is in txt format, we will need first to convert it to csv format.

In [21]:
# Import from the original TXT (semicolon-separated) and parse Date+Time
df = pd.read_csv(
	"household_power_consumption.txt",
	sep=";",
	na_values=["?", "NA", ""],
	parse_dates={"Datetime": ["Date", "Time"]}, # Merge Date and Time into Datetime
	infer_datetime_format=True, # Speed up parsing
	low_memory=False, # Avoid dtype warning
 	dayfirst=True # European date format
)

  df = pd.read_csv(
  df = pd.read_csv(


## Dataset details

In [22]:
df.shape

(2075259, 8)

We got a data with 2 Million records, and 8 columns.

In [25]:
df.head(10)

Unnamed: 0,Datetime,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,2006-12-16 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0
5,2006-12-16 17:29:00,3.52,0.522,235.02,15.0,0.0,2.0,17.0
6,2006-12-16 17:30:00,3.702,0.52,235.09,15.8,0.0,1.0,17.0
7,2006-12-16 17:31:00,3.7,0.52,235.22,15.8,0.0,1.0,17.0
8,2006-12-16 17:32:00,3.668,0.51,233.99,15.8,0.0,1.0,17.0
9,2006-12-16 17:33:00,3.662,0.51,233.86,15.8,0.0,2.0,16.0


## Column meanings and purpose in the dataset

This dataset contains **minute-level electricity measurements** from a single household (Sceaux, France), collected over a long period (Dec 2006–Nov 2010). Each row is one timestamp, and the variables describe the household’s electrical load both at an overall level and for three specific appliance groups (sub-meterings). The dataset also contains a small proportion of missing measurements (~1.25%), meaning some timestamps exist but the sensor values may be absent.

### Datetime
- **Meaning:** The exact date and time of the measurement (one-minute resolution).
- **Purpose:** Enables time-series analysis and feature engineering (hour, day of week, seasonality, holidays), and supports forecasting (next minute/hour/day consumption).

### Global_active_power
- **Meaning:** Total **active power** consumed by the household at that minute (typically in **kW**).
- **Purpose:** This is usually the **main target variable** for forecasting/monitoring because it represents the real power used by appliances. It can also be converted into energy and cost estimates over time.

### Global_reactive_power
- **Meaning:** Total **reactive power** at that minute (typically in **kVAR**).
- **Purpose:** Helps characterize the type of electrical load (inductive/capacitive appliances). It can provide additional predictive signal and insight into efficiency/power quality, even though it is not “useful work” energy.

### Voltage
- **Meaning:** The household supply **voltage** at that minute (in **V**).
- **Purpose:** Captures grid/supply fluctuations that can affect current draw and power consumption. Useful for diagnosing abnormal behaviour and improving prediction accuracy.

### Global_intensity
- **Meaning:** Total **current intensity** drawn by the household at that minute (in **A**).
- **Purpose:** Another view of instantaneous load. Since power is related to voltage and current, this feature is strongly tied to consumption peaks and can help models detect high-load periods.

### Sub_metering_1
- **Meaning:** **Active energy** consumed by the **kitchen** appliance group during that minute (in **Wh**).
- **Purpose:** Provides appliance-group breakdown of consumption, supporting more detailed behavioural analysis (cooking patterns) and enabling models to learn which activities drive peaks.

### Sub_metering_2
- **Meaning:** **Active energy** consumed by the **laundry room** appliance group during that minute (in **Wh**).
- **Purpose:** Helps identify energy usage linked to washing/drying routines and supports targeted insights (e.g., shifting laundry to cheaper hours).

### Sub_metering_3
- **Meaning:** **Active energy** consumed by **electric water heating and air conditioning** during that minute (in **Wh**).
- **Purpose:** Often linked to strong **seasonal effects** (heating/cooling). It is key for analysing winter/summer consumption patterns and improving forecasting.

---

### Extra derived consumption (not directly measured by sub-meterings)
The dataset notes that:
- **Other (unmetered) minute energy** can be estimated as:

  \[
  \text{Other\_energy\_Wh} = \left(\frac{\text{Global\_active\_power} \times 1000}{60}\right) - \text{Sub\_metering\_1} - \text{Sub\_metering\_2} - \text{Sub\_metering\_3}
  \]

- **Meaning:** Energy consumed by all equipment **not covered** by the three sub-meterings (e.g., lighting, electronics, small appliances).
- **Purpose:** Allows a more complete breakdown of household consumption and can be a useful additional feature/target for analysis.


## Null values analysis

In [26]:
df.isnull().sum()

Datetime                     0
Global_active_power      25979
Global_reactive_power    25979
Voltage                  25979
Global_intensity         25979
Sub_metering_1           25979
Sub_metering_2           25979
Sub_metering_3           25979
dtype: int64

After this null assessment, we found that 25979 recorda are null.

Regarding the type of minute information the dataset provides us, and the fact that these records represent 1.25% of the dataset, they will be deleted.

In [28]:
# Drop rows with any null values
df = df.dropna().reset_index(drop=True)

# Quick check
df.isnull().sum()

Datetime                 0
Global_active_power      0
Global_reactive_power    0
Voltage                  0
Global_intensity         0
Sub_metering_1           0
Sub_metering_2           0
Sub_metering_3           0
dtype: int64

In [29]:
df.shape

(2049280, 8)

## Duplicated analysis

In [31]:
df.duplicated().sum()

0

This dataset has no duplicate data.