<a href="https://colab.research.google.com/github/anandaditya07/Smart-Energy-Consumption-Analysis-and-Prediction-using-Machine-Learning-with-Device-Level-Insights/blob/main/Aditya_Anand.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 **WEEK 1 & 2**


**Module 1: Data Collection and Understanding**


1. **Define project scope and functional objectives for smart energy analysis.**



This project is about understanding how much electricity different appliances in a smart home use. Instead of only seeing one total electricity bill at the end of the month, we want to see which device uses how much power and when. This will help us know where energy is being wasted.

**Functional Objectives**

*   Track energy usage of each device and each room separately.
*   Show energy use in the form of graphs (hourly, daily, weekly).
*   Find which devices use the most power and at what time.
*   Use machine learning to predict future electricity use.
*   Help save electricity by giving suggestions to reduce unnecessary usage.


2. **Collect and structure the SmartHome Energy Monitoring Dataset**

In [1]:
# Basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd

path = "/content/drive/MyDrive/HomeC_augmented.csv"
df = pd.read_csv(path)

Mounted at /content/drive


In [None]:
# Read the CSV
df_raw = pd.read_csv(path)

print("Original shape:", df_raw.shape)
df_raw.head()

In [None]:
# Column names
df.columns


In [None]:
# Basic info – data types, nulls, etc.
df.info()


In [None]:
# Basic statistics for numerical columns
df.describe().T


In [None]:
import pandas as pd
# Change 'timestamp' to the actual time column name from df_raw.columns
time_col = "time"   # e.g. "date", "time", "Datetime" etc.

# Convert to datetime
df_raw[time_col] = pd.to_datetime(df_raw[time_col], errors='coerce')

# Drop rows where timestamp could not be parsed
df_raw = df_raw.dropna(subset=[time_col])

# Sort by time
df_raw = df_raw.sort_values(time_col);

# Set timestamp as index
df = df_raw.set_index(time_col)

print("After setting time index:", df.shape)
df.head()

In [None]:
# All devices/measurements (since time is now index)
device_cols = df.columns.tolist()
print("Device / sensor columns:", device_cols[:10])



3. **Verify data integrity, handle missing timestamps, and perform exploratory analysis.**



i. Check Data Integrity

  We verify whether the dataset has:

*   Repeated timestamps
*   Empty/Missing data

    If yes, we fix them.




In [None]:
import pandas as pd
print("\n~~~~~~~ MISSING TIMESTAMP HANDLING ~~~~~~~")
# Try to guess the time gap between readings (like 1 hour / 5 min)
inferred_freq = pd.infer_freq(df.index[:100])
print("Inferred frequency:", inferred_freq)

# If frequency cannot be detected → assume 1 hour gap
if inferred_freq is None:
    inferred_freq = '1H'

# Create a new continuous timeline with no gaps
full_range = pd.date_range(start=df.index.min(),
                           end=df.index.max(),
                           freq=inferred_freq)

# Reindex so dataset follows this timeline
df = df.reindex(full_range)
df.index.name = "timestamp"

# Fill empty values created by reindexing
df = df.ffill().bfill()
print("Missing values after filling:")
print(df.isna().sum())

ii. Handle Missing Timestamps

We ensure time moves smoothly with no missing timestamps,
and we fill gaps in the data by copying nearby values.

In [None]:
import pandas as pd
print("\n~~~~~~~ MISSING TIMESTAMP HANDLING ~~~~~~~")
# Try to guess the time gap between readings (like 1 hour / 5 min)
inferred_freq = pd.infer_freq(df.index[:100])
print("Inferred frequency:", inferred_freq)
# If frequency cannot be detected → assume 1 hour gap
if inferred_freq is None:
    inferred_freq = '1H'
# Create a new continuous timeline with no gaps
full_range = pd.date_range(start=df.index.min(),
                           end=df.index.max(),
                           freq=inferred_freq)
# Reindex so dataset follows this timeline
df = df.reindex(full_range)
df.index.name = "timestamp"
# Fill empty values created by reindexing
df = df.ffill().bfill()
print("Missing values after filling:")
print(df.isna().sum())

iii. Exploratory Data Analysis
* Minimum, maximum, average energy usage per device

* Graph that shows how energy usage changes with time


In [None]:
import matplotlib.pyplot as plt
import numpy as np
print("\n~~~~~~EXPLORATORY ANALYSIS ~~~~~~~")
# Show basic numeric statistics for all device columns
display(df.describe().T)
# Show how a few device values change over time
plt.figure(figsize=(12,4))
# Exclude 'Unnamed: 0' from sample_cols for better visualization of energy consumption
sample_cols = [col for col in df.select_dtypes(include=[np.number]).columns if col != 'Unnamed: 0'][:3]
for col in sample_cols:
    plt.plot(df.index, df[col], label=col)
plt.xlabel("Time")
plt.ylabel("Energy Consumption")
plt.title("Sample Energy Consumption Over Time")
plt.legend()
plt.show()

**iv**. **Organize energy readings by device, room, and timestamp.**

In [None]:
# Make a copy to be safe
df_device = df.copy()
# Select all numeric columns as device columns
device_cols = df_device.select_dtypes(include=['number']).columns.tolist()
print("Device / sensor columns:", device_cols)
# Convert from wide format → long format
df_long = df_device.reset_index().melt(
    id_vars=["timestamp"],       # column that stays fixed (time)
    value_vars=device_cols,      # columns that will become 'device'
    var_name="device",           # new column name for device name
    value_name="energy"          # new column name for energy value
)
print("Long format shape:", df_long.shape)
df_long.head()

In [None]:
# Example device → room mapping
# IMPORTANT: change keys to match your real device names
room_map = {
    "Kitchen_Light": "Kitchen",
    "Fridge": "Kitchen",
    "AC_Bedroom": "Bedroom",
    "TV_LivingRoom": "Living Room",
    # Add more device: room pairs here...
}
# Create 'room' column using the mapping
df_long["room"] = df_long["device"].map(room_map).fillna("Unknown")
# Show first few organized rows
df_long.head()

**Module 2: Data Cleaning and Preprocessing**

i. Handle missing values and outliers in power consumption readings.



In [None]:
import numpy as np # Ensure numpy is imported for np.number, if not already
# Missing values check
print("Missing values before cleaning:")
print(df.isna().sum())

# Fill missing values using forward & backward fill
df = df.ffill().bfill()
print("Missing values after filling:")
print(df.isna().sum())

# Remove outliers using 1st and 99th percentile for each numeric column
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
for col in num_cols:
    low, high = df[col].quantile([0.01, 0.99])
    df[col] = df[col].clip(lower=low, upper=high)

print("Outliers handled successfully.")

ii. Convert timestamps to datetime format and resample data (hourly/daily).

In [None]:
import numpy as np
# PART 2: RESAMPLE DATA (HOURLY / DAILY)

# Select only numeric columns for resampling
numeric_df = df.select_dtypes(include=[np.number])

# Hourly average consumption
df_hourly = numeric_df.resample('h').mean()

print("Hourly data shape:", df_hourly.shape)
df_hourly.head()

iii. Normalize or scale energy values for model compatibility.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# PART 3: NORMALIZATION / SCALING

# Select target and features later
df_scaled = df_hourly.copy()

scaler = MinMaxScaler()
df_scaled[df_hourly.columns] = scaler.fit_transform(df_hourly)

df_scaled.head()

iv. Split dataset into training, validation, and testing sets.

In [None]:
# PART 4: TRAIN / VALIDATION / TEST SPLIT

# Select the main target column (CHANGE to your main power column)
target_col = df_scaled.columns[0]  # example: first numeric col
print("Using target:", target_col)

# Create X and y
X = df_scaled.drop(columns=[target_col])
y = df_scaled[target_col]

# Time-based splitting
train_size = int(len(df_scaled) * 0.7)
val_size = int(len(df_scaled) * 0.15)

X_train = X.iloc[:train_size]
y_train = y.iloc[:train_size]

X_val = X.iloc[train_size:train_size + val_size]
y_val = y.iloc[train_size:train_size + val_size]

X_test = X.iloc[train_size + val_size:]
y_test = y.iloc[train_size + val_size:]

print("Train size:", len(X_train))
print("Validation size:", len(X_val))
print("Test size:", len(X_test))


**Milestone 2: Week 3-4**


Module 3: Feature Engineering

i. Extract relevant time-based features (hour, day, week, month trends).

In [None]:
# PART 1: TIME-BASED FEATURES
df_features = df_scaled.copy()

df_features["hour"] = df_features.index.hour
df_features["dayofweek"] = df_features.index.dayofweek   # 0=Monday
df_features["month"] = df_features.index.month

print("Time-based features added.")
df_features.head()

ii. Aggregate device-level consumption statistics.

In [None]:
# PART 2: AGGREGATE DEVICE CONSUMPTION
df_features["total_energy"] = df_features.select_dtypes(include='number').sum(axis=1)
df_features.head()


iii. Create lag features and moving averages for time series learning.

In [None]:
# PART 3: LAG AND MOVING AVERAGE FEATURES
target_col = df_features.columns[0]  # change if needed
print("Target column:", target_col)

# Lag features (previous values)
for lag in [1, 6, 12, 24]:
    df_features[f"{target_col}_lag_{lag}"] = df_features[target_col].shift(lag)

# Rolling/Moving averages
df_features["rolling_mean_6"] = df_features[target_col].rolling(6).mean()
df_features["rolling_mean_12"] = df_features[target_col].rolling(12).mean()
df_features["rolling_mean_24"] = df_features[target_col].rolling(24).mean()

# Drop rows created with NaN from shifting
df_features = df_features.dropna()

df_features.head()


iv. Prepare final feature set for ML model input.

In [None]:

# PART 4: FINAL ML FEATURE MATRIX
X = df_features.drop(columns=[target_col])
y = df_features[target_col]

# Time-based splitting for model training
train_size = int(len(df_features) * 0.7)
val_size = int(len(df_features) * 0.15)

X_train = X.iloc[:train_size]
y_train = y.iloc[:train_size]

X_val = X.iloc[train_size:train_size+val_size]
y_val = y.iloc[train_size:train_size+val_size]

X_test = X.iloc[train_size+val_size:]
y_test = y.iloc[train_size+val_size:]

print("Training set:", X_train.shape)
print("Validation set:", X_val.shape)
print("Testing set:", X_test.shape)
