# **Energy Consumption Data Exploration and Visualization**
This project explores energy consumption data from households, using visualization and feature engineering to derive insights. The analysis includes initial data exploration, handling missing values, feature engineering, and detailed visualizations for data understanding.
# Dataset Overview
The dataset records electrical power consumption over time, with columns for:

1. Date and Time - Timestamps of measurements.
2. Global Active Power - Power consumed by household appliances (in kilowatts).
3. Global Reactive Power - Power stored in the system for reactive components (in kilowatts).
4. Voltage - Voltage measurements (in volts).
5. Global Intensity - Current intensity (in amperes).
6. Sub Metering 1, 2, 3 - Power readings for specific sub-meters in watt-hours.
## Step 1: Problem Definition and Objective
The goal of this project is to analyze electricity consumption trends, identify patterns, and provide insights that can guide energy providers and policymakers in making informed decisions. The analysis will leverage data-driven methods, focusing on visualizations and encoding for deeper insights.
The primary goal is to explore and analyze electricity consumption patterns to uncover insights that aid energy providers and policymakers in better planning and management.


---
# Step 2: Import Libraries and Load the Dataset
We start by importing essential Python libraries for data handling, visualization, and statistical analysis. Each library has a specific purpose:

pandas: Used for data manipulation and preprocessing.
numpy: Helpful for numerical operations.
matplotlib and seaborn: Libraries for generating informative plots.




In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
!gdown --fuzzy https://drive.google.com/file/d/1bvaXJJqNObOCkX-i475BNxpidk024pyx/view?usp=sharing
data=pd.read_csv('/content/household_power_consumption.txt',sep=";")


This code imports essential libraries for data handling (pandas, numpy) and visualization (matplotlib, seaborn). The dataset is loaded using the read_csv function, where the sep=";" argument specifies the delimiter.

# Step 3: Displaying Basic Information About the Data
Displaying the First and Last Few Rows of the Dataset
To gain an initial understanding, weâ€™ll view the first and last few records in the dataset. This provides insight into data structure, types of values, and potential irregularities.


In [None]:
data.head()  # View first 5 rows
data.tail()  # View last 5 rows


The .head() and .tail() functions display the first and last 5 records of the dataset, providing an initial glance at the data structure and contents.

# Basic Dataset Metrics

In [None]:
# Display the number of rows and columns in the dataset
data.shape

# Check Dataset Dimensions and Column Types

In [None]:
print("Dataset Dimensions:", data.shape)
data.info()  # Overview of columns and data types

The shape attribute displays the number of rows and columns, while .info() provides column types and non-null counts, helping identify any potential data type issues or missing values.

# Statistical Summary of Numerical Data


# Data Cleaning and Transformation
Statistical Summary

In [None]:
data.describe()  # Summary of numerical columns
data.describe(include='object')  # Summary of categorical columns

The describe() function summarizes numerical columns with metrics like mean, median, min, and max values. The include='object' argument generates a summary of non-numeric columns, providing insight into data distributions.
# Step 4: Handling Missing Values
Checking for Null Values

In [None]:
# Check for missing values
data.isnull().any()

In [None]:
print("Missing Values Count:\n", data.isnull().sum())
missing_percentage = (data.isnull().mean() * 100).round(2)
print("Missing Values Percentage:\n", missing_percentage)

These commands calculate missing values per column and their respective percentages, which help assess the extent of missing data in each feature.
# Convert Data Types

In [None]:
columns_to_convert = ['Global_active_power', 'Global_reactive_power', 'Voltage',
                      'Global_intensity', 'Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3']
data[columns_to_convert] = data[columns_to_convert].apply(pd.to_numeric, errors='coerce')
data.info()  # Confirm data type conversion


 The pd.to_numeric function converts specified columns to float, handling any non-numeric values by setting them to NaN. Confirming conversion with .info() ensures these columns are now suitable for numerical analysis.

# Fill Missing Values

In [None]:
data['Sub_metering_3'].fillna(data['Sub_metering_3'].mean(), inplace=True)
data.isnull().sum()  # Confirm no missing values remain

 Missing values in the Sub_metering_3 column are filled with the column mean using .fillna(). This step helps handle data gaps without dropping rows, preserving dataset size.

In [None]:
# Display statistical summary for numerical columns
data.describe()


# Handling Null Values
Replacing null values in Sub_metering_3 with the column's mean.

In [None]:
# Fill null values in 'Sub_metering_3' with the mean value
average = data['Sub_metering_3'].mean()
data['Sub_metering_3'].fillna(average, inplace=True)
data.isnull().sum()  # Check again for null values

All missing values in Sub_metering_3 are replaced with the mean, ensuring no null values remain.

# Step 5: Converting Data Types
Convert Object Data Types to Numeric
Convert columns from object to float (excluding Date and Time), as they are necessary for mathematical operations and visualizations.

In [None]:
data['DateTime'] = pd.to_datetime(data['Date'] + ' ' + data['Time'])
data['Year'] = data['DateTime'].dt.year
data['Month'] = data['DateTime'].dt.month
data['Day'] = data['DateTime'].dt.day
data['Hour'] = data['DateTime'].dt.hour
data['Minute'] = data['DateTime'].dt.minute


In [None]:
# Convert object data types to numeric (except 'Date' and 'Time')
data['Global_active_power'] = pd.to_numeric(data['Global_active_power'], errors='coerce')
data['Global_reactive_power'] = pd.to_numeric(data['Global_reactive_power'], errors='coerce')
data['Voltage'] = pd.to_numeric(data['Voltage'], errors='coerce')
data['Global_intensity'] = pd.to_numeric(data['Global_intensity'], errors='coerce')
data['Sub_metering_1'] = pd.to_numeric(data['Sub_metering_1'], errors='coerce')
data['Sub_metering_2'] = pd.to_numeric(data['Sub_metering_2'], errors='coerce')

# Verify changes in data types
data.info()


Combining the Date and Time columns into a single DateTime feature enables further extraction of Year, Month, Day, Hour, and Minute. These new columns help analyze seasonal and hourly usage trends.Identifying Weekends and Sunlight Hours

# Identifying Weekends and Sunlight Hours

In [None]:
def is_holiday(date):
    return 1.0 if pd.to_datetime(date).weekday() >= 5 else 0.0

data['Is_holiday'] = data['Date'].apply(is_holiday)
data['Light'] = data['Hour'].apply(lambda x: 1.0 if 6 <= x < 18 else 0.0)

The is_holiday function identifies weekends (Saturdays and Sundays) as holidays. Additionally, the Light feature categorizes hours from 6 AM to 6 PM as daylight, assisting in determining electricity usage patterns relative to sunlight.


#Step 6: Data Visualization
Distribution of Voltage
Plotting the distribution to understand voltage levels' frequency across different ranges.

python
Copy code


In [None]:
plt.figure(figsize=(10, 6))
plt.hist(data['Voltage'], bins=50, color='skyblue', edgecolor='black')
plt.xlabel('Voltage (V)')
plt.ylabel('Frequency')
plt.title('Distribution of Voltage')
plt.show()

 Most voltage readings fall between 200 and 260 volts, indicating stable electricity supply levels.
# Relationship between Voltage and Global Active Power
A scatter plot to examine the relationship between voltage and active power.

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Voltage', y='Global_active_power', data=data, color='purple')
plt.xlabel('Voltage (V)')
plt.ylabel('Global Active Power (kW)')
plt.title('Voltage vs. Global Active Power')
plt.show()


The scatter plot suggests a positive correlation, where an increase in voltage often corresponds with higher power consumption.
# Total Metering Usage
Visualizing the total usage across the three sub-metering areas.

In [None]:
# Sum up each sub_metering column
metering_data = data[['Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3']].sum()

# Pie chart of total meter readings
plt.figure(figsize=(8, 8))
plt.pie(metering_data, labels=metering_data.index, autopct='%1.1f%%', startangle=90, explode=(0.1, 0, 0))
plt.title('Total Energy Consumption per Sub Meter')
plt.show()


Sub_metering_3 accounts for the highest usage, indicating it might represent the most energy-intensive area or activity.
# Pair Plot
Pair plots provide insight into relationships between multiple variables.

In [None]:
sns.pairplot(data[['Global_active_power', 'Global_reactive_power', 'Voltage', 'Global_intensity']])
plt.show()


The pair plot reveals that Global_active_power and Global_reactive_power have a strong positive correlation.

# Step 7: Encoding and Feature Engineering
Date and Time Feature Engineering
Convert Date and Time to separate components for year, month, day, and hour to capture time-based patterns.

python
Copy code


In [None]:
# Convert 'Date' and 'Time' into a single DateTime column
data['DateTime'] = pd.to_datetime(data['Date'] + ' ' + data['Time'])

# Extract year, month, day, hour, and minute
data['Year'] = data['DateTime'].dt.year
data['Month'] = data['DateTime'].dt.month
data['Day'] = data['DateTime'].dt.day
data['Hour'] = data['DateTime'].dt.hour
data['Minute'] = data['DateTime'].dt.minute
data.head()


Adding these new time-based features can help in identifying patterns and seasonal trends.
# Public Holiday and Daylight Encoding
Identify weekends as public holidays and assign a binary value for daylight.

In [None]:
# Function to identify weekends as holidays
def is_holiday(date):
    return 1.0 if pd.to_datetime(date).weekday() >= 5 else 0.0

# Apply holiday and daylight encoding
data['Is_holiday'] = data['Date'].apply(is_holiday)
data['Light'] = data['Hour'].apply(lambda x: 1 if 6 <= x < 18 else 0)


These features could help capture variations in consumption based on holiday schedules and natural daylight availability.

# Time Series Analysis of Global Active Power
A time series line plot helps observe how electricity consumption (global active power) fluctuates over time. This can reveal daily, weekly, and seasonal patterns.

In [None]:
plt.figure(figsize=(14, 6))
plt.plot(data['DateTime'], data['Global_active_power'], color='darkblue', linewidth=0.5)
plt.title('Time Series of Global Active Power')
plt.xlabel('DateTime')
plt.ylabel('Global Active Power (kW)')
plt.show()


This plot provides a clear visualization of energy consumption trends over time. Peaks and valleys may correspond to higher usage during specific times (e.g., evenings or weekends) or holidays.

# Hourly Electricity Consumption Pattern
Plotting average global active power for each hour of the day can show which times of day have higher energy usage.

In [None]:
hourly_usage = data.groupby('Hour')['Global_active_power'].mean()

plt.figure(figsize=(10, 6))
plt.plot(hourly_usage.index, hourly_usage.values, color='orange', marker='o')
plt.title('Average Hourly Electricity Consumption')
plt.xlabel('Hour of Day')
plt.ylabel('Average Global Active Power (kW)')
plt.grid()
plt.show()


This graph can show whether electricity usage is higher during specific hours, like early morning or evening, which is helpful for peak demand analysis.



# Monthly Global Active Power Consumption
This bar plot illustrates the total consumption of electricity (global active power) by month, helping identify seasonal usage patterns.

In [None]:
monthly_usage = data.groupby('Month')['Global_active_power'].sum()

plt.figure(figsize=(10, 6))
plt.bar(monthly_usage.index, monthly_usage.values, color='skyblue')
plt.title('Total Monthly Global Active Power Consumption')
plt.xlabel('Month')
plt.ylabel('Total Global Active Power (kW)')
plt.xticks(range(1, 13))
plt.show()


This plot highlights seasonal variations. For example, higher values in winter could indicate increased usage due to heating, while summer peaks may relate to air conditioning.

# Box Plot of Voltage by Hour
This box plot shows the distribution of voltage for each hour of the day, indicating any potential variability in voltage throughout the day.

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x='Hour', y='Voltage', data=data, palette='coolwarm')
plt.title('Voltage Distribution by Hour of Day')
plt.xlabel('Hour of Day')
plt.ylabel('Voltage (V)')
plt.show()


This plot provides insights into the stability of voltage throughout the day. Sudden fluctuations could hint at demand changes or irregularities in supply.

# Correlation Heatmap
A heatmap can visualize correlations among all numerical variables, helping identify strong relationships between energy features like power, voltage, and sub-metering.



In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(data[['Global_active_power', 'Global_reactive_power', 'Voltage', 'Global_intensity',
                 'Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3']].corr(),
            annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Energy Features')
plt.show()

The heatmap reveals correlations among features, such as a strong relationship between global active and reactive power, which could be crucial in understanding energy dynamics.

# Stacked Area Plot for Sub-metering Over Time
A stacked area plot shows cumulative sub-metering values over time, making it easy to see each sub-meter's contribution to total energy usage.

In [None]:
plt.figure(figsize=(14, 6))
plt.stackplot(data['DateTime'],
              data['Sub_metering_1'],
              data['Sub_metering_2'],
              data['Sub_metering_3'],
              labels=['Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3'],
              colors=['#ff9999', '#66b3ff', '#99ff99'])
plt.legend(loc='upper left')
plt.title('Sub-metering Energy Usage Over Time')
plt.xlabel('DateTime')
plt.ylabel('Energy Usage (Wh)')
plt.show()


This plot reveals which areas are consuming more energy and how each sub-metered area contributes to total energy usage over time, providing insights into possible energy-saving interventions.