 **Household Power Consumption Dataset**

**Data Description**
This dataset contains measurements of household power consumption recorded over time, focusing on energy usage to predict future consumption patterns.

**Column Descriptions:**
**Date:** The date of the measurement in the format DD/MM/YYYY.  
**Time:** The time of the measurement in the format HH:MM:SS.  
**Global_active_power:** The total active power consumed by the household, measured in kilowatts (kW).  
**Global_reactive_power:** The total reactive power, which represents the power that does no useful work, measured in kilovolt-amperes reactive (kVAR).  
**Voltage:** The electrical potential measured at the household, in volts (V).  
**Global_intensity:** The overall current intensity flowing in the household, measured in amperes (A).  
**Sub_metering_1:** The energy consumption for kitchen appliances, measured in watt-hours (Wh).  
**Sub_metering_2:** The energy consumption for laundry appliances, measured in watt-hours (Wh).  
**Sub_metering_3:** The energy consumption for other appliances (like air conditioning), measured in watt-hours (Wh).

**Importing NumPy and Pandas is essential for efficient numerical computations and data manipulation, enabling seamless handling and analysis of large datasets in Python.**

In [46]:
import pandas as pd
import numpy as np

**Loading data and seperaating it with colon for easy understanding**

In [47]:
df = pd.read_csv('household_power_consumption.txt',sep=";")

  df = pd.read_csv('household_power_consumption.txt',sep=";")


**Shape tells about the dataset size which contains 2075259 rows and 9 columns**

In [49]:
df.shape

(2075259, 9)

**head() shows first 5 rows of the dataset**

In [50]:
df.head()

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,16/12/2006,17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


**tail() shows last 5 rows of the dataset**

In [51]:
df.tail()

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
2075254,26/11/2010,20:58:00,0.946,0.0,240.43,4.0,0.0,0.0,0.0
2075255,26/11/2010,20:59:00,0.944,0.0,240.0,4.0,0.0,0.0,0.0
2075256,26/11/2010,21:00:00,0.938,0.0,239.82,3.8,0.0,0.0,0.0
2075257,26/11/2010,21:01:00,0.934,0.0,239.7,3.8,0.0,0.0,0.0
2075258,26/11/2010,21:02:00,0.932,0.0,239.55,3.8,0.0,0.0,0.0


**`describe()` gives a statistical summary for numerical columns: count, mean, min, max, std deviation**

In [52]:
df.describe()

Unnamed: 0,Sub_metering_3
count,2049280.0
mean,6.458447
std,8.437154
min,0.0
25%,0.0
50%,1.0
75%,17.0
max,31.0


**The `df.info()` method provides a concise summary of a DataFrame, including the number of entries, column data types, and the presence of any missing values**

In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075259 entries, 0 to 2075258
Data columns (total 9 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Date                   object 
 1   Time                   object 
 2   Global_active_power    object 
 3   Global_reactive_power  object 
 4   Voltage                object 
 5   Global_intensity       object 
 6   Sub_metering_1         object 
 7   Sub_metering_2         object 
 8   Sub_metering_3         float64
dtypes: float64(1), object(8)
memory usage: 142.5+ MB


**`describe(include=object)`gives information about non-numeric columns, such as counts, unique values,top and frequencies**

In [54]:
df.describe(include = object)

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2
count,2075259,2075259,2075259,2075259.0,2075259,2075259.0,2075259.0,2075259.0
unique,1442,1440,6534,896.0,5168,377.0,153.0,145.0
top,6/12/2008,17:24:00,?,0.0,?,1.0,0.0,0.0
freq,1440,1442,25979,472786.0,25979,169406.0,1840611.0,1408274.0


**`df.isnull().any()`checks for the presence of any missing values in each column of the DataFrame**

In [55]:
df.isnull().any()

Date                     False
Time                     False
Global_active_power      False
Global_reactive_power    False
Voltage                  False
Global_intensity         False
Sub_metering_1           False
Sub_metering_2           False
Sub_metering_3            True
dtype: bool

**`df.isnull().sum()`calculates the total number of missing values in each column.**

In [56]:
df.isnull().sum()

Date                         0
Time                         0
Global_active_power          0
Global_reactive_power        0
Voltage                      0
Global_intensity             0
Sub_metering_1               0
Sub_metering_2               0
Sub_metering_3           25979
dtype: int64

**To know the null percentage**

In [57]:
null_percentage = (df.isnull().sum() / len(df)) * 100
print(null_percentage)

Date                     0.000000
Time                     0.000000
Global_active_power      0.000000
Global_reactive_power    0.000000
Voltage                  0.000000
Global_intensity         0.000000
Sub_metering_1           0.000000
Sub_metering_2           0.000000
Sub_metering_3           1.251844
dtype: float64


**``nunique()`` returns the number of unique values in each column**

In [58]:
 #to find the number of unique values in each column
unique_counts = df.nunique()
print("Number of unique values in each column:")
print(unique_counts)

Number of unique values in each column:
Date                     1442
Time                     1440
Global_active_power      6534
Global_reactive_power     896
Voltage                  5168
Global_intensity          377
Sub_metering_1            153
Sub_metering_2            145
Sub_metering_3             32
dtype: int64


**The `dropna()` function removes all rows from the DataFrame that contain any null values, resulting in a cleaned dataset without missing data**

In [59]:
df_cleaned = df.dropna()
print("DataFrame after removing rows with null values:")
print(df_cleaned)

DataFrame after removing rows with null values:
               Date      Time Global_active_power Global_reactive_power  \
0        16/12/2006  17:24:00               4.216                 0.418   
1        16/12/2006  17:25:00               5.360                 0.436   
2        16/12/2006  17:26:00               5.374                 0.498   
3        16/12/2006  17:27:00               5.388                 0.502   
4        16/12/2006  17:28:00               3.666                 0.528   
...             ...       ...                 ...                   ...   
2075254  26/11/2010  20:58:00               0.946                   0.0   
2075255  26/11/2010  20:59:00               0.944                   0.0   
2075256  26/11/2010  21:00:00               0.938                   0.0   
2075257  26/11/2010  21:01:00               0.934                   0.0   
2075258  26/11/2010  21:02:00               0.932                   0.0   

         Voltage Global_intensity Sub_metering_1 Su

**The ``convert_dtypes()`` function automatically infers and converts the data types of DataFrame columns to the most appropriate dtypes (e.g., strings, integers, or floats)**

In [60]:
df_converted = df.convert_dtypes()
print(df_converted.dtypes)

Date                     string[python]
Time                     string[python]
Global_active_power              object
Global_reactive_power            object
Voltage                          object
Global_intensity                 object
Sub_metering_1                   object
Sub_metering_2                   object
Sub_metering_3                    Int64
dtype: object


**`apply(pd.to_numeric, errors='coerce')` attempts to convert specified columns to numeric types, coercing any non-numeric values to NaN**

In [61]:
columns_to_convert_ = df.columns.difference(['Date', 'Time']) 
df[columns_to_convert] = df[columns_to_convert].apply(pd.to_numeric, errors='coerce')

In [62]:
print(df.dtypes)

Date                      object
Time                      object
Global_active_power      float64
Global_reactive_power    float64
Voltage                  float64
Global_intensity         float64
Sub_metering_1           float64
Sub_metering_2           float64
Sub_metering_3           float64
dtype: object


**After converting the relevant columns to float, all columns can be easily described, allowing for straightforward statistical analysis and data insights**

In [63]:
df.describe()

Unnamed: 0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
count,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0
mean,1.091615,0.1237145,240.8399,4.627759,1.121923,1.29852,6.458447
std,1.057294,0.112722,3.239987,4.444396,6.153031,5.822026,8.437154
min,0.076,0.0,223.2,0.2,0.0,0.0,0.0
25%,0.308,0.048,238.99,1.4,0.0,0.0,0.0
50%,0.602,0.1,241.01,2.6,0.0,0.0,1.0
75%,1.528,0.194,242.89,6.4,0.0,1.0,17.0
max,11.122,1.39,254.15,48.4,88.0,80.0,31.0


In [64]:
df.shape

(2075259, 9)

**The `df.describe(include='object')` function displays summary statistics for columns of type object, in this case, shows only the "Date" and "Time" columns, as the remaining columns are of type float64**

In [65]:
df.describe(include = object)

Unnamed: 0,Date,Time
count,2075259,2075259
unique,1442,1440
top,6/12/2008,17:24:00
freq,1440,1442
