<h1 align = 'center'>EDA, FE and Regression Models<h1><br>Household Power Consumption Dataset

### 1. EDA and FE
1. Data Profiling
2. Stastical analysis
3. Graphical Analysis
4. Data Cleaning
5. Data Scaling

### 2. Models 
1. Linear Regression 
2. Ridge Regression
3. Lasso Regression
4. Elastic-Net Regression
5. Support Vector Regressor
6. Decision Tree Regressor
7. Random Forest Regressor
8. Bagging Regressor
9. Extra Tree Regressor
10. AdaBoost Regressor
11. Voting Regressor
12. GradientBoost Regressor
13. XGBoost Regressor

### 3. Performance Metrices
1. R2 Score
2. Adjusted R2 Score
3. Mean Square Error
4. Mean Absolute Error
5. Root Mean Square Error

**Dataset:** https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption

In [None]:
from IPython import display
display.Image("power.jpg")

**<h3 align="center">Importing Required Libraries</h3>**

In [None]:
import pandas as pd
import datetime
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
%matplotlib inline

import pymongo
import json


import sklearn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from category_encoders.binary import BinaryEncoder
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, ExtraTreesRegressor, VotingRegressor, AdaBoostRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.metrics import r2_score, mean_absolute_error,mean_squared_error 
import pickle


import warnings
warnings.filterwarnings('ignore')

**<h3 align="center">Importing Dataset and Data Cleaning</h3>**

In [None]:
### importing original dataset
dataset=pd.read_csv('household_power_consumption.txt', sep=";",parse_dates = {'Datetime':['Date','Time']},
           infer_datetime_format = True)
dataset.head()

### Data Set Information:

**This archive contains 2075259 measurements gathered in a house located in Sceaux (7km of Paris, France) between December 2006 and November 2010 (47 months).**
Notes:
1. (global_active_power*1000/60 - sub_metering_1 - sub_metering_2 - sub_metering_3) represents the active energy consumed every minute (in watt hour) in the household by electrical equipment not measured in sub-meterings 1, 2 and 3.
2. The dataset contains some missing values in the measurements (nearly 1,25% of the rows). All calendar timestamps are present in the dataset but for some timestamps, the measurement values are missing: a missing value is represented by the absence of value between two consecutive semi-colon attribute separators. For instance, the dataset shows missing values on April 28, 2007.


### Attribute Information:

1. date: Date in format dd/mm/yyyy
2. time: time in format hh:mm:ss
3. global_active_power: household global minute-averaged active power (in kilowatt)
4. global_reactive_power: household global minute-averaged reactive power (in kilowatt)
5. voltage: minute-averaged voltage (in volt)
6. global_intensity: household global minute-averaged current intensity (in ampere)
7. sub_metering_1: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered).
8. sub_metering_2: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light.
9. sub_metering_3: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner.

In [None]:
### Getting shape of original dataset
dataset.shape

As we can see the data is very big

records : 20,75,259
columns : 8

In [None]:
### Checking Data types of features in original dataset
dataset.dtypes

In [None]:
### checking unique values in each feature to form data cleaning strategy if necessary

for feature in [feature for feature in dataset.columns if feature not in ['Datetime']]:
    print("feature {} has these {} unique values\n".format(feature, dataset[feature].unique()))

In [None]:
### checking no of records in each feature that have value as ?

for feature in [feature for feature in dataset.columns if feature not in ['Datetime']]:
    print("The feature {} has {} ? in it".format(feature,dataset[dataset[feature]=='?'].shape))

In [None]:
### replacing ? values with nan values
dataset.replace('?', np.nan, inplace=True)

In [None]:
### checking no of records in each feature that have value as ? after replacing them
for feature in [feature for feature in dataset.columns if feature not in ['Datetime']]:
    print("The feature {} has {} ? in it".format(feature,dataset[dataset[feature]=='?'].shape))

In [None]:
#check the count of nan values
dataset.isna().sum()

In [None]:
# dropping nan values
dataset.dropna(inplace=True)

In [None]:
dataset.info()

In [None]:
dataset.sample(5)

In [None]:
#change the data types of all features
change_dtypes = {
    "Global_active_power":"float64","Global_reactive_power":"float64", "Voltage":"float64",
    "Global_intensity":"float64","Sub_metering_1":"float64","Sub_metering_2":"float64",
    "Sub_metering_3":"float64"
}

dataset = dataset.astype(change_dtypes)
dataset.dtypes

In [None]:
#Combine all the three sub-meters into one
dataset["power_consumed"] = dataset["Sub_metering_1"] + dataset["Sub_metering_2"] + dataset["Sub_metering_3"]

In [None]:
#Drop  Sub_metering features
dataset.drop(["Sub_metering_1","Sub_metering_2","Sub_metering_3"],axis = 1,inplace = True)

In [None]:
dataset.sample(5)

In [None]:
## checking for Duplicate values
dataset[dataset.duplicated()]

Observation
1. There is no null value in dataset.
2. Total records : 20,75,259
columns : 8  is present.
3. There is no duplicate observation in dataset

<h2 align = 'center'> Analysis of Features </h2>

In [None]:
#classify time of the day into bins for better visulaization
def time_of_day(x):
    if x in range(6,12):
        return "Morning"
    elif x in range(12,16):
        return "Afternoon"
    elif x in range(16,22):
        return "Evening"
    else:
        return "Late night"

In [None]:
dataset["Time_of_day"] = dataset['Datetime'].dt.hour.apply(time_of_day)

In [None]:
#time of day vs power  consumed
dataset.groupby("Time_of_day")[['power_consumed']].sum()

In [None]:
#Dataset is very big so we have to take random sample from original dataset
sample_data = dataset.sample(n = 50000, ignore_index= True)
sample_data

In [None]:
#Power consumed with reference to time of the day

plt.figure(figsize = (15,8))
sns.barplot(x = 'Time_of_day', y = "power_consumed", data = sample_data, palette = "pastel")
plt.show()

Observation :-
The power consumption is higher in morning and afternoon

In [None]:
#Power Consumption with reference to months

In [None]:
# Extract month_name from the datetime
sample_data['month'] = sample_data['Datetime'].dt.month_name()

In [None]:
sample_data.groupby('month')[['power_consumed']].sum()

In [None]:
plt.figure(figsize = (15,8))
sns.barplot(x = "month", y = "power_consumed",data = sample_data, palette= "icefire_r" )


Observation :- The Power consumption is more in the months of December, February and January

In [None]:
#Power consumption with reference to year
# Extract month_name from the datetime
sample_data['year'] = sample_data['Datetime'].dt.year
plt.figure(figsize = (15,8))
sns.barplot(x = "year", y = "power_consumed",data = sample_data, palette= "rocket" )


Observation:- 
Maximum power consumption was in the year 2006
and Minimum in 2008

In [None]:
sample_data.drop(columns=['Time_of_day','month','year'],inplace= True)

In [None]:
sample_data.head()

In [None]:
#Lineplot voltage vs power consumption
sns.lineplot(x = "Voltage", y = "power_consumed", data=sample_data, color = "b")

In [None]:
#regplot of Global_active_power vs power_consumed
sns.regplot(x='Global_active_power' ,y='power_consumed' , data = sample_data)

In [None]:
#Lineplot of Global_reactive_power vs power_consumed
sns.lineplot(x='Global_active_power' ,y='power_consumed' , data = sample_data)

In [None]:
#Correlation between Features
sample_data.corr()

In [None]:
plt.figure(figsize = (15,10))
sns.heatmap(sample_data.corr(),annot=True)
plt.yticks(rotation = 25)

In [None]:
sample_data_copy = sample_data.copy() 
sample_data_copy.head(2)

In [None]:
sample_data_copy.drop("Datetime", axis = 1,inplace = True)

In [None]:
#Check the Outliers
plt.figure(figsize = (15,10))
plt.suptitle('BoxPlot of all features', fontsize = 25, fontweight = "bold", alpha = 0.8, y = 1.)

for i in range(0, len(sample_data_copy.columns)):
    plt.subplot(3,2,i+1)
    sns.boxplot(x= sample_data_copy[sample_data_copy.columns[i]], data = sample_data)
    plt.xlabel(sample_data_copy.columns[i],fontsize = 20)
    plt.tight_layout()

Observation:-
There are many outliers in every features , we have to remove them

In [None]:
#Handling the outliers
def handling_outliers(data,column):
    IQR = data[column].quantile(0.75) - data[column].quantile(0.25)
    lower_fence = data[column].quantile(0.25) - (1.5 * IQR)
    higher_fence = data[column].quantile(0.75) + (1.5 * IQR)
    print(column, "---", "IQR --->",IQR)
    print("Lower Fence:",lower_fence)
    print("Higher Fence:", higher_fence)
    print("______________________________________________")
     #data[data[column] <=lower_fence][column]= lower_fence
    data.loc[data[column] <=lower_fence, column] = lower_fence
    #data[data[column] >=higher_fence][column]= higher_fence
    data.loc[data[column] >=higher_fence, column] = higher_fence

In [None]:
for columns in sample_data_copy:
    handling_outliers(sample_data_copy,columns)

In [None]:
#Check the boxplot after removing outliers
plt.figure(figsize = (15,10))
plt.suptitle('BoxPlot of all features', fontsize = 25, fontweight = "bold", alpha = 0.8, y = 1.)

for i in range(0, len(sample_data_copy.columns)):
    plt.subplot(3,2,i+1)
    sns.boxplot(x= sample_data_copy[sample_data_copy.columns[i]], data = sample_data)
    plt.xlabel(sample_data_copy.columns[i],fontsize = 20)
    plt.tight_layout()

In [None]:
import pymongo
import json

In [None]:
client = pymongo.MongoClient("mongodb+srv://charan:charangowda@machinelearning.fqyneei.mongodb.net/?retryWrites=true&w=majority")
db = client.test

In [None]:
#Convert data into dict
data = sample_data_copy.to_dict(orient = "records")
data[:5]

In [None]:
database = client['Household_Comsumption']
database

In [None]:
# data_after_preprocessing is table name
collection = database["data_after_preprocessing"]
collection.insert_many(data)

In [None]:
#Retrive data from MongoDB
all_record = collection.find()
list_record = list(all_record)
list_record[:5]

In [None]:
data_mongo = pd.DataFrame(list_record)
data_mongo.head()

In [None]:
data_mongo.drop("_id",axis = 1,inplace = True)

In [None]:
data_mongo.head()

In [None]:
data_mongo.shape

<h1 align = 'Center'> Model building </h1>

In [None]:
 #dependent and independent features
X = data_mongo.drop('power_consumed',axis = 1)
y = data_mongo['power_consumed']

In [None]:
X.head()

In [None]:
y.head()

In [None]:
#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
X_train.shape, y_train.shape

In [None]:
X_test.shape, y_test.shape

In [None]:
#Standardize Scaler
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
scaler


In [None]:
### Using fit_transform to standardise Train data
X_train = scaler.fit_transform(X_train)

In [None]:
### Here using only transform to avoid data leakage
X_test = scaler.transform(X_test)

In [None]:
Report = []

<h1 align = 'Center'> Linear Regression </h1>

In [None]:
## creating linear regression model
linear_reg = LinearRegression()

# Passing training data (X and y) to the model
linear_reg.fit(X_train, y_train)

# coefficients and intercept of best fit hyperplane
print("Linear Regression Coefficient",linear_reg.coef_)
print("Linear Regression Intercept",linear_reg.intercept_)

# Prediction of test data
linear_test_pred = linear_reg.predict(X_test)

# R Square score
lin_test_r2_score = metrics.r2_score(y_test,linear_test_pred)
print("Linear Regression r2:",lin_test_r2_score)

# Adjusted R Square score
lin_test_adjr2_score = 1 - (1-lin_test_r2_score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
print("Adjusted R2:",lin_test_adjr2_score)

# Insert this information in Report list
Report.append({'Model':'Linear Regression',
              'Testing Accuracy r2':lin_test_r2_score,
               'Adjusted r2':lin_test_adjr2_score,
               'MSE_Test':mean_squared_error(y_test,linear_test_pred),
               'MAE_Test':mean_absolute_error(y_test,linear_test_pred),
               'RMSE_Test':np.sqrt(mean_squared_error(y_test,linear_test_pred)),
              })

<h1 align = 'Center'> Ridge Regression </h1>

In [None]:
## creating Ridge regression model
ridge_reg=Ridge()

### Passing training data(X and y) to the model
ridge_reg.fit(X_train, y_train)

### Printing co-efficients and intercept of best fit hyperplane
print("1. Co-efficients of independent features is {}".format(ridge_reg.coef_))
print("2. Intercept of best fit hyper plane is {}".format(ridge_reg.intercept_))

### Prediction of test data
ridge_reg_pred = ridge_reg.predict(X_test)

### R Sqaure Score
Ridge_score = metrics.r2_score(y_test,ridge_reg_pred)
print('Ridge_r2_score:',Ridge_score)

### Adjusted R Sqaure
Adjusted_R2 = 1 - (1-Ridge_score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
print('Ridge_Adjusted_R2',Adjusted_R2)

# Insert this information in Report list
Report.append({'Model':'Ridge Regression',
              'Testing Accuracy r2':Ridge_score,
               'Adjusted r2':Adjusted_R2,
               'MSE_Test':mean_squared_error(y_test,ridge_reg_pred),
               'MAE_Test':mean_absolute_error(y_test,ridge_reg_pred),
               'RMSE_Test':np.sqrt(mean_squared_error(y_test,ridge_reg_pred)),
              })

<h1 align = 'Center'> Lasso regression </h1>

In [None]:
## creating Lasso regression model
lasso_reg = Lasso()

### Passing training data(X and y) to the model
lasso_reg.fit(X_train, y_train)

### Printing co-efficients and intercept of best fit hyperplane
print("1. Co-efficients of independent features is {}".format(lasso_reg.coef_))
print("2. Intercept of best fit hyper plane is {}".format(lasso_reg.intercept_))

### Prediction of test data
lasso_reg_pred = lasso_reg.predict(X_test)

## R Square
lasso_score = r2_score(y_test,lasso_reg_pred)
print('Lasso_R2:',lasso_score)

## Adjusted R2
Adjusted_r2 = 1 - (1-lasso_score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
print("Lasso Adjusted R2:",Adjusted_r2)


# Insert this information in Report list
Report.append({'Model':'Lasso Regression',
              'Testing Accuracy r2':lasso_score,
               'Adjusted r2':Adjusted_r2,
               'MSE_Test':mean_squared_error(y_test,lasso_reg_pred),
               'MAE_Test':mean_absolute_error(y_test,lasso_reg_pred),
               'RMSE_Test':np.sqrt(mean_squared_error(y_test,lasso_reg_pred)),
              })

<h1 align = 'Center'> SVR model </h1>

In [181]:
# Hyper-parameter tuning the SVM model
param_grid = {'kernel':['rbf','linear','poly']}

grid = GridSearchCV(estimator = SVR(),
                    param_grid=param_grid,
                            cv=5,
                            n_jobs= -1)

grid.fit(X_train,y_train)

In [None]:
#predicting data
svr_pred = grid.predict(X_test)

## r2 score
svr_r2Score = metrics.r2_score(y_test,svr_pred)
print("SVR R2 score:",svr_r2Score)

## Adjusted r2 score
Adjusted_r2 = 1 - (1-svr_r2Score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
print("SVR Adjusted R2:",Adjusted_r2)


# Insert this information in Report list
Report.append({'Model':'SVR Regression',
              'Testing Accuracy r2':svr_r2Score,
               'Adjusted r2':Adjusted_r2,
               'MSE_Test':mean_squared_error(y_test,svr_pred),
               'MAE_Test':mean_absolute_error(y_test,svr_pred),
               'RMSE_Test':np.sqrt(mean_squared_error(y_test,svr_pred)),
              })

<h1 align = 'Center'> Decission Tree regressor </h1>

In [None]:
## creating Decission Tree regression model
Decissiontree = DecisionTreeRegressor()

### Passing training data(X and y) to the model
Decissiontree.fit(X_train, y_train)

### Prediction of test data
Decissiontreepred = Decissiontree.predict(X_test)

## R Square
decission_score = r2_score(y_test,Decissiontreepred)
print('decission_score_R2:',decission_score)

## Adjusted R2
Adjusted_r2 = 1 - (1-decission_score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
print("Decission Adjusted R2:",Adjusted_r2)


# Insert this information in Report list
Report.append({'Model':'Decission Tree regressor',
              'Testing Accuracy r2':decission_score,
               'Adjusted r2':Adjusted_r2,
               'MSE_Test':mean_squared_error(y_test,Decissiontreepred),
               'MAE_Test':mean_absolute_error(y_test,Decissiontreepred),
               'RMSE_Test':np.sqrt(mean_squared_error(y_test,Decissiontreepred)),
              })

In [None]:
### creating dictionary containing model objects for different algorithmn
models={
    "Random Forest Regressor":RandomForestRegressor(),
    "Bagging Regressor": BaggingRegressor(base_estimator=LinearRegression()),
    "Extra Tree Regressor": ExtraTreesRegressor(), 
    "AdaBoost Regressor": AdaBoostRegressor(),
    "GradientBoost Regressor": GradientBoostingRegressor(),
    "XGBoost Regressor": XGBRegressor()
    
}

In [None]:
### Creating function for model training
def model_trainer(model, X_train_data, y_train_data, X_test_data):
    """
    This function takes model object, X train data, y train data, and 
    X test data as argument, trains model and gives prediction for train data 
    and prediction for test data.
    """
    model.fit(X_train_data, y_train_data)
    y_train_pred=model.predict(X_train_data)
    pred_val=model.predict(X_test_data)
    return y_train_pred, pred_val

In [None]:
### Creating function that will evaluate model
def model_evaluator(actual_val, pred_val, X_test_val):
    """
    The function takes actual value, predicted value and X test value as 
    argument and returns Mean square error, Mean absolute error, Root 
    mean square error, r2 score and adjusted r2 score rounded to 3 decimal 
    places.
    """
    mse=round(mean_squared_error(actual_val, pred_val),3)
    mae=round(mean_absolute_error(actual_val, pred_val),3)
    rmse=round(np.sqrt(mean_squared_error(actual_val, pred_val)),3)
    r2_sco=round(r2_score(actual_val, pred_val),4)
    adj_r2_sco=round(1-(1-r2_sco)*(len(actual_val)-1)/(len(actual_val)-X_test.shape[1]-1),4)
    return mse, mae, rmse, r2_sco, adj_r2_sco

In [None]:
### Training all models and getting their performance and storing it in empty list
for num in range(len(list(models))):
    ### selecting model
    model=list(models.values())[num]
    model_name = list(models.keys())[num]
    ### getting training data prediction and test data prediction
    y_pred, pred_val=model_trainer(model,X_train, y_train, X_test)
    
    ### Getting model performance parameters for training data
    mse, mae, rmse, r2_sco, adj_r2_sco=model_evaluator(y_train,y_pred,X_train )
    print("{} Model\n".format(list(models.keys())[num]))
    print("Model Performance for training dataset")
    print("Mean Square Error: {}\nMean Absolute Error: {}\nRoot Mean Square Error: {}\nR2 Score: {}\nAdjusted R2 Score: {}".format(mse,mae, rmse, r2_sco, adj_r2_sco))
    print("-"*50)
    
    ### Getting model performance parameters for test data
    mse, mae, rmse, r2_sco, adj_r2_sco=model_evaluator(y_test,pred_val,X_test )
    print("Model Performance for Test dataset")
    print("Mean Square Error: {}\nMean Absolute Error: {}\nRoot Mean Square Error: {}\nR2 Score: {}\nAdjusted R2 Score: {}".format(mse,mae, rmse, r2_sco, adj_r2_sco))
    
  

    # Insert this information in Report list
    Report.append({'Model':model_name,
              'Testing Accuracy r2':r2_sco,
               'Adjusted r2':adj_r2_sco,
               'MSE_Test':mse,
               'MAE_Test':mae,
               'RMSE_Test':rmse
              })
    

In [None]:
Report

In [None]:
report_df = pd.DataFrame.from_dict(Report)

In [None]:
report_df