# **Project Name**    -
#Bike Sharing Demand Prediction


##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member 1 -** Upendra Pratap Singh

# **GitHub Link -**

https://github.com/UPENDRA555/Bike-Sharing-Demand-Prediction/tree/main

# **Problem Statement**


## Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.

# ***Let's Begin !***

## ***Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.ticker as mtick
import matplotlib.pyplot as plt
%matplotlib inline

# hide warnings
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/SeoulBikeData.csv', encoding='unicode_escape')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print('The row & column count')
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

In [None]:
df.dtypes

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_value=len(df[df.duplicated()])
print('The number of duplicate value in theis data:', duplicate_value)
df.groupby(df.columns.tolist(),as_index=False).size().head()

In [None]:

df.drop_duplicates()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
def show_missing():
    missing = df.columns[df.isnull().any()].tolist()
    return missing

# Missing data counts and percentage
print('Missing Data Count')
print(df[show_missing()].isnull().sum().sort_values(ascending = False))
print('--'*50)
print('Missing Data Percentage')
print(round(df[show_missing()].isnull().sum().sort_values(ascending = False)/len(df)*100,2))

In [None]:
# Number of missing value in each column
df.isna().sum()

In [None]:
# Visualizing the missing values
!pip install missingno

In [None]:
import missingno as msno
msno.matrix(df)

In [None]:
msno.bar(df)

#Data Description

The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.






Attribute Information:


*   Date : year-month-day
*   Rented Bike count - Count of bikes rented at each hour


*   Hour - Hour of he day
*   Temperature-Temperature in Celsius

*   Humidity - %
*   Windspeed - m/s

*   Visibility - 10m
*   Dew point temperature - Celsius

*   Solar radiation - MJ/m2
*   Rainfall - mm

*   Snowfall - cm
*   Seasons - Winter, Spring, Summer, Autumn

*   Holiday - Holiday/No holiday
*   Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)















## ***Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

In [None]:
# Write your code to make your dataset analysis ready.
data = df.copy()
data.head()

In [None]:
data.dtypes

In [None]:
# Date columns to Date format conversion
data['Date']= pd.to_datetime(data['Date'])


*   Changing the 'Date' column in three column 'Day', 'Month' And 'Year' columns  List item


*   Create a another column is "Weekend" for Saterday and Sunday because we need this data



In [None]:
# Changing the 'Date' column in three column 'Day', 'Month' And 'Year' columns List item
data['Year'] = data['Date'].dt.year
data['Month'] = data['Date'].dt.month
data['Day'] = data['Date'].dt.day_name()

In [None]:
# Create a another column is "Weekend" for Saterday and Sunday
data['Weekend']=data['Day'].apply(lambda x : 1 if x=='Saturday' or x=='Sunday' else 0 )



*   We are drop the Day and Year column and also a Date column
*   Day column contais each day details of every month, we don't need this data for our relevence

*   Year column contains only 2 unique number 2017 and 2018 for december to novenber only 1 year, so don't need a Year column for only 1 year


*   We are drop a Date column because we are extracting a Day, Month and Year then there is no relevence in this dataset








In [None]:
data=data.drop(columns=['Date','Day','Year'],axis=1)

In [None]:
data.columns

In [None]:
data.dtypes

We are seen 'Hour', 'Month', and 'Weekend' column are shows a integer type but it is a categorical columns then convert these column into ctaegorical type

In [None]:
# 'Hours', 'Month' and 'Weekend' are convert into category column
data['Hour']= data['Hour'].astype('category')
data['Month']= data['Month'].astype('category')
data['Weekend']= data['Weekend'].astype('category')

In [None]:
print(data['Hour'].value_counts())
print(data['Month'].value_counts())
print(data['Weekend'].value_counts())

In [None]:
categorical_features= data.select_dtypes(include=['object', 'category'])
categorical_features.head()

In [None]:
numerical_features= data.select_dtypes(include=['int64', 'float64'])
numerical_features.head()

## *** EDA(Exploratory data analysis)***


#### For Target variable

In [None]:
# Split the Independent and dependent variable
X=data.iloc[:,2:]
y=data['Rented Bike Count']

In [None]:
X.head()

In [None]:
y.head()

In [None]:
# Plot the distplot for Dependent variable
plt.figure(figsize=(10,6))
sns.distplot(data['Rented Bike Count'], hist= True)
plt.xlabel('Rented Bike Count')
plt.ylabel('Density')
plt.show()

In the above distplot data are rightly skweness then applying square root to Rented Bike Count to improve skewness

In [None]:
# Apply square root to improve skewnwess
plt.figure(figsize=(10,6))
sns.distplot(np.sqrt(data['Rented Bike Count']), hist= True)
plt.xlabel('Rented Bike Count')
plt.ylabel('Density')
plt.show()

#### Categorical feature Analysis

In [None]:
# Find a total count of a unique value
for colm in categorical_features:
  data[colm].value_counts()
  print(data[colm].value_counts())

In [None]:
# Plot a bar graph of a categorical feature
for colm in categorical_features:
  data[colm].value_counts().plot(kind= 'bar', figsize= (10, 10), fontsize=10, width= 0.3)
  plt.xlabel(colm)
  plt.ylabel('Count')
  plt.show()

In [None]:
# Plot a barplot between a categorical feature and target feature
for col in categorical_features:
  plt.figure(figsize=(10,6))
  sns.barplot(x=data[col], y= data['Rented Bike Count'])
  plt.xlabel(col)
  plt.show()

In [None]:
#ploting Box plot to visualize and trying to get information from plot
for col in categorical_features:
  sns.boxplot(x=data[col],y=data["Rented Bike Count"])
  plt.show()

In [None]:
# plot average rent over categorical feature
for col in categorical_features:
  avg_rent = data.groupby(col)['Rented Bike Count'].mean()
  plt.figure(figsize=(20,4))
  a=avg_rent.plot(legend=True,marker='o',title="Average Bikes Rented")
  a.set_xticks(range(len(avg_rent)));
  a.set_xticklabels(avg_rent.index.tolist(), rotation=85);
  plt.show()



*   Rented Bike demand is high in morning 7 am to 9 am and evening 5 pm to 7 pm.
*   Rented bike demand is high in summer season an low in winter season

*   Rented bike demand is decresing in holidays.
*   Peoples dont use reneted bikes in no functioning day.

*   Rented bike demand is high in June month and low in January and Febuary month
*   Rented bike demand is decrese in Weekend







In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(20,8))
sns.pointplot(data=data,x='Hour',y='Rented Bike Count',hue='Seasons',ax=ax)
ax.set(title='Count of Rented bikes acording to seasons ')



*   In summer season the use of rented bike is high and peak time is 7am-9am and 7pm-5pm.
*   In winter season the use of rented bike is very low because of snowfall.



In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(20,8))
sns.pointplot(data=data,x='Hour',y='Rented Bike Count',hue='Holiday',ax=ax)
ax.set(title='Count of Rented bikes acording to Holiday ')



*   plot shows that in holiday people uses the rented bike from 2pm-8pm



In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(20,8))
sns.pointplot(data=data,x='Hour',y='Rented Bike Count',hue='Functioning Day',ax=ax)
ax.set(title='Count of Rented bikes acording to Functioning Day ')



*   Point plot which shows the use of rented bike in functioning daya or not, and it clearly shows that,




*   Peoples dont use reneted bikes in no functioning day.




In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(20,8))
sns.pointplot(data=data,x='Hour',y='Rented Bike Count',hue='Weekend',ax=ax)
ax.set(title='Count of Rented bikes acording to weekend ')



*   In the week days which represent in blue colur show that the demand of the bike higher because of the office.
*   Peak Time are 7 am to 9 am and 5 pm to 7 pm

*   The orange colur represent the weekend days, and it show that the demand of rented bikes are very low specially in the morning hour but when the evening start from 4 pm to 8 pm the demand slightly increases






#### Numerical Feature Analysis

In [None]:
#plotting histogram

for col in numerical_features[:]:
  sns.histplot(data[col])
  plt.axvline(data[col].mean(), color='magenta', linestyle='dashed', linewidth=2)
  plt.axvline(data[col].median(), color='cyan', linestyle='dashed', linewidth=2)
  plt.show()

In [None]:
# ploting Regression plot of each columns of dataset v/s rented bike count columns

for col in numerical_features[:]:
  if col == 'Rented Bike Count':
    pass
  else:
    sns.regplot(x=data[col],y=data["Rented Bike Count"],line_kws={"color": "red"})

  plt.show()



*   From the above regression plot of all numerical features we see that the columns 'Temperature', 'Wind_speed','Visibility', 'Dew_point_temperature', 'Solar_Radiation' are positively relation to the target variable.
*   which means the rented bike count increases with increase of these features.

*   'Rainfall','Snowfall','Humidity' these features are negatively related with the target variaable which means the rented bike count decreases when these features increase.


Relationship between numerical variable and 'Rented Bike Count'

In [None]:
#print the plot to analyze the relationship between rented bike count and numrical columns

for col in numerical_features[:]:
  if col == 'Rented Bike Count':
    pass
  else:
    data.groupby(col).mean()['Rented Bike Count'].plot()

  plt.show()



*   For Temprature- people like to ride bikes when it is pretty hot around 25°C in average
*   For Humidity- People like to ride when humidity is low like in range of 10 - 15(%)

*   For Wind Speend- demand of rented bike is uniformly distribute despite of wind speed but when the speed of wind was 7 m/s then the demand of bike also increase that clearly means peoples love to ride bikes when its little windy.
*   For Visibility- People like to ride when visibility is more 500

*   For Dew point temperature- "Dew_point_temperature' is almost same as the 'temperature' there is some similarity present
*   For Solar Radiation- the amount of rented bikes is huge, when there is solar radiation, the counter of rents is around 1000

*   For Rainfall- rains a lot the demand of of rent bikes is not decreasing, here for example even if we have 20 mm of rain there is a big peak of rented bikes
*   For Snowfall- the amount of rented bike is very low When we have more than 4 cm of snow, the bike rents is much lower











#### Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code of datasets
plt.figure(figsize=(25,20))
sns.heatmap(data.corr(),annot=True,cmap="coolwarm")
plt.show()

In [None]:
# Multicollinearity

from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):

   # Calculating VIF
   vif = pd.DataFrame()
   vif["variables"] = X.columns
   vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

   return(vif)

In [None]:
calc_vif(df[[i for i in df.describe().columns if i not in ['Rented Bike Count','Dew point temperature(°C)'] ]])




#### Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(data, hue ='Rented Bike Count')

### one hot encoding

In [None]:
#creating Dummy variable for categorical columns
dummy_categorical_feature= pd.get_dummies(categorical_features,drop_first=True)

In [None]:
dummy_categorical_feature.head()

In [None]:
#concating numeric columns and dummy columns and creating final df
final_df= pd.concat([dummy_categorical_feature,numerical_features],axis=1)

In [None]:
final_df.head()

#### Train Test split for regression

In [None]:
X=final_df.drop(['Rented Bike Count'],axis=1)
y=np.sqrt(final_df['Rented Bike Count'])

In [None]:
# Scaling the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [None]:
#Creat test and train data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.20, random_state=0)
print(X_train.shape)
print(X_test.shape)

### Linear Regression

In [None]:
# Use a Linear regression model to fit the algorithm and perdict the model
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
reg= LinearRegression().fit(X_train, y_train)

#get the X_train and X-test value
y_pred_train=reg.predict(X_train)
y_pred_test=reg.predict(X_test)

#check the score
print(reg.score(X_train, y_train))

print('---------------------------------------------------------------------------------------------------------------------')

#check the coefficeint
print(reg.coef_)

#### Train Data Score Chart for Linear Regression

In [None]:
#import the module
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_lr_train= mean_squared_error((y_train), (y_pred_train))
print("MSE :",MSE_lr_train)

#calculate RMSE
RMSE_lr_train=np.sqrt(MSE_lr_train)
print("RMSE :",RMSE_lr_train)

# import the module
from sklearn.metrics import mean_absolute_error
#calculate MAE
MAE_lr_train= mean_absolute_error(y_train, y_pred_train)
print("MAE :",MAE_lr_train)



#import the module
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_lr_train= r2_score(y_train, y_pred_train)
print("R2 :",r2_lr_train)
Adjusted_R2_lr_train = (1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

#### Test Data Score Chart for Linear Regression

In [None]:
#import the module
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_lr_test= mean_squared_error((y_test), (y_pred_test))
print("MSE :",MSE_lr_test)

#calculate RMSE
RMSE_lr_test=np.sqrt(MSE_lr_test)
print("RMSE :",RMSE_lr_test)

# import the module
from sklearn.metrics import mean_absolute_error
#calculate MAE
MAE_lr_test= mean_absolute_error((y_test), (y_pred_test))
print("MAE :",MAE_lr_test)



#import the module
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_lr_test= r2_score((y_test), (y_pred_test))
print("R2 :",r2_lr_test)
Adjusted_R2_lr_test = (1-(1-r2_score((y_test), (y_pred_test)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

#### Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
### Heteroscadacity
plt.scatter((y_pred_test),(y_test)-(y_pred_test))
plt.show()

In [None]:
#Plot the figure
plt.figure(figsize=(15,10))
plt.plot(y_pred_test)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

### LASSO REGRESSION

In [None]:
# Create an instance of Lasso Regression implementation
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=1.0, max_iter=3000)
# Fit the Lasso model
lasso.fit(X_train, y_train)

#get the X_train and X-test value
y_pred_train_lasso=lasso.predict(X_train)
y_pred_test_lasso=lasso.predict(X_test)

# Create the model score
print(lasso.score(X_test, y_test), lasso.score(X_train, y_train))

print('-----------------------------------------------------------------')

#check the coefficeint
print(lasso.coef_)

For Train Data

In [None]:
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_l_train= mean_squared_error((y_train), (y_pred_train_lasso))
print("MSE :",MSE_l_train)

#calculate RMSE
RMSE_l_train=np.sqrt(MSE_l_train)
print("RMSE :",RMSE_l_train)


#calculate MAE
MAE_l_train= mean_absolute_error(y_train, y_pred_train_lasso)
print("MAE :",MAE_l_train)


from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_l_train= r2_score(y_train, y_pred_train_lasso)
print("R2 :",r2_l_train)
Adjusted_R2_l_train = (1-(1-r2_score(y_train, y_pred_train_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

For Test Data

In [None]:
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_l_test= mean_squared_error(y_test, y_pred_test_lasso)
print("MSE :",MSE_l_test)

#calculate RMSE
RMSE_l_test=np.sqrt(MSE_l_test)
print("RMSE :",RMSE_l_test)


#calculate MAE
MAE_l_test= mean_absolute_error(y_test, y_pred_test_lasso)
print("MAE :",MAE_l_test)


from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_l_test= r2_score((y_test), (y_pred_test_lasso))
print("R2 :",r2_l_test)
Adjusted_R2_l_test=(1-(1-r2_score((y_test), (y_pred_test_lasso)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_lasso)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

#### Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
#Plot the figure
plt.figure(figsize=(15,10))
plt.plot(np.array(y_pred_test_lasso))
plt.plot(np.array((y_test)))
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
# Heteroscadacity
plt.scatter((y_pred_test_lasso),(y_test-y_pred_test_lasso))
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Use a GridSearch CV for hyperparameter tuning of laaso regression
from sklearn.model_selection import GridSearchCV
params = {'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0,
                    10.0, 20, 50, 100, 500, 1000 ]}

grid_cv_lasso = GridSearchCV(estimator=lasso,
                       param_grid=params,
                       scoring='neg_mean_absolute_error',
                       cv=5,
                       return_train_score=True,
                       verbose=1)
grid_lasso_reg= grid_cv_lasso.fit(X_train, y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_lasso_grid=grid_lasso_reg.predict(X_train)
y_pred_test_lasso_grid=grid_lasso_reg.predict(X_test)

# Create the model score
print(grid_lasso_reg.score(X_test, y_test), grid_lasso_reg.score(X_train, y_train))



For Train Data

In [None]:
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_l_train_grid= mean_squared_error((y_train), (y_pred_train_lasso_grid))
print("MSE :",MSE_l_train_grid)

#calculate RMSE
RMSE_l_train_grid=np.sqrt(MSE_l_train_grid)
print("RMSE :",RMSE_l_train_grid)


#calculate MAE
MAE_l_train_grid= mean_absolute_error(y_train, y_pred_train_lasso_grid)
print("MAE :",MAE_l_train_grid)


from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_l_train_grid= r2_score(y_train, y_pred_train_lasso_grid)
print("R2 :",r2_l_train_grid)
Adjusted_R2_l_train_grid = (1-(1-r2_score(y_train, y_pred_train_lasso_grid))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_lasso_grid))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

For Test Data

In [None]:
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_l_test_grid= mean_squared_error(y_test, y_pred_test_lasso_grid)
print("MSE :",MSE_l_test_grid)

#calculate RMSE
RMSE_l_test_grid=np.sqrt(MSE_l_test_grid)
print("RMSE :",RMSE_l_test_grid)


#calculate MAE
MAE_l_test_grid= mean_absolute_error(y_test, y_pred_test_lasso_grid)
print("MAE :",MAE_l_test_grid)


from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_l_test_grid= r2_score((y_test), (y_pred_test_lasso_grid))
print("R2 :",r2_l_test_grid)
Adjusted_R2_l_test_grid=(1-(1-r2_score((y_test), (y_pred_test_lasso_grid)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_lasso_grid)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

### RIDGE REGRESSION

In [None]:
#import the packages
from sklearn.linear_model import Ridge

ridge= Ridge(alpha=0.1)

# Fit the Lasso model
ridge.fit(X_train, y_train)

#get the X_train and X-test value
y_pred_train_ridge=ridge.predict(X_train)
y_pred_test_ridge=ridge.predict(X_test)

# Create the model score
print(ridge.score(X_test, y_test), ridge.score(X_train, y_train))

print('-----------------------------------------------------------------')

#check the coefficeint
print(ridge.coef_)

For Train Data

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_r_train= mean_squared_error((y_train), (y_pred_train_ridge))
print("MSE :",MSE_r_train)

#calculate RMSE
RMSE_r_train=np.sqrt(MSE_r_train)
print("RMSE :",RMSE_r_train)


#calculate MAE
MAE_r_train= mean_absolute_error(y_train, y_pred_train_ridge)
print("MAE :",MAE_r_train)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_r_train= r2_score(y_train, y_pred_train_ridge)
print("R2 :",r2_r_train)
Adjusted_R2_r_train=(1-(1-r2_score(y_train, y_pred_train_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

For Test Data

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_r_test= mean_squared_error(y_test, y_pred_test_ridge)
print("MSE :",MSE_r_test)

#calculate RMSE
RMSE_r_test=np.sqrt(MSE_r_test)
print("RMSE :",RMSE_r_test)


#calculate MAE
MAE_r_test= mean_absolute_error(y_test, y_pred_test_ridge)
print("MAE :",MAE_r_test)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_r_test= r2_score((y_test), (y_pred_test_ridge))
print("R2 :",r2_r_test)
Adjusted_R2_r_test=(1-(1-r2_score((y_test), (y_pred_test_ridge)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_ridge)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
#Plot the figure
plt.figure(figsize=(15,10))
plt.plot((y_pred_test_ridge))
plt.plot((np.array(y_test)))
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
# Heteroscadacity
plt.scatter((y_pred_test_ridge),(y_test)-(y_pred_test_ridge))
plt.show()

#### Cross- Validation & Hyperparameter Tuning

In [None]:
# Use a GridSearch CV for hyperparameter tuning of ridge regression
from sklearn.model_selection import GridSearchCV
params = {'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0,
                    10.0, 20, 50, 100, 500, 1000 ]}

grid_cv_ridge = GridSearchCV(estimator=ridge,
                       param_grid=params,
                       scoring='neg_mean_absolute_error',
                       cv=5,
                       return_train_score=True,
                       verbose=1)
grid_ridge_reg= grid_cv_lasso.fit(X_train, y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_ridge_grid=grid_ridge_reg.predict(X_train)
y_pred_test_ridge_grid=grid_ridge_reg.predict(X_test)

# Create the model score
print(grid_ridge_reg.score(X_test, y_test), grid_ridge_reg.score(X_train, y_train))

For Train Data

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_r_train_grid= mean_squared_error((y_train), (y_pred_train_ridge_grid))
print("MSE :",MSE_r_train_grid)

#calculate RMSE
RMSE_r_train_grid=np.sqrt(MSE_r_train_grid)
print("RMSE :",RMSE_r_train_grid)


#calculate MAE
MAE_r_train_grid= mean_absolute_error(y_train, y_pred_train_ridge_grid)
print("MAE :",MAE_r_train_grid)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_r_train_grid= r2_score(y_train, y_pred_train_ridge_grid)
print("R2 :",r2_r_train_grid)
Adjusted_R2_r_train_grid=(1-(1-r2_score(y_train, y_pred_train_ridge_grid))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_ridge_grid))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

For Test Data

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_r_test_grid= mean_squared_error(y_test, y_pred_test_ridge_grid)
print("MSE :",MSE_r_test_grid)

#calculate RMSE
RMSE_r_test_grid=np.sqrt(MSE_r_test_grid)
print("RMSE :",RMSE_r_test_grid)


#calculate MAE
MAE_r_test_grid= mean_absolute_error(y_test, y_pred_test_ridge_grid)
print("MAE :",MAE_r_test_grid)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_r_test_grid= r2_score((y_test), (y_pred_test_ridge_grid))
print("R2 :",r2_r_test_grid)
Adjusted_R2_r_test_grid=(1-(1-r2_score((y_test), (y_pred_test_ridge_grid)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_ridge_grid)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

### ELASTIC NET REGRESSION

In [None]:
#import the packages
#a * L1 + b * L2
#alpha = a + b and l1_ratio = a / (a + b)
from sklearn.linear_model import ElasticNet
elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.5)

#FIT THE MODEL
elasticnet.fit(X_train,y_train)

#get the X_train and X-test value
y_pred_train_en=elasticnet.predict(X_train)
y_pred_test_en=elasticnet.predict(X_test)

#check the score
elasticnet.score(X_train, y_train)

For Train Data

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_e_train= mean_squared_error((y_train), (y_pred_train_en))
print("MSE :",MSE_e_train)

#calculate RMSE
RMSE_e_train=np.sqrt(MSE_e_train)
print("RMSE :",RMSE_e_train)


#calculate MAE
MAE_e_train= mean_absolute_error(y_train, y_pred_train_en)
print("MAE :",MAE_e_train)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_e_train= r2_score(y_train, y_pred_train_en)
print("R2 :",r2_e_train)
Adjusted_R2_e_train=(1-(1-r2_score(y_train, y_pred_train_en))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_en))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

For Test Data

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_e_test= mean_squared_error(y_test, y_pred_test_en)
print("MSE :",MSE_e_test)

#calculate RMSE
RMSE_e_test=np.sqrt(MSE_e_test)
print("RMSE :",RMSE_e_test)


#calculate MAE
MAE_e_test= mean_absolute_error(y_test, y_pred_test_en)
print("MAE :",MAE_e_test)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_e_test= r2_score((y_test), (y_pred_test_en))
print("R2 :",r2_e_test)
Adjusted_R2_e_test=(1-(1-r2_score((y_test), (y_pred_test_en)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_en)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

In [None]:
#Plot the figure
plt.figure(figsize=(15,10))
plt.plot(np.array(y_pred_test_en))
plt.plot((np.array(y_test)))
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
# Heteroscadacity
plt.scatter((y_pred_test_en),(y_test)-(y_pred_test_en))
plt.show()

### Cross- Validation & Hyperparameter Tuning

In [None]:
# Use a GridSearch CV for hyperparameter tuning of ridge regression
from sklearn.model_selection import GridSearchCV
params = {'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0,
                    10.0, 20, 50, 100, 500, 1000 ]}

grid_cv_en = GridSearchCV(estimator=elasticnet,
                       param_grid=params,
                       scoring='neg_mean_absolute_error',
                       cv=5,
                       return_train_score=True,
                       verbose=1)
grid_en_reg= grid_cv_en.fit(X_train, y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_en_grid=grid_en_reg.predict(X_train)
y_pred_test_en_grid=grid_en_reg.predict(X_test)

# Create the model score
print(grid_en_reg.score(X_test, y_test), grid_en_reg.score(X_train, y_train))

For Train Data

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_e_train_grid= mean_squared_error((y_train), (y_pred_train_en_grid))
print("MSE :",MSE_e_train_grid)

#calculate RMSE
RMSE_e_train_grid=np.sqrt(MSE_e_train_grid)
print("RMSE :",RMSE_e_train_grid)


#calculate MAE
MAE_e_train_grid= mean_absolute_error(y_train, y_pred_train_en_grid)
print("MAE :",MAE_e_train_grid)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_e_train_grid= r2_score(y_train, y_pred_train_en_grid)
print("R2 :",r2_e_train_grid)
Adjusted_R2_e_train_grid=(1-(1-r2_score(y_train, y_pred_train_en_grid))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_en_grid))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

For Test Data

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_e_test_grid= mean_squared_error(y_test, y_pred_test_en_grid)
print("MSE :",MSE_e_test_grid)

#calculate RMSE
RMSE_e_test_grid=np.sqrt(MSE_e_test_grid)
print("RMSE :",RMSE_e_test_grid)


#calculate MAE
MAE_e_test_grid= mean_absolute_error(y_test, y_pred_test_en_grid)
print("MAE :",MAE_e_test_grid)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_e_test_grid= r2_score((y_test), (y_pred_test_en_grid))
print("R2 :",r2_e_test_grid)
Adjusted_R2_e_test_grid=(1-(1-r2_score((y_test), (y_pred_test_en_grid)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_en_grid)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

### Evaluating the models

#### 1- Without Hyperameter Tuning

For Train Datasets Evalution Dataframe

In [None]:
regression = ['LinearRegression', 'LASSO REGRESSION', 'RIDGE REGRESSION', 'ELASTIC NET REGRESSION']

MSE = [MSE_lr_train, MSE_l_train, MSE_r_train, MSE_e_train]
RMSE = [RMSE_lr_train, RMSE_l_train, RMSE_r_train, RMSE_e_train]
MAE = [MAE_lr_train, MAE_l_train, MAE_r_train, MAE_e_train]
R2 = [r2_lr_train, r2_l_train, r2_r_train, r2_e_train]
Adjusted_R2 = [Adjusted_R2_lr_train, Adjusted_R2_l_train, Adjusted_R2_r_train, Adjusted_R2_e_train]

In [None]:
pd.DataFrame({'regression_train':regression, 'MSE': MSE, 'RMSE': RMSE, 'R2': R2,'Adjusted_R2':Adjusted_R2})

For Test Datasets Evalution Dataframe

In [None]:
regression = ['LinearRegression', 'LASSO REGRESSION', 'RIDGE REGRESSION', 'ELASTIC NET REGRESSION']

MSE = [MSE_lr_test, MSE_l_test, MSE_r_test, MSE_e_test]
RMSE = [RMSE_lr_test, RMSE_l_test, RMSE_r_test, RMSE_e_test]
MAE = [MAE_lr_test, MAE_l_test, MAE_r_test, MAE_e_test]
R2 = [r2_lr_test, r2_l_test, r2_r_test, r2_e_test]
Adjusted_R2 = [Adjusted_R2_lr_test, Adjusted_R2_l_test, Adjusted_R2_r_test, Adjusted_R2_e_test]

In [None]:
pd.DataFrame({'regression_test':regression, 'MSE': MSE, 'RMSE': RMSE, 'R2': R2,'Adjusted_R2':Adjusted_R2})

#### 2- After Hyperameter Tuning

For Train Datasets Evalution Dataframe

In [None]:
regression = ['LASSO REGRESSION', 'RIDGE REGRESSION', 'ELASTIC NET REGRESSION']

MSE = [MSE_l_train_grid, MSE_r_train_grid, MSE_e_train_grid]
RMSE = [RMSE_l_train_grid, RMSE_r_train_grid, RMSE_e_train_grid]
MAE = [MAE_l_train_grid, MAE_r_train_grid, MAE_e_train_grid]
R2 = [r2_l_train_grid, r2_r_train_grid, r2_e_train_grid]
Adjusted_R2 = [Adjusted_R2_l_train_grid, Adjusted_R2_r_train_grid, Adjusted_R2_e_train_grid]

In [None]:
pd.DataFrame({'regression_train_grid':regression, 'MSE': MSE, 'RMSE': RMSE, 'R2': R2,'Adjusted_R2':Adjusted_R2})

For Test Datasets Evalution Dataframe

In [None]:
regression = ['LASSO REGRESSION', 'RIDGE REGRESSION', 'ELASTIC NET REGRESSION']

MSE = [MSE_l_test_grid, MSE_r_test_grid, MSE_e_test_grid]
RMSE = [RMSE_l_test_grid, RMSE_r_test_grid, RMSE_e_test_grid]
MAE = [MAE_l_test_grid, MAE_r_test_grid, MAE_e_test_grid]
R2 = [r2_l_test_grid, r2_r_test_grid, r2_e_test_grid]
Adjusted_R2 = [Adjusted_R2_l_test_grid, Adjusted_R2_r_test_grid, Adjusted_R2_e_test_grid]

In [None]:
pd.DataFrame({'regression_test_grid':regression, 'MSE': MSE, 'RMSE': RMSE, 'R2': R2,'Adjusted_R2':Adjusted_R2})

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**



*   No overfitting is seen.
*   LinearRegression and RIDGE REGRESSION gives the highest R2 score of 75.9% recpectively for Train Set and 76% for Test set.



As this data is time dependent, the values for variables like temperature, windspeed, solar radiation etc., will not always be consistent. Therefore, there will be scenarios where the model might not perform well. As Machine learning is an exponentially evolving field, we will have to be prepared for all contingencies and also keep checking our model from time to time

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***