# Regression Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**YOUR NAME, YOUR SURNAME**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: Spain Electricity Shortfall Challenge

The government of Spain is considering an expansion of it's renewable energy resource infrastructure investments. As such, they require information on the trends and patterns of the countries renewable sources and fossil fuel energy generation. Your company has been awarded the contract to:

- 1. analyse the supplied data;
- 2. identify potential errors in the data and clean the existing data set;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a model that is capable of forecasting the three hourly demand shortfalls;
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to you, the senior data scientist, by your manager via email reads as follow:

> In this project you are tasked to model the shortfall between the energy generated by means of fossil fuels and various renewable sources - for the country of Spain. The daily shortfall, which will be referred to as the target variable, will be modelled as a function of various city-specific weather features such as `pressure`, `wind speed`, `humidity`, etc. As with all data science projects, the provided features are rarely adequate predictors of the target variable. As such, you are required to perform feature engineering to ensure that you will be able to accurately model Spain's three hourly shortfalls.
 
On top of this, she has provided you with a starter notebook containing vague explanations of what the main outcomes are. 

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
# 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [1]:
# Libraries for data loading, data manipulation and data visulisation

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Libraries for data preparation and model building

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Setting global constants to ensure notebook results are reproducible

dtr_random_state = 42
rfr_random_state = 0

# Libraries for pickling the model

import pickle

import warnings
warnings.filterwarnings('ignore')

In [None]:
#from scipy.stats import norm
#from scipy import stats
#from scipy.stats import pearsonr

<a id="two"></a>
# 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [2]:
df = pd.read_csv('df_train.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,time,Madrid_wind_speed,Valencia_wind_deg,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Madrid_humidity,Bilbao_clouds_all,Bilbao_wind_speed,...,Madrid_temp_max,Barcelona_temp,Bilbao_temp_min,Bilbao_temp,Barcelona_temp_min,Bilbao_temp_max,Seville_temp_min,Madrid_temp,Madrid_temp_min,load_shortfall_3h
0,0,2015-01-01 03:00:00,0.666667,level_5,0.0,0.666667,74.333333,64.0,0.0,1.0,...,265.938,281.013,269.338615,269.338615,281.013,269.338615,274.254667,265.938,265.938,6715.666667
1,1,2015-01-01 06:00:00,0.333333,level_10,0.0,1.666667,78.333333,64.666667,0.0,1.0,...,266.386667,280.561667,270.376,270.376,280.561667,270.376,274.945,266.386667,266.386667,4171.666667
2,2,2015-01-01 09:00:00,1.0,level_9,0.0,1.0,71.333333,64.333333,0.0,1.0,...,272.708667,281.583667,275.027229,275.027229,281.583667,275.027229,278.792,272.708667,272.708667,4274.666667
3,3,2015-01-01 12:00:00,1.0,level_8,0.0,1.0,65.333333,56.333333,0.0,1.0,...,281.895219,283.434104,281.135063,281.135063,283.434104,281.135063,285.394,281.895219,281.895219,5075.666667
4,4,2015-01-01 15:00:00,1.0,level_7,0.0,1.0,59.0,57.0,2.0,0.333333,...,280.678437,284.213167,282.252063,282.252063,284.213167,282.252063,285.513719,280.678437,280.678437,6620.666667


Placeholder for droppping worthless Unnamed: 0 feature as it brings no value to the analysis and model creation...

In [4]:
df = df.drop(df.columns[0], axis=1)

In [5]:
df.shape

(8763, 48)

Placeholder | Test dataset has the same feature so we remove it immediately after reading the csv file and check the first 5 observations...

In [6]:
df_test = pd.read_csv('df_test.csv')
df_test = df_test.drop(df_test.columns[0], axis=1)

In [7]:
df_test.head()

Unnamed: 0,time,Madrid_wind_speed,Valencia_wind_deg,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Madrid_humidity,Bilbao_clouds_all,Bilbao_wind_speed,Seville_clouds_all,...,Barcelona_temp_max,Madrid_temp_max,Barcelona_temp,Bilbao_temp_min,Bilbao_temp,Barcelona_temp_min,Bilbao_temp_max,Seville_temp_min,Madrid_temp,Madrid_temp_min
0,2018-01-01 00:00:00,5.0,level_8,0.0,5.0,87.0,71.333333,20.0,3.0,0.0,...,287.816667,280.816667,287.356667,276.15,280.38,286.816667,285.15,283.15,279.866667,279.15
1,2018-01-01 03:00:00,4.666667,level_8,0.0,5.333333,89.0,78.0,0.0,3.666667,0.0,...,284.816667,280.483333,284.19,277.816667,281.01,283.483333,284.15,281.15,279.193333,278.15
2,2018-01-01 06:00:00,2.333333,level_7,0.0,5.0,89.0,89.666667,0.0,2.333333,6.666667,...,284.483333,276.483333,283.15,276.816667,279.196667,281.816667,282.15,280.483333,276.34,276.15
3,2018-01-01 09:00:00,2.666667,level_7,0.0,5.333333,93.333333,82.666667,26.666667,5.666667,6.666667,...,284.15,277.15,283.19,279.15,281.74,282.15,284.483333,279.15,275.953333,274.483333
4,2018-01-01 12:00:00,4.0,level_7,0.0,8.666667,65.333333,64.0,26.666667,10.666667,0.0,...,287.483333,281.15,286.816667,281.816667,284.116667,286.15,286.816667,284.483333,280.686667,280.15


In [8]:
df_test.shape

(2920, 47)

Placeholder| Confirm without target variable load_shortfall_3h...

<a id="three"></a>
# 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


In [None]:
# look at data statistics

In [None]:
# plot relevant feature interactions

In [None]:
# evaluate correlation

In [None]:
# have a look at feature distributions

<a id="four"></a>
# 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [None]:
# remove missing values/ features

In [None]:
# engineer existing features

In [None]:
# create new features

In [10]:
df_clean = df.copy()

Before we commence data cleaning and feature engineering, a copy of the training dataset is made.

## Imputing Data for Null Values

In [11]:
df_clean["Valencia_pressure"] = df_clean["Valencia_pressure"].fillna(df_clean["Valencia_pressure"].mode()[0])

Placeholder for describing null value operation...

## Dropping Features

Placeholder for why are temp_min and temp_max being dropped...

In [12]:
df_clean = df_clean.loc[:, ~df_clean.columns.str.contains("temp_min")]
df_clean = df_clean.loc[:, ~df_clean.columns.str.contains("temp_max")]

## Engineering Existing Features

Placeholder for transformations done in the following cell... 

In [13]:
df_clean["Valencia_wind_deg"] = df_clean["Valencia_wind_deg"].str.extract('(\d+)')
df_clean["Valencia_wind_deg"] = pd.to_numeric(df_clean["Valencia_wind_deg"])

df_clean["Seville_pressure"] = df_clean["Seville_pressure"].str.extract('(\d+)')
df_clean["Seville_pressure"] = pd.to_numeric(df_clean["Seville_pressure"])

df_clean["time"] = pd.to_datetime(df_clean["time"])

## Engineering New Features

In [14]:
df_clean["Hour"] = df_clean["time"].dt.hour
df_clean["Day"] = df_clean["time"].dt.day
df_clean["Weekday"] = df_clean["time"].dt.weekday
df_clean["Week"] = df_clean["time"].dt.isocalendar().week
df_clean['Week'] = df_clean['Week'].astype('int64')
df_clean["Month"] = df_clean["time"].dt.month
df_clean["Year"] = df_clean["time"].dt.year

df_clean = df_clean.drop(['time'], axis=1)

Placeholder for time feature...

In [15]:
seasons = {1: 'Winter',
           2: 'Winter',
           3: 'Spring',
           4: 'Spring',
           5: 'Spring',
           6: 'Summer',
           7: 'Summer',
           8: 'Summer',
           9: 'Autumn',
           10: 'Autumn',
           11: 'Autumn',
           12: 'Winter',
           }

df_clean['Season'] = df_clean['Month'].apply(lambda x: seasons[x])

df_dum = pd.get_dummies(df_clean)

df_dum.columns = [col.replace(" ","_") for col in df_dum.columns]

df_dum['Season_Winter'] = df_dum['Season_Winter'].astype('int64')
df_dum['Season_Spring'] = df_dum['Season_Spring'].astype('int64')
df_dum['Season_Summer'] = df_dum['Season_Summer'].astype('int64')
df_dum['Season_Autumn'] = df_dum['Season_Autumn'].astype('int64')

Placeholder for season dummy encoding feature...

<a id="five"></a>
# 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

The following section will detail the creation, training and predictions of three regression models as follows:
1. Multiple Linear Regressor
2. Decision Tree Regressor
3. Random Forest Regressor

In [None]:
# split data

In [None]:
# create targets and features dataset

In [None]:
# create one or more ML models

In [None]:
# evaluate one or more ML models

First we need to split the training data into the target and predictor variable datasets.

Furthermore we also use the 'train_test_split' function to divide the now seperate target and predictor variables into training and validation datasets (i.e. X_train, X_valid, y_train, y_valid).

In [16]:
X = df_dum.drop(['load_shortfall_3h'], axis=1) # Predictor Variables
y = df_dum['load_shortfall_3h']                # Target Variable

X_train, X_valid, y_train, y_valid = train_test_split (X, y, test_size=0.20, random_state=1)

# 5.1 Model Creation

## Multiple Linear Regressor

In [17]:
mlr = LinearRegression()
mlr.fit(X_train, y_train)

LinearRegression()

## Decision Tree Regressor

In [18]:
dtr = DecisionTreeRegressor(random_state = dtr_random_state)
dtr.fit(X_train, y_train)

DecisionTreeRegressor(random_state=42)

## Random Forest Regressor

In [19]:
rfr = RandomForestRegressor(n_estimators = 100, random_state = rfr_random_state)
rfr.fit(X_train, y_train)

RandomForestRegressor(random_state=0)

All of the above models are trained on only the split training dataset (X_train and y_train) while the validation set (X_valid and y_valid) are held back.

In [20]:
results_dict = {'Training RMSE':
                    {
                        "Multiple Linear Regression": np.sqrt(mean_squared_error(y_train, mlr.predict(X_train))),
                        "Decision Tree Regression": np.sqrt(mean_squared_error(y_train, dtr.predict(X_train))),
                        "Random Forest Regression": np.sqrt(mean_squared_error(y_train, rfr.predict(X_train)))
                    },
                'Validation RMSE':
                    {
                        "Multiple Linear Regression": np.sqrt(mean_squared_error(y_valid, mlr.predict(X_valid))),
                        "Decision Tree Regression": np.sqrt(mean_squared_error(y_valid, dtr.predict(X_valid))),
                        "Random Forest Regression": np.sqrt(mean_squared_error(y_valid, rfr.predict(X_valid)))
                    },
                'Training Rscore':
                    {
                        "Multiple Linear Regression": r2_score(y_train, mlr.predict(X_train)),
                        "Decision Tree Regression": r2_score(y_train, dtr.predict(X_train)),
                        "Random Forest Regression": r2_score(y_train, rfr.predict(X_train))
                    },
                'Validation Rscore':
                    {
                        "Multiple Linear Regression": r2_score(y_valid, mlr.predict(X_valid)),
                        "Decision Tree Regression": r2_score(y_valid, dtr.predict(X_valid)),
                        "Random Forest Regression": r2_score(y_valid, rfr.predict(X_valid)) 
                    }
                }

results_df = pd.DataFrame(data=results_dict)
results_df

Unnamed: 0,Training RMSE,Validation RMSE,Training Rscore,Validation Rscore
Multiple Linear Regression,4675.437541,4643.808937,0.200765,0.192969
Decision Tree Regression,0.0,3875.795353,1.0,0.437836
Random Forest Regression,1006.352531,2634.636329,0.962972,0.740234


Discussion of statistics shown and comparison of results across training and validation sets...

In [21]:
mlr.fit(X, y)
linearpreds = mlr.predict(X)

In [22]:
dtr.fit(X, y)
treepreds = dtr.predict(X)

In [23]:
rfr.fit(X, y)
forestpreds = rfr.predict(X)

Placeholder | Each model is then retrained on all the data present in the original training dataset (X and y) and the RMSE reviewed...

In [24]:
results_dictX = {'RMSE for X':
                    {
                        "Multiple Linear Regression": np.sqrt(mean_squared_error(y, linearpreds)),
                        "Decision Tree Regression": np.sqrt(mean_squared_error(y, treepreds)),
                        "Random Forest Regression": np.sqrt(mean_squared_error(y, forestpreds))
                    },
                'Rscore for X':
                    {
                        "Multiple Linear Regression": r2_score(y, linearpreds),
                        "Decision Tree Regression": r2_score(y, treepreds),
                        "Random Forest Regression": r2_score(y, forestpreds)
                    }
                }

results_dfX = pd.DataFrame(data=results_dictX)
results_dfX

Unnamed: 0,RMSE for X,Rscore for X
Multiple Linear Regression,4667.021386,0.199957
Decision Tree Regression,0.0,1.0
Random Forest Regression,943.997753,0.967268


## 5.2 Comparison of Predictions

Placeholder | Comparison of known target variable values to predictions by individual models...

In [25]:
train_predictions = pd.concat([pd.DataFrame(linearpreds, columns=['Multiple Linear Regression']),
                               pd.DataFrame(treepreds, columns=['Decision Tree Regression']),
                               pd.DataFrame(forestpreds, columns=['Random Forest Regression']), y], axis =1)
train_predictions

Unnamed: 0,Multiple Linear Regression,Decision Tree Regression,Random Forest Regression,load_shortfall_3h
0,9746.125825,6715.666667,6145.710000,6715.666667
1,9158.339116,4171.666667,4159.906667,4171.666667
2,9513.514667,4274.666667,4488.166667,4274.666667
3,10079.326586,5075.666667,5604.960000,5075.666667
4,10377.215007,6620.666667,6513.820000,6620.666667
...,...,...,...,...
8758,11364.802872,-28.333333,1174.730000,-28.333333
8759,10993.212284,2266.666667,2775.543333,2266.666667
8760,9217.076488,822.000000,1222.063333,822.000000
8761,10371.789996,-760.000000,1109.246667,-760.000000


<a id="six"></a>
# 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

## Returning to the Test Dataset

Placeholder | All transformations to the training data set need to be repeated for the test data set...

In [26]:
df_t = df_test.copy()

# impute missing values AND features

df_t["Valencia_pressure"] = df_t["Valencia_pressure"].fillna(df_t["Valencia_pressure"].mode()[0]) 

df_t = df_t.loc[:, ~df_t.columns.str.contains("temp_min")]
df_t = df_t.loc[:, ~df_t.columns.str.contains("temp_max")]

# engineer existing features

df_t["Valencia_wind_deg"] = df_t["Valencia_wind_deg"].str.extract('(\d+)')
df_t["Valencia_wind_deg"] = pd.to_numeric(df_t["Valencia_wind_deg"])

df_t["Seville_pressure"] = df_t["Seville_pressure"].str.extract('(\d+)')
df_t["Seville_pressure"] = pd.to_numeric(df_t["Seville_pressure"])

df_t["time"] = pd.to_datetime(df_t["time"])

# Engineer new features

df_t["Hour"] = df_t["time"].dt.hour
df_t["Day"] = df_t["time"].dt.day
df_t["Weekday"] = df_t["time"].dt.weekday
df_t["Week"] = df_t["time"].dt.isocalendar().week
df_t['Week'] = df_t['Week'].astype('int64')
df_t["Month"] = df_t["time"].dt.month
df_t["Year"] = df_t["time"].dt.year

df_t = df_t.drop(['time'], axis=1)

df_t['Season'] = df_t['Month'].apply(lambda x: seasons[x])

df_dt = pd.get_dummies(df_t)

df_dt.columns = [col.replace(" ","_") for col in df_dt.columns]

df_dt['Season_Winter'] = df_dt['Season_Winter'].astype('int64')
df_dt['Season_Spring'] = df_dt['Season_Spring'].astype('int64')
df_dt['Season_Summer'] = df_dt['Season_Summer'].astype('int64')
df_dt['Season_Autumn'] = df_dt['Season_Autumn'].astype('int64')

In [27]:
df_dt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2920 entries, 0 to 2919
Data columns (total 46 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Madrid_wind_speed     2920 non-null   float64
 1   Valencia_wind_deg     2920 non-null   int64  
 2   Bilbao_rain_1h        2920 non-null   float64
 3   Valencia_wind_speed   2920 non-null   float64
 4   Seville_humidity      2920 non-null   float64
 5   Madrid_humidity       2920 non-null   float64
 6   Bilbao_clouds_all     2920 non-null   float64
 7   Bilbao_wind_speed     2920 non-null   float64
 8   Seville_clouds_all    2920 non-null   float64
 9   Bilbao_wind_deg       2920 non-null   float64
 10  Barcelona_wind_speed  2920 non-null   float64
 11  Barcelona_wind_deg    2920 non-null   float64
 12  Madrid_clouds_all     2920 non-null   float64
 13  Seville_wind_speed    2920 non-null   float64
 14  Barcelona_rain_1h     2920 non-null   float64
 15  Seville_pressure     

In [28]:
test_mlr = mlr.predict(df_dt)

daf_mlr = pd.DataFrame(test_mlr, columns=['Multiple Linear Regression'])

In [29]:
test_dtr = dtr.predict(df_dt)

daf_dtr = pd.DataFrame(test_dtr, columns=['Decision Tree Regression'])

In [30]:
test_rfr = rfr.predict(df_dt)

daf_rfr = pd.DataFrame(test_rfr, columns=['Random Forest Regression'])

Placeholder | Three models trained on engineered test data (Predictors)...

In [31]:
full_predictions = pd.concat([daf_mlr, daf_dtr, daf_rfr], axis =1)
full_predictions

Unnamed: 0,Multiple Linear Regression,Decision Tree Regression,Random Forest Regression
0,9920.840811,10246.333333,10603.870000
1,9833.990571,11881.666667,8100.506667
2,10598.591711,8320.000000,7852.406667
3,10866.829643,14774.666667,10336.593333
4,9871.704599,17855.333333,13223.313333
...,...,...,...
2915,13310.901282,16656.000000,14510.980000
2916,15208.550806,13247.666667,15424.905000
2917,16454.467053,15934.000000,15583.223333
2918,16457.750416,14083.333333,15841.803333


Placeholder | The above represents the predictions made on the test data set for each of the models created...

## Creating Submission File

Placeholder | The following code readies the model predictions to be written to a csv file...

In [32]:
output = pd.DataFrame({"time":df_test['time']})
submission = output.join(daf_rfr)
submission.rename(columns = {'Random Forest Regression':'load_shortfall_3h'}, inplace = True)
submission.to_csv("submission.csv", index=False)
submission

Unnamed: 0,time,load_shortfall_3h
0,2018-01-01 00:00:00,10603.870000
1,2018-01-01 03:00:00,8100.506667
2,2018-01-01 06:00:00,7852.406667
3,2018-01-01 09:00:00,10336.593333
4,2018-01-01 12:00:00,13223.313333
...,...,...
2915,2018-12-31 09:00:00,14510.980000
2916,2018-12-31 12:00:00,15424.905000
2917,2018-12-31 15:00:00,15583.223333
2918,2018-12-31 18:00:00,15841.803333


## Pickling the Chosen Model

Placeholder | Mention pickling and decide if the test/load portion is necessary...

In [33]:
model_save_path = "rfr_model.pkl"
with open(model_save_path,'wb') as file:
    pickle.dump(rfr, file)

In [34]:
model_load_path = "rfr_model.pkl"
with open(model_load_path,'rb') as file:
    unpickled_model = pickle.load(file)

In [35]:
y_pred = unpickled_model.predict(df_dt)
y_pred

array([10603.87      ,  8100.50666667,  7852.40666667, ...,
       15583.22333333, 15841.80333333, 17402.19333333])

<a id="seven"></a>
# 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic