# Initial Modelling
We are going to make some basic models to predict both Min and Max demands for a day using findings from the EDA phase of the projects.
Models will start simple using a few datasets to predict and then get more complex.
Models will be using RMSE and MAE to evaluate and compare.

In [1]:
# Import general packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error, mean_absolute_error

## Load Data and Format for Modelling
To begin with we will load in the:
- Total Demand Dataset
- NSW Temperature Dataset
- NSW Residential Solar Data
- NSW Population Data

Seeing as the Residential Solar Data only goes up to 2020 and Demand Data is only from 2010, when doing the train test split data from 2010-2018 will be used for training, 2019 and 2020 will be used for test. Other data will be discarded for now.

Data that have recording interval periods greater than 1 day will be linearly interpolated for now for simplicity.

In [2]:
# Import demand dataset
demand_df = pd.read_csv('../data/raw/totaldemand_nsw.csv', names=['datetime', 'region', 'demand'], header=0)
demand_df['datetime'] = pd.to_datetime(demand_df['datetime'])
demand_df = demand_df.resample('D', on='datetime')['demand'].agg(['min', 'max'])
demand_df.rename(columns={'min':'demand_min', 'max':'demand_max'}, inplace=True)
demand_df.head()

Unnamed: 0_level_0,demand_min,demand_max
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-01-01,6157.36,8922.42
2010-01-02,6112.73,9326.64
2010-01-03,6014.91,8277.85
2010-01-04,6023.79,9522.3
2010-01-05,6287.12,10728.72


In [3]:
# Import temperature data
temp_df = pd.read_csv('../data/raw/temperature_nsw.csv', names=['datetime', 'location', 'temp'], header=0)
temp_df['datetime'] = pd.to_datetime(temp_df['datetime'])
temp_df.drop(temp_df[temp_df['temp'] <= -9999].index, inplace = True)
temp_df = temp_df.resample('D', on='datetime')['temp'].agg(['min', 'max', 'mean'])
temp_df.rename(columns={'min':'temp_min', 'max':'temp_max', 'mean':'temp_mean'}, inplace=True)
temp_df.head()

Unnamed: 0_level_0,temp_min,temp_max,temp_mean
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-01,22.1,28.8,25.094
2010-01-02,21.6,29.4,24.765385
2010-01-03,17.9,21.5,19.429825
2010-01-04,17.9,23.9,20.625926
2010-01-05,15.4,27.7,22.660417


In [4]:
# Import solar data
solar_df = pd.read_csv('../data/raw/nsw_residential_solar.csv', names=['datetime', 'units', 'cum_units', 'output', 'cum_output'], header=0)
solar_df['datetime'] = pd.to_datetime(solar_df['datetime'])
# Interpolate to get daily data
solar_df = solar_df.set_index('datetime').resample('D', convention='end').interpolate(method='linear')
solar_df.head()

Unnamed: 0_level_0,units,cum_units,output,cum_output
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2008-01-01,127.0,1882.0,287.946,2710.745
2008-01-02,128.451613,1887.548387,286.983677,2719.071258
2008-01-03,129.903226,1893.096774,286.021355,2727.397516
2008-01-04,131.354839,1898.645161,285.059032,2735.723774
2008-01-05,132.806452,1904.193548,284.09671,2744.050032


In [5]:
# Import population data
pop_df = pd.read_csv('../data/raw/NSW_population.csv', usecols=['TIME_PERIOD: Time Period', 'OBS_VALUE'], header=0)
pop_df.rename(columns={'TIME_PERIOD: Time Period':'datetime', 'OBS_VALUE':'population'}, inplace=True)
pop_df['datetime'] = pd.to_datetime(pop_df['datetime'], format='%Y')
pop_df.head()
pop_df = pop_df.set_index('datetime').resample('D', convention='start').interpolate(method='linear')
pop_df.head()

Unnamed: 0_level_0,population
datetime,Unnamed: 1_level_1
2001-01-01,6530349.0
2001-01-02,6530487.0
2001-01-03,6530625.0
2001-01-04,6530764.0
2001-01-05,6530902.0


In [6]:
# Merge data frames and split into train/test sets. Start with untransformed data then add models with transformations eg sqrt(temp), ln(solar output)

## Linear Regression Models
Now that we have all the data entered, let's build some preliminary linear regression models
- untransformed models
    - add normalisation
- transformed models with normalisation
- add in interactions between variables e.g. temp-solaroutput
... later look at other ML algorithms and data that could improve model

In [7]:
import statsmodels.api as sm
import statsmodels.formula.api as smf


  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
