## Modeling

### Prophet Research

**Install Notes:**
- Numpy
  - Numpy is 1.23.0
  - scipy wants < 1.23.0
  - numba wants < 1.21
- daal4py
  - wants daal==2021.2.3 (which isn't installed)
- Plotly
  - may need to install to utilize some of the interactive figures

**DataFrame Prep:**
- Needs Pandas dataframe with 2 columns:
  - 'ds': datestamp that pandas recognizes
    - Says preferrably YYYY-MM-DD HH:MM:SS for timestamp, but hoping it takes time zone aware stuff
  - 'y': the target variable we want to forecast
    - if these are the only inputs, then I wonder if we can include temperatures :(
    
**Hyperparameters:**

- growth='linear'
  - Leave as linear since we don't have a market cap
- changepoints=None
  - Check out changepoints after first run or so
  - 1 person recommended 1 change point per month
  - but, this read almost like I could add a harvey thing in at first
- n_changepoints=25
- changepoint_range=0.8
- yearly_seasonality='auto'
- weekly_seasonality='auto'
- daily_seasonality='auto'
  - One recommendation was to set these to false then make our own
  - definitely want an hourly representation (which may really mean daily - as it repeats each day)
    - add seasonality probably with daily and fourier_order of 2-10 (probably on the lower side of that)
    - probably want something similar with the yearly - based off monthly plots, probably double whatever the appropriate hourly number should be
- holidays=None
  - NEEDS to be in hourly form
- seasonality_mode='additive'
  - stick with additive, but may want to check out a multiplicative
- seasonality_prior_scale=10.0
- holidays_prior_scale=10.0
  - try multiple values here.  
- changepoint_prior_scale=0.05
- mcmc_samples=0
- interval_width=0.8
- uncertainty_samples=1000
- stan_backend=None

**Cross_valdiation:**
- Parameters:
  - horizon: forecast time period (3 days?)
  - initial: initial training period (3 years? - minimum training window?)
  - period: spacing between cutoff dates
- Example:
  - from prophet.diagnostics import cross_validation
  - df_cv = cross_validation(m, initial='730 days', period='180 days', horizon = '365 days')

  
    
**Regression:**
- Looks like we should be able to add weather data via add_regressors
  - Read up on [this article](https://towardsdatascience.com/forecast-model-tuning-with-additional-regressors-in-prophet-ffcbf1777dda)
  - Non-continuous variable: would need to bucket temperatures:
    - Maybe: is hot day? or is cold day?
    - Should this is hot/is cold be based on time of day?  would need to see inflection point for that


In [1]:
import pandas as pd
import numpy as np

import wrangle

### Prophet Planning:

1. Create reshape dataframe function >> [jump to](#1)
2. Perform Basic Prophet model >> [jump to](#2) 
  - Use default 
  - Store performance variables
3. Add Prophet parameters >> [jump to](#3)
  - Holidays
  - Growth (ex: logistic) 
    - seems to be for market size info, probably inappropriate for us
    - want linear (default)
4. Perform Basic Prophet model w/ cross-validation
  - Use same parameters as #2
  - Add in cross-validation to get aggregated performance
  - sliding window if possible
5. Modify reshape function to add in a regressor

### 1) Create a reshape dataframe <a class='anchor' id='1'></a>

[Next](#2)

- need 'ds' column (start with UTC)
- need 'y' column (ercot load)

In [2]:
df = wrangle.get_combined_df()
df.index

DatetimeIndex(['2010-01-01 00:00:00-06:00', '2010-01-01 01:00:00-06:00',
               '2010-01-01 02:00:00-06:00', '2010-01-01 03:00:00-06:00',
               '2010-01-01 04:00:00-06:00', '2010-01-01 05:00:00-06:00',
               '2010-01-01 06:00:00-06:00', '2010-01-01 07:00:00-06:00',
               '2010-01-01 08:00:00-06:00', '2010-01-01 09:00:00-06:00',
               ...
               '2022-06-30 14:00:00-05:00', '2022-06-30 15:00:00-05:00',
               '2022-06-30 16:00:00-05:00', '2022-06-30 17:00:00-05:00',
               '2022-06-30 18:00:00-05:00', '2022-06-30 19:00:00-05:00',
               '2022-06-30 20:00:00-05:00', '2022-06-30 21:00:00-05:00',
               '2022-06-30 22:00:00-05:00', '2022-06-30 23:00:00-05:00'],
              dtype='datetime64[ns, US/Central]', name='datetime', length=109535, freq=None)

In [3]:
df.columns

Index(['ercot_load', 'dow', 'is_weekday', 'is_obs_holiday', 'hs_temp',
       'hs_feelslike', 'hs_dew', 'hs_humidity', 'hs_precip', 'hs_windgust',
       'hs_windspeed', 'hs_winddir', 'hs_sealevelpressure', 'hs_cloudcover',
       'hs_visibility', 'hs_solarradiation', 'hs_solarenergy', 'hs_uvindex',
       'gv_temp', 'gv_feelslike', 'gv_dew', 'gv_humidity', 'gv_precip',
       'gv_windgust', 'gv_windspeed', 'gv_winddir', 'gv_sealevelpressure',
       'gv_cloudcover', 'gv_visibility', 'gv_solarradiation', 'gv_solarenergy',
       'gv_uvindex', 'pl_temp', 'pl_feelslike', 'pl_dew', 'pl_humidity',
       'pl_precip', 'pl_windgust', 'pl_windspeed', 'pl_winddir',
       'pl_sealevelpressure', 'pl_cloudcover', 'pl_visibility',
       'pl_solarradiation', 'pl_solarenergy', 'pl_uvindex', 'vc_temp',
       'vc_feelslike', 'vc_dew', 'vc_humidity', 'vc_precip', 'vc_windgust',
       'vc_windspeed', 'vc_winddir', 'vc_sealevelpressure', 'vc_cloudcover',
       'vc_visibility', 'vc_solarradiation', '

In [4]:
prophet_df = pd.DataFrame()
prophet_df['ds'] = df.index.copy()
prophet_df['y'] = df.ercot_load.copy()

In [5]:
prophet_df

Unnamed: 0,ds,y
0,2010-01-01 00:00:00-06:00,
1,2010-01-01 01:00:00-06:00,
2,2010-01-01 02:00:00-06:00,
3,2010-01-01 03:00:00-06:00,
4,2010-01-01 04:00:00-06:00,
...,...,...
109530,2022-06-30 19:00:00-05:00,
109531,2022-06-30 20:00:00-05:00,
109532,2022-06-30 21:00:00-05:00,
109533,2022-06-30 22:00:00-05:00,


In [6]:
#probably an index problem above, so instead:
df2 = pd.DataFrame(df.ercot_load)

In [7]:
df2 #now just need to move datetime to a column and relabel columns

Unnamed: 0_level_0,ercot_load
datetime,Unnamed: 1_level_1
2010-01-01 00:00:00-06:00,7931.241900
2010-01-01 01:00:00-06:00,7775.456846
2010-01-01 02:00:00-06:00,7704.815982
2010-01-01 03:00:00-06:00,7650.575724
2010-01-01 04:00:00-06:00,7666.708317
...,...
2022-06-30 19:00:00-05:00,15040.841510
2022-06-30 20:00:00-05:00,14700.132848
2022-06-30 21:00:00-05:00,14637.633680
2022-06-30 22:00:00-05:00,14543.743791


In [8]:
df2.reset_index(drop=False, inplace=True)

In [9]:
df2.info() #cool, preserved tz aware datetime

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109535 entries, 0 to 109534
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype                     
---  ------      --------------   -----                     
 0   datetime    109535 non-null  datetime64[ns, US/Central]
 1   ercot_load  109535 non-null  float64                   
dtypes: datetime64[ns, US/Central](1), float64(1)
memory usage: 1.7 MB


In [10]:
df2.rename(columns = {'datetime':'ds','ercot_load':'y'},inplace=True)
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109535 entries, 0 to 109534
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype                     
---  ------  --------------   -----                     
 0   ds      109535 non-null  datetime64[ns, US/Central]
 1   y       109535 non-null  float64                   
dtypes: datetime64[ns, US/Central](1), float64(1)
memory usage: 1.7 MB


###### Drop into a function

In [31]:
def get_prophet_df():
    '''
    Retrieves a cleaned dataframe and formats it for input into
    the FB Prophet model.
    
    NOTE: Prophet does not support timezone - need it in UTC, then make tz naive
    '''
    #Acquire combined dataframe
    df = wrangle.get_combined_df(get_central = False)
    #Pull index/load data into new 
    df2 = pd.DataFrame(df.ercot_load)
    #Move index out
    df2.reset_index(drop=False, inplace=True)
    #Rename columns
    df2.rename(columns = {'datetime':'ds','ercot_load':'y'},inplace=True)
    #Make TZ naive
    df2.ds = df2.ds.dt.tz_localize(None)
    
    return df2
    
    

In [32]:
test = get_prophet_df()

In [33]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109535 entries, 0 to 109534
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype         
---  ------  --------------   -----         
 0   ds      109535 non-null  datetime64[ns]
 1   y       109535 non-null  float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 1.7 MB


In [34]:
test.head(2)

Unnamed: 0,ds,y
0,2010-01-01 06:00:00,7931.2419
1,2010-01-01 07:00:00,7775.456846


### 2) Basic Prophet Model <a class='anchor' id='2'></a>

In [14]:
df = wrangle.get_prophet_df(get_central=False)

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109535 entries, 0 to 109534
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype              
---  ------  --------------   -----              
 0   ds      109535 non-null  datetime64[ns, UTC]
 1   y       109535 non-null  float64            
dtypes: datetime64[ns, UTC](1), float64(1)
memory usage: 1.7 MB


In [16]:
train = df[df.ds < '2018']
train

Unnamed: 0,ds,y
0,2010-01-01 06:00:00+00:00,7931.241900
1,2010-01-01 07:00:00+00:00,7775.456846
2,2010-01-01 08:00:00+00:00,7704.815982
3,2010-01-01 09:00:00+00:00,7650.575724
4,2010-01-01 10:00:00+00:00,7666.708317
...,...,...
70117,2017-12-31 19:00:00+00:00,11088.549792
70118,2017-12-31 20:00:00+00:00,11099.277737
70119,2017-12-31 21:00:00+00:00,11014.918563
70120,2017-12-31 22:00:00+00:00,10904.926336


In [17]:
from prophet import Prophet

In [18]:
help(Prophet)

Help on class Prophet in module prophet.forecaster:

class Prophet(builtins.object)
 |  Prophet(growth='linear', changepoints=None, n_changepoints=25, changepoint_range=0.8, yearly_seasonality='auto', weekly_seasonality='auto', daily_seasonality='auto', holidays=None, seasonality_mode='additive', seasonality_prior_scale=10.0, holidays_prior_scale=10.0, changepoint_prior_scale=0.05, mcmc_samples=0, interval_width=0.8, uncertainty_samples=1000, stan_backend=None)
 |  
 |  Prophet forecaster.
 |  
 |  Parameters
 |  ----------
 |  growth: String 'linear', 'logistic' or 'flat' to specify a linear, logistic or
 |      flat trend.
 |  changepoints: List of dates at which to include potential changepoints. If
 |      not specified, potential changepoints are selected automatically.
 |  n_changepoints: Number of potential changepoints to include. Not used
 |      if input `changepoints` is supplied. If `changepoints` is not supplied,
 |      then n_changepoints potential changepoints are selec

In [19]:
model = Prophet()
type(model)

prophet.forecaster.Prophet

In [35]:
model.fit(test)

06:36:42 - cmdstanpy - INFO - Chain [1] start processing
06:37:08 - cmdstanpy - INFO - Chain [1] done processing


<prophet.forecaster.Prophet at 0x7fafab8f7070>