In [None]:
from utils import feature_importances, remove_nans, split_with_gap
import featuretools as ft
from featuretools.primitives import RollingMean, NumericLag
import woodwork as ww
from evalml import AutoMLSearch
from evalml.model_understanding import graph_prediction_vs_actual_over_time


import pandas as pd

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import median_absolute_error

In this demo, we'll work to predict future daily average temperatures using historical temperature data. This is a time series machine learning problem, which requires special considerations during preprocessing, feature engineering, and model building.

To highlight the proess through which we can solve a time series problem, we'll build three models.

First, we'll build a baseline model; this will highlight the unique constraints for data-splitting and allow us to understand the problem definition. Then, we'll explore time series feature engineering by generating features with Featuretools. Finally, we'll use EvalML's time series modeling to greatly simplify the process through which these models are built.

## Understanding Time Series Problems

Time series forecasting is different from other machine learning problems in that there is an inherent temporal ordering to the data. The ordering comes from a time index column, so at a specific point in time, we may have knowlege about earlier observations but not later ones. If the data is unordered, it’d be hard to see any overall trend or seasonality, but when sorted by date, any relationships that exist in the data can be seen and used when making predictions (winter is cold; summer is hot!). Notice how this is different from non-time series data, which can be presented in any order without having an impact on the resulting predictions.

Other demos in this repository explore this concept some. predict-remaining-useful-life, predict olympic medals, predict-appointment-noshow all have time indices that play a large roll in splitting the data for feature engineering. We can set a `cutoff_time` for feature engineering after which we do not have access to data. This is very useful for datasets that have multiple tables with relationships; we can build features from aggregations across tables. 

In this demo, we'll only have one table worth of data, but its temporal ordering means that we have access to a column's own historical data for feature engineering. When trying to determine tomorrow's temperature, knowing today's temperature may be the most predictive piece of information we can get. Realistically, we may not have data from so recent a time, but the concept stands; utilizing the most recent information we have is the bread and butter of time series modeling.

In a time series problem, our task is to predict the future values of our target variable. If we engineer the right features, we can use normal regression models; but we need to account for the temporal ordering of the data. 

## Load in Data

We’ll demonstrate how to build a time series model using the DailyDelhiClimateTrain dataset, which contains a `meantemp` target variable and a `date` time index. There are other columns, but for the purposes of simplicity, we'll only work with the target and time index columns. To include the others would bring this demo into the sphere of multivariate time series modeling, which brings its own host of complexity.

In [None]:
file_name = "DailyDelhiClimateTrain"
df = pd.read_csv(f"data/{file_name}.csv")

df

Now, we'll do a quick sanity check that the data has some temporal pattern that we can exploit for modeling purposes.

First, we'll use a Woodwork method to check whether there is any column with a uniform sampling frequency. This is important, because it means that there is a constant amount of time between observations. A dataset that does not have a uniform sampling frequency can still be used for time series modeling, but the existence of that frequency is a good indicator that this dataset is ripe for time series modeling. For columns that have multiple datetime columns, checking for a frequency is also a good indicator for which could be the time index.

In [None]:
df.ww.init()
df.ww.infer_temporal_frequencies()

Indeed, one of the columns, `date`, has a daily frequency; we'll move forward with it as our time index.

Now, we’ll graph the data.

In [None]:
ts = df['meantemp']
ts.index = df['date']
ts.plot()

We can see a strong seasonality, which makes sense for temperature! In many places, the time of the year is indicative of what the weather will look like. Now, we'll build a baseline model that uses the most recently available data. 

But how do we define what data we have access to when? In many scenarios, this might be be determined by quickly we can get access to recent observations. Since we're building a model using training and test data that we'll have access to right now, we'll need to set some parameters arbitrarily. But these parameters will let us define the problem more formally. We'll stick with these definitions problem configuration throughout the rest of the demo. 

## Problem Configuration
Here are a few concepts that give us our official problem configuration:

**forecast_horizon**: The number of time periods we are trying to forecast. In this example, we’re interested in predicting the mean temperature for the next 5 days, so the value is 5.

**gap**: The number of time periods between the end of the training set and the start of the test set. We’re going to make predictions using data from three days prior to each observation.

**max_delay**: The maximum number of rows to look in the past from the current row in order to compute features. Here, we’ll use a max delay of 20.

**time_index**: The column of the training dataset that contains the date corresponding to each observation. Here, it's the `date` column.

Our problem can then be described as trying to predict the mean temperature over the next five days using temperature data from 20 days prior. 

In [None]:
# The only columns we'll want to use for modeling - makes this a univariate problem
time_index = "date"
target_col = 'meantemp'

# parameters as evalml uses them 
gap = 3
max_delay = 20
forecast_horizon = 5

Since we do not want to complicate the solition by performing multivariate time series modeling, we'll only use the time index column and target column for the rest of this demo. 

In [None]:
univariate_df = df[[time_index, target_col]]

Additionally, we'll want to have our data split up into training and testing data. We'll use the same split for both our baseline and Featuretools run. 

In [None]:
split_point = int(univariate_df.shape[0]*.8)

# leave gap observations between training and test datasets
training_data = univariate_df[:split_point]
test_data = univariate_df[(split_point + gap):]

# Baseline Run

Our baseline run will only include one feature that is shifted to the first known value for each observation. When splitting data, we'll need to be careful to not have the test dataset's lag feature use values that are technically before the test set begins or inside of the training set. 

First, let's split the data, leaving a `gap` number of observations between the train and test sets.

In [None]:
# Add a delayed target feature to both traning and test data
target_lag_training = training_data[target_col].shift(forecast_horizon + gap + 1)
target_lag_training.name = 'target_lag'
baseline_training = pd.concat([training_data, target_lag_training], axis=1)

target_lag_test = test_data[target_col].shift(forecast_horizon + gap + 1)
target_lag_test.name = 'target_lag'
baseline_test = pd.concat([test_data, target_lag_test], axis=1)

# Get rid of the time index column for modeling
baseline_training.drop(time_index, axis=1, inplace=True)
baseline_test.drop(time_index, axis=1, inplace=True)

# The lag feature introduces nans, so we remove those rows and pull out the target
X_train, y_train = remove_nans(baseline_training, target_col)
X_test, y_test = remove_nans(baseline_test, target_col)


In [None]:
target_lag_test

In [None]:
test_data

In [None]:
reg = RandomForestRegressor(n_estimators=100)
reg.fit(X_train, y_train)

preds = reg.predict(X_test)
scores = median_absolute_error(preds, y_test)
print('Median Abs Error: {:.2f}'.format(scores))

high_imp_feats = feature_importances(X_train, reg, feats=10)

We can build more features, some of which may be similar to the `lag` feature we used in the baseline model, but if we look back at the graph with the rolling mean, we remember that rolling mean was a really good indicator for the mean temp. So we'll want a way of including that as a feature without exposing our target. This is where Featuretools' time series primitives comes into play. We'll also add some more standard datetime primitives that might have predictive power; for example, the month of the year is a very good indicator of what the teamperature should be.

# Feature Engineering Run 

First we'll split our data in exactly the same way that we did for the baseline run
### Split Data

### Feature Engineering with Featuretools

Now, we engineer some time series-specific features. We'll recreate _______
EXPLAIN WHY WE 

In [None]:
# parameters as featuretools will use them
rolling_gap = forecast_horizon + gap
rolling_window_length = int(.25*max_delay) + 1 # a quarter is a heuristic here 
rolling_min_periods = int(.25*max_delay) + 1

In [None]:
training_es = ft.EntitySet()
training_es.add_dataframe(training_data.copy(), 
                 dataframe_name='temperatures', 
                 index='id', 
                 make_index=True, 
                 time_index=time_index)

In [None]:
test_es = ft.EntitySet()
test_es.add_dataframe(test_data.copy(), 
                 dataframe_name='temperatures', 
                 index='id', 
                 make_index=True, 
                 time_index=time_index)

In [None]:
datetime_featureizer = [ 'Month', 'Hour', "Year"]
# how is the statistically significant lags from evalml that makes up the nubmer of lags determined? 
# max delay - dets the number of features (up to - can pick any one/number )
lagging_featureizer = [NumericLag(periods=t + forecast_horizon + gap) for t in range(forecast_horizon + gap + 1)]


train_fm, features = ft.dfs(entityset=training_es, 
               target_dataframe_name='temperatures', 
               max_depth=1,
               trans_primitives = datetime_featureizer + lagging_featureizer +[ 
                                   RollingMean(rolling_window_length, 
                                               gap=rolling_gap,
                                              min_periods=rolling_min_periods)]
              )

X_train, y_train = remove_nans(train_fm, target_col)

X_train

In [None]:
test_fm = ft.calculate_feature_matrix(features, test_es)

X_test, y_test = remove_nans(test_fm, target_col)


test_fm.ww

In [None]:
X_test

In [None]:
reg = RandomForestRegressor(n_estimators=100)
reg.fit(X_train, y_train)

preds = reg.predict(X_test)
scores = median_absolute_error(preds, y_test)
print('Median Abs Error: {:.2f}'.format(scores))

high_imp_feats = feature_importances(X_train, reg, feats=100)


Looking at the feature importances above, we see that the rolling mean was, indeed, very predictive along with the Month feature. 

## Use Time Series Regression Problem From EvalML
We will now build a model that is very similar to the one we just built with the help of Featuretools. EvalML's time series regression problem type does the same feature engineering that we just did under the hood. That, along with some other optimizations and the fact that we run multiple pipelines shows the power of EvalML.

In [None]:
import evalml

univariate_df = df[[time_index, target_col]]

X = univariate_df
y = univariate_df.pop(target_col)

X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X, y,
                                                                   problem_type='time series regression',
                                                                   test_size=.2,
                                                                  problem_configuration={"gap": gap, "max_delay": max_delay,
                                             "forecast_horizon": forecast_horizon, "time_index": time_index},)



In [None]:
from evalml import AutoMLSearch

automl = AutoMLSearch(X_train, y_train, problem_type="time series regression",
                      max_batches=1,
                      problem_configuration={"gap": gap, "max_delay": max_delay,
                                             "forecast_horizon": forecast_horizon, "time_index": time_index},
                      allowed_model_families=["xgboost", "random_forest", "linear_model", "extra_trees",
                                              "decision_tree"],
                      objective='MedianAE'
                      )
automl.search()

In [None]:
y_train

In [None]:
automl.rankings

In [None]:
pipeline = automl.best_pipeline
pipeline.feature_importance

Look at how similar the feature importances are! The top three are all the same most of the time. 

In [None]:
pipeline.fit(X_train, y_train)

best_pipeline_score = pipeline.score(X_test, y_test, ['MedianAE'], X_train, y_train)['MedianAE']
best_pipeline_score

In [None]:
baseline = automl.get_pipeline(0)
baseline.fit(X_train, y_train)
naive_baseline_score = baseline.score(X_test, y_test, ['MedianAE'], X_train, y_train)['MedianAE']
naive_baseline_score

In [None]:

fig = graph_prediction_vs_actual_over_time(pipeline, X_test, y_test, X_train, y_train, dates=X_test['date'])
fig

In [None]:
fig = graph_prediction_vs_actual_over_time(baseline, X_test, y_test, X_train, y_train, dates=X_test['date'])
fig