In [None]:
from utils import feature_importances
import featuretools as ft
from featuretools.primitives import RollingMean, NumericLag
import woodwork as ww
from evalml import AutoMLSearch
from evalml.model_understanding import graph_prediction_vs_actual_over_time


import pandas as pd

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import median_absolute_error

## Time Series explanation

Other demos use data that is temporally ordered: predict-remaining-useful-life, predict olympic medals, predict-appointment-noshow. What all these have in common is a time index column that gives the data a temporal ordering. In those demos, the implications of that temporal ordering is that, in order to have testing and training data, we need to split the data in a way that honors that boundary before we can perform feature engineering on it. In EntitySets with multiple dataframes, this means that aggregations need to take into account cutoff times. 

Idea: Explain how the other temporal demos are not time series problems - they’re asking a different question 

In this demo, we’ll also be using data that has a temporal ordering, however, we’ll be solving a slightly different type of problem, a time series problem, and that will inform the feature engineering we perform. Featuretools and EvalML can be used to build time series machine learning models, and this demo will show how that can be done.

Before we can get to building our models, we will provide further explanations of what a time series regression problem actually is.

A time series regression model will make use of the inherent relationship between datapoints that are closer to one another to make predictions. There is a level of dependence between a data point and the ones that came before it. Therefore, the features used in modeling are built from the target column itself. A certain observation’s features will include information from previous observations (or rows) but cannot contain information from that observation itself. this makes including columns beyond the target and time index difficult, because the non target columns must follow those same rules. That kind of time series problem is multivariate in nature; this demo will focus on solving a univariate time series problem, or one that just uses the time index and the target column.

One important aspect of time series modeling is that the data must be ordered by its time index. If the data is unordered, it’d be hard to see any overall trend or seasonality, but when sorted by date, any relationships that exist in the data can be seen and used when making predictions (winter is cold; summer is hot!). Notice how this is different from non-time series data, which can be presented in any order without having an impact on the resulting predictions.

In a time series problem, our task is to predict the future values of our target variable. If we engineer the right features, we can use normal regression models; but we need to account for the temporal ordering of the data. 


## Introduce Dataset

We’ll demonstrate how to build a time series model using the DailyDelhiClimateTrain dataset, which contains a meantemp target column and a date time index.

In [None]:
file_name = "DailyDelhiClimateTrain"
df = pd.read_csv(f"data/{file_name}.csv")

df

Let’s take a quick look at the data to confirm that it makes sense to use this dataset for time series modeling. First, we’ll check whether there is any column with a uniform sampling frequency. This is important, because it means that there is a constant amount of time between observations, and this lets us build features more efficiently. A dataset that does not have a uniform sampling frequency can still be used for time series modeling, but the existence of that frequency is a good indicator that this dataset is ripe for time series modeling. For columns that have multiple datetime columns, checking for a frequency is also a good indicator for which should be the time index,

In [None]:
df.ww.init()
df.ww.infer_temporal_frequencies()

Using Woodwork's `infer_temporal_frequencies` method, we see that one of the columns, `date`, has a daily frequency. This indicates to us that `date` will be our time index in the modeling process.

Now, we’ll graph the data. We can see a strong seasonality, which makes sense for temperature , as, in many places, the time of the year is indicative of what the weather will look like. First, the fact that the black line, rolling std, does not have any pattern _____. The second is that the rolling mean (red line) very closely matches the actual temperature. This will be important for model building, though of course, if we make a feature out of the rolling mean, we cannot include that day's temperature in each window, or we'd be exposing the target variable.  But we also see that there's no significant trend over the course of the dataset. This is important for time series modeling. If there was a significant trend, we would need to account for it in pre-processing. Even so, we may decide to account for seasonality in prerocessing in order to _____.

In [None]:
ts = df['meantemp']
ts.index = df['date']
ts.plot()

## Introduce Problem
Now that we’ve seen that the data is a good candidate for time series modeling, let’s figure out the exact problem we’ll be solving. To do that, we’ll need to introduce a few concepts that will have an impact on our feature engineering. 

**forecast_horizon**: The number of time periods we are trying to forecast. In this example, we’re interested in predicting the mean temperature for the next 5 days, so the value is 5.

**gap**: The number of time periods between the end of the training set and the start of the test set. We’re going to make predictions using data from three days prior to each observation.

**max_delay**: The maximum number of rows to look in the past from the current row in order to compute features. Here, we’ll use a max delay of 20.

**time_index**: The column of the training dataset that contains the date corresponding to each observation. Here, it's the `date` column.

Our problem can then be described as trying to predict the mean temperature over the next five days using temperature data from 20 days prior. 

In [None]:
# The only columns we'll want to use for modeling - makes this a univariate problem
time_index = "date"
target_col = 'meantemp'

# parameters as evalml uses them 
gap = 3
max_delay = 20
forecast_horizon = 5

## Preprocessing

Since we do not want to complicate the solition by performing multivariate time series modeling, we'll only use the time index column and target column for the rest of this demo. 

In [None]:
univariate_df = df[[time_index, target_col]]

### Baseline Run

Our baseline run will only include one feature that is shifted to the first known value for each observation. When splitting data, we'll need to be careful to not have the test dataset's lag feature use values that are technically before the test set begins or inside of the training set. 

First, let's split the data, leaving a `gap` number of observations between the train and test sets.

In [None]:
def preprocess(time_target_fs):
    # remove nans
    max_nans = 0
    for col in time_target_fs.columns:
        max_nans = max( time_target_fs[col].isna().sum(), max_nans)
    
    if max_nans:
        time_target_fs = time_target_fs.iloc[max_nans:]
        
    X = time_target_fs
    
    y = X.pop(target_col)
    return X, y
    
    

In [None]:
split_point = int(univariate_df.shape[0]*0.7)

# leave gap observations between training and test datasets  
training_data = univariate_df[:split_point]
test_data =  univariate_df[(split_point + gap):]


In [None]:
# lag feature introduces nans, which we need to handle
training_data['lag'] = training_data[target_col].shift(forecast_horizon + gap + 1)
test_data['lag'] = test_data[target_col].shift(forecast_horizon + gap + 1)


training_data.drop(time_index, axis=1, inplace=True)
test_data.drop(time_index, axis=1, inplace=True)

X_train, y_train = preprocess(training_data)
X_test, y_test = preprocess(test_data)


In [None]:
reg = RandomForestRegressor(n_estimators=100)
reg.fit(X_train, y_train)

preds = reg.predict(X_test)
scores = median_absolute_error(preds, y_test)
print('Median Abs Error: {:.2f}'.format(scores))

high_imp_feats = feature_importances(X_train, reg, feats=10)

We can build more features, some of which may be similar to the `lag` feature we used in the baseline model, but if we look back at the graph with the rolling mean, we remember that rolling mean was a really good indicator for the mean temp. So we'll want a way of including that as a feature without exposing our target. This is where Featuretools' time series primitives comes into play. We'll also add some more standard datetime primitives that might have predictive power; for example, the month of the year is a very good indicator of what the teamperature should be.

### Feature Engineering Run 

In [None]:
split_point = int(univariate_df.shape[0]*0.7)

# leave gap observations between training and test datasets  
training_data = univariate_df[:split_point]
test_data =  univariate_df[(split_point + gap):]

In [None]:
training_data

In [None]:
# parameters as featuretools will use them
rolling_gap = forecast_horizon + gap
rolling_window_length = int(.25*max_delay) + 1 # a quarter is a heuristic here 
rolling_min_periods = int(.25*max_delay) + 1

In [None]:
training_es = ft.EntitySet()
training_es.add_dataframe(univariate_df, 
                 dataframe_name='temperatures', 
                 index='id', 
                 make_index=True, 
                 time_index=time_index)

In [None]:
test_es = ft.EntitySet()
test_es.add_dataframe(test_data.copy(), 
                 dataframe_name='temperatures', 
                 index='id', 
                 make_index=True, 
                 time_index=time_index)

In [None]:
datetime_featureizer = ['Day', 'Month', 'Hour', "Year"]
# how is the statistically significant lags from evalml that makes up the nubmer of lags determined? 
# max delay - dets the number of features (up to - can pick any one/number )
lagging_featureizer = [NumericLag(periods=t + forecast_horizon + gap) for t in range(forecast_horizon + gap + 1)]


train_fm, features = ft.dfs(entityset=training_es, 
               target_dataframe_name='temperatures', 
               max_depth=1,
               trans_primitives = datetime_featureizer + lagging_featureizer +[ 
                                   RollingMean(rolling_window_length, 
                                               gap=rolling_gap,
                                              min_periods=rolling_min_periods)]
              )

X_train, y_train = preprocess(train_fm)

train_fm.ww

In [None]:
test_fm = ft.calculate_feature_matrix(features, test_es)

X_test, y_test = preprocess(test_fm)


test_fm.ww

In [None]:
X_test

In [None]:
reg = RandomForestRegressor(n_estimators=100)
reg.fit(X_train, y_train)

preds = reg.predict(X_test)
scores = median_absolute_error(preds, y_test)
print('Median Abs Error: {:.2f}'.format(scores))

high_imp_feats = feature_importances(X_train, reg, feats=100)


Looking at the feature importances above, we see that the rolling mean was, indeed, very predictive along with the Month feature. 

## Use Time Series Regression Problem From EvalML
We will now build a model that is very similar to the one we just built with the help of Featuretools. EvalML's time series regression problem type does the same feature engineering that we just did under the hood. That, along with some other optimizations and the fact that we run multiple pipelines shows the power of EvalML.

In [None]:
import evalml

univariate_df = df[[time_index, target_col]]

X = univariate_df
y = univariate_df.pop(target_col)

X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X, y,
                                                                   problem_type='time series regression',
                                                                   test_size=.3,
                                                                  problem_configuration={"gap": gap, "max_delay": max_delay,
                                             "forecast_horizon": forecast_horizon, "time_index": time_index},)



In [None]:
from evalml import AutoMLSearch

# X = pd.read_csv(f"data/{file_name}.csv")[[time_index, target_col]]
# X.ww.init()
# y = X.ww.pop(target_col)


# train_dates, test_dates = X[time_index] < "2016-08-08", X[time_index] >= "2016-08-08"
# X_train, y_train = X.ww.loc[train_dates], y.ww.loc[train_dates]
# X_test, y_test =  X.ww.loc[test_dates], y.ww.loc[test_dates]

automl = AutoMLSearch(X_train, y_train, problem_type="time series regression",
                      max_batches=1,
                      problem_configuration={"gap": gap, "max_delay": max_delay,
                                             "forecast_horizon": forecast_horizon, "time_index": time_index},
                      allowed_model_families=["xgboost", "random_forest", "linear_model", "extra_trees",
                                              "decision_tree"],
                      objective='MedianAE'
                      )
automl.search()

In [None]:
automl.rankings

In [None]:
pipeline = automl.best_pipeline
pipeline.feature_importance

Look at how similar the feature importances are! The top three are all the same most of the time. 

In [None]:
pipeline.fit(X_train, y_train)

best_pipeline_score = pipeline.score(X_test, y_test, ['R2'], X_train, y_train)['R2']
best_pipeline_score

In [None]:
baseline = automl.get_pipeline(0)
baseline.fit(X_train, y_train)
naive_baseline_score = baseline.score(X_test, y_test, ['R2'], X_train, y_train)['R2']


In [None]:
fig = graph_prediction_vs_actual_over_time(pipeline, X_test, y_test, X_train, y_train, dates=X_test['date'])
fig

In [None]:
fig = graph_prediction_vs_actual_over_time(baseline, X_test, y_test, X_train, y_train, dates=X_test['date'])
fig