In [1]:
import warnings
warnings.filterwarnings('ignore')
import utils
import sklearn 

# Baseline Time Series Solution

In this notebook, we'll build a baseline model to predict future daily average temperatures using historical temperature data.

Time series forecasting is different from other machine learning problems in that there is an inherent temporal ordering to the data, which means that special considerations will need to be taken into account during preprocessing, feature engineering, and model building.

## Configure Problem

In [2]:
filepath = "dataset/DailyDelhiClimate.csv"

time_index = "date"
target_col = 'meantemp'

df = utils.read_data(filepath, time_index, target_col)

df.head(10)

Unnamed: 0,date,meantemp
0,2013-01-01,10.0
1,2013-01-02,7.4
2,2013-01-03,7.166667
3,2013-01-04,8.666667
4,2013-01-05,6.0
5,2013-01-06,7.0
6,2013-01-07,7.0
7,2013-01-08,8.857143
8,2013-01-09,14.0
9,2013-01-10,11.0


In this demo and in many time series problems, we're trying to predict a sequential series of values that are highly dependent on one another. We will exploit the fact that more recent observations are more predictive than more distant ones--when trying to determine tomorrow's temperature, knowing today's temperature may be the most predictive piece of information we can get.

In many scenarios, however, we may not have access to data so quickly to use yesterday's temperature for modeling. Consider an example where we're recording data that takes a week to ingest; the earliest data we have access to is from seven days ago, so seven days would be our contraint for our baseline feature.

In this demo, we do not naturally have any of these constraints, so we'll need to set a delay arbitrarily when formally defining the problem we're solving. Let's say we have a delay of nine days; since our data occurs at a daily frequency, this will be `9` rows. 

In [3]:
delay = 9

## Data Splitting

Additionally, we'll want to have our data split up into training and testing data. Since the data has a strict temporal ordering, this will split the data at a defined point in time instead of randomly sampling from the data.

In [4]:
training_data, test_data = utils.get_train_test(df, 3)
test_data.head()

Unnamed: 0,date,meantemp
1105,2016-01-11,15.75
1106,2016-01-12,18.0
1107,2016-01-13,18.266667
1108,2016-01-14,15.5625
1109,2016-01-15,13.0


## Feature Engineering
Our baseline run will only include one feature that is a delayed value from the `meantemp` column. That delayed value will be the first known value for each observation. 

In [5]:
baseline_training = utils.add_delayed_feature(training_data, 
                                              col_to_delay=target_col, 
                                              delay_length=delay)
baseline_test = utils.add_delayed_feature(test_data, 
                                          col_to_delay=target_col, 
                                          delay_length=delay)

baseline_training.head(13)

Unnamed: 0,date,meantemp,target_delay
0,2013-01-01,10.0,
1,2013-01-02,7.4,
2,2013-01-03,7.166667,
3,2013-01-04,8.666667,
4,2013-01-05,6.0,
5,2013-01-06,7.0,
6,2013-01-07,7.0,
7,2013-01-08,8.857143,
8,2013-01-09,14.0,
9,2013-01-10,11.0,10.0


Notice how the `meantemp` value at index `0` is the same `target_delay` value at index `10`, the first non null value, since it's the first delayed value to be present in the original target column. 

## Format data for modeling

We won't want the time index column, `date`, in our data for modeling, we also need to remove the null values that were introduced in the delayed target feature, and last we need to pull the target column out. 

In [6]:
# Get rid of the time index column for modeling
baseline_training.drop(time_index, axis=1, inplace=True)
baseline_test.drop(time_index, axis=1, inplace=True)

# The lag feature introduces nans, so we remove those rows and pull out the target
X_train = baseline_training.dropna()
y_train = X_train.pop(target_col)

X_test = baseline_test.dropna()
y_test = X_test.pop(target_col)

X_train.head()

Unnamed: 0,target_delay
9,10.0
10,7.4
11,7.166667
12,8.666667
13,6.0


## Model Building

Now that we've formatted our training and test data for modeling, we can use the training data, `X_train` and the target column `y_train`, to fit the random forest regressor we've chosen as our estimator. Then, we use the test data, `X_test` to predict our target values and check its accuracy against `y_test`.

In [7]:
reg, baseline_score = utils.train_and_fit_random_forest_regressor(X_train, y_train, X_test, y_test)

Median Abs Error: 2.20


We're using median absolue error as our objective function for determining how the model performs. **The closer the score is to zero, the more accurate our model is.**

We can also learn a bit more about our model by looking at the feature importances. 

In [8]:
high_imp_feats = utils.feature_importances(X_train, reg, feats=10)

1: target_delay [1.000]
-----

