## Setting

We have already started exploring the minimum temperature dataset during the sessions. In this assignment we will carry forward the feature engineering, create following features and try to fit a random forest regression model.

The entire list of task to be followed is as following.

1. For the entire datatset, add the following features
    - Day of the month
    - Month of the year
    - Year - 1981
    - Day of the year
        * write a custom function which computes day of the year from day of the month and month of year
        * apply the function in list comprehension
    - Add $lag_{1}$, $lag_{2}$, $lag_{3}$, $lag_{4}$, $lag_{5}$ features
2. Split the dataset into two parts
    - $1^{st}$ 9 years (training set)
    - the last (tenth) year (test set)
3. Write a function to fit a model to your training set (return model as an output)
4. Write a function to predict the model's performance on the test set

## Here is code to reproduce the dataset that we discussed in class.

Please note that there are a few currept entries which needed to be corrected.

In [1]:
from pprint import pprint
from pandas import read_csv
from pandas import Series
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

series = read_csv('./data/daily_temp.csv', 
                  header=0, parse_dates=[0], index_col=0)
series.iloc[566] = 0.8
series.iloc[565] = 0.2
series.iloc[1290] = 0.1

dataframe = pd.concat([series.shift(2), series.shift(1), series, series.shift(-1)], axis=1)
dataframe.columns = ['t-2', 't-1', 't', 't+1']
dataframe.head(5)

Unnamed: 0_level_0,t-2,t-1,t,t+1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1981-01-01,,,20.7,17.9
1981-01-02,,20.7,17.9,18.8
1981-01-03,20.7,17.9,18.8,14.6
1981-01-04,17.9,18.8,14.6,15.8
1981-01-05,18.8,14.6,15.8,15.8


## 1. Creating timestamp features

Write a function `timestamp_features` that creates following features
* day of month
* month of year
* adjusted year (baslined at 1981)
* day of year
* day of week
* week of year
    
The function 
* accepts:
    * Provided dataframe
* returns:
    * dataframe with added timesptamp features 
   

In [2]:
def timestamp_features(dataframe):
    dataframe["day"] = dataframe.index.day
    dataframe["month"] = dataframe.index.month
    dataframe["adj_year"] = dataframe.index.year -1981
    dataframe["day_of_year"] = dataframe.index.dayofyear
    dataframe["day_of_week"] = dataframe.index.dayofweek
    dataframe["week_of_year"] = dataframe.index.weekofyear
    return dataframe

In [3]:
dataframe = timestamp_features(dataframe)
dataframe.head()

Unnamed: 0_level_0,t-2,t-1,t,t+1,day,month,adj_year,day_of_year,day_of_week,week_of_year
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1981-01-01,,,20.7,17.9,1,1,0,1,3,1
1981-01-02,,20.7,17.9,18.8,2,1,0,2,4,1
1981-01-03,20.7,17.9,18.8,14.6,3,1,0,3,5,1
1981-01-04,17.9,18.8,14.6,15.8,4,1,0,4,6,1
1981-01-05,18.8,14.6,15.8,15.8,5,1,0,5,0,2


## 2. Creating Lag features

Write a function `lag_features` that creates following features
* t-7
* t-15
    
The function 
* accepts:
    * Provided dataframe
* returns:
    * dataframe with added lag features features 
   

In [24]:
def lag_features(dataframe):
    dataframe["t-7"] = series.shift(7)
    dataframe["t-15"] = series.shift(15)
    return dataframe

In [25]:
dataframe = lag_features(dataframe)
dataframe.head()

Unnamed: 0_level_0,t-2,t-1,t,t+1,day,month,adj_year,day_of_year,day_of_week,week_of_year,t-7,t-15
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1981-01-01,,,20.7,17.9,1,1,0,1,3,1,,
1981-01-02,,20.7,17.9,18.8,2,1,0,2,4,1,,
1981-01-03,20.7,17.9,18.8,14.6,3,1,0,3,5,1,,
1981-01-04,17.9,18.8,14.6,15.8,4,1,0,4,6,1,,
1981-01-05,18.8,14.6,15.8,15.8,5,1,0,5,0,2,,


## 3. Create train test split

Write a function `train_test_split` that creates train test split. Keep in mind that while handling time series data we create unshuffled splits.

Create train test split so that first 9 years of data is allocated to train dataset and last year data is allocated to test dataset
    
The function 
* accepts:
    * Provided dataframe
* returns:
    * train_X, test_X, train_y, test_y 
   

In [26]:
def train_test_split(dataframe):
    train = dataframe.loc[dataframe.adj_year<=8]
    test = dataframe.loc[dataframe.adj_year==9]
    train_X = train[['t-2', 't-1', 't', 'day', 'month', 'adj_year', 'day_of_year',
           'day_of_week', 'week_of_year', 't-7', 't-15']]
    train_y = train[["t"]]
    test_X = test[['t-2', 't-1', 't', 'day', 'month', 'adj_year', 'day_of_year',
           'day_of_week', 'week_of_year', 't-7', 't-15']]
    test_y = test[["t"]]
    return train_X, test_X, train_y, test_y

In [27]:
X_train, X_test, y_train, y_test = train_test_split(dataframe)

## 4. Create train test split

Write a function `model` that trains a random forest regressor model on given train dataset and calculates mse for the test dataset

The function 
* accepts:
    * train_X, test_X, train_y, test_y 
* returns:
    * mse
    * trained model
   

In [30]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

def model(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    return mse, model

In [31]:
model(X_train[15:], X_test, y_train[15:], y_test)

  


(0.0001095890410958939,
 RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
            max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0, warm_start=False))