In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("hw8.ipynb")

# CPSC 330 - Applied Machine Learning 

## Homework 8: Time series
**Due date: See the [Calendar](https://htmlpreview.github.io/?https://github.com/UBC-CS/cpsc330/blob/master/docs/calendar.html).**

## Imports

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import r2_score
from sklearn.model_selection import cross_validate

## Submission instructions
<hr>
rubric={points:4}

You will receive marks for correctly submitting this assignment. To submit this assignment, follow the instructions below:

- **You may work on this assignment in a group (group size <= 4) and submit your assignment as a group.** 
- Below are some instructions on working as a group.  
    - The maximum group size is 4. 
    - You can choose your own group members. 
    - Use group work as an opportunity to collaborate and learn new things from each other. 
    - Be respectful to each other and make sure you understand all the concepts in the assignment well. 
    - It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. [Here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members) are some instructions on adding group members in Gradescope.  
- Upload the .ipynb file to Gradescope.
- **If the .ipynb file is too big or doesn't render on Gradescope for some reason, also upload a pdf or html in addition to the .ipynb.** 
- Make sure that your plots/output are rendered properly in Gradescope.

<br><br>

## Exercise 1: time series prediction

In this exercise we'll be looking at a [dataset of avocado prices](https://www.kaggle.com/neuromusic/avocado-prices). You should start by downloading the dataset. We will be forcasting average avocado price for the next week. 

In [3]:
df = pd.read_csv("data/avocado.csv", parse_dates=["Date"], index_col=0)
df.head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [4]:
df.shape

(18249, 13)

In [5]:
df["Date"].min()

Timestamp('2015-01-04 00:00:00')

In [6]:
df["Date"].max()

Timestamp('2018-03-25 00:00:00')

It looks like the data ranges from the start of 2015 to March 2018 (~2 years ago), for a total of 3.25 years or so. Let's split the data so that we have a 6 months of test data.

In [7]:
split_date = '20170925'
df_train = df[df["Date"] <= split_date]
df_test  = df[df["Date"] >  split_date]

In [8]:
assert len(df_train) + len(df_test) == len(df)

<br><br>

<!-- BEGIN QUESTION -->

### 1.1 How many time series? 
rubric={points:4}

In the Rain is Australia dataset from lecture, we had different measurements for each Location. What about this dataset: for which categorical feature(s), if any, do we have separate measurements? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.1
    
</div>

The feature I would consider segmenting the time series by is `region`. The observations within a region are likely to be more related that observations in different regions. The regional relationship between observations implies, to me, that any trend in avocado price would be meaningful rather than artificial.

In [9]:
# Number of regions
df_train['region'].unique().size

54

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.2 Equally spaced measurements? 
rubric={points:4}

In the Rain in Australia dataset, the measurements were generally equally spaced but with some exceptions. How about with this dataset? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.2
    
</div>

The data set contains observations that occur every 7 days. 

When sorting the data set by date, removing duplicate dates, and calculating the column-wise difference we get a time interval of 7 days. The 7 day interval is observed in the head, tail, and a random continuous slice of the data set.

In [10]:
df_train['Date'].sort_values().drop_duplicates().head(n = 10).diff()

50      NaT
50   7 days
49   7 days
48   7 days
47   7 days
46   7 days
45   7 days
44   7 days
43   7 days
42   7 days
Name: Date, dtype: timedelta64[ns]

In [11]:
df_train.drop_duplicates().shape[0]

15441

In [12]:
dates = df_train['Date'].sort_values().drop_duplicates()

start_idx = np.random.randint(0, dates.size - 10)
end_idx = start_idx + 10

dates.iloc[start_idx:end_idx].diff()

47      NaT
46   7 days
45   7 days
44   7 days
43   7 days
42   7 days
41   7 days
40   7 days
39   7 days
38   7 days
Name: Date, dtype: timedelta64[ns]

In [13]:
df_train['Date'].sort_values().drop_duplicates().tail(n = 10).diff()

23      NaT
22   7 days
21   7 days
20   7 days
19   7 days
18   7 days
17   7 days
16   7 days
15   7 days
14   7 days
Name: Date, dtype: timedelta64[ns]

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.3 Interpreting regions 
rubric={points:4}

In the Rain is Australia dataset, each location was a different place in Australia. For this dataset, look at the names of the regions. Do you think the regions are also all distinct, or are there overlapping regions? Justify your answer by referencing the data.

<div class="alert alert-warning">

Solution_1.3
    
</div>

There are regions that overlap.

The most evident overlap is between `TotalUS` and all other regions, given that all other regions are a part of `TotalUS`. There are also states (ie. `California`) present, along with cities that are within those states (ex. `San Francisco`). Moreover, some regions are a combinations of two regions (ie. `PheonixTucson`, `WestTexNewMexico`).

In [14]:
df_train['region'].unique().tolist()

['Albany',
 'Atlanta',
 'BaltimoreWashington',
 'Boise',
 'Boston',
 'BuffaloRochester',
 'California',
 'Charlotte',
 'Chicago',
 'CincinnatiDayton',
 'Columbus',
 'DallasFtWorth',
 'Denver',
 'Detroit',
 'GrandRapids',
 'GreatLakes',
 'HarrisburgScranton',
 'HartfordSpringfield',
 'Houston',
 'Indianapolis',
 'Jacksonville',
 'LasVegas',
 'LosAngeles',
 'Louisville',
 'MiamiFtLauderdale',
 'Midsouth',
 'Nashville',
 'NewOrleansMobile',
 'NewYork',
 'Northeast',
 'NorthernNewEngland',
 'Orlando',
 'Philadelphia',
 'PhoenixTucson',
 'Pittsburgh',
 'Plains',
 'Portland',
 'RaleighGreensboro',
 'RichmondNorfolk',
 'Roanoke',
 'Sacramento',
 'SanDiego',
 'SanFrancisco',
 'Seattle',
 'SouthCarolina',
 'SouthCentral',
 'Southeast',
 'Spokane',
 'StLouis',
 'Syracuse',
 'Tampa',
 'TotalUS',
 'West',
 'WestTexNewMexico']

<!-- END QUESTION -->

<br><br>

We will use the entire dataset despite any location-based weirdness uncovered in the previous part.

We will be trying to forecast the avocado price. The function below is adapted from Lecture 20, with some improvements.

In [15]:
def create_lag_feature(df, orig_feature, lag, groupby, new_feature_name=None, clip=False):
    """
    Creates a new feature that's a lagged version of an existing one.
    
    NOTE: assumes df is already sorted by the time columns and has unique indices.
    
    Parameters
    ----------
    df : pandas.core.frame.DataFrame
        The dataset.
    orig_feature : str
        The column name of the feature we're copying
    lag : int
        The lag; negative lag means values from the past, positive lag means values from the future
    groupby : list
        Column(s) to group by in case df contains multiple time series
    new_feature_name : str
        Override the default name of the newly created column
    clip : bool
        If True, remove rows with a NaN values for the new feature
    
    Returns
    -------
    pandas.core.frame.DataFrame
        A new dataframe with the additional column added.
        
    TODO: could/should simplify this function by using `df.shift()`
    """
        
    if new_feature_name is None:
        if lag < 0:
            new_feature_name = "%s_lag%d" % (orig_feature, -lag)
        else:
            new_feature_name = "%s_ahead%d" % (orig_feature, lag)
    
    new_df = df.assign(**{new_feature_name : np.nan})
    for name, group in new_df.groupby(groupby):        
        if lag < 0: # take values from the past
            new_df.loc[group.index[-lag:],new_feature_name] = group.iloc[:lag][orig_feature].values
        else:       # take values from the future
            new_df.loc[group.index[:-lag], new_feature_name] = group.iloc[lag:][orig_feature].values
            
    if clip:
        new_df = new_df.dropna(subset=[new_feature_name])
        
    return new_df

We first sort our dataframe properly:

In [16]:
df_sort = df.sort_values(by=["region", "type", "Date"]).reset_index(drop=True)
df_sort

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico
18247,2018-03-18,1.56,15896.38,2055.35,1499.55,0.00,12341.48,12114.81,226.67,0.0,organic,2018,WestTexNewMexico


We then call `create_lag_feature`. This creates a new column in the dataset `AveragePriceNextWeek`, which is the following week's `AveragePrice`. We have set `clip=True` which means it will remove rows where the target would be missing.

In [17]:
df_hastarget = create_lag_feature(df_sort, "AveragePrice", +1, ["region", "type"], "AveragePriceNextWeek", clip=True)
df_hastarget

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region,AveragePriceNextWeek
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany,1.24
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany,1.17
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany,1.06
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany,0.99
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany,0.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18243,2018-02-18,1.56,17597.12,1892.05,1928.36,0.00,13776.71,13553.53,223.18,0.0,organic,2018,WestTexNewMexico,1.57
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico,1.54
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico,1.56
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico,1.56


Our goal is to predict `AveragePriceNextWeek`. 

Let's split the data:

In [18]:
df_train = df_hastarget[df_hastarget["Date"] <= split_date]
df_test  = df_hastarget[df_hastarget["Date"] >  split_date]

<br><br>

<!-- BEGIN QUESTION -->

### 1.4 `AveragePrice` baseline 
rubric={points:4}

Soon we will want to build some models to forecast the average avocado price a week in advance. Before we start with any ML though, let's try a baseline. Previously we used `DummyClassifier` or `DummyRegressor` as a baseline. This time, we'll do something else as a baseline: we'll assume the price stays the same from this week to next week. So, we'll set our prediction of "AveragePriceNextWeek" exactly equal to "AveragePrice", assuming no change. That is kind of like saying, "If it's raining today then I'm guessing it will be raining tomorrow". This simplistic approach will not get a great score but it's a good starting point for reference. If our model does worse that this, it must not be very good. 

Using this baseline approach, what $R^2$ do you get on the train and test data?

<div class="alert alert-warning">

Solution_1.4
    
</div>

The `AveragePrice`, the target variable, has its first value removed. We cannot use the first value in the target variable, since we do not have any value that precedes it. The `AveragePriceNextWeek` has its last value removed, since it does not precede any `AveragePrice` value.

The $R^2$ score is 0.9715, indicating a strong positive correlation between the two series.

In [19]:
from sklearn.metrics import r2_score

r2_score(df_train['AveragePrice'][1:], 
         df_train['AveragePriceNextWeek'][:-1])

0.9715177543827928

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.5 Forecasting average avocado price
rubric={points:10}

Now that the baseline is done, let's build some models to forecast the average avocado price a week later. Experiment with a few approachs for encoding the date. Justify the decisions you make. Which approach worked best? Report your test score and briefly discuss your results.

Benchmark: you should be able to achieve $R^2$ of at least 0.79 on the test set. I got to 0.80, but not beyond that. Let me know if you do better!

Note: because we only have 2 splits here, we need to be a bit wary of overfitting on the test set. Try not to test on it a ridiculous number of times. If you are interested in some proper ways of dealing with this, see for example sklearn's [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html), which is like cross-validation for time series data.

<div class="alert alert-warning">

Solution_1.3
    
</div>

_Type your answer here, replacing this text._

In [20]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15441 entries, 0 to 18222
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Date                  15441 non-null  datetime64[ns]
 1   AveragePrice          15441 non-null  float64       
 2   Total Volume          15441 non-null  float64       
 3   4046                  15441 non-null  float64       
 4   4225                  15441 non-null  float64       
 5   4770                  15441 non-null  float64       
 6   Total Bags            15441 non-null  float64       
 7   Small Bags            15441 non-null  float64       
 8   Large Bags            15441 non-null  float64       
 9   XLarge Bags           15441 non-null  float64       
 10  type                  15441 non-null  object        
 11  year                  15441 non-null  int64         
 12  region                15441 non-null  object        
 13  AveragePriceNext

### Data Splitting + Feature Engineering

The season and fiscal quarter of when the avocados were generated.

1. Season: The yield and operational costs of avocado production may vary with season (ie summer vs. winter). The decreased yields of avocados or increase operational costs of shipping avocados may impact the price at the grocery store.
2. Fiscal year: Stakeholders may want to meet fiscal year projections. If there were any sudden changes in avocado sales that may impact those projects, prices may change by the next quarter.

In [21]:
SEASON_MONTHS = {'winter': [12,  1,  2],
                 'spring': [ 3,  4,  5],
                 'summer': [ 6,  7,  8],
                 'fall':   [ 9, 10, 11]}

FISCAL_QUARTER = {'q1': [ 1,  2,  3],
                  'q2': [ 4,  5,  6],
                  'q3': [ 7,  8,  9],
                  'q4': [10, 11, 12]}

X_train, y_train = df_train.drop(columns = 'AveragePrice'), df_train['AveragePrice']
X_test, y_test = df_test.drop(columns = 'AveragePrice'), df_test['AveragePrice']

def replace_values(value, mapper):
    return next(k for k, v in mapper.items() if value in v)

X_train = X_train.assign(
    fiscal_quarter = X_train['Date'].apply(lambda x: replace_values(x.month, FISCAL_QUARTER)),
    season = X_train['Date'].apply(lambda x: replace_values(x.month, SEASON_MONTHS)),
    month = X_train['Date'].dt.month,
    l_and_xl_size = X_train['4225'] + X_train['4770']
)

X_test = X_test.assign(
    fiscal_quarter = X_test['Date'].apply(lambda x: replace_values(x.month, FISCAL_QUARTER)),
    season = X_test['Date'].apply(lambda x: replace_values(x.month, SEASON_MONTHS)),
    month = X_test['Date'].dt.month,
    l_and_xl_size = X_test['4225'] + X_test['4770']
)

### Preprocessing

The training data set preprocessor is parameterized here. 

The numeric features (ie. Total Volume, Total Bags) are standardized using the `StandardScaler` transformer. Standardization of numeric features maps all features into a similar scale, removing the influence that metrics' intrinsic magnitude may have.

Categorical variables are one-hot encoded. The year and month are one-hot encoded as well, given that I suspect each year and month may contribute differently to the model.

In [22]:
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import TimeSeriesSplit

NUMERIC = X_train.select_dtypes(include = [np.float64, np.int64]).columns.tolist()
NOMINAL = ['type', 'region', 'season', 'fiscal_quarter', 'year', 'month']
DROP = ['Total Volume', 'Total Bags', 'Date', '4225', '4770']

preprocessor = make_column_transformer(
    (StandardScaler(), NUMERIC),
    (OneHotEncoder(handle_unknown = 'ignore'),  NOMINAL),
    ('drop', DROP)
)

### Ridge

A simple ridge regression is performed because it contain regularization. The regularization in this form of linear regression prevents for overfitting of the data.

This model performs the best out of all other models in this notebook.

In [23]:
from sklearn.linear_model import RidgeCV

pipe_ridge = make_pipeline(
    preprocessor,
    RidgeCV(alphas = 10.0 ** np.arange(-5, 5, 1), cv = TimeSeriesSplit())
)
pipe_ridge.fit(X_train, y_train)

print(f'R^2: {r2_score(y_test, pipe_ridge.predict(X_test))}')

R^2: 0.8015849461338838


### RandomForestRegressor

A random forest regressor is also used, but performs the worst out of the models.

In [24]:
from sklearn.ensemble import RandomForestRegressor

pipe_rfr = make_pipeline(
    preprocessor,
    RandomForestRegressor(n_jobs = -1, random_state = 123)
)

search_rfr = GridSearchCV(pipe_rfr, n_jobs = -1, cv = TimeSeriesSplit(), return_train_score = True, param_grid = {
    'randomforestregressor__n_estimators': [25, 50, 75, 100],
    'randomforestregressor__max_depth': [5, 10, 20],
})

search_rfr.fit(X_train, y_train)

pd.DataFrame(search_rfr.cv_results_).filter(regex = '^((?!split).)*$')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_randomforestregressor__max_depth,param_randomforestregressor__n_estimators,params,mean_test_score,std_test_score,rank_test_score,mean_train_score,std_train_score
0,0.681523,0.413588,0.047839,0.027316,5,25,"{'randomforestregressor__max_depth': 5, 'rando...",0.839295,0.023231,12,0.865195,0.0073
1,1.433677,0.800441,0.029151,0.011601,5,50,"{'randomforestregressor__max_depth': 5, 'rando...",0.840146,0.022951,11,0.865439,0.007615
2,2.445781,1.427057,0.042865,0.015332,5,75,"{'randomforestregressor__max_depth': 5, 'rando...",0.840173,0.022988,10,0.865612,0.007664
3,3.529075,1.788799,0.050978,0.011988,5,100,"{'randomforestregressor__max_depth': 5, 'rando...",0.84046,0.022427,9,0.865735,0.007714
4,2.664032,1.75913,0.021744,0.010057,10,25,"{'randomforestregressor__max_depth': 10, 'rand...",0.845423,0.024842,4,0.936785,0.00968
5,4.381915,2.788671,0.02573,0.003189,10,50,"{'randomforestregressor__max_depth': 10, 'rand...",0.846218,0.025002,2,0.937816,0.010055
6,6.196236,4.038729,0.035042,0.009741,10,75,"{'randomforestregressor__max_depth': 10, 'rand...",0.845931,0.025024,3,0.938209,0.010166
7,7.998486,4.980297,0.039601,0.006068,10,100,"{'randomforestregressor__max_depth': 10, 'rand...",0.846267,0.024708,1,0.938336,0.010206
8,7.296848,5.835468,0.017276,0.00223,20,25,"{'randomforestregressor__max_depth': 20, 'rand...",0.841771,0.026138,8,0.978702,0.001328
9,13.057718,10.394326,0.043557,0.028541,20,50,"{'randomforestregressor__max_depth': 20, 'rand...",0.843478,0.027012,7,0.980149,0.001349


In [25]:
print(f'R^2: {r2_score(y_test, search_rfr.predict(X_test))}')

R^2: 0.7714265423288609


### XGBoost

Finally, a gradient-boosted tree model is tested. This model is to extend the random forest results. Rather than having multiple independent decision trees, XGBoost has trees that minimize error from its predecessors.

The model performs better than random forest, but still not as good as Ridge.

In [26]:
from xgboost.sklearn import XGBRegressor

pipe_xgboost = make_pipeline(
    preprocessor,
    XGBRegressor(n_jobs = -1)
)
pipe_xgboost.fit(X_train, y_train)

search_xgb = GridSearchCV(pipe_xgboost, n_jobs = -1, cv = TimeSeriesSplit(), return_train_score = True, param_grid = {
    'xgbregressor__n_estimators': [25, 50, 75, 100],
    'xgbregressor__max_depth': [5, 10, 20],
    'xgbregressor__learning_rate': [0.1, 0.15, 0.2, 0.25, 0.3],
    
})

search_xgb.fit(X_train, y_train)

pd.DataFrame(search_xgb.cv_results_).filter(regex = '^((?!split).)*$').head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_xgbregressor__learning_rate,param_xgbregressor__max_depth,param_xgbregressor__n_estimators,params,mean_test_score,std_test_score,rank_test_score,mean_train_score,std_train_score
0,0.30912,0.103781,0.008169,0.001458,0.1,5,25,"{'xgbregressor__learning_rate': 0.1, 'xgbregre...",0.807467,0.014104,54,0.855563,0.005601
1,0.759746,0.342692,0.009091,0.001556,0.1,5,50,"{'xgbregressor__learning_rate': 0.1, 'xgbregre...",0.843826,0.021777,2,0.915517,0.012584
2,1.233059,0.393166,0.014744,0.004088,0.1,5,75,"{'xgbregressor__learning_rate': 0.1, 'xgbregre...",0.843915,0.023407,1,0.928387,0.013668
3,1.61271,0.541726,0.01428,0.006437,0.1,5,100,"{'xgbregressor__learning_rate': 0.1, 'xgbregre...",0.841566,0.024221,4,0.937504,0.013789
4,0.900887,0.329,0.019486,0.013021,0.1,10,25,"{'xgbregressor__learning_rate': 0.1, 'xgbregre...",0.796428,0.017664,59,0.914681,0.004439


In [27]:
print(f'R^2: {r2_score(y_test, search_xgb.predict(X_test))}')

R^2: 0.800328179610198


<!-- END QUESTION -->

<br><br><br><br>

## Exercise 2: very short answer questions

Each question is worth 2 points.

<!-- BEGIN QUESTION -->

### 2.1 Time series

rubric={points:4}

The following questions pertain to Lecture 20 on time series data:

1. Sometimes a time series has missing time points or, worse, time points that are unequally spaced in general. Give an example of a real world situation where the time series data would have unequally spaced time points.
2. In class we discussed two approaches to using temporal information: encoding the date as one or more features, and creating lagged versions of features. Which of these (one/other/both/neither) two approaches would struggle with unequally spaced time points? Briefly justify your answer.

<div class="alert alert-warning">

Solution_2.1
    
</div>

1. Stock trading may not aggregating into equally space time points, however people may be selling and buying stock at any time when the stock trade is open.
2. Lagging may suffer the most, given that the assumption is: "what occurred T unit time ago, will occur T unit time from now". If there is inconsistent periodicity, then the rigor of the assumption is broken. If we create multiple lag features, they will not all represent the same T unit time difference.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.2 Survival analysis
rubric={points:6}

The following questions pertain to Lecture 21 on survival analysis. We'll consider the use case of customer churn analysis.

1. What is the problem with simply labeling customers are "churned" or "not churned" and using standard supervised learning techniques?
2. Consider customer A who just joined last week vs. customer B who has been with the service for a year. Who do you expect will leave the service first: probably customer A, probably customer B, or we don't have enough information to answer?
3. If a customer's survival function is almost flat during a certain period, how do we interpret that?

<div class="alert alert-warning">

Solution_2.2
    
</div>

1. The data is right censored, meaning that not all customers have completed their tenure. We would expect that all customers will eventually cease to use the service. If customers are labeled as "churned" or "not churned", there will be a higher false negative rate ("not churned").
2. We do not have enough data to answer this question. What kind of service does the company provide? Is it a service that is only used for 1 year? What is the context of customer A and B? Is customer A trialing the product? If the service is a subscription model, did customer B forget to unsubscribe? Depending on the service, how active are customer A and B? Many more questions may need to be answered.
3. The percentage of survival for the customer does not change during that period of time. It does not mean that the customer will not survive (ie. churn), but that the probability is unchanged.

<!-- END QUESTION -->

<br><br>

**PLEASE READ BEFORE YOU SUBMIT:** 

When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from "1" will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 
4. Make sure that the plots and output are rendered properly in your submitted file. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb so that the TAs can view your submission on Gradescope. 

![](img/eva-well-done.png)