In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("hw8.ipynb")

# CPSC 330 - Applied Machine Learning 

## Homework 8: Time series
**Due date: See the [Calendar](https://htmlpreview.github.io/?https://github.com/UBC-CS/cpsc330/blob/master/docs/calendar.html).**

## Imports

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import r2_score

from sklearn.model_selection import (
    TimeSeriesSplit,
    cross_val_score,
    cross_validate,
    train_test_split,
)

## Submission instructions
<hr>
rubric={points:4}

You will receive marks for correctly submitting this assignment. To submit this assignment, follow the instructions below:

- **You may work on this assignment in a group (group size <= 4) and submit your assignment as a group.** 
- Below are some instructions on working as a group.  
    - The maximum group size is 4. 
    - You can choose your own group members. 
    - Use group work as an opportunity to collaborate and learn new things from each other. 
    - Be respectful to each other and make sure you understand all the concepts in the assignment well. 
    - It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. [Here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members) are some instructions on adding group members in Gradescope.  
- Upload the .ipynb file to Gradescope.
- **If the .ipynb file is too big or doesn't render on Gradescope for some reason, also upload a pdf or html in addition to the .ipynb.** 
- Make sure that your plots/output are rendered properly in Gradescope.

<br><br>

## Exercise 1: time series prediction

In this exercise we'll be looking at a [dataset of avocado prices](https://www.kaggle.com/neuromusic/avocado-prices). You should start by downloading the dataset. We will be forcasting average avocado price for the next week. 

In [3]:
df = pd.read_csv("data/avocado.csv", parse_dates=["Date"], index_col=0)
df.head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [4]:
df.shape

(18249, 13)

In [5]:
df["Date"].min()

Timestamp('2015-01-04 00:00:00')

In [6]:
df["Date"].max()

Timestamp('2018-03-25 00:00:00')

It looks like the data ranges from the start of 2015 to March 2018 (~2 years ago), for a total of 3.25 years or so. Let's split the data so that we have a 6 months of test data.

In [7]:
split_date = '20170925'
df_train = df[df["Date"] <= split_date]
df_test  = df[df["Date"] >  split_date]

In [8]:
assert len(df_train) + len(df_test) == len(df)

<br><br>

<!-- BEGIN QUESTION -->

### 1.1 How many time series? 
rubric={points:4}

In the Rain is Australia dataset from lecture, we had different measurements for each Location. What about this dataset: for which categorical feature(s), if any, do we have separate measurements? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.1
    
</div>

Similarly to the Rain in Australia dataset there should as well be different measurements for each city. Depending on the location should change how many avocados will sell. It appears the data has been collected evenly in terms of date across all cities. Also every city will sell conventional and organic avocados.

I also think there would be separate measurements for the type of avocado. There is two types; conventional and organic, and they would likely sell differently to each other.

I think we should one hot encode both of these values.

In [9]:
list_cities = df_train['region'].unique()
cities_dict = {}
time_dict = {}
list_cities

array(['Albany', 'Atlanta', 'BaltimoreWashington', 'Boise', 'Boston',
       'BuffaloRochester', 'California', 'Charlotte', 'Chicago',
       'CincinnatiDayton', 'Columbus', 'DallasFtWorth', 'Denver',
       'Detroit', 'GrandRapids', 'GreatLakes', 'HarrisburgScranton',
       'HartfordSpringfield', 'Houston', 'Indianapolis', 'Jacksonville',
       'LasVegas', 'LosAngeles', 'Louisville', 'MiamiFtLauderdale',
       'Midsouth', 'Nashville', 'NewOrleansMobile', 'NewYork',
       'Northeast', 'NorthernNewEngland', 'Orlando', 'Philadelphia',
       'PhoenixTucson', 'Pittsburgh', 'Plains', 'Portland',
       'RaleighGreensboro', 'RichmondNorfolk', 'Roanoke', 'Sacramento',
       'SanDiego', 'SanFrancisco', 'Seattle', 'SouthCarolina',
       'SouthCentral', 'Southeast', 'Spokane', 'StLouis', 'Syracuse',
       'Tampa', 'TotalUS', 'West', 'WestTexNewMexico'], dtype=object)

In [10]:
list_types = df_train['type'].unique()
list_types

array(['conventional', 'organic'], dtype=object)

In [11]:
for city in list_cities:
    city_df = df_train.loc[df_train['region'] == city]
    min_date = city_df['Date'].min()
    max_date = city_df['Date'].max()
    min_max_arr = [min_date, max_date]
    time_dict[city] = min_max_arr

In [12]:
time_df = pd.DataFrame(time_dict).T
time_df.rename(columns={0: 'Earliest Date', 1: 'Latest Date'}, inplace=True)
time_df

Unnamed: 0,Earliest Date,Latest Date
Albany,2015-01-04,2017-09-24
Atlanta,2015-01-04,2017-09-24
BaltimoreWashington,2015-01-04,2017-09-24
Boise,2015-01-04,2017-09-24
Boston,2015-01-04,2017-09-24
BuffaloRochester,2015-01-04,2017-09-24
California,2015-01-04,2017-09-24
Charlotte,2015-01-04,2017-09-24
Chicago,2015-01-04,2017-09-24
CincinnatiDayton,2015-01-04,2017-09-24


In [13]:
for city in list_cities:
    city_df = df_train.loc[df_train['region'] == city]
    types = city_df['type'].unique()
    cities_dict[city] = types

In [14]:
cities_df = pd.DataFrame(cities_dict).T
cities_df

Unnamed: 0,0,1
Albany,conventional,organic
Atlanta,conventional,organic
BaltimoreWashington,conventional,organic
Boise,conventional,organic
Boston,conventional,organic
BuffaloRochester,conventional,organic
California,conventional,organic
Charlotte,conventional,organic
Chicago,conventional,organic
CincinnatiDayton,conventional,organic


<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.2 Equally spaced measurements? 
rubric={points:4}

In the Rain in Australia dataset, the measurements were generally equally spaced but with some exceptions. How about with this dataset? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.2
    
</div>

In [15]:
time_df

Unnamed: 0,Earliest Date,Latest Date
Albany,2015-01-04,2017-09-24
Atlanta,2015-01-04,2017-09-24
BaltimoreWashington,2015-01-04,2017-09-24
Boise,2015-01-04,2017-09-24
Boston,2015-01-04,2017-09-24
BuffaloRochester,2015-01-04,2017-09-24
California,2015-01-04,2017-09-24
Charlotte,2015-01-04,2017-09-24
Chicago,2015-01-04,2017-09-24
CincinnatiDayton,2015-01-04,2017-09-24


In [16]:
len(df_train['Date'].unique())

143

In [17]:
time_len_dict = {}
for city in list_cities:
    city_df = df_train.loc[df_train['region'] == city]
    amount_dates = [len(city_df['Date'].unique())]
    time_len_dict[city] = amount_dates

time_len_df = pd.DataFrame(time_len_dict).T
time_len_df

Unnamed: 0,0
Albany,143
Atlanta,143
BaltimoreWashington,143
Boise,143
Boston,143
BuffaloRochester,143
California,143
Charlotte,143
Chicago,143
CincinnatiDayton,143


In [18]:
def check_dates(df, time_list, city_list):
    time_list.sort()
    for city in city_list:
        city_df = df.loc[df['region'] == city]
        curr_time_list = city_df['Date'].unique()
        curr_time_list.sort()
        for i in range(len(curr_time_list)):
            if curr_time_list[i] != time_list[i]:
                return False
    
    return True

In [19]:
time_list = df_train['Date'].unique()
assert check_dates(df_train, time_list, list_cities) == True

I believe this dataset is equally spaced out. They have the same amount of number of time stamps for each city, and in the function above it seems to confirm that they all match

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.3 Interpreting regions 
rubric={points:4}

In the Rain is Australia dataset, each location was a different place in Australia. For this dataset, look at the names of the regions. Do you think the regions are also all distinct, or are there overlapping regions? Justify your answer by referencing the data.

<div class="alert alert-warning">

Solution_1.3
    
</div>

It appears there is overlapping regions. For instance rows with 'TotalUS' or 'West' will overlap with much of the data and not are distinct values.

In [20]:
list_cities

array(['Albany', 'Atlanta', 'BaltimoreWashington', 'Boise', 'Boston',
       'BuffaloRochester', 'California', 'Charlotte', 'Chicago',
       'CincinnatiDayton', 'Columbus', 'DallasFtWorth', 'Denver',
       'Detroit', 'GrandRapids', 'GreatLakes', 'HarrisburgScranton',
       'HartfordSpringfield', 'Houston', 'Indianapolis', 'Jacksonville',
       'LasVegas', 'LosAngeles', 'Louisville', 'MiamiFtLauderdale',
       'Midsouth', 'Nashville', 'NewOrleansMobile', 'NewYork',
       'Northeast', 'NorthernNewEngland', 'Orlando', 'Philadelphia',
       'PhoenixTucson', 'Pittsburgh', 'Plains', 'Portland',
       'RaleighGreensboro', 'RichmondNorfolk', 'Roanoke', 'Sacramento',
       'SanDiego', 'SanFrancisco', 'Seattle', 'SouthCarolina',
       'SouthCentral', 'Southeast', 'Spokane', 'StLouis', 'Syracuse',
       'Tampa', 'TotalUS', 'West', 'WestTexNewMexico'], dtype=object)

In [21]:
...

Ellipsis

In [22]:
...

Ellipsis

<!-- END QUESTION -->

<br><br>

We will use the entire dataset despite any location-based weirdness uncovered in the previous part.

We will be trying to forecast the avocado price. The function below is adapted from Lecture 20, with some improvements.

In [23]:
def create_lag_feature(df, orig_feature, lag, groupby, new_feature_name=None, clip=False):
    """
    Creates a new feature that's a lagged version of an existing one.
    
    NOTE: assumes df is already sorted by the time columns and has unique indices.
    
    Parameters
    ----------
    df : pandas.core.frame.DataFrame
        The dataset.
    orig_feature : str
        The column name of the feature we're copying
    lag : int
        The lag; negative lag means values from the past, positive lag means values from the future
    groupby : list
        Column(s) to group by in case df contains multiple time series
    new_feature_name : str
        Override the default name of the newly created column
    clip : bool
        If True, remove rows with a NaN values for the new feature
    
    Returns
    -------
    pandas.core.frame.DataFrame
        A new dataframe with the additional column added.
        
    TODO: could/should simplify this function by using `df.shift()`
    """
        
    if new_feature_name is None:
        if lag < 0:
            new_feature_name = "%s_lag%d" % (orig_feature, -lag)
        else:
            new_feature_name = "%s_ahead%d" % (orig_feature, lag)
    
    new_df = df.assign(**{new_feature_name : np.nan})
    for name, group in new_df.groupby(groupby):        
        if lag < 0: # take values from the past
            new_df.loc[group.index[-lag:],new_feature_name] = group.iloc[:lag][orig_feature].values
        else:       # take values from the future
            new_df.loc[group.index[:-lag], new_feature_name] = group.iloc[lag:][orig_feature].values
            
    if clip:
        new_df = new_df.dropna(subset=[new_feature_name])
        
    return new_df

We first sort our dataframe properly:

In [24]:
df_sort = df.sort_values(by=["region", "type", "Date"]).reset_index(drop=True)
df_sort

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico
18247,2018-03-18,1.56,15896.38,2055.35,1499.55,0.00,12341.48,12114.81,226.67,0.0,organic,2018,WestTexNewMexico


We then call `create_lag_feature`. This creates a new column in the dataset `AveragePriceNextWeek`, which is the following week's `AveragePrice`. We have set `clip=True` which means it will remove rows where the target would be missing.

In [25]:
df_hastarget = create_lag_feature(df_sort, "AveragePrice", +1, ["region", "type"], "AveragePriceNextWeek", clip=True)
df_hastarget

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region,AveragePriceNextWeek
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany,1.24
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany,1.17
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany,1.06
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany,0.99
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany,0.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18243,2018-02-18,1.56,17597.12,1892.05,1928.36,0.00,13776.71,13553.53,223.18,0.0,organic,2018,WestTexNewMexico,1.57
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico,1.54
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico,1.56
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico,1.56


Our goal is to predict `AveragePriceNextWeek`. 

Let's split the data:

In [26]:
df_train = df_hastarget[df_hastarget["Date"] <= split_date]
df_test  = df_hastarget[df_hastarget["Date"] >  split_date]

<br><br>

<!-- BEGIN QUESTION -->

### 1.4 `AveragePrice` baseline 
rubric={points:4}

Soon we will want to build some models to forecast the average avocado price a week in advance. Before we start with any ML though, let's try a baseline. Previously we used `DummyClassifier` or `DummyRegressor` as a baseline. This time, we'll do something else as a baseline: we'll assume the price stays the same from this week to next week. So, we'll set our prediction of "AveragePriceNextWeek" exactly equal to "AveragePrice", assuming no change. That is kind of like saying, "If it's raining today then I'm guessing it will be raining tomorrow". This simplistic approach will not get a great score but it's a good starting point for reference. If our model does worse that this, it must not be very good. 

Using this baseline approach, what $R^2$ do you get on the train and test data?

<div class="alert alert-warning">

Solution_1.4
    
</div>

I got 0.83 $R^2$ score for my training data and 0.76 $R^2$ score for my test data.

In [27]:
X_train = df_train.drop(columns='AveragePriceNextWeek')
y_train = df_train['AveragePriceNextWeek']
X_test = df_test.drop(columns='AveragePriceNextWeek')
y_test = df_test['AveragePriceNextWeek']

In [28]:
pred_avg_price_list = X_train['AveragePrice'].tolist()
real_avg_price_list = y_train.tolist()

In [29]:
pred_avg_price_list1 = X_test['AveragePrice'].tolist()
real_avg_price_list1 = y_test.tolist()

In [30]:
r2 = r2_score(real_avg_price_list, pred_avg_price_list)
r2_test = r2_score(real_avg_price_list1, pred_avg_price_list1)
print("Our baseline training R2 score is " + str(r2))
print("Our baseline test R2 score is " + str(r2_test))

Our baseline training R2 score is 0.8285800937261841
Our baseline test R2 score is 0.7631780188583048


<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.5 Forecasting average avocado price
rubric={points:10}

Now that the baseline is done, let's build some models to forecast the average avocado price a week later. Experiment with a few approachs for encoding the date. Justify the decisions you make. Which approach worked best? Report your test score and briefly discuss your results.

Benchmark: you should be able to achieve $R^2$ of at least 0.79 on the test set. I got to 0.80, but not beyond that. Let me know if you do better!

Note: because we only have 2 splits here, we need to be a bit wary of overfitting on the test set. Try not to test on it a ridiculous number of times. If you are interested in some proper ways of dealing with this, see for example sklearn's [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html), which is like cross-validation for time series data.

<div class="alert alert-warning">

Solution_1.3
    
</div>

In [31]:
numeric_features = ['AveragePrice', 'Total Volume', '4046', '4225', '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags']
categorical_features = ['type', 'region']
drop_features = ['year', 'Date']

In [32]:
from sklearn.compose import ColumnTransformer, make_column_transformer
preprocessor = make_column_transformer((StandardScaler(), numeric_features),
                                       (OneHotEncoder(handle_unknown='ignore', sparse=False), categorical_features),
                                       ('drop', drop_features))

In [33]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.ensemble import RandomForestClassifier

rf_pipe = make_pipeline(preprocessor, RandomForestRegressor(n_estimators = 100))
rf_pipe.fit(X_train, y_train)

In [34]:
scores = cross_validate(
    rf_pipe, X_train, y_train, cv=TimeSeriesSplit(), return_train_score=True
)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,1.281129,0.036766,0.77956,0.976956
1,2.958475,0.04688,0.785192,0.976502
2,5.047254,0.046871,0.86516,0.97602
3,7.465964,0.040003,0.821631,0.978264
4,10.643481,0.046871,0.828597,0.978516


In [35]:
r_pipe = make_pipeline(preprocessor, Ridge())
r_pipe.fit(X_train, y_train)

In [36]:
scores = cross_validate(
    r_pipe, X_train, y_train, cv=TimeSeriesSplit(), return_train_score=True
)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.015621,0.0,0.79658,0.838689
1,0.015627,0.0,0.804751,0.828481
2,0.015619,0.0,0.866903,0.82772
3,0.015584,0.0,0.839653,0.842096
4,0.015623,0.0,0.764186,0.843689


In [37]:
#taken from lecture 20
X_train_month = X_train.assign(
    Month=X_train["Date"].apply(lambda x: x.month_name())
)  # x.month_name() to get the actual string
X_test_month = X_test.assign(Month=X_test["Date"].apply(lambda x: x.month_name()))

In [38]:
preprocessor_with_month = make_column_transformer((StandardScaler(), numeric_features),
                                       (OneHotEncoder(handle_unknown='ignore', sparse=False), categorical_features + ["Month"]),
                                       ('drop', drop_features))

In [39]:
r_pipe_with_month = make_pipeline(preprocessor_with_month, Ridge())
r_pipe_with_month.fit(X_train_month, y_train)
scores = cross_validate(
    r_pipe_with_month, X_train_month, y_train, cv=TimeSeriesSplit(), return_train_score=True
)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.01562,0.015622,0.796418,0.842274
1,0.026995,0.006981,0.80725,0.832252
2,0.015623,0.015621,0.865785,0.832493
3,0.015621,0.01562,0.842087,0.846083
4,0.031242,0.015618,0.769989,0.847829


In [40]:
rf_pipe_with_month = make_pipeline(preprocessor_with_month, RandomForestRegressor(n_estimators=100))
rf_pipe_with_month.fit(X_train_month, y_train)
scores = cross_validate(
    rf_pipe_with_month, X_train_month, y_train, cv=TimeSeriesSplit(), return_train_score=True
)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,1.43056,0.031278,0.78049,0.97663
1,3.371127,0.046878,0.791637,0.97652
2,5.581868,0.046878,0.86679,0.976665
3,8.145216,0.046869,0.823572,0.978927
4,11.290151,0.046863,0.836994,0.978873


In [41]:
rf_pipe.score(X_test, y_test)

0.7540115581179774

In [42]:
r_pipe.score(X_test, y_test)

0.8018393461329533

In [43]:
r_pipe_with_month.score(X_test_month, y_test)

0.805145331368215

In [44]:
rf_pipe_with_month.score(X_test_month, y_test)

0.7637513954397476

To preprocess I broke the columns up into 3 categories; numerical, categorical, and drop. I performed standard scaler on the numerical and one hot encoding on the categorical features. I think one hot encoder would work well for this problem because the features are labels, such as cities or the type of avocado, so it can correlate them to trends. I dropped year because I felt it would be irrelevant for predicting future results.

So I tried 2 different types of models, linear regression and random forest regression. To my surprise random forest performed quite poorly and was worse than the baseline model. Linear regression on then other hand did a bit better but still wasn't excellent. I did add a new feature "Month" and that slightly helped both of the models scores but not much. I felt like adding month would useful because avocado's may be more popular to eat in certain months than others, and the model would be able to use that information to make predictions. It didn't help as much as I'd hope, but it did boost the scores a bit.

In the end my final scores were:

Linear Regression: 0.802

Random Forest: 0.754

Linear Regression with Month: 0.805

Random Forest with Month: 0.764

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 2: very short answer questions

Each question is worth 2 points.

<!-- BEGIN QUESTION -->

### 2.1 Time series

rubric={points:4}

The following questions pertain to Lecture 20 on time series data:

1. Sometimes a time series has missing time points or, worse, time points that are unequally spaced in general. Give an example of a real world situation where the time series data would have unequally spaced time points.
2. In class we discussed two approaches to using temporal information: encoding the date as one or more features, and creating lagged versions of features. Which of these (one/other/both/neither) two approaches would struggle with unequally spaced time points? Briefly justify your answer.

<div class="alert alert-warning">

Solution_2.1
    
</div>

1. A time series could have unequal spaced time points when using data that is quite random like earthquakes or something of that nature
2. I think creating lagged versions of the features would struggle while encoding the date should still be OK. We could still get potentially interesting data out of encoding the date by still looking at things like the month or seasons because although they may be unequally spaced, they still may tell the model something. With the lag based features how much something has been done at the data point before is less interesting because we don't know when that was. It could have been a long time ago or really close which makes it confusing.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.2 Survival analysis
rubric={points:6}

The following questions pertain to Lecture 21 on survival analysis. We'll consider the use case of customer churn analysis.

1. What is the problem with simply labeling customers are "churned" or "not churned" and using standard supervised learning techniques?
2. Consider customer A who just joined last week vs. customer B who has been with the service for a year. Who do you expect will leave the service first: probably customer A, probably customer B, or we don't have enough information to answer?
3. If a customer's survival function is almost flat during a certain period, how do we interpret that?

<div class="alert alert-warning">

Solution_2.2
    
</div>

1. The major issue is that we do not have correct target values to train our model on. Some people haven't churned and we don't know if they will churn soon or in a long time. This hurts the model because some of the data will be censored. You could assume everyone churns at the end of dataset or only look at the people who do churn but that would both be underestimates.
2. I would expect the person who has been with the company for a year to churn first. Typically the longer someone is with a company, the more likely they will be to leave. But there is other factors such as contract type and length, or payment methods that can be could be good indicators. So if i were to guess it would be the person there for a year but a deeper analysis looking at other features could show that the other person is more likely to leave first.
3. It means that they are not becoming more likely to churn. So this would be very ideal if the person is already at a high survival probability because it means that they are not becoming more likely to leave.

<!-- END QUESTION -->

<br><br>

**PLEASE READ BEFORE YOU SUBMIT:** 

When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from "1" will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 
4. Make sure that the plots and output are rendered properly in your submitted file. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb so that the TAs can view your submission on Gradescope. 

![](img/eva-well-done.png)