# CPSC 330 - Applied Machine Learning

## Homework 8: Introduction to Computer vision and Time Series

**Due date: see the [Calendar](https://htmlpreview.github.io/?https://github.com/UBC-CS/cpsc330/blob/master/docs/calendar.html).**

## Imports

In [1]:
from hashlib import sha1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import r2_score

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
## Instructions
rubric={points}

You will earn points for following these instructions and successfully submitting your work on Gradescope.  

### Group wotk instructions

**You may work with a partner on this homework and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 2.
  
- Use group work as an opportunity to collaborate and learn new things from each other.
- Be respectful to each other and make sure you understand all the concepts in the assignment well.
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline.
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).
- If you would like to use late tokens for the homework, all group members must have the necessary late tokens available. Please note that the late tokens will be counted for all members of the group.   


### General submission instructions

- Please **read carefully
[Use of Generative AI policy](https://ubc-cs.github.io/cpsc330-2025W1/syllabus.html#use-of-generative-ai-in-the-course)** before starting the homework assignment.
- **Run all cells before submitting:** Go to `Kernel -> Restart Kernel and Clear All Outputs`, then select `Run -> Run All Cells`. This ensures your notebook runs cleanly from start to finish without errors.
  
- **Submit your files on Gradescope.**  
   - Upload only your `.ipynb` file **with outputs displayed** and any required output files.
     
   - Do **not** submit other files from your repository.  
   - If you need help, see the [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/).  
- **Check that outputs render properly.**  
   - Make sure all plots and outputs appear in your submission.
     
   - If your `.ipynb` file is too large and doesn't render on Gradescope, also upload a PDF or HTML version so the TAs can view your work.  
- **Keep execution order clean.**  
   - Execution numbers must start at "1" and increase in order.
     
   - Notebooks without visible outputs may not be graded.  
   - Out-of-order or missing execution numbers may result in mark deductions.  
- **Follow course submission guidelines:** Review the [CPSC 330 homework instructions](https://ubc-cs.github.io/cpsc330-2025W1/docs/homework_instructions.html) for detailed guidance on completing and submitting assignments.
   
</div>

_Points:_ 2

<!-- END QUESTION -->

<br><br>

## Exercise 1: time series prediction

In this exercise we'll be looking at a [dataset of avocado prices](https://www.kaggle.com/neuromusic/avocado-prices). You should start by downloading the dataset and storing it under the `data` folder. We will be forcasting average avocado price for the next week.

In [2]:
df = pd.read_csv("data/avocado.csv", parse_dates=["Date"], index_col=0)
df.head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [3]:
df.shape

(18249, 13)

In [4]:
df["Date"].min()

Timestamp('2015-01-04 00:00:00')

In [5]:
df["Date"].max()

Timestamp('2018-03-25 00:00:00')

It looks like the data ranges from the start of 2015 to March 2018 (~2 years ago), for a total of 3.25 years or so. Let's split the data so that we have a 6 months of test data.

In [6]:
split_date = '20170925'
df_train = df[df["Date"] <= split_date]
df_test  = df[df["Date"] >  split_date]

In [7]:
assert len(df_train) + len(df_test) == len(df)

<br><br>

<!-- BEGIN QUESTION -->

### 1.1 How many time series?
rubric={points:4}

In the [Rain in Australia](https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package) dataset from lecture demo, we had different measurements for each Location.

We want you to consider this for the avocado prices dataset. For which categorical feature(s), if any, do we have separate measurements? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.1
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

In [8]:
print("For region, and type, we have separate measurements.For example, on December 27, 2015 between Albany vs. Atlanta, the price and volume are different (ex. prices, volumes).")
print("Even on the same day and location, between conventional and organic avocados (differnt types), there are separate measurements. For example, on December 27, 2015 in Albany, there are different prices and volumes for conventional and organic avocados")

For region, and type, we have separate measurements.For example, on December 27, 2015 between Albany vs. Atlanta, the price and volume are different (ex. prices, volumes).
Even on the same day and location, between conventional and organic avocados (differnt types), there are separate measurements. For example, on December 27, 2015 in Albany, there are different prices and volumes for conventional and organic avocados


In [9]:
...

Ellipsis

In [10]:
...

Ellipsis

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.2 Equally spaced measurements?
rubric={points:4}

In the Rain in Australia dataset, the measurements were generally equally spaced but with some exceptions. How about with this dataset? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.2
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

In [11]:
print("Yes, this dataset has equally spaced measurements (measures every 7 days). In other words, this dataset takes its measurements for the full years of 2015-2018, on the same day each week.")

Yes, this dataset has equally spaced measurements (measures every 7 days). In other words, this dataset takes its measurements for the full years of 2015-2018, on the same day each week.


In [12]:
...

Ellipsis

In [13]:
...

Ellipsis

In [14]:
...

Ellipsis

In [15]:
...

Ellipsis

In [16]:
...

Ellipsis

In [17]:
...

Ellipsis

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.3 Interpreting regions
rubric={points:4}

In the Rain in Australia dataset, each location was a different place in Australia. For this dataset, look at the names of the regions. Do you think the regions are also all distinct, or are there overlapping regions? Justify your answer by referencing the data.

<div class="alert alert-warning">

Solution_1.3
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

In [18]:
print("No, I don't think the regions are fully distinct. For example, there are regions called West and Southwest, and total US. However, there are also individual cities that fall into these regions")
print("For example, every other region in the dataset falls into the total US category. Additionally, within the smaller more general regions (ex. Houston is a city in Texas that would have some overlaps with the larger region WestTexNewMexico)")

No, I don't think the regions are fully distinct. For example, there are regions called West and Southwest, and total US. However, there are also individual cities that fall into these regions
For example, every other region in the dataset falls into the total US category. Additionally, within the smaller more general regions (ex. Houston is a city in Texas that would have some overlaps with the larger region WestTexNewMexico)


In [19]:
...

Ellipsis

In [20]:
...

Ellipsis

<!-- END QUESTION -->

<br><br>

We will use the entire dataset despite any location-based weirdness uncovered in the previous part.

We will be trying to forecast the avocado price. The function below is adapted from the lecture.

In [21]:
import pandas as pd


def create_lag_feature(
    df: pd.DataFrame,
    orig_feature: str,
    lag: int,
    groupby: list[str],
    new_feature_name: str | None = None,
    clip: bool = False,
) -> pd.DataFrame:
    """
    Create a lagged (or ahead) version of a feature, optionally per group.

    Assumes df is already sorted by time within each group and has unique indices.

    Parameters
    ----------
    df : pd.DataFrame
        The dataset.
    orig_feature : str
        Name of the column to lag.
    lag : int
        The lag:
          - negative ‚Üí values from the past (t-1, t-2, ...)
          - positive ‚Üí values from the future (t+1, t+2, ...)
    groupby : list of str
        Column(s) to group by if df contains multiple time series.
    new_feature_name : str, optional
        Name of the new column. If None, a name is generated automatically.
    clip : bool, default False
        If True, drop rows where the new feature is NaN.

    Returns
    -------
    pd.DataFrame
        A new dataframe with the additional column added.
    """
    if lag == 0:
        raise ValueError("lag cannot be 0 (no shift). Use the original feature instead.")

    # Default name if not provided
    if new_feature_name is None:
        if lag < 0:
            new_feature_name = f"{orig_feature}_lag{abs(lag)}"
        else:
            new_feature_name = f"{orig_feature}_ahead{lag}"

    df = df.copy()

    # Map your convention (negative=past, positive=future) to pandas shift
    # pandas: shift(+k) ‚Üí past, shift(-k) ‚Üí future
    periods = abs(lag) if lag < 0 else -lag

    df[new_feature_name] = (
        df.groupby(groupby, sort=False)[orig_feature]
          .shift(periods)
    )

    if clip:
        df = df.dropna(subset=[new_feature_name])

    return df


We first sort our dataframe properly:

In [22]:
df_sort = df.sort_values(by=["region", "type", "Date"]).reset_index(drop=True)
df_sort

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico
18247,2018-03-18,1.56,15896.38,2055.35,1499.55,0.00,12341.48,12114.81,226.67,0.0,organic,2018,WestTexNewMexico


We then call `create_lag_feature`. This creates a new column in the dataset `AveragePriceNextWeek`, which is the following week's `AveragePrice`. We have set `clip=True` which means it will remove rows where the target would be missing.

In [23]:
df_hastarget = create_lag_feature(df_sort, "AveragePrice", +1, ["region", "type"], "AveragePriceNextWeek", clip=True)
df_hastarget

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region,AveragePriceNextWeek
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany,1.24
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany,1.17
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany,1.06
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany,0.99
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany,0.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18243,2018-02-18,1.56,17597.12,1892.05,1928.36,0.00,13776.71,13553.53,223.18,0.0,organic,2018,WestTexNewMexico,1.57
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico,1.54
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico,1.56
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico,1.56


Our goal is to predict `AveragePriceNextWeek`.

Let's split the data:

In [24]:
df_train = df_hastarget[df_hastarget["Date"] <= split_date]
df_test  = df_hastarget[df_hastarget["Date"] >  split_date]

<br><br>

<!-- BEGIN QUESTION -->

### 1.4 `AveragePrice` baseline
rubric={points}

Soon we will want to build some models to forecast the average avocado price a week in advance. Before we start with any ML though, let's try a baseline. Previously we used `DummyClassifier` or `DummyRegressor` as a baseline. This time, we'll do something else as a baseline: we'll assume the price stays the same from this week to next week. So, we'll set our prediction of "AveragePriceNextWeek" exactly equal to "AveragePrice", assuming no change. That is kind of like saying, "If it's raining today then I'm guessing it will be raining tomorrow". This simplistic approach will not get a great score but it's a good starting point for reference. If our model does worse that this, it must not be very good.

Using this baseline approach, what $R^2$ do you get on the train and test data?

<div class="alert alert-warning">

Solution_1.4
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

In [25]:
train_r2 = None
df_train = df_train.sort_values(['region', 'type', 'Date'], ascending=[True, True, False])
df_train['PredictedAveragePriceNextWeek'] = df_train['AveragePrice']

train_r2 = r2_score(df_train['AveragePriceNextWeek'], df_train['PredictedAveragePriceNextWeek'])
print(train_r2)

0.8285800937261841


In [26]:
test_r2 = None

df_test = df_test.sort_values(['region', 'type', 'Date'], ascending=[True, True, False])
df_test['PredictedAveragePriceNextWeek'] = df_test['AveragePrice']

test_r2 = r2_score(df_test['AveragePriceNextWeek'], df_test['PredictedAveragePriceNextWeek'])

print(test_r2)

0.7631780188583048


In [27]:
...

Ellipsis

In [28]:
assert not train_r2 is None, "Are you using the correct variable name?"
assert not test_r2 is None, "Are you using the correct variable name?"
assert sha1(str(round(train_r2, 3)).encode('utf8')).hexdigest() == 'b1136fe2a8918904393ab6f40bfb3f38eac5fc39', "Your training score is not correct. Are you using the right features?"
assert sha1(str(round(test_r2, 3)).encode('utf8')).hexdigest() == 'cc24d9a9b567b491a56b42f7adc582f2eefa5907', "Your test score is not correct. Are you using the right features?"

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.5 Forecasting average avocado price
rubric={points:10}

Now that the baseline is done, let's build some models to forecast the average avocado price a week later. Experiment with a few approachs for encoding the date. Justify the decisions you make. Which approach worked best? Report your test score and briefly discuss your results.

Benchmark: you should be able to achieve $R^2$ of at least 0.79 on the test set. I got to 0.80, but not beyond that. Let me know if you do better!

Note: because we only have 2 splits here, we need to be a bit wary of overfitting on the test set. Try not to test on it a ridiculous number of times. If you are interested in some proper ways of dealing with this, see for example sklearn's [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html), which is like cross-validation for time series data.

<div class="alert alert-warning">

Solution_1.5
    
</div>

_Points:_ 10

For the model, I am using Ridge. Therefore, I will not be using POSIX encoding because it doesn't capture the cyclical property of months/weeks/days. For the first approach, I tried encoding the dates as weeks (from week 1-52), which gave an R2 score of 0.78759. I scaled almost all numerical features except for Average price and week, since in each of the rest of the categories, there was a large range of values.

I also tried making month*day interactions, and that gave me 0.78659

Making month*year interactions gave me 0.75766

Lagging feature by one week gave 0.795. This was better than the previous scores because the lag feature helps ridge "remember" what the previous value was, and make linear predictions based on that

While I did get 0.795 using methods taught in lecture, I wanted to explore beyond lecture concepts and experiemnt with sin and cos (see below). The below method is what I ended up sticking to in my final answer

Lastly, while this was not covered in lecture, because time is a cyclical feature (goes in circle), I figured it would have the properties of a circle as well, so maybe sin and cos could potentially apply here? If we can somehow turn the integers into degrees, this might help make it more compatible with Ridge (ex. January is 0 degrees, December is 360-(360/12) = 360-30 = 330 degrees). In order to turn the weeks and months into a fraction so we can calculate cosin/sin, we divide month by 12 and week by 52. I got stuck on how to turn this into degrees, so I had to ask chatgpt how to convert my fractional months and weeks into an angle, and it said "Every point on a circle can be represented by x and y coordinates:
x=cos(Œ∏),y=sin(Œ∏)
Where Œ∏ is the angle representing the position in the cycle.
For months: ùúÉ=2 ùúã‚ãÖmonth/12
For weeks of the year: ùúÉ=2 ùúã‚ãÖweek/52". I just put that formula into my code, and nothing else changed. This gave me an r2 score of 0.80.

I made a table to show both the sin and values from the weeks and months using the formula given above. I realized that some months had the same sin or cos value (ex. Jan and May both had 0.5 on sin table). Therefore, I used both sin and cos because only using one or the other can lead to inaccuracy, and indeed when I removed cos and only used sin, my score went down to 0.79.

Effect on Ridge: Ridge has no way of knowing that January of this year and December of last year are actually very close to each other, because their raw numbers are 1 and 12 respectively. By using sin and cos, January comes right after December, while going from January to December needs you to traverse the full circle. This means that Ridge is now able to "understand" the properties of time simmilar to how we do when we convert the months into degrees.

While my entire explanation uses months, the exact same logic applies for weeks 1-52.

In [29]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge

df_hastarget['week'] = df_hastarget['Date'].dt.isocalendar().week
categorical = ['region', 'type', 'year']
scale = ['Total Volume', '4046', '4225', '4770',
                      'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags']
skip = ['AveragePrice', 'week']

X = df_hastarget[categorical + scale + skip]
y = df_hastarget['AveragePriceNextWeek']

preprocessor = ColumnTransformer(
    transformers=[
        ('categorical', OneHotEncoder(handle_unknown='ignore'), categorical),
        ('numerical', StandardScaler(), scale),
        ('pass', 'passthrough', skip)
        ])

model = make_pipeline(preprocessor, Ridge())

training = df_hastarget['Date'] <= split_date
testing = df_hastarget['Date'] > split_date


X_train, X_test = X[training], X[testing]
y_train, y_test = y[training], y[testing]
model.fit(X_train, y_train)

predictions = model.predict(X_test)
r2 = r2_score(y_test, predictions)
print("Test R^2:", r2)

Test R^2: 0.7875938444179044


In [30]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge

df_hastarget['month'] = df_hastarget['Date'].dt.month
df_hastarget['day'] = df_hastarget['Date'].dt.day
df_hastarget['month_day_interaction'] = df_hastarget['month'] * df_hastarget['day']

categorical = ['region', 'type', 'year']
scale = ['Total Volume', '4046', '4225', '4770',
         'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags']
skip = ['AveragePrice', 'month', 'day', 'month_day_interaction']

X = df_hastarget[categorical + scale + skip]
y = df_hastarget['AveragePriceNextWeek']

preprocessor = ColumnTransformer(
    transformers=[
        ('categorical', OneHotEncoder(handle_unknown='ignore'), categorical),
        ('numerical', StandardScaler(), scale),
        ('pass', 'passthrough', skip)
        ])

model = make_pipeline(preprocessor, Ridge())

training = df_hastarget['Date'] <= split_date
testing = df_hastarget['Date'] > split_date


X_train, X_test = X[training], X[testing]
y_train, y_test = y[training], y[testing]
model.fit(X_train, y_train)

predictions = model.predict(X_test)
r2 = r2_score(y_test, predictions)
print("Test R^2:", r2)

Test R^2: 0.7865943442746997


In [31]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge

df_hastarget['month'] = df_hastarget['Date'].dt.month
df_hastarget['day'] = df_hastarget['Date'].dt.day
df_hastarget['month_day_interaction'] = df_hastarget['month'] * df_hastarget['day']
df_hastarget['month_year_interaction'] = df_hastarget['month'] * df_hastarget['year']

categorical = ['region', 'type']
scale = ['Total Volume', '4046', '4225', '4770',
         'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags']
skip = ['AveragePrice', 'month', 'year', 'month_year_interaction']

X = df_hastarget[categorical + scale + skip]
y = df_hastarget['AveragePriceNextWeek']

preprocessor = ColumnTransformer(
    transformers=[
        ('categorical', OneHotEncoder(handle_unknown='ignore'), categorical),
        ('numerical', StandardScaler(), scale),
        ('pass', 'passthrough', skip)
        ])

model = make_pipeline(preprocessor, Ridge())

training = df_hastarget['Date'] <= split_date
testing = df_hastarget['Date'] > split_date


X_train, X_test = X[training], X[testing]
y_train, y_test = y[training], y[testing]
model.fit(X_train, y_train)

predictions = model.predict(X_test)
r2 = r2_score(y_test, predictions)
print("Test R^2:", r2)


Test R^2: 0.7576628942839743


In [32]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge

def lag_df(df, lag, cols):
    return df.assign(
        **{f"{col}-{n}": df[col].shift(n) for n in range(1, lag + 1) for col in cols}
    ) #pulled from lecture 19
lag_1 = lag_df(df_hastarget, 1, ["AveragePriceNextWeek"])

categorical = ['region', 'type', 'year']
scale = ['Total Volume', '4046', '4225', '4770',
         'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags']
skip = ['AveragePrice']
df_hastarget = df_hastarget.dropna()

X = df_hastarget[categorical + scale + skip]
y = df_hastarget['AveragePriceNextWeek']

preprocessor = ColumnTransformer(
    transformers=[
        ('categorical', OneHotEncoder(handle_unknown='ignore'), categorical),
        ('numerical', StandardScaler(), scale),
        ('pass', 'passthrough', skip)
        ])

model = make_pipeline(preprocessor, Ridge())

training = df_hastarget['Date'] <= split_date
testing = df_hastarget['Date'] > split_date


X_train, X_test = X[training], X[testing]
y_train, y_test = y[training], y[testing]
model.fit(X_train, y_train)

predictions = model.predict(X_test)
r2 = r2_score(y_test, predictions)
print("Test R^2:", r2)

Test R^2: 0.7953358641906658


In [33]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge

df_hastarget['month'] = df_hastarget['Date'].dt.month
df_hastarget['week'] = df_hastarget['Date'].dt.isocalendar().week

df_hastarget['month_sin'] = np.sin(2 * np.pi * df_hastarget['month'] / 12)
df_hastarget['month_cos'] = np.cos(2 * np.pi * df_hastarget['month'] / 12)

df_hastarget['week_sin'] = np.sin(2 * np.pi * df_hastarget['week'] / 52)
df_hastarget['week_cos'] = np.cos(2 * np.pi * df_hastarget['week'] / 52)

categorical = ['region', 'type']
scale = ['Total Volume', '4046', '4225', '4770',
         'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags']
skip = ['AveragePrice', 'month_sin', 'month_cos', 'week_sin', 'week_cos', 'year']

X = df_hastarget[categorical + scale + skip]
y = df_hastarget['AveragePriceNextWeek']

preprocessor = ColumnTransformer(
    transformers=[
        ('categorical', OneHotEncoder(handle_unknown='ignore'), categorical),
        ('numerical', StandardScaler(), scale),
        ('pass', 'passthrough', skip)
        ])

model = make_pipeline(preprocessor, Ridge())

training = df_hastarget['Date'] <= split_date
testing = df_hastarget['Date'] > split_date


X_train, X_test = X[training], X[testing]
y_train, y_test = y[training], y[testing]
model.fit(X_train, y_train)

predictions = model.predict(X_test)
r2 = r2_score(y_test, predictions)
print("Test R^2:", r2)

Test R^2: 0.8009005941577396


In [34]:
...

Ellipsis

In [35]:
...

Ellipsis

In [36]:
...

Ellipsis

In [37]:
...

Ellipsis

In [38]:
...

Ellipsis

In [39]:
...

Ellipsis

In [40]:
...

Ellipsis

In [41]:
...

Ellipsis

In [42]:
...

Ellipsis

In [43]:
...

Ellipsis

In [44]:
...

Ellipsis

In [45]:
...

Ellipsis

In [46]:
...

Ellipsis

In [47]:
...

Ellipsis

In [48]:
...

Ellipsis

In [49]:
...

Ellipsis

In [50]:
...

Ellipsis

In [51]:
...

Ellipsis

In [52]:
...

Ellipsis

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 2: Short answer questions

<!-- BEGIN QUESTION -->

### 2.1 Time series

rubric={points:6}

The following questions pertain to Lecture 20 on time series data:

1. Sometimes a time series has missing time points or, worse, time points that are unequally spaced in general. Give an example of a real world situation where the time series data would have unequally spaced time points.
2. In class we discussed two approaches to using temporal information: encoding the date as one or more features, and creating lagged versions of features. Which of these (one/other/both/neither) two approaches would struggle with unequally spaced time points? Briefly justify your answer.
3. When studying time series modeling, we explored several ways to encode date information as a feature for the citibike dataset. When we used time of day as a numeric feature, the Ridge model was not able to capture the periodic pattern. Why? How did we tackle this problem? Briefly explain.

<div class="alert alert-warning">

Solution_2.1
    
</div>

_Points:_ 6

1. Real time product purchases (ex. maybe my company had a customer today, but then no one purchased anyting till 10 months later on black friday, and suddenly, I have 100 customers each purchasing my product anywhere from 30 seconds to 1 minute apart)
2. I think creating lagged versions of features would struggle the most because it assumes that time is equally spaced apart. For example, in the avocado dataset, if we had the first 500 rows be measured every week, and the rest of the rows are measured every 2 days, the lagged features would not be as accurate because of the inconsistency (patterns of change are different in different intervals of time). To clarify, let's say avocados went up in price by one cent everyday. Now let's say in row 499 the price was $1.00, and row 500 was $1.07 because it was measured a week later, and row 501 was $1.02 because it was only measured 2 days later. If we used lagged features, row 499 and row 500 might be predicted correctly based on what it has learned off the training, however row 501 might be off by 5 cents (because the difference between row 500 and 501 was only 2 cents rather than the assymed 7 cents)
3. Similar to what I discussed in the previous question, Ridge is a linear model and does not perform well on cyclical patterns as a result if you try to encode time in integers, since it tries to represent time as an ever increasing property. To fix this, time was encoded as a categorical variable, and made hour*day interactions to increase the train/test score.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.2 Computer vision
rubric={points:6}

The following questions pertain to the lecture on multiclass classification and introduction to computer vision.

1. How many parameters (coefficients and intercepts) will `sklearn`‚Äôs `LogisticRegression()` model learn for a four-class classification problem, assuming that you have 10 features? Briefly explain your answer.
2. In Lecture 19, we briefly discussed how neural networks are sort of like `sklearn`'s pipelines, in the sense that they involve multiple sequential transformations of the data, finally resulting in the prediction. Why was this property useful when it came to transfer learning?
3. Imagine that you have a small dataset with ~1000 images containing pictures and names of 50 different Computer Science faculty members from UBC. Your goal is to develop a reasonably accurate multi-class classification model for this task. Describe which model/technique you would use and briefly justify your choice in one to three sentences.

<div class="alert alert-warning">

Solution_2.2
    
</div>

_Points:_ 6

1. For each feature in each class, logisticregression learns one coefficient per feature per class. This means that for each class, there are 10 coefficients, and because there are 4 classes, it is 10*4, which is 40. This is because it is a multiclass classification, which uses softmax (turns all raw scores into probabilities and ensures they all sum to 1) instead of sigmoid. Additionally, for each class there is one intercept, so it also learns 4 intercepts (one intercept per class).

2. The input gets passed through a neural network which contains hidden layers. As the input passes through each layer, it extracts a certain feature of the input data, and it adjusts the weight of that feature, and as a result the predictions, everytime it is fed something. This is a resource consuming process, for a model to be made from scratch to adjust its weight everytime data is fed. However, with neural network based models, the weights of relevant features have already been determined through training the model previously. Therefore, once the training phase is over, other people can use that model and not worry about retraining the model to learn the weights of features, because the model is basically pre-trained

3. I would not use feed forward architecture on images because it learns too many parameters (i.e. total of coefficients and intercepts), and can't put all the pixels of the image together (focuses on the finer details rather than the bigger picture). This fails because there is more than one picture of each faculty member considering there are 1000 pictures, and each picture of a certain faculty member would likely have very different pixels composing it, so it cannot recognize a person just off the individual pixels. Instead, I would use filters in CNNs (doesn't focus on finer grained details like pixels), with each of its filters focusing on only a certain aspect of the image, and passing only if it finds the pattern that the filter was looking for (ex. maybe a certain colour such as skin or hair colour needs to be present to classify an image as person A).

<!-- END QUESTION -->

<br><br>

Before submitting your assignment, please make sure you have followed all the instructions in the Submission Instructions section at the top.

Here is a quick checklist before submitting:

- [ ] Restart kernel, clear outputs, and run all cells from top to bottom.  
- [ ] `.ipynb` file runs without errors and contains all outputs.  
- [ ] Only `.ipynb` and required output files are uploaded (no extra files).  
- [ ] Execution numbers start at **1** and are in order.  
- [ ] If `.ipynb` is too large and doesn't render on Gradescope, also upload a PDF/HTML version.  
- [ ] Reviewed the [CPSC 330 homework instructions](https://ubc-cs.github.io/cpsc330-2025W1/docs/homework_instructions.html).  

![](https://github.com/divagandhi/hw8/blob/main/img/eva-well-done.png?raw=1)