# Problem Session 5
## The Simpsons and Bicycles I

In the first of two time series based problem sessions we focus on some of the basics of time series forecasting. In particular, we will do some exploratory data analysis, test your understanding of data split adjustments and build a couple of first step models for two time series.

The problems in this notebook will cover the content covered in our `Time Series Forecasting` lectures including:
- `What are Time Series and Forecasting`,
- `Adjustments for Time Series Data`,
- `Time and Dates in Python` and
- `Baseline Forecasts`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from seaborn import set_style

set_style("whitegrid")

### The Simpsons

##### 1. Introducing the data

The first data set you will work with is the IMDB ratings of every Simpsons episode (as of May 6, 2022). If you recall, you pulled this data in `Problem Session 1` using the `Cinemagoer` package. For this notebook you will load a saved version of the data.

Load `simpsons_imdb.csv` from the `Data` folder, look at the first five observations.

In [None]:
simpsons = pd.read_csv("../Data/simpsons_imdb.csv")

In [None]:
simpsons.head()

Here are descriptions for the columns of this data set:
- `season` is the season of the Simpsons to which the episode belongs,
- `episode` gives the number of that episode with respect to when it aired in its season,
- `title` gives the name of the episode,
- `imdb_rating` is the average rating of the episode among IMDB's users.

##### 2. Train test split

We will use half a season as our forecast's horizon. Make a train test split that sets aside the last season (roughly two of our horizons) as a test set and uses the rest as a training set.

In [None]:
## Code here


In [None]:
## Code here


##### 3. EDA

Plot the training data using a scatter plot.

Does this time series seem to exhibit a trend? Does this time series seem to exhibit seasonality? If it exhibits either do your best to describe what you see.

In [None]:
plt.figure(figsize=(16,8))





plt.xlabel()
plt.ylabel()

plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

plt.show()

###### Write here




##### 4. Choose baselines

Choose one or two baseline models that you could build on these data. Plot these models over the training observations.

<i>Note: if you choose the trend model as a baseline you should fit an `sklearn` `LinearRegression` model for that, not just estimate $\beta$ using first differences</i>.

<b>Baseline 1</b>

In [None]:
## Code here



In [None]:
## Code here



In [None]:
plt.figure(figsize=(16,8))

## Here is the training data
plt.scatter(range(1, len(simps_train)+1), 
            simps_train.imdb_rating,
            alpha=.5,
            label="Training Data")

## Fill this in with your baseline
plt.plot(,
            ,
            'r',
            label=)

plt.xlabel("Episode Number", fontsize=16)
plt.ylabel("IMDB Label", fontsize=16)

plt.legend(fontsize=14)

plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

plt.show()

<b>Baseline 2</b>

A trend model with an intercept.

In [None]:
## Code here



In [None]:
## Code here



In [None]:
plt.figure(figsize=(16,8))

## Here is the training data
plt.scatter(range(1, len(simps_train)+1), 
            simps_train.imdb_rating,
            alpha=.5,
            label="Training Data")

## Fill this in with your baseline
plt.plot(,
            ,
            'r',
            linewidth=2,
            label=)

plt.xlabel("Episode Number", fontsize=16)
plt.ylabel("IMDB Label", fontsize=16)

plt.legend(fontsize=14)

plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

plt.show()

##### 5. Baseline CV Average MSE

Calculate the average cross-validation mean squared error for these two baseline models. Set up this cross-validation so that there are five splits.

In [None]:
## Import here


from sklearn.metrics import mean_squared_error

In [None]:
## Make your cross-validation object here
## 11 is roughly half a season
cv = 

In [None]:
## this will hold the mses
mses = np.zeros((2, 5))

j = 0
for train_index, test_index in cv.split(simps_train):
    simps_tt = simps_train.loc[train_index]
    simps_ho = simps_train.loc[test_index]
    
    ## baseline 1
    pred1 = 
    
    mses[0,j] = mean_squared_error(simps_ho.imdb_rating.values, pred1)
    
    ## baseline 2
    pred2 = 
    
    mses[1,j] = mean_squared_error(simps_ho.imdb_rating.values, pred2)
    j = j + 1

In [None]:
print("The mean CV MSE for the average baseline is", 
          np.round(np.mean(mses[0,:]), 3))
print("The mean CV MSE for the trend baseline is",
          np.round(np.mean(mses[1,:]), 3))

We will return to these baseline performances in `Problem Session 6`.

### Google trends "bike" interest

The second data set you will work with in this problem session is a time series collected using <a href="https://trends.google.com/trends/?geo=US">Google Trends</a>.

<img src="bike.png" width="20%"></img>


##### 1. Load the data

Load the data stored in `bike_google_trends.csv` stored in the `Data` folder.

Look at the first five observations.

In [None]:
bike = pd.read_csv("../Data/bike_google_trends.csv", parse_dates=['Month'])

In [None]:
bike.head()

- The `Month` column of this data set gives the month and year that the interest was measured. 
- The `bike_interest` column of this data set gives the level of interest (in percent) based on Google search engine searches for the term "bikes", scaled so that the month with greatest interest has a value of $100\%$ while every other month is the percent of the maximum interest recorded.

##### 2. Identifying stakeholders

One thing you may need to get more practice with is identifying the <i>stakeholders</i> for a particular problem. The stakeholders are the people who are most interested in your problem and the outcome of your solution.

Thinking about this can help you frame your project goals and focus your thinking to provide a solution that most suits the stakeholders' wants/needs.

For this question, take some time to think about what kinds of people may most be interested in forecasting interest in bikes. Why might they be interested? How could this forecast best help them?

##### Write here




##### 3. Train test split

Make a train test split in the data. Set aside May 2021 to May 2022 as a test set.

<i>Hint: the `datetime` module could be useful.</i>

In [None]:
## code here



In [None]:
## code here



##### 4. EDA 1

Plot the training data.

Does this time series appear to exhibit a trend or seasonality?

In [None]:
from datetime import datetime, timedelta

plt.figure(figsize=(16,6))

## fill in the plot here
plt.plot()

plt.ylabel("Interest (Perc of Max)", fontsize=16)
plt.xlabel("Date", fontsize=16)

plt.xticks([datetime(2004,1,1) + timedelta(days=365*i) for i in range(19)],
              range(2004, 2023),
              fontsize=14)

plt.yticks(fontsize=14)

plt.show()

##### 4. EDA II

One way to explore the number of time steps in a given season is to plot scatter plots of the time series against itself at given <i>lags</i>. Such plots place the time series on the horizontal axis and the time series at $\ell$ steps into the future on the vertical axis. Seasonal data should exhibit a high correlation between itself and lags at multiples of the season length.

Use the function below to make such scatter plots for lag values from $\ell=1$ to $\ell=25$. This function also returns the correlation between the time series and the lagged series. Record these correlations and plot them against the lag. Using this information how long would you say a season is?

In [None]:
## This code will make lag plots for you
def make_lag_plot(lag):
    ## the original time series
    ts = bike_train.bike_interest.values[:-lag]
    
    ## the lagged time series
    ts_lagged = bike_train.bike_interest.values[lag:]
    
    
    ## Making the figure
    plt.figure(figsize=(6,6))
    
    plt.scatter(ts, ts_lagged, alpha=.7)
    
    ## line y=x for reference
    plt.plot([0,100], [0,100], 'k--')
    
    ## labeling the plot
    plt.title("Lag = " + str(lag), fontsize=16)
    
    plt.show()
    
    ## return the correlation coefficient
    return np.corrcoef(x,y)[0,1]

In [None]:
## code here




In [None]:
## Plot the correlation against the lag here
plt.figure(figsize=(10,6))

## input the lags and correlations
plt.scatter()

plt.xlabel("Lag", fontsize=16)
plt.ylabel("Correlation", fontsize=16)

plt.xticks(range(0,25,3), fontsize=13)
plt.yticks(fontsize=13)

plt.ylim([-1.1,1.1])

plt.show()

##### How long does a season last?

##### 5. Baselines

Select two baseline forecasts for these data.

Create a validation set of June 2020 to May 2021 and plot your baseline predictions alongside the actual validation data.

##### Baseline 1

In [None]:
## Code or write here




In [None]:
## Code or write here




In [None]:
## Plot here
plt.figure(figsize=(16, 6))

## The Training data
plt.plot(bike_tt.Month, 
            bike_tt.bike_interest,
            'b',
            label = "Training Data")

## The validation data
plt.plot(bike_val.Month,
            bike_val.bike_interest,
            'b--',
            label = "Validation Data")

## Insert your 1st baseline forecast here
plt.plot(bike_val.Month,
            ,
            'r--.',
            label=)

plt.legend(fontsize=14)

plt.ylabel("Bike Interest", fontsize=16)
plt.xlabel("Date", fontsize=16)

plt.show()

##### 2. Baseline 2

In [None]:
## code or write here




In [None]:
## code or write here




In [None]:
## code or write here




In [None]:
## Plot here
plt.figure(figsize=(16, 6))

## The Training data
plt.plot(bike_tt.Month, 
            bike_tt.bike_interest,
            'b',
            label = "Training Data")

## The validation data
plt.plot(bike_val.Month,
            bike_val.bike_interest,
            'b--',
            label = "Validation Data")

## fill in the missing pieces to plot your 2nd baseline
plt.plot(bike_val.Month,
            ,
            'r--.',
            label=)

plt.legend(fontsize=14)

plt.ylabel("Bike Interest", fontsize=16)
plt.xlabel("Date", fontsize=16)

plt.show()

##### 6. Evaluating baselines

Visually speaking, how does it look like your two baselines do on the validation set? Does the observed interest level seem consistent with the rest of the time series? Do your best to explain your answer.

In light of your answer, should we include these data in our forecasting process? Why or Why not?

##### Write here






We will return to these data in the time series `Practice Problems` notebooks.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)