# Hands-On Exercise 5.1: Forecasting using Prophet

## Objectives

In this exercise, you will learn how to do time series forecasting using Prophet.

## Overview

Prophet is a procedure and libraries for forecasting time series. It was created at Meta (Facebook) to all teams across the company to perform forecasting with minimal effort.

It makes use of the Stan probabilistic programming language to fit the time series models to the data.

You will use Prophet to forecast Canadian vehicle sales and COVID 19 cases.

## Load libraries

Load the libraries you'll be using in this exercise.

In [None]:
library(dplyr)
library(lubridate)
library(prophet)
library(readr)

## Forecasting vehicle sales

In RStudio, create a new script (e.g. `Ex5.1.R`). Add commands to the file according to the instructions that follow in this exercise, and execute each command as you move through the steps.

Read the new vehicle sales data in `data/new_vehicle_sales.csv` as `vehicle_sales_data`.

Transform `REF_DATE` to a `datetime` value names `ds` and rename `VALUE` to be `y` (as required by Prophet).

Select only these two fields.

<font color="red">**Set the working directory to the course root folder using `setwd("/home/user/course/")`.**</font>

#### <font color="green">Solution...</font>

In [None]:
vehicle_sales_data <- read_csv("data/new_vehicle_sales.csv") |>
  mutate(ds=ym(REF_DATE)) |>
  select(ds, y=VALUE)

Review the data.

#### <font color="green">Solution...</font>

In [None]:
View(vehicle_sales_data)

This contains the number of motor vehicles sold in Canada for every month since 1946.

Create a Prophet model from the vehicle sales data. Assign it to `vehicle_sales_model`.

#### <font color="green">Solution...</font>

In [None]:
vehicle_sales_model <- prophet(vehicle_sales_data)

Create a Prophet futures dataframe to hold predictions for another year (12 months) of sales. Assign it to `vehicle_sales_future`.

#### <font color="green">Solution...</font>

In [None]:
vehicle_sales_future <- make_future_dataframe(vehicle_sales_model, freq="month", periods=12)

Calculate the predicted vehicle sales for the next year. Assign the predictions to `vehicle_sales_forecast`.

#### <font color="green">Solution...</font>

In [None]:
vehicle_sales_forecast <- predict(vehicle_sales_model, vehicle_sales_future)

Examine the predictions for the last few months.

#### <font color="green">Solution...</font>

In [None]:
tail(vehicle_sales_forecast)

Plot the time series and predictions.

#### <font color="green">Solution...</font>

In [None]:
plot(vehicle_sales_model, vehicle_sales_forecast, xlabel="Year",  ylabel="Vehicles")

It's difficult to see what's going on as there's too much data. You could plot a subset of the data, but, instead, create an interactive plot.

#### <font color="green">Solution...</font>

In [None]:
dyplot.prophet(vehicle_sales_model, vehicle_sales_forecast)

Drag the lower range handle past 2020.

Note that the data is highly seasonal. Also note that Prophet is fairly confident in its prediction of next year's sales. The error region doesn't diverge significantly. 

Plot the components that Prophet has discovered to review the trend and seasonality.

#### <font color="green">Solution...</font>

In [None]:
prophet_plot_components(vehicle_sales_model, vehicle_sales_forecast)

Grow in sales stagnated from around 1980 to 1995, before picking up again.

New car sales appear to be strongest in Spring. 

## Forecasting COVID 19 cases

Read the COVID 19 data (`data/covid_19.csv`).

Filter it so that it only contains data for the US (`iso_code` is `USA`). Select `date` as `ds` and `new_cases` as `y` (i.e. use the format required by Prophet).

Store the data as `usa_covid_data`.

In [None]:
usa_covid_data <- read_csv("data/covid_19.csv") |>
  filter(iso_code == "USA") |>
  select(ds=date, y=new_cases)

Review the data.

#### <font color="green">Solution...</font>

In [None]:
View(usa_covid_data)

This is new cases of COVID recorded each day in the US.

Create a Prophet model from the COVID data. Assign it to `usa_covid_model`.

#### <font color="green">Solution...</font>

In [None]:
usa_covid_model <- prophet(usa_covid_data)

Create a Prophet futures dataframe to hold predictions for another three months (90 days) of new cases. Assign it to `usa_covid_future`.

#### <font color="green">Solution...</font>

In [None]:
usa_covid_future <- make_future_dataframe(usa_covid_model, periods=90)

Calculate the predicted new COVID cases for the next three months. Assign the predictions to `usa_covid_forecast`.

#### <font color="green">Solution...</font>

In [None]:
usa_covid_forecast <- predict(usa_covid_model, usa_covid_future)

Plot the time series and the predicted three months of new cases.

Include the changepoints discovered by Prophet on the chart.

#### <font color="green">Solution...</font>

In [None]:
plot(usa_covid_model, usa_covid_forecast, xlabel="Day", ylabel="New cases") +
  add_changepoints_to_plot(usa_covid_model)

This model doesn't seem to be a particularly good fit. It misses rising infections in April 2020, July 2020 and April 2021. It also underestimates the peak around January 2021. It's missing some of the change points---including one later in the time series (April 2021).

The forecast for the next three months doesn't appear to be very sophisticated. It predicts an ever-declining number of new cases.

Create an interactive plot of the new cases (including predictions).

#### <font color="green">Solution...</font>

In [None]:
dyplot.prophet(usa_covid_model, usa_covid_forecast)

Zoom in and examine the seasonality. Prophet has identified the weekly seasonality in this data. 

However, this isn't of much interest to those tracking COVID infections. They are more interested in the longer term effects (e.g. rising infections, peaks, etc.).

Prophet may have overfitted to the seasonal effects.

Create a new Prophet model of new US COVID cases with the following parameters.

- Weekly seasonality disabled (to reduce overfitting)
- Changepoints detected over the _entire_ time period
- Discover more changepoints

Assign the new model to `usa_covid_model_tuned`.

#### <font color="green">Solution...</font>

In [None]:
usa_covid_model_tuned <- prophet(
  usa_covid_data, 
  weekly.seasonality=FALSE, 
  changepoint.range=1,
  changepoint.prior.scale=0.75
)

Create a Prophet futures dataframe to hold (tuned) predictions for another three months (90 days) of new cases. Assign it to `usa_covid_future_tuned`.

#### <font color="green">Solution...</font>

In [None]:
usa_covid_future_tuned <- make_future_dataframe(usa_covid_model_tuned, periods=90)

Using the _tuned_ model, calculate the predicted new COVID cases for the next three months. Assign the predictions to `usa_covid_forecast_tuned`.

#### <font color="green">Solution...</font>

In [None]:
usa_covid_forecast_tuned <- predict(usa_covid_model_tuned, usa_covid_future_tuned)

Plot the time series and predictions---including the changepoints discovered by Prophet.

#### <font color="green">Solution...</font>

In [None]:
plot(usa_covid_model_tuned, usa_covid_forecast_tuned, xlabel="Day", ylabel="New cases") +
  add_changepoints_to_plot(usa_covid_model_tuned)

Note that the model is a _much_ better fit to the data. Prophet has identified many more changepoints---including ones later in the period covered.

However, look at the uncertain in the prediction as it gets further out. Prophet has little confidence in its prediction of new COVID cases in the US. Hardly surprising, given the challenges of modelling infection rates.

A lot more work would be required to make this a useful model. It's actually unlikely that time series modelling would be able to predict COVID infections with any accuracy. One of main challenges in data science is selecting the correct modelling approach in the first place.

## Congratulations!

You have successfully forecast time series using Prophet.