# Time Series Forecasting

This example shows using [Prophet](https://facebook.github.io/prophet/) and Dask for scalable time series forecasting.

> Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects.

As discussed in the [*Forecasting at scale*](https://peerj.com/preprints/3190/), large datasets aren't the only type of scaling challenge teams run into. In this example we'll focus on one of the scaling challenges indentified in that paper:

> in most realistic settings, a large number of forecasts will be created, necessitating efficient, automated means of evaluating and comparing them, as well as detecting when they are likely to be performing poorly. When hundreds or even thousands of forecasts are made, it becomes important to let machines do the hard work of model evaluation and comparison while efficiently using human feedback to fix performance problems.

That sounds like a perfect opportunity for Dask. We'll use Prophet and Dask together to parallize the *diagnostics* stage of research. It does not attempt to parallize the training of the model itself.

In [None]:
# This example currently relies on Prophet master
!pip install 'git+https://github.com/facebook/prophet/#egg=fbprophet&subdirectory=python

In [None]:
import pandas as pd
from fbprophet import Prophet

We'll walk through the example from the Prophet quickstart. These values represent the log daily page views for [Peyton Manning's wikipedia page](https://en.wikipedia.org/wiki/Peyton_Manning).

In [None]:
df = pd.read_csv(
    'https://raw.githubusercontent.com/facebook/prophet/master/examples/example_wp_log_peyton_manning.csv',
    parse_dates=['ds']
)
df.head()

In [None]:
df.plot(x='ds', y='y');

Fitting the model takes a handful of seconds. Dask isn't involved at all here.

In [None]:
%%time
m = Prophet()
m.fit(df)

And we can make a forecast. Again, Dask isn't involved here.

In [None]:
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
m.plot(forecast);

## Parallel Diagnostics

Prophet includes a `cross_validation` function method, which uses *simulated historical forecasts* to provide some idea of a model's quality.

> This is done by selecting cutoff points in the history, and for each of them fitting the model using data only up to that cutoff point. We can then compare the forecasted values to the actual values.

See https://facebook.github.io/prophet/docs/diagnostics.html for more.

In [None]:
from fbprophet.diagnostics import cross_validation, performance_metrics

In [None]:
df_cv = cross_validation(m, initial="730 days", period="180 days", horizon="365 days")

Internally, `cross_validation` determines some `cutoffs` based on the user's parameters to generate forecasts for. In this case, we ended up with 11. *The historical forecasts for each cutoff can be done entirely in parallel*. So in order to distribute cross validation, we'll just generate those cutoffs ourselves and call one `cross_validation` for each `cutoff`.

Note that the rest of this example depends on `fbprophet>=0.7` (currently in development).

In [None]:
import dask
from distributed import Client, performance_report

client = Client()
client

In [None]:
from fbprophet.diagnostics import generate_cutoffs

In [None]:
initial = pd.Timedelta("730 days")
period = pd.Timedelta("180 days")
horizon = pd.Timedelta("365 days")

cutoffs = generate_cutoffs(
    m.history.copy().reset_index(drop=True),
    horizon=horizon,
    initial=initial,
    period=period,
)
cutoffs

We'll use `dask.delayed` and `fbprophet.diagnostics.single_cutoff_forecast` to lazily do all the cross validations.

In [None]:
from fbprophet.diagnostics import single_cutoff_forecast

In [None]:
delayed_cv = dask.delayed(single_cutoff_forecast)

# df2 is a somewhat large object.
df2 = dask.delayed(m.history.copy().reset_index(drop=True))
predict_columns = ['ds', 'yhat', 'yhat_lower', 'yhat_upper']


cvs = [delayed_cv(df2, m, cutoff, horizon, predict_columns)
       for cutoff in cutoffs]
cvs

In [None]:
client.scatter(df2);  # pre-scatter large data.

This finished instantly, since we haven't done anything yet.
By passing those to `dask.compute` the cluster will get to work.

In [None]:
# Watch the distributed Dashboard here
%time cvs = dask.compute(*cvs)

This returns a list of DataFrames

In [None]:
cvs[0]

Which we can concatenate to get the same result as the original.

In [None]:
df_cv = pd.concat(cvs, ignore_index=True)
df_cv

At this point, we're back to the same result as if we had done things without Dask.
We can compute `performance_metrics`.

In [None]:
from fbprophet.diagnostics import performance_metrics

df_p = performance_metrics(df_cv)
df_p

And plot, e.g. the mean absolute percent error.

In [None]:
from fbprophet.plot import plot_cross_validation_metric

fig = plot_cross_validation_metric(df_cv, metric='mape')

## Things to improve

Currently, this requires a bit of specialized knowledge to use correctly.
It took me (a light user of Prophet) a bit of time to dig into Prophet's source code to determine the appropriate point of parallelization. And it takes a bit of Dask knowledge to know just what should be wrapped in `dask.delayed`.

Currently, Prophet supports parallel evaluation in `cross_validation` with `multiprocessing`. It would be great if Prophet could do things using an interface like `concurrent.futures`. Then users could choose their own parallelism.


```python
from concurrent.futures import ThreadPoolExecutor


# Parallelize with Threads
cross_validation(model, executor=ThreadPoolExecutor())

# Parallelize with Dask
cross_validation(model, executor=client)
```

However, there are a few issues that would need to be solved upstream in Python itself (see https://github.com/dask/distributed/issues/3695) for more. For now, this post serves as a reference for a way to achieve parallel, distributed diagnostics usings Prophet's already great API.