<font size="+3"><strong>Time Series: Statistical Models</strong></font>

# Autoregression

Autoregression (AR) is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. AR works in a similar way to **autocorrelation**: in both cases, we're taking data from one part of a set and comparing it to another part. An AR model regresses itself. 

## Cleaning the Data

Just like with linear regression, we'll start by bringing in some tools to help us along the way.

In [None]:
import warnings

import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
from pymongo import MongoClient
from sklearn.metrics import mean_absolute_error
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.ar_model import AutoReg

warnings.simplefilter(action="ignore", category=FutureWarning)

Since we'll be working with the `"air-quality"` data again, we need to connect to the server, start our client, and grab the data we need.

In [None]:
client = MongoClient(host="localhost", port=27017)
db = client["air-quality"]
lagos = db["lagos"]

<font size="+1">Practice</font>

Just to make sure we're all on the same page, import all those libraries and get your database up and running. Remember that even though all the examples use the Site 3 data from the `lagos` collection, the practice sets should use Site 4 data from the `lagos` collection. Call your database `lagos_prac`.

In [None]:
lagos_prac = ...

In order to get our data into a form we can use to build our model, we're going to need to transform it in several key ways. The first thing we need to do is to get the data we need, and save the results in a DataFrame. Since we're interested in predicting the changes in air quality over time, let's set the DataFrame's index to `"timestamp"`:

In [None]:
results = lagos.find(
    # Note that the `3` refers to Site 3.
    {"metadata.site": 3, "metadata.measurement": "P2"},
    projection={"P2": 1, "timestamp": 1, "_id": 0},
)
df = pd.DataFrame(list(results)).set_index("timestamp")

<font size="+1">Practice</font>

Try it yourself! Create a list called `results_prac` that pulls data from Site 4 in the `lagos` data, then save it in a DataFrame called `df_prac` with the index `"timestamp"`.

## Localizing the Timezone

Because MongoDB stores all timestamps in `UTC`, we need to figure out a way to localize it. Having timestamps in UTC might be useful if we were trying to predict some kind of global trend, but since we're only interested in what's happening with the air in Lagos, we need to change the data from UTC to `Africa/Lagos`. Happily, pandas has a pair of tools to help us out: [`tz_localize`](https://pandas.pydata.org/docs/reference/api/pandas.Series.tz_localize.html) and [`tz_convert`](https://pandas.pydata.org/docs/reference/api/pandas.Series.tz_convert.html). We use those methods to transform our data like this:

In [None]:
df.index = df.index.tz_localize("UTC").tz_convert("Africa/Lagos")

## Resampling Data

The most important kind of data in our time-series model is the data that deals with time. Our `"timestamp"` data tells us when each reading was taken, but in order to create a good predictive model, we need the readings to happen at regular intervals. Our data doesn't do that, so we need to figure out a way to change it so that it does. The [`resample`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html) method does that for us. 

Let's resample our data to create 1-hour reading intervals by aggregating using the mean:

In [None]:
# `"1H"` represents our one-hour window
df = df["P2"].resample("1H").mean().fillna(method="ffill").to_frame()

Notice the second half of the code:

```python
fillna(method="ffill").to_frame()
```

That tells the model to **forward-fill** any empty cells with **imputed** data. Forward-filling means that the model should start imputing data based on the closest cell that actually has data in it. This helps to keep the imputed data in line with the rest of the dataset. 

## Adding a Lag

We've spent some time elsewhere thinking about how two sets of data &mdash; apartment price and location, for example &mdash; compare to *each other*, but we haven't had any reason to consider how a dataset might compare to *itself*. If we're predicting the future, we want to know how good our prediction will be, so it might be useful to build some of that accountability into our model. To do that, we need to add a **lag**.

Lagging data means that we're adding a delay. In this case, we're going to allow the model to test itself out by comparing its predictions with what actually happened an hour before. If the prediction and the reality are close, then it's a good model; if they aren't, then the model isn't a very good one.

So, let's add a one-hour lag to our dataset: 

In [None]:
# In `shift(1), the `1` is the lagged interval.
df["P2.L1"] = df["P2"].shift(1)