(time-series)=
# Time Series

In this chapter, we'll look at time series. If you haven't yet looked at the two sections on **pandas**, the [Data Quickstart](data-quickstart) and [Working with Data](working-with-data) chapters, it might be worth taking a quick spin through them first. You may also find it useful to be familiar with a few of the concepts from the previous [chapter on time](time-intro).

While we'll cover the basics here, the full set of time series functionality of **pandas** can be [found here](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html).

This chapter has benefitted from the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake VanderPlas, and Tom Augspurger's [Effective Pandas](https://github.com/TomAugspurger/effective-pandas).

Let's imports a few of the packages we'll need first.

In [None]:
import numpy as np
import pandas as pd
from rich import inspect
import matplotlib.pyplot as plt

# Plot settings
plt.style.use(
    "https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt"
)

## Time Series with **pandas**

[**pandas**](https://pandas.pydata.org/) is the workhorse of time series analysis in Python. The basic object is a *timestamp*. The `pd.to_datetime` function creates timestamps from strings that could reasonably represent datetimes. Let's see an example of using `pd.to_datetime` to create a timestamp and then inspect all of the methods and attributes of the created timestamp using **rich**'s `inspect` function.

In [None]:
date = pd.to_datetime("16th of February, 2020")
inspect(date)

This is of type `Timestamp` and you can see that it has many of the same properties as the built-in Python `datetime.datetime` class from the previous chapter. As with that, the default setting for `tz` (timezone) and `tzinfo` is `None`. There are some extra properties, though, such as `freq` for frequency, which will be very useful when it comes to manipulating time *series* as opposed to just one or two datetimes.

### Creating Time Series

There are two main scenarios in which you might be creating time series using **pandas**: i) creating one from scratch or ii) reading in data from a file. Let's look at a few ways to do i) first. 

You can create a time series with **pandas** by taking a date as created above and extending it using **pandas** timedelta function:

In [None]:
date + pd.to_timedelta(np.arange(12), "D")

This has created a datetime index of type `datetime65[ns]` (remember, an index is a special type of **pandas** column), where "ns" stands for nano-second resolution.

Another method is to create a range of dates (pass a frequency using the `freq=` keyword argument):

In [None]:
pd.date_range(start="2018/1/1", end="2018/1/8")

Another way to create ranges is to specify the number of periods and the frequency:

In [None]:
pd.date_range("2018-01-01", periods=3, freq="H")

Following the discussion of the previous chapter on timezones, you can also localise timezones directly in **pandas** dataframes:


In [None]:
dti = pd.date_range("2018-01-01", periods=3, freq="H").tz_localize("UTC")
dti.tz_convert("US/Pacific")

Now let's see how to turn data that has been read in with a non-datetime type into a vector of datetimes. This happens *all the time* in practice. We'll read in some data on job vacancies for information and communication jobs, ONS code UNEM-JP9P, and then try to wrangle the given "date" column into a **pandas** datetime column.

In [None]:
import requests

url = "https://api.ons.gov.uk/timeseries/JP9P/dataset/UNEM/data"

# Get the data from the ONS API:
df = pd.DataFrame(pd.json_normalize(requests.get(url).json()["months"]))
df["value"] = pd.to_numeric(df["value"])
df = df[["date", "value"]]
df = df.rename(columns={"value": "Vacancies (Information & Communication)"})
df.head()

We have the data in. Let's look at the column types that arrived.

In [None]:
df.info()

This is the default 'object' type, but we want the date column to have `datetime64[ns]`, which is a datetime type. Again, we use `pd.to_datetime`:

In [None]:
df["date"] = pd.to_datetime(df["date"])
df["date"].head()

In this case, the conversion from the format of data that was put in of "2001 MAY" to datetime worked out-of-the-box. `pd.to_datetime` will always take an educated guess as to the format, but it won't always work out.

What happens if we have a more tricky-to-read-in datetime column? This frequently occurs in practice so it's well worth exploring an example. Let's create some random data with dates in an unusual format with month first, then year, then day, eg "1, '19, 29" and so on.

In [None]:
small_df = pd.DataFrame({"date": ["1, '19, 22", "1, '19, 23"], "values": ["1", "2"]})
small_df["date"]

Now, if we were to run this via `pd.to_datetime` with no further input, it would misinterpret, for example, the first date as `2022-01-19`. So we must pass a bit more info to `pd.to_datetime` to help it out. We can pass a `format=` keyword argument with the format that the datetime takes. Here, we'll use `%m` for month in number format, `%y` for year in 2-digit format, and `%d` for 2-digit day. We can also add in the other characters such as `'` and `,`. You can find a list of datetime format identifiers in the previous chapter or over at [https://strftime.org/](https://strftime.org/).

In [None]:
pd.to_datetime(small_df["date"], format="%m, '%y, %d")

### Offsets

Our data, currently held in `df`, were read in as if they were from the *start* of the month but these data refer to the month that has passed and so should be for the *end* of the month. Fortunately, we can change this using a time offset.

In [None]:
df["date"] = df["date"] + pd.offsets.MonthEnd()
df.head()

While we used the `MonthEnd` offset here, there are many different offsets available. You can find a [full table of date offsets here](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects).

### Creating a datetime index and setting the frequency

For the subsequent parts, we'll set the datetime column to be the index of the dataframe. *This is the standard setup you will likely want to use when dealing with time series.*

In [None]:
df = df.set_index("date")
df.head()

Now, if we look at the first few entries of the index of dataframe (a datetime index) using `head` as above, we'll see that the `freq=` parameter is set as `None`.

In [None]:
df.index[:5]

This can be set for the whole dataframe using the `asfreq` function:

In [None]:
df = df.asfreq("M")
df.index[:5]

Although most of the time it doesn't matter about the fact that `freq=None`, some aggregation operations need to know the frequency of the time series in order to work and it's good practice to set it if your data *are* regular. 

Note that trying to set the frequency when your datetime index doesn't match up to a particular frequency will cause errors or problems. 

A few useful frequencies to know about are in the table below; all of these can be used with `pd.to_datetime` too.

| Code  | Represents                                                          |
|-------|---------------------------------------------------------------------|
| D     | Calendar day                                                        |
| W     | Weekly                                                              |
| M     | Month end                                                           |
| Q     | Quarter end                                                         |
| A     | Year end                                                            |
| H     | Hours                                                               |
| T     | Minutes                                                             |
| S     | Seconds                                                             |
| B     | Business day                                                        |
| BM    | Business month end                                                  |
| BQ    | Business quarter end                                                |
| BA    | Business year end                                                   |
| BH    | Business hours                                                      |
| MS    | Month start                                                         |
| QS    | Quarter start                                                       |
| W-SUN | Weeks beginning with Sunday (similar for other days)                |
| 2M    | Every 2 months (works with other combinations of numbers and codes) |

### Making Quick Time Series Plots

Having managed to put your time series into a dataframe, perhaps converting a column of type string into a colume of type datetime in the process, you often just want to see the thing! We can achieve this using the `plot` command, as long as we have a datetime index.


In [None]:
df.plot();