<a href="https://colab.research.google.com/github/dlsun/pods/blob/master/08-Time-Series/8.1%20Working%20with%20Time%20Series%20Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 8 Time Series Data

A _time series_ is simply a series of measurements indexed by time. In this chapter, we discuss strategies for analyzing and visualizing time series data.

# 8.1 Working with Time Series Data

In this lesson, we will work with [a data set from the National Oceanic & Atmospheric Administration (NOAA)](http://www.esrl.noaa.gov/gmd/ccgg/trends/) consisting of weekly measurements of atmospheric carbon dioxide ($\text{CO}_2$) at the Mauna Loa Observatory in Hawaii, dating back to 1974. Since atmospheric $\text{CO}_2$ is considered to be one of the primary drivers of climate change, this time series is highly relevant to climate policy.

First, let's read in the data set.

In [0]:
import pandas as pd
data_dir = "https://dlsun.github.io/pods/data/"
df_co2 = pd.read_csv(data_dir + "mauna_loa_co2_weekly.csv")
df_co2

By default, the dates are stored as strings. In order to make `pandas` recognize them as dates, we call `pd.to_datetime`.

In [0]:
pd.to_datetime(df_co2["date"])

Notice that the `dtype` of this `Series` is `datetime64[ns]`, which is a special type for storing dates and times. 

In this particular example, `pandas` was able to automatically infer the correct formatting of the dates; however, we can also specify the format explicitly, using the [standard format codes](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior).

In [0]:
pd.to_datetime(df_co2["date"], format="%Y-%m-%d")

It makes sense to make the date the index of this `DataFrame`.

In [0]:
df_co2.index = pd.to_datetime(df_co2["date"], format="%Y-%m-%d")
df_co2

Another way to achieve (essentially) the same result is to read in the **date** column as the index and to specify that the values should be parsed as dates.

In [0]:
pd.read_csv(data_dir + "mauna_loa_co2_weekly.csv",
            index_col="date",
            parse_dates=True)

## Visualizing Time Series

Time series are typically plotted as a line, with time on the $x$-axis and the measurement on the $y$-axis. Since our `DataFrame` is already indexed by time, we can simply select the variable we want to plot (**ppm**, the concentration of $\text{CO}_2$ in parts per million) and call `.plot.line()`.

In [0]:
df_co2["ppm"].plot.line()

Oops! It seems that there are some missing values in the data that are coded as $-999.99$.

In [0]:
df_co2[df_co2["ppm"] < 0]["ppm"]

Let's replace these values with `NaN`s and recreate the plot.

In [0]:
import numpy as np
df_co2 = df_co2.replace(-999.99, np.nan)
ppm = df_co2["ppm"]
ppm.plot.line()

The upward trend in this graph has been cause of great consternation.

Let's take a closer look at what `pandas` did with those missing values that we filtered out. We saw above that measurements were missing for all 4 weeks in December 1975. Let's zoom in on this region by restricting to dates before February 1976. We can compare use logical operators (`<`, `>`, `==`, etc.) to compare dates, except that we have to be sure to compare dates with dates, creating `datetime` objects as necessary.

In [0]:
from datetime import datetime

(ppm[ppm.index < datetime(1976, 2, 1)].
 plot.line(style="o-"))

Notice how `pandas` left the appropriate space between the measurement on 1975-11-30 and the next available measurement on 1976-01-04. This is one advantage of casting dates to `datetime`s, rather than simply leaving them as strings. If we had instead made a line plot using the **date** column (which stores the dates as strings), then the values would still have been plotted in the right order, but the points would be uniformly spaced, instead of spaced according to how far apart they are in time.

In [0]:
df_co2_no_na = df_co2.dropna()

(df_co2_no_na[df_co2_no_na.index < datetime(1976, 2, 1)].
 plot.line(x="date", y="ppm", style="o-"))

## Changing the Sampling Frequency

From the graph, there is a clear seasonal pattern in $\text{CO}_2$ levels. The levels increase in the winter (in the northern hemisphere), peaking around May of each year, and then decline in the summer. Plants are responsible for this seasonal pattern. In the summer months, plants absorb $\text{CO}_2$ from the atmosphere as part of photosynthesis, in order to grow flowers and leaves. In the winter months, these leaves fall to the ground, where they are broken down by microbes that emit $\text{CO}_2$ in the process. 

However, these seasonal fluctuations are dwarfed by the overall increasing trend, which is thought to be caused by human activities. To see the overall trend more clearly, we can calculate the yearly average, therefore smoothing over all of the seasonal fluctuations. Although this can be done manually, `pandas` provides a convenience method, `.resample()`, that changes the sampling frequency of the time series. The `.resample()` function works like `.groupby()`; you have to specify a column and an aggregation function. In the code below, we average the **ppm** in each year to obtain a time series with a sampling frequency of 1 cycle per year.

In [0]:
ppm_1y = df_co2.resample("1Y")["ppm"].mean()
ppm_1y.plot.line()
ppm_1y

## Lags and Differences

Another way to remove the effect of seasonality is to take differences. If we take each measurement and subtract the measurement from a year earlier, then any seasonal effect should cancel out, since we are comparing winter measurements to winter measurements and summer measurements to summer measurements.

The easiest way to take this difference is to shift (or _lag_) all of the values in the `DataFrame`. Since each row in our `DataFrame` represents 1 week, we should shift the `DataFrame` by 52 rows so that every value is lined up with its value 1 year ago.

In [0]:
df_co2_lag = df_co2.shift(52)
df_co2_lag

By comparing the **date** index (which was not shifted) to the **date** column (which was shifted), we see that the values in the `DataFrame` correspond to approximately 1 year earlier than the date in the index. Therefore, if we subtract these lagged **ppm** values from the original **ppm** values, we obtain a `Series` of one-year changes in **ppm**.

In [0]:
diffs = df_co2["ppm"] - df_co2_lag["ppm"]
diffs.plot.line()
diffs

The seasonality is gone, but so is the trend. The fact that these differences hover around 2 tells us that **ppm** has been increasing by about 2 ppm per week.

In accounting, these types of metrics are called "Year-over-Year" (YoY) metrics. They are valuable precisely because they eliminate the seasonal effects that are common in many industries. For example, many retailers see an increase in sales during the holidays. To measure growth, we should measure the increase relative to the previous year.

## Exercises

Exercises 1-2 ask you to work with the Austin weather data set (https://dlsun.github.io/pods/data/austin_weather_2019.csv ), which contains hourly measurements of the weather in Austin, TX in 2019. This data set was collected from the [NOAA](https://www.ncdc.noaa.gov/crn/qcdatasets.html). See the [data documentation](https://www1.ncdc.noaa.gov/pub/data/uscrn/products/hourly02/README.txt) for more information.

1\. Read in the data set. Plot the hourly temperature (**T_HR_AVG**) time series as a function of the local date and time (**LST_DATE**, **LST_TIME**).

2\. The hourly temperature plot is extremely noisy. Plot the daily average temperature and weekly average temperature to get a better sense of the climate in Austin, TX.