# Lecture 2.4: Time Series Data

[**Lecture slides**](https://docs.google.com/presentation/d/1Q8jO0RezXD3ZrJcZpOEAHSjEYRD-IOt3cSxmLVjSJ6w/edit?usp=sharing)

This lecture, we are going to learn about pandas' [time series & date functionality](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-series-date-functionality) by exploring a Google trends [dataset](https://www.kaggle.com/GoogleNewsLab/food-searches-on-google-since-2004) of popular food search terms.

**Learning goals:**
- List the main time classes and apis in pandas
- Set a time index in a DataFrame
- Select dates and date ranges from a time index
- Pivot a stacked table
- Explore time series data with data visualization
- Shift a time series
- Calculate a rolling statistic
- Resample a time series
- Interpolate missing values

## 1. Introduction to Time Series

A Time Series is a sequence of data points indexed by _time_. In Pandas, there are **three** main classes related to time series. According to the [official documentation](https://pandas.pydata.org/docs/user_guide/timeseries.html#overview):

>**Timestamp**: A specific date and time with timezone support  
**Timedelta**: An absolute time duration  
**Period**: A span of time defined by a point in time and its associated frequency  


### 1.1 Timestamp

🕗 The pandas [`Timestamp`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html) represents a specific point in time.

In [None]:
import pandas as pd

pd.Timestamp('2012-12-21')

Like most other pandas classes, it wraps an efficient NumPy [dtype](https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#basic-datetimes) with useful methods and apis. For example, it exposes [time components](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components):

In [None]:
dt = pd.Timestamp('2012-12-21')
print(dt.dayofweek)
print(dt.day_name())

The `Timestamp` constructor parses many datetime representations, including python `datetime` and NumPy `datetime64`:

In [None]:
import datetime
import numpy as np

print(pd.Timestamp(2012, 12, 21))

print(pd.Timestamp(datetime.datetime(2012, 12, 21)))

print(pd.Timestamp(np.datetime64('2012-12-21')))


However, pandas offers the convenient [`.to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) method, which parses almost anything you'll throw at it. So no need to use `Timestamp` constructors directly!

In [None]:
print(pd.to_datetime('21-12-2012'))
print(pd.to_datetime('2012-12-21'))
print(pd.to_datetime('21st of December 2012'))

### 1.2 Timedelta

⏱ The pandas [`Timedelta`](https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html) represents a duration of time.

In [None]:
pd.Timedelta('42 days 666 hours')

These durations can also be negative:

In [None]:
pd.Timedelta('-1 hr 3 min 3 s 7us')

Just like `Timestamp`, `Timedelta` parses many data types, including python `timedelta` and NumPy `timedelta64`:

In [None]:
print(pd.Timedelta(days=42, hours=666))
print(pd.Timedelta(datetime.timedelta(days=42, hours=666)))
print(pd.Timedelta(np.timedelta64(1, 'ms')))

`Timedelta`s are particularly useful to carry out arithmetic operations on `Timestamp`s. For example:

In [None]:
day1 = pd.Timestamp('2012-12-21')
print(f'The 21st of December 2012 was a {day1.day_name()}')

day2 = day1 + pd.Timedelta('1 day')
print(f'The day after was a {day2.day_name()}')

td = pd.Timestamp.now() - pd.Timestamp('2012-12-21')
print(f'It has been {td.days} days since the end of the world! 🙀')

### 1.3 Period

⏳ [`Period`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Period.html) represents fixed-frequency intervals.

For example, let's make a period with one hour frequency, starting on the 21st of December 2012:

In [None]:
pd.Period('2012-12-21', freq='h')

You don't always have to specify the period. In fact, pandas will infer it based on the date format used as argument:

In [None]:
print(repr(pd.Period('2011-01-01')))
print(repr(pd.Period('2011-01')))

Periods aren't very useful on their own, but shine when used as a TimeSeries _index_. More on this in [this section](#2.-Time-Indexing).

## 2. Time Indexing

The `Timestamp`, `Timedelta`, and `Period` classes are quite fun for manipulating dates and time intervals, but this is a lecture about _time series_ , i.e sequences of time data. For this, we need to create time indices. 

There are three main types of time indices, one for each of the time classes:
- `DatetimeIndex` is a sequence of `Timestamp`
- `TimedeltaIndex` is a sequence of `Timedelta`
- `PeriodIndex` is a sequence of `Period`

(More details in the [official documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#overview))

Let's create a `DatetimeIndex`:

In [None]:
pd.DatetimeIndex(['2020-01-01', '2020-01-04', '2020-01-05', '2024-03-09']) 

As always, pandas offers a more convenient way to construct `DatetimeIndex`. Use `pd.date_range()` to create regularly spaced sequences of `Timestamp`s.

In [None]:
pd.date_range('1969-07-20', periods=4, freq='H')

We can also do the same with `TimedeltaIndex` and `PeriodIndex`:

In [None]:
pd.timedelta_range(0, periods=4, freq='H')

In [None]:
pd.period_range('1969-07-20', periods=4, freq='H')

🧠 Notice that the `PeriodIndex` constructor looks exactly the same as the `DatetimeIndex` constructor... Can you explain the difference between these two types of indices?

We've created time indices... Now let's use them in a `Series`!


In [None]:
index = pd.date_range('2000-01-01', periods=60, freq='D')
ts = pd.Series(np.random.randn(len(index)), index=index)
ts.head()

Notice how the series is "aware" of its index frequency, `D` (one day).

Since our time series is still a pandas `Series`, we can use all the indexing tricks learnt in lecture 2.2:

In [None]:
# selecting by index label
ts['2000-01-05']

In [None]:
# selecting by index position
ts[4]

In [None]:
# list slicing 3rd element through to 5th element
ts[2:5]

In [None]:
# list slicing elements in steps of 3
ts[::4]

 Having a time index opens up many more possibilities with data selection. For example, we can...

In [None]:
# selecting by index label with datetime object
ts[datetime.datetime(2000, 1, 5)]

In [None]:
# selecting by index label with parsed datetime string
ts['01/05/2000']

In [None]:
# selecting by slice of parsed datetime string
ts['3rd of January 2000':'5th of January 2000']

In [None]:
# selecting by range of parsed datetime string
ts['February 2000']

And, of course, all of this magic also applies to dataframes:

In [None]:
dft = pd.DataFrame(np.random.randn(666, 2), columns=['SAD', 'PEPE'], index=pd.date_range('20121221', periods=666, freq='D'))
dft.head()

In [None]:
dft.loc['10th March 2013':'20th March 2013', :].plot.bar()

🧠 Take the time to understand what's happening in the two cells above. What's used as arguments to the `DataFrame` constructor? What is being selected with the `.loc[]` operator?


## 3. Time Series Data Exploration

Now that we've conquered the basics of `TimeSeries`, let's put our skills to practice on some real data!

The `food_searches.csv` dataset tracks the [Google trends](https://trends.google.com/trends) popularity of various foods and drinks from 2004 through 2016. Please bear in mind that most of this data is USA centric, so don't draw global conclusions on the trends just yet 🙃. 

Just like in lecture 2.2, let's start with some summary statistics to get insight into the values:

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('food_searches.csv')
df.head()

In [None]:
df.describe(include='all')

The `id` column holds the `string` name of the food/drink, `googleTopic` seems to be a google specific id, `week_id` is a date period, and `value` is the popularity of the search term. `value` has `min=0` and `max=100`, so we can expect a normalised value in percentage.

Notice how the `DataFrame` index is _not_ a time index:



In [None]:
df.index

This means we need to manually set the time index. However, `week_id` isn't a time data type...

In [None]:
df.info()

We don't want to end up with an `object` index! Then we won't be able to do all that fancy time indexing. So first things first: let's convert `week_id` to a datetime `dtype`. Remember how pandas makes date conversion easy with `pd.to_datetime()`? Well this `week_id` is particulary strange, so we've got help out the parsing a little with the `format` argument. Specifying date string formats is a very common occurence, a list of the symbols can be found in the python [documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior).

ℹ️ Weird formats happen all the time with real-world data, and that's partly why pandas exists! Knowing how to deal with these scenarios is a crucial data science skill.

In [None]:
# First add the day of the week to the `week_id` string
df['datetime'] = df['week_id'] + '-1'
# Then indicate the date string format
df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%W-%w')
df.info()

Now that we have a `datetime64` column, we simply have to set it as our `DataFrame` index:

In [None]:
df = df.set_index('datetime')
df['4th of July 2011']

Great! Now it will be much easier to explore the dataset with our shiny new time index. One thing still feels weird though... Ideally, we'd like to visualise and compare the trends of different foods and drinks. But in our `Dataframe`, those values are ["stacked"](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html) and differentiated using the column _value_ `id`. In previous lectures, when looking at different _features_ of a dataset, we had those separated into _columns_ , not groups of _rows_. And that made it easy to select, update, calculate, and plot those _features_. 

But fear not! Once again, pandas is here to rescue us. This is a common manipulation called "reshaping" or "pivoting" a table (more details in the [official documentation](https://pandas.pydata.org/docs/user_guide/reshaping.html)). We can disentangle this mess by using [`.pivot()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html):

In [None]:
df = df.pivot(columns='id', values='value')
df.head()

Much better! 😌 We can see some pesky `NaN`s though... they always sneak in our datasets! Just like in lecture 2.2, we want to get rid of them. They might ruin our beautiful plots! 🎨 

However, we don't want to get rid of an entire _week_ of data if one food's popularity value is missing. Instead, we'd like to get rid of the food _column_ if it contains any `NaN`. We can do this with the `axis` argument:

In [None]:
df = df.dropna(axis=1)
df.head()

Looks like `apple-ru` is gone! 🇷🇺That's okay, we still have 185 trends to analyse... Speaking of 🍎, let's visualise the popularity of `apple` searches on Google between 2004 and 2016:

In [None]:
df['apple'].plot.area()

That looks highly seasonal! [Seasonality](https://en.wikipedia.org/wiki/Seasonality) is the presence of regular variations in data. In this case, we expect that the seasonality of apples follows... seasons! It's not easy to see which months are the most red delicious from the graph. Trying to gauge seasonality is a common challenge with time series data. 

The solution is **seasonal plots**. Let's make one by splitting this graph by year, and plotting the yearly trends individually. We'll put the code in a function so we can explore the seasonality of many foods easily:

In [None]:
# new `year` column
df['year'] = df.index.year
# new week column
df['week'] = df.index.week

def plot_seasonal(df, food):
    # pivot the dataframe so years are columns
    pivot_df = df.pivot(index='week', columns='year', values=food)
    # plot one line per year
    pivot_df.plot.line(alpha=0.6, legend=False)

In [None]:
plot_seasonal(df, 'chocolate')

🍫 Chocolate sure is popular around holiday season! Why do you think there is a peak in mid-february? 💝 

Let's explore the trends of a food slightly less fit for special occasions: donuts 🍩

In [None]:
plot_seasonal(df, 'donut')

As expected, donuts are less festive than chocolate. I wonder what caused those peaks! Feel free to explore the data yourself 🤠

Those differences in trends are interesting, but we'd like to compare them more closely. We have a suspicion that some alcoholic drinks have very different seasonal profiles... For example, `champagne` is usually popping around New Year's Eve, whilst `mojito` is a classic summer cocktail.

**Average seasonal plots** allow us to compare the seasonality of two variables. Let's group the values by week, and plot the average values. The result is a graph of the average popularity of the drinks for each week. e.g the values of `champagne` at `week=20` is simply the mean of all the `champagne` values in df happening on the 20th week of the year.

In [None]:
df.groupby('week').mean().plot.line(y=['champagne', 'mojito'])

🍾 Our instincts are confirmed: the New Year is a better time for bubbles than for rum!

Sometimes, we wish to track the cumulative trends of some variables as well as their relative trends. For example, we might be interested in learning the popularity of _beverages_, as well as the breakdown into `tea`, `coffee`, and a hip newcomer `cold-brew-coffee`. 

**Area charts** allow to stack lines and observe the cumulative result. We're interesting in long term trends, so we'll use the yearly averages to remove some of the distracting seasonal variations we observed earlier.

In [None]:
df.groupby(df.index.year).mean()[['tea', 'coffee', 'cold-brew-coffee']].plot.area()

Three are three key insights from this graph:
- beverages are on the rise since 2004
- coffee got a boost around 2010
- cold brews really took off around 2014

ℹ️ This high "density" of information is key in effective data visualization, as explained in the legendary book [The Visual Display of Quantitative Information](https://www.goodreads.com/book/show/17744.The_Visual_Display_of_Quantitative_Information). More on this in lecture 2.6!

💪 Create an area plot of the popularity of `ice-cream` from January to July 2006.

In [None]:
# INSERT YOUR CODE HERE

🧠 Notice how the seasonality of your graph isn't obvious because of the chosen time window. Can you think of ways to avoid missing this detail in a time series analysis?

## 4. Shifting

The broccoli search expert at Google made a calendar mistake... All his data is late by 8 weeks! We don't want to manually calculate a new column and reindex our `DataFrame`. We _certainly_ don't want to manually update the values. Instead, we can use the [`.shift()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html) method:

In [None]:
# shift values by 8 weeks and save in new column
df['apple_shift'] = df.shift(periods=-8*7, freq='D')['apple']
# plot old and shifted data
df.loc['2013', ['apple', 'apple_shift']].plot.line()

Phew, broccoli tragedy averted. 🥦 Sometimes we want to shift the _index_ , not the values. We can then use the [`.tshift()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.tshift.html) method:

In [None]:
df.tshift(-1000, freq='W')['broccoli'].plot.line()

That's some old cabbage! More details on the difference between the two methods in the [python data science handbook](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html#Time-shifts).

## 5. Windowing

🦐 We hear word that shrimps were highly fashionable around 2010, so we'd like to pinpoint this "shrimp chic" theory. 

In [None]:
df['shrimp'].plot.line()

There's a definite shellfish bump around 2009, but the [variance](https://en.wikipedia.org/wiki/Variance) of the popularity is muddling the graph. We could "smooth out" the line by plotting its moving average. These [rolling statistics](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html#Rolling-windows) can be calculated with pandas' [`.rolling()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rolling.html#pandas.Series.rolling) method. The api interface is similar to [`.groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) (from lecture 2.3). The method returns a `Rolling` object, on which we must apply an _aggregation function_. In our case, we want a rolling _average_ , so we'll use the [`.mean()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.window.rolling.Rolling.mean.html) method:

In [None]:
# calculate rolling average of window size 12
df['shrimp_av'] = df['shrimp'].rolling(12).mean()
# plot at two different scales
df.loc[:, ['shrimp', 'shrimp_av']].plot.line()
df.loc['2013', ['shrimp', 'shrimp_av']].plot.line()

📈 Each point of `shrimp_av` is the average popularity of the 12 previous weeks, and allows us to focus on longer term trends. 

💪 Create a line graph of the rolling standard deviation of the popularity of `coconut` 🥥 with a window size of 10, for the year 2012.

In [None]:
# INSERT YOUR CODE HERE

🧠 Can you explain what these peaks and troughs represent?

## 6. Resampling

It is common to encounter datasets with irregular time indices, e.g a `DatetimeIndex` where the `Timestamp`s are not regularly spaced. This can be hard to work with for certain downstream tasks, such as times series prediction. One solution is to _resample_ the data to a regular interval. All the old values found in a new interval need to be combined using an _aggregation function_. In this sense, resampling is similar to [windowing](#5.-Windowing), except the aggregation is done on fixed frequency intervals instead of a sliding window. 

In pandas, resampling is done with [`.resample()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html). For example, we can resample our google trends to a monthly frequency:


In [None]:
df.resample('M').mean().head()

The time index now steps in months, and not in weeks. All the values in this new `DataFrame` were _averaged_. More information on resampling and converting frequencies in the [python data science handbook](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html#Resampling-and-converting-frequencies).

## 7. Interpolation


In [None]:
def forget_data(df, food):
    np.random.seed(1337)
    missing_indices = np.random.randint(low=0, high=len(df), size=42)
    df.iloc[missing_indices, df.columns.get_loc(food)] = None
    
forget_data(df, 'long-island-iced-tea')

Oh no, someone had too many `long-island-iced-tea`s 🍹, and forgot some of the data...

In [None]:
df.loc['2012', 'long-island-iced-tea'].plot.line()

Our beautiful graph! 😭 How do we repair such a mess? Instead of throwing away everything because of a few forgotten values, we can try to _interpolate_ the missing data. [Interpolation](https://en.wikipedia.org/wiki/Linear_interpolation) is "guessing" what the missing values are, based on their neighbours. Linear interpolation is the most common kind.

💄 Let's make our graph pretty again with the [`.interpolate()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html) method:

In [None]:
df['long-island-iced-tea'] = df['long-island-iced-tea'].interpolate()
df.loc['2012', 'long-island-iced-tea'].plot.line()

Data repaired, hangover prevented. ⛑ Please note that the values are _not_ the same as before and there is no way to magically recover lost data. However, interpolation can enable a larger scale analysis that would otherwise fail, or act as a "better than nothing" solution. There are many interpolation methods, check out the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html) for more details!

## 8. Summary

Today, we went on a tour of pandas' [time series & date functionality](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-series-date-functionality). We learned about the **`Timestamp`**, **`Timedelta`**, and **`Period`** classes, and their associated time index classes. Then, we selected and manipulated a `DataFrame` using **time-based indexing**. We loaded a Google trends [dataset](https://www.kaggle.com/GoogleNewsLab/food-searches-on-google-since-2004) of popular food search terms, and **pivoted** the table to access our ordered time index. We explored the data, and visualized some of its time-specific aspects, such as **seasonality** and **comparative trends**. We also learned time data cleaning transformations, such as **index shifting**, **rolling statistics**, **resampling**, and **interpolation**. Overall, we discovered the main techniques for time series data exploration, tested them on a real dataset, and got insights into some dietary trends.

# Resources
## Core Resources

- [**Slides**](https://docs.google.com/presentation/d/1Q8jO0RezXD3ZrJcZpOEAHSjEYRD-IOt3cSxmLVjSJ6w/edit?usp=sharing)
- [Kaggle dataset of popular food searches](https://www.kaggle.com/GoogleNewsLab/food-searches-on-google-since-2004)
- [Pandas time series & date functionality](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)
- [Python Data Science Handbook - Time Series](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html)

## Additional Resources

- [Time formatting](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)
- [Exploring and visualizing time series](https://uc-r.github.io/ts_exploration)
- [Times series analysis with pandas](https://www.dataquest.io/blog/tutorial-time-series-analysis-with-pandas/)
- [Times series visualization with python](https://machinelearningmastery.com/time-series-data-visualization-with-python/)
- [7 types of temporal visualizations](https://humansofdata.atlan.com/2016/11/visualizing-time-series-data/)
- [11 stunning time series graphs](https://medium.com/@plotlygraphs/time-series-graphs-eleven-stunning-ways-you-can-use-them-cd1c1bcfe749)
- [Reshaping and pivot tables](https://pandas.pydata.org/docs/user_guide/reshaping.html)
- [Analysing time series data in pandas](https://towardsdatascience.com/analyzing-time-series-data-in-pandas-be3887fdd621)
