# Chapter 11 - Time Series

## 11.2 - Time Series Basics

In [1]:
from datetime import datetime as dt
from dateutil.parser import parse
import pandas as pd
import numpy as np

The basic time series object in `pandas` is a `Series` index.

In [2]:
qs1 = []
# Instantiate a time series: First day of each Q in the year 2019.
for i in range(1, 13, 3):
    t = dt(2019, i, 1)
    qs1.append(t)
display(qs1)

# Use the time series as the index to a Series
s1 = pd.Series([10,20,30,40], index=qs1)
display(s1)

[datetime.datetime(2019, 1, 1, 0, 0),
 datetime.datetime(2019, 4, 1, 0, 0),
 datetime.datetime(2019, 7, 1, 0, 0),
 datetime.datetime(2019, 10, 1, 0, 0)]

2019-01-01    10
2019-04-01    20
2019-07-01    30
2019-10-01    40
dtype: int64

Under the hood, the `datetime` objects are stored as a `DateTimeIndex`.

In [3]:
display(s1.index)
print(type(s1.index))

DatetimeIndex(['2019-01-01', '2019-04-01', '2019-07-01', '2019-10-01'], dtype='datetime64[ns]', freq=None)

<class 'pandas.core.indexes.datetimes.DatetimeIndex'>


Like other `Series`, arithmetic operations on two `Series` objects are aligned on the dates.

In [4]:
# Instantiate a new Series with Q1 - Q3 dates
qs2 = [parse('2019-01-01'), parse('2019-04-01'), parse('2019-07-01')]
s2 = pd.Series([25,45,65], index=qs2)
display(s2)

# Add s1 and s2 Series objects
display(s1+s2) # There is no value for 2019-10-01 so resultant of sum is NaN

2019-01-01    25
2019-04-01    45
2019-07-01    65
dtype: int64

2019-01-01    35.0
2019-04-01    65.0
2019-07-01    95.0
2019-10-01     NaN
dtype: float64

### Indexing, Selection, Subsetting

You can use the date as the index to pull out its value. Different date formats are permitted.

In [5]:
display(s2)
print(s2['20190101'])
print(s2['07/01/2019'])

2019-01-01    25
2019-04-01    45
2019-07-01    65
dtype: int64

25
65


Use `pd.date_range` to automatically generate dates with a start date and number of periods.

In [6]:
d1 = pd.Series(range(0,20), pd.date_range('2018/07/01', periods=20))
display(d1.head(10))

2018-07-01    0
2018-07-02    1
2018-07-03    2
2018-07-04    3
2018-07-05    4
2018-07-06    5
2018-07-07    6
2018-07-08    7
2018-07-09    8
2018-07-10    9
Freq: D, dtype: int64

In [7]:
d2 = pd.Series(range(0,400), pd.date_range('20190701', periods=400))
display(d2.head(5))
print('...')
display(d2.tail(3))

2019-07-01    0
2019-07-02    1
2019-07-03    2
2019-07-04    3
2019-07-05    4
Freq: D, dtype: int64

...


2020-08-01    397
2020-08-02    398
2020-08-03    399
Freq: D, dtype: int64

Now, filtering can be performed on the `DateTimeIndex`.

In [8]:
display(d2['2019'].iloc[:5]) # Filter by year

2019-07-01    0
2019-07-02    1
2019-07-03    2
2019-07-04    3
2019-07-05    4
Freq: D, dtype: int64

In [9]:
display(d2['2019-09'].iloc[:5]) # Filter by year & month

2019-09-01    62
2019-09-02    63
2019-09-03    64
2019-09-04    65
2019-09-05    66
Freq: D, dtype: int64

In [10]:
display(d2[dt(2019,10, 3):].iloc[:5]) # Filter by datetime

2019-10-03    94
2019-10-04    95
2019-10-05    96
2019-10-06    97
2019-10-07    98
Freq: D, dtype: int64

Filtering can also be done using truncating

In [11]:
# Truncate (remove) all values before 1 Aug 2019
d2.truncate(before='2019-08-01').iloc[:10]

2019-08-01    31
2019-08-02    32
2019-08-03    33
2019-08-04    34
2019-08-05    35
2019-08-06    36
2019-08-07    37
2019-08-08    38
2019-08-09    39
2019-08-10    40
Freq: D, dtype: int64

In [12]:
# Truncate (Remove) all values after 5 Jul
d2.truncate(after='2019-07-05')

2019-07-01    0
2019-07-02    1
2019-07-03    2
2019-07-04    3
2019-07-05    4
Freq: D, dtype: int64

### Time Series with Duplicate Indices

Sometimes, there may be multiple data observations falling on the same timestamp.

In [13]:
dup_dates = pd.DatetimeIndex(['2020-07-28', '2020-07-29',
                              '2020-07-29', '2020-07-30', 
                              '2020-07-31'])
s3 = pd.Series(np.arange(5), index=dup_dates)
display(s3)
# Checking index.is_unique can validate that the index contains duplicates
print(s3.index.is_unique)

2020-07-28    0
2020-07-29    1
2020-07-29    2
2020-07-30    3
2020-07-31    4
dtype: int64

False


In [14]:
# Indexing to this time series produces a slice if there are duplicates
print(s3['20200729'])
# If there are no duplicates, this produces a scalar
print(s3['20200728'])
print(s3['20200731'])

2020-07-29    1
2020-07-29    2
dtype: int64
0
4


In [15]:
# To aggregate having non-unique timestamps, level=0 needs to be specified
print(s3.groupby(level=0).sum())
print(s3.groupby(level=0).mean())

2020-07-28    0
2020-07-29    3
2020-07-30    3
2020-07-31    4
dtype: int64
2020-07-28    0.0
2020-07-29    1.5
2020-07-30    3.0
2020-07-31    4.0
dtype: float64


**References:**

Python for Data Analysis, 2nd Edition, McKinney (2017)