# Time Series Analysis PyCon 2017 by Aileen Nielsen

Covers basic techniques. Taking some notes and exploring a few things.

In [37]:
# imports
import pandas as pd
from pytz import common_timezones, all_timezones

## Basics

Generate a series of time stamps using `date_range`:

In [3]:
rng = pd.date_range("2020 Aug 9 17:15", periods=10, freq="M")
rng

DatetimeIndex(['2020-08-31 17:15:00', '2020-09-30 17:15:00',
               '2020-10-31 17:15:00', '2020-11-30 17:15:00',
               '2020-12-31 17:15:00', '2021-01-31 17:15:00',
               '2021-02-28 17:15:00', '2021-03-31 17:15:00',
               '2021-04-30 17:15:00', '2021-05-31 17:15:00'],
              dtype='datetime64[ns]', freq='M')

`Timestamp` uses US-style data format (`mm/dd/yyyy`):

In [4]:
pd.Timestamp("2020-08-17")

Timestamp('2020-08-17 00:00:00')

Down to nanosecond precision (even though only the first six decimal places are displayed):

In [5]:
pd.Timestamp("2020-08-17 17:15:08.156")

Timestamp('2020-08-17 17:15:08.156000')

Timestamps can tell us what quarter, day of the week, etc. they belong to:

In [15]:
ts1 = pd.Timestamp("2020-08-23 17:15:08.156")
print(f"quarter of ts1: {ts1.quarter}")
print(f"day of week of ts1: {ts1.dayofweek}")
print(f"name of day of ts1: {ts1.day_name()}")

quarter of ts1: 3
day of week of ts1: 6
name of day of ts1: Sunday


Time offsets. For example, one day and one microsecond:

In [16]:
pd.Timedelta("1 day 1us")

Timedelta('1 days 00:00:00.000001')

Add an hour and a half to a time stamp:

In [17]:
pd.Timestamp("2018-08-17 22:00") + pd.Timedelta("1.5 hours")

Timestamp('2018-08-17 23:30:00')

Time spans. Check if a time stamp is within a time period:

In [22]:
p = pd.Period("2020-08")
t = pd.Timestamp("2020-08-18")
print(f"period: {p}")
print(f"ts: {t}")
p.start_time < t and p.end_time > t

period: 2020-08
ts: 2020-08-18 00:00:00


True

Generate a range of time periods. Looks like a timestamp, but not its type, which is an hourly period.

In [24]:
p_rng = pd.period_range("2020-01-01 12:15", freq="H", periods=10)
p_rng

PeriodIndex(['2020-01-01 12:00', '2020-01-01 13:00', '2020-01-01 14:00',
             '2020-01-01 15:00', '2020-01-01 16:00', '2020-01-01 17:00',
             '2020-01-01 18:00', '2020-01-01 19:00', '2020-01-01 20:00',
             '2020-01-01 21:00'],
            dtype='period[H]', freq='H')

Minutely:

In [26]:
p_rng = pd.period_range("2020-01-01 12:15", freq="60T", periods=10)
p_rng

PeriodIndex(['2020-01-01 12:15', '2020-01-01 13:15', '2020-01-01 14:15',
             '2020-01-01 15:15', '2020-01-01 16:15', '2020-01-01 17:15',
             '2020-01-01 18:15', '2020-01-01 19:15', '2020-01-01 20:15',
             '2020-01-01 21:15'],
            dtype='period[60T]', freq='60T')

Combining offset aliases in a `date_range`:

In [31]:
pd.date_range("2020-01-01 12:15", freq="1h30min", periods=10)

DatetimeIndex(['2020-01-01 12:15:00', '2020-01-01 13:45:00',
               '2020-01-01 15:15:00', '2020-01-01 16:45:00',
               '2020-01-01 18:15:00', '2020-01-01 19:45:00',
               '2020-01-01 21:15:00', '2020-01-01 22:45:00',
               '2020-01-02 00:15:00', '2020-01-02 01:45:00'],
              dtype='datetime64[ns]', freq='90T')

Create a `Series` indexed by time periods:

In [28]:
num_periods = 10
ts_pd = pd.Series(range(num_periods), pd.period_range("2020-08-01 11:15", freq="60T", periods=num_periods))
ts_pd

2020-08-01 11:15    0
2020-08-01 12:15    1
2020-08-01 13:15    2
2020-08-01 14:15    3
2020-08-01 15:15    4
2020-08-01 16:15    5
2020-08-01 17:15    6
2020-08-01 18:15    7
2020-08-01 19:15    8
2020-08-01 20:15    9
Freq: 60T, dtype: int64

You can then slice the `Series`:

In [30]:
ts_pd["2020-08-01 13":"2020-08-01 17"]

2020-08-01 13:15    2
2020-08-01 14:15    3
2020-08-01 15:15    4
2020-08-01 16:15    5
2020-08-01 17:15    6
Freq: 60T, dtype: int64

A `Series` of indexed by time stamps:

In [34]:
num_periods = 10
ts_dt = pd.Series(range(num_periods), pd.date_range("2020-08-01 11:15", freq="60T", periods=num_periods))
ts_dt

2020-08-01 11:15:00    0
2020-08-01 12:15:00    1
2020-08-01 13:15:00    2
2020-08-01 14:15:00    3
2020-08-01 15:15:00    4
2020-08-01 16:15:00    5
2020-08-01 17:15:00    6
2020-08-01 18:15:00    7
2020-08-01 19:15:00    8
2020-08-01 20:15:00    9
Freq: 60T, dtype: int64

In [35]:
ts_dt["2020-08-01 13":"2020-08-01 17"]

2020-08-01 13:15:00    2
2020-08-01 14:15:00    3
2020-08-01 15:15:00    4
2020-08-01 16:15:00    5
2020-08-01 17:15:00    6
Freq: 60T, dtype: int64

How do you decide whether to use a `date_range` or a `period_range`? "I counted $n$ visitors to my site at time $x$ versus I counted $n$ visitors to my site during time period $x$". I counted at this time and a minute later and again another minute later or I counted all visitors during this month.

You can convert between a `DateTimeIndex` and a `PeriodIndex` using:
```python
ts_dt.to_period()
ts_pd.to_timestamp()
```

## Time zones

Can use `pytz` to get pretty good time zone support, including awareness of DST changes and a fairly up-to-date IANA tz db (at least on Linux and mac OS) and there is also [PEP 615](https://www.python.org/dev/peps/pep-0615/).

List of supported time zones:

In [44]:
print(len(all_timezones))

592


In [45]:
print(all_timezones)

['Africa/Abidjan', 'Africa/Accra', 'Africa/Addis_Ababa', 'Africa/Algiers', 'Africa/Asmara', 'Africa/Asmera', 'Africa/Bamako', 'Africa/Bangui', 'Africa/Banjul', 'Africa/Bissau', 'Africa/Blantyre', 'Africa/Brazzaville', 'Africa/Bujumbura', 'Africa/Cairo', 'Africa/Casablanca', 'Africa/Ceuta', 'Africa/Conakry', 'Africa/Dakar', 'Africa/Dar_es_Salaam', 'Africa/Djibouti', 'Africa/Douala', 'Africa/El_Aaiun', 'Africa/Freetown', 'Africa/Gaborone', 'Africa/Harare', 'Africa/Johannesburg', 'Africa/Juba', 'Africa/Kampala', 'Africa/Khartoum', 'Africa/Kigali', 'Africa/Kinshasa', 'Africa/Lagos', 'Africa/Libreville', 'Africa/Lome', 'Africa/Luanda', 'Africa/Lubumbashi', 'Africa/Lusaka', 'Africa/Malabo', 'Africa/Maputo', 'Africa/Maseru', 'Africa/Mbabane', 'Africa/Mogadishu', 'Africa/Monrovia', 'Africa/Nairobi', 'Africa/Ndjamena', 'Africa/Niamey', 'Africa/Nouakchott', 'Africa/Ouagadougou', 'Africa/Porto-Novo', 'Africa/Sao_Tome', 'Africa/Timbuktu', 'Africa/Tripoli', 'Africa/Tunis', 'Africa/Windhoek', 'Ameri

By default, Pandas time objects are not time zone aware:

In [39]:
rng = pd.date_range("2020-08-15 00:00", periods=15, freq="d")
rng

DatetimeIndex(['2020-08-15', '2020-08-16', '2020-08-17', '2020-08-18',
               '2020-08-19', '2020-08-20', '2020-08-21', '2020-08-22',
               '2020-08-23', '2020-08-24', '2020-08-25', '2020-08-26',
               '2020-08-27', '2020-08-28', '2020-08-29'],
              dtype='datetime64[ns]', freq='D')

In [40]:
print(rng.tz)

None


In [41]:
rng = pd.date_range("2020-08-15 00:00", periods=15, freq="d", tz="Europe/London")
rng

DatetimeIndex(['2020-08-15 00:00:00+01:00', '2020-08-16 00:00:00+01:00',
               '2020-08-17 00:00:00+01:00', '2020-08-18 00:00:00+01:00',
               '2020-08-19 00:00:00+01:00', '2020-08-20 00:00:00+01:00',
               '2020-08-21 00:00:00+01:00', '2020-08-22 00:00:00+01:00',
               '2020-08-23 00:00:00+01:00', '2020-08-24 00:00:00+01:00',
               '2020-08-25 00:00:00+01:00', '2020-08-26 00:00:00+01:00',
               '2020-08-27 00:00:00+01:00', '2020-08-28 00:00:00+01:00',
               '2020-08-29 00:00:00+01:00'],
              dtype='datetime64[ns, Europe/London]', freq='D')

In [42]:
print(rng.tz)

Europe/London


Convert a time zone naive date-time index to a time zone aware one using `Series.tz_localize()`. There is a matching method on `Timestamp`.

In [49]:
rng_naive = pd.date_range("2020-08-15 00:00", periods=15, freq="d")
rng_aware = rng_naive.tz_localize(tz="Europe/Berlin")
rng_aware

DatetimeIndex(['2020-08-15 00:00:00+02:00', '2020-08-16 00:00:00+02:00',
               '2020-08-17 00:00:00+02:00', '2020-08-18 00:00:00+02:00',
               '2020-08-19 00:00:00+02:00', '2020-08-20 00:00:00+02:00',
               '2020-08-21 00:00:00+02:00', '2020-08-22 00:00:00+02:00',
               '2020-08-23 00:00:00+02:00', '2020-08-24 00:00:00+02:00',
               '2020-08-25 00:00:00+02:00', '2020-08-26 00:00:00+02:00',
               '2020-08-27 00:00:00+02:00', '2020-08-28 00:00:00+02:00',
               '2020-08-29 00:00:00+02:00'],
              dtype='datetime64[ns, Europe/Berlin]', freq='D')

If an object already has a time zone associated with it, you can change it using `tz_convert()`:

In [51]:
rng_aware_ny = rng_aware.tz_convert("America/New_York")
rng_aware_ny

DatetimeIndex(['2020-08-14 18:00:00-04:00', '2020-08-15 18:00:00-04:00',
               '2020-08-16 18:00:00-04:00', '2020-08-17 18:00:00-04:00',
               '2020-08-18 18:00:00-04:00', '2020-08-19 18:00:00-04:00',
               '2020-08-20 18:00:00-04:00', '2020-08-21 18:00:00-04:00',
               '2020-08-22 18:00:00-04:00', '2020-08-23 18:00:00-04:00',
               '2020-08-24 18:00:00-04:00', '2020-08-25 18:00:00-04:00',
               '2020-08-26 18:00:00-04:00', '2020-08-27 18:00:00-04:00',
               '2020-08-28 18:00:00-04:00'],
              dtype='datetime64[ns, America/New_York]', freq='D')

Takes care of DST changes:

In [53]:
rng = pd.date_range("2020-03-05", periods=10, tz="US/Eastern")
ts = pd.Series(range(10), index=rng)
ts

2020-03-05 00:00:00-05:00    0
2020-03-06 00:00:00-05:00    1
2020-03-07 00:00:00-05:00    2
2020-03-08 00:00:00-05:00    3
2020-03-09 00:00:00-04:00    4
2020-03-10 00:00:00-04:00    5
2020-03-11 00:00:00-04:00    6
2020-03-12 00:00:00-04:00    7
2020-03-13 00:00:00-04:00    8
2020-03-14 00:00:00-04:00    9
Freq: D, dtype: int64

If you do not work in UTC, you might get ambiguous times. This can happen when you track time stamps across a DST change boundary (e.g., the clock falls back by an hour in the time zone and therefore, the a time is followed by the same time). For example:

In [61]:
rng_hourly = pd.DatetimeIndex(data=["2011-11-06 00:00", "2011-11-06 01:00", "2011-11-06 01:00", "2011-11-06 02:00"])
rng_hourly

DatetimeIndex(['2011-11-06 00:00:00', '2011-11-06 01:00:00',
               '2011-11-06 01:00:00', '2011-11-06 02:00:00'],
              dtype='datetime64[ns]', freq=None)

In [63]:
rng_hourly.tz_localize("US/Central")

AmbiguousTimeError: Cannot infer dst time from %r, try using the 'ambiguous' argument

We can handle this error by letting Pandas infer how to handle the ambiguous time:

In [64]:
rng_hourly.tz_localize("US/Central", ambiguous="infer")

DatetimeIndex(['2011-11-06 00:00:00-05:00', '2011-11-06 01:00:00-05:00',
               '2011-11-06 01:00:00-06:00', '2011-11-06 02:00:00-06:00'],
              dtype='datetime64[ns, US/Central]', freq=None)

Best to work in UTC whenever possible:

In [67]:
rng_hourly.tz_localize("US/Central", ambiguous="infer").tz_convert("utc")

DatetimeIndex(['2011-11-06 05:00:00+00:00', '2011-11-06 06:00:00+00:00',
               '2011-11-06 07:00:00+00:00', '2011-11-06 08:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)

## Resampling

In [70]:
rng = pd.date_range("2011-01-01", periods=72, freq="H")
#print(len(rng))
#print(range(len(rng)))
#print(list(range(len(rng))))
ts = pd.Series(list(range(len(rng))), index=rng)

In [71]:
ts.head()

2011-01-01 00:00:00    0
2011-01-01 01:00:00    1
2011-01-01 02:00:00    2
2011-01-01 03:00:00    3
2011-01-01 04:00:00    4
Freq: H, dtype: int64

In [74]:

converted_ffill = ts.asfreq("45min", method="ffill")

In [76]:
converted_ffill.head()

2011-01-01 00:00:00    0
2011-01-01 00:45:00    0
2011-01-01 01:30:00    1
2011-01-01 02:15:00    2
2011-01-01 03:00:00    3
Freq: 45T, dtype: int64

In [77]:
converted_bfill = ts.asfreq("45min", method="bfill")

In [78]:
converted_bfill.head()

2011-01-01 00:00:00    0
2011-01-01 00:45:00    1
2011-01-01 01:30:00    2
2011-01-01 02:15:00    3
2011-01-01 03:00:00    3
Freq: 45T, dtype: int64

In [79]:
converted = ts.asfreq("45min")

In [80]:
converted.head()

2011-01-01 00:00:00    0.0
2011-01-01 00:45:00    NaN
2011-01-01 01:30:00    NaN
2011-01-01 02:15:00    NaN
2011-01-01 03:00:00    3.0
Freq: 45T, dtype: float64

## Topics without notes

Some topics I did not take notes on.

### Read and handle temporal data from files

Handling reading of data from files with dates. You can tell Pandas which columns in your file (or the resulting `DataFrame`) contain date information and ask it to infer the dates for you. This then allows you to invoke methods on the `DateTimeIndex`. Can be quite convenient. However, if I remember correctly, if you let Pandas infer date formats without specifying a format string, this can become an expensive operation. According to the presentation, sometimes, letting Pandas infer dates may be faster. Need to always test this on the data at hand.

In case of monthly (or any other interval data), you may want to convert the time stamps into periods using `to_period()`.

A useful method I was not aware of is `truncate()`. For example, to remove certain rows from a date-time-indexed `DataFrame`: `ts.truncate(before="2016-10-31", after="2016-12-31")`.