<h1>Time Series</h1>

Anything that is observed or measured at many points in time forms a time series. Many time series are fixed frequency, which is to say that data points occur at regular intervals according to some rule, such as every 15 seconds, every 5 minutes, or once per month. Time series can also be irregular without fixed unit of time or offset between units.

<b>Note: pandas also supports indexed based on timedeltas, which can be a useful way of representing experiment or elapsed time. </b>

pandas provides many built-in time series tools and data algorithms. We can efficiently work with very large time series and easily slice and dice, aggregate, and resample irregular- and fixed-frequency tme series. Some of these tools are especially useful for financial and economics applications.

<h3>Data and Time Data Types and Tools</h3>

The Python standard library includes data types for date and time data, as well as calendar-related functionality. The <b>datetime, time, and calendar</b> modules are the main plcaes to start. 
The <b>datetime.datetime</b> type or simple <b>datetime</b> is widely used:

In [1]:
import pandas as pd
import numpy as np

In [2]:
from datetime import datetime

In [3]:
now = datetime.now()

In [4]:
now

datetime.datetime(2021, 1, 21, 12, 55, 6, 176238)

In [5]:
now.year, now.month, now.day

(2021, 1, 21)

<b>datetime</b> stores both the date and time down to the microsecond. <b>timedelta</b> represents the temporal difference between two datetime objects:

In [6]:
delta = datetime(2011,1,7) - datetime(2008,6,24,8,15)

In [7]:
delta

datetime.timedelta(days=926, seconds=56700)

In [8]:
delta.days

926

In [9]:
delta.seconds

56700

In [10]:
type(datetime.now())

datetime.datetime

In [11]:
type(delta)

datetime.timedelta

We can add or subtract a <b>timedelta</b> or multiple thereof to a <b>datetime</b> object to yield a new shifted object:

In [12]:
from datetime import timedelta

In [13]:
start = datetime(2011,1,7)

In [14]:
type(start)

datetime.datetime

In [15]:
start + timedelta(12)

datetime.datetime(2011, 1, 19, 0, 0)

In [16]:
start - 2 * timedelta(12)

datetime.datetime(2010, 12, 14, 0, 0)

![alt Type](Images/TimeSeries/datetime.png)

<h3>Converting Between String and Datetime</h3>

We can format <b>datetime</b> objects and pandas <b>Timestamp</b> objects, which we'll introduce later, as strings using <b>str</b> or the <b>strftime</b> method, passing a format specification:

In [17]:
stamp = datetime(2011,1,3)

In [18]:
stamp

datetime.datetime(2011, 1, 3, 0, 0)

In [19]:
str(stamp)

'2011-01-03 00:00:00'

In [20]:
stamp.strftime('%Y-%m-%d')

'2011-01-03'

![alt Text](Images/TimeSeries/datetime_format1.png)

![alt Text](Images/TimeSeries/datetime_format2.png)

We can use these same format codes to convert strings to dates using datetime.strptime:

In [21]:
value = '2011-01-03'

In [22]:
datetime.strptime(value, '%Y-%m-%d')

datetime.datetime(2011, 1, 3, 0, 0)

In [23]:
datestrs = ['7/6/2011', '8/6/2011']

In [24]:
[datetime.strptime(x, '%m/%d/%Y') for x in datestrs]

[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]

<b>datetime.strptime</b> is a good way to parse a date with a known format. Howver, it can be a bit annoying to have to write a format spec each time, especially fro common date formats. In this case, we can use the <b>parser.parse</b> method in the third-party <b>dateutil</b> package:

In [25]:
from dateutil.parser import parse

In [26]:
parse('2011-01-03')

datetime.datetime(2011, 1, 3, 0, 0)

<b>dateutil</b> is capable of parsing most human-itelligible date representations:

In [27]:
parse('Jan 31, 1997 10:45 PM')

datetime.datetime(1997, 1, 31, 22, 45)

In international locales, day appearing before month is very common, so we can pass dayfirst = True to indicate this:

In [28]:
parse('6/12/2011', dayfirst = True)

datetime.datetime(2011, 12, 6, 0, 0)

pandas is generally oriented towards working with arrays of dates, whether used as an axis index or a column in a DataFrame. The <b>to_datetime</b> method parses many different kinds of date representations.

In [29]:
datestrs = ['2011-07-06 12:00:00', '2011-08-06 00:00:00']

In [30]:
pd.to_datetime(datestrs)

DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)

It also handles values that shuld be considered missing(None, empty string, etc.):

In [31]:
idx = pd.to_datetime(datestrs + [None])

In [32]:
idx

DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)

In [33]:
idx[2]

NaT

In [34]:
pd.isnull(idx[2])

True

In [35]:
pd.isnull(idx)

array([False, False,  True])

<b>Note: NaT (Not a Time) is pandas's null value for timestamp data.

<b>Note: dateutil.parser is a useful but imperfect tool. Notably, it will recognize some strings as dates that we might prefer that it didn't for example '42' will be parsed as the year 2042 with today's calendar date.

<b>datetime</b> objects also have a number of locale-specific formatting options for systems in other countries or languages. For example, the abbreviated month names will be differet on German or French systems compared with English systems:

![alt Text](Images/TimeSeries/local_specific_df.png)

<h3>Time Series Basics</h3>

A basic kind of time series object in pandas is a Series indexed by timestamps, which is often represented external to pandas as Python strings or datetime objects:

In [36]:
from datetime import datetime

In [37]:
dates = [
    datetime(2011,1,2), datetime(2011,1,5),
    datetime(2011,1,7), datetime(2011,1,8),
    datetime(2011,1,10), datetime(2011,1,12)
]

In [38]:
ts = pd.Series(np.random.randn(6),index = dates)

In [39]:
ts

2011-01-02   -0.963655
2011-01-05   -0.868587
2011-01-07   -1.337060
2011-01-08   -0.670504
2011-01-10    0.607814
2011-01-12    1.017511
dtype: float64

Under the hood, these datetime objects have been put in a Datetime Index:

In [40]:
ts.index

DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)

Like other Series, arithmetic operations between differently indexed time series automatically align on the dates:

In [41]:
ts + ts[::2]

2011-01-02   -1.927310
2011-01-05         NaN
2011-01-07   -2.674121
2011-01-08         NaN
2011-01-10    1.215629
2011-01-12         NaN
dtype: float64

Recall that ts[::2] selects every second element in ts.

pandas stores timestamps using NumPy's datetime64 data type at the nanosecond resolution:

In [42]:
ts.index.dtype

dtype('<M8[ns]')

Scalar values from a DatetimeIndex are pandas Timestamp objects:

In [43]:
stamp = ts.index[0]

In [44]:
stamp

Timestamp('2011-01-02 00:00:00')

A Timestamp can be substituted anywhere we would use a datetime object. Addionally, it can store frequency information (if any) and understand how to do time zone conversion and other kinds of manipulations.

<h3>Indexing, Selection, Subsetting</h3>

Time series behaves like any other pandas.Series when we are indexing and selecting data based on label:

In [45]:
stamp = ts.index[2]

In [46]:
stamp

Timestamp('2011-01-07 00:00:00')

In [47]:
ts

2011-01-02   -0.963655
2011-01-05   -0.868587
2011-01-07   -1.337060
2011-01-08   -0.670504
2011-01-10    0.607814
2011-01-12    1.017511
dtype: float64

In [48]:
ts[stamp]

-1.3370603839942414

As a convenience, we can also pass a string that is interpretable as date:

In [49]:
ts['1/10/2011']

0.607814292016424

For longer time series, a year or only a year and month can be passed to easily select slices of data:

In [50]:
longer_ts = pd.Series(np.random.randn(1000), 
                     index = pd.date_range('1/1/2000', periods = 1000))

In [51]:
longer_ts

2000-01-01    0.845901
2000-01-02    0.348283
2000-01-03    0.182119
2000-01-04   -0.407073
2000-01-05   -0.438410
                ...   
2002-09-22    1.056698
2002-09-23    1.071585
2002-09-24   -2.501620
2002-09-25   -0.105930
2002-09-26    1.948904
Freq: D, Length: 1000, dtype: float64

In [52]:
longer_ts['2001']

2001-01-01    0.319370
2001-01-02   -1.075469
2001-01-03   -0.006441
2001-01-04   -1.178455
2001-01-05   -0.630372
                ...   
2001-12-27    1.090459
2001-12-28    0.166974
2001-12-29   -0.419049
2001-12-30    0.207420
2001-12-31   -0.147472
Freq: D, Length: 365, dtype: float64

Here, the string '2001' is interpreted as a year and selects that time period. This also works if you specify the month:

In [53]:
longer_ts['2001-05']

2001-05-01    0.048986
2001-05-02    0.830458
2001-05-03   -0.926056
2001-05-04   -0.668446
2001-05-05    2.019223
2001-05-06    1.386093
2001-05-07   -2.270726
2001-05-08    0.464756
2001-05-09   -1.105441
2001-05-10   -1.435237
2001-05-11   -0.339706
2001-05-12   -1.675635
2001-05-13    0.403043
2001-05-14    0.058218
2001-05-15    0.605584
2001-05-16   -1.175536
2001-05-17   -1.007603
2001-05-18    1.102512
2001-05-19    0.170447
2001-05-20    1.860692
2001-05-21   -0.593652
2001-05-22   -1.373767
2001-05-23    1.109634
2001-05-24    0.080935
2001-05-25   -0.172999
2001-05-26    1.033989
2001-05-27    0.561816
2001-05-28    1.327437
2001-05-29    1.336848
2001-05-30    0.949774
2001-05-31   -0.626000
Freq: D, dtype: float64

In [54]:
ts

2011-01-02   -0.963655
2011-01-05   -0.868587
2011-01-07   -1.337060
2011-01-08   -0.670504
2011-01-10    0.607814
2011-01-12    1.017511
dtype: float64

In [55]:
ts[datetime(2011,1,7):]

2011-01-07   -1.337060
2011-01-08   -0.670504
2011-01-10    0.607814
2011-01-12    1.017511
dtype: float64

Because most time sereis data is ordered chronologically, we can slice with time-stamps not contained in a time series to perfrom a range query:

In [56]:
ts

2011-01-02   -0.963655
2011-01-05   -0.868587
2011-01-07   -1.337060
2011-01-08   -0.670504
2011-01-10    0.607814
2011-01-12    1.017511
dtype: float64

In [57]:
ts['1/6/2011': '1/11/2011']

2011-01-07   -1.337060
2011-01-08   -0.670504
2011-01-10    0.607814
dtype: float64

As before, we can pass either a string date, datetime, or timestamp. Remember that slicing in this manner produces views on the source time series like sliccing NumPy arrays. This means that no data is copied and modifications on the slice will be reflected in the original data.

There is an equivalent instance method, <b>truncate</b>, that slices a Series between two dates:

In [58]:
ts

2011-01-02   -0.963655
2011-01-05   -0.868587
2011-01-07   -1.337060
2011-01-08   -0.670504
2011-01-10    0.607814
2011-01-12    1.017511
dtype: float64

In [59]:
ts.truncate(after='1/9/2011')

2011-01-02   -0.963655
2011-01-05   -0.868587
2011-01-07   -1.337060
2011-01-08   -0.670504
dtype: float64

All of this holds true for DataFrame as well, indexing on its rows:

In [60]:
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')

In [61]:
long_df = pd.DataFrame(np.random.randn(100,4),
                      index = dates,
                      columns = ['Colorado', 'Texas', 'New York', 'Ohio'])

In [62]:
long_df

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,0.166503,-0.788973,-1.754544,0.976552
2000-01-12,0.439641,-0.858747,3.296004,-0.398272
2000-01-19,0.866603,-2.377226,-0.000949,-0.049554
2000-01-26,-1.384804,1.436707,0.841157,-1.367809
2000-02-02,-0.840875,1.540792,0.094606,0.558995
...,...,...,...,...
2001-10-31,-0.002362,-0.608792,-1.552786,-0.685939
2001-11-07,0.427987,0.857004,-2.383852,-0.126432
2001-11-14,0.303351,-0.975651,-0.728750,0.017699
2001-11-21,-1.194590,0.579330,0.272940,0.315820


In [63]:
long_df.loc['5-2001']

Unnamed: 0,Colorado,Texas,New York,Ohio
2001-05-02,-1.079054,-1.347998,-0.934342,0.167494
2001-05-09,-0.752395,1.746979,-2.434367,0.412317
2001-05-16,0.368836,0.166948,1.09091,-0.084843
2001-05-23,-0.539539,-1.089492,-0.392484,0.853302
2001-05-30,-1.746952,0.522827,0.210384,-0.497027


<h3>Time Series with Duplicate Indices</h3>

In some applications, there may be multiple data observations falling on a particular timestamp. Here, is an example:

In [64]:
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000',
                         '1/2/2000', '1/3/2000'])

In [65]:
dup_ts = pd.Series(np.arange(5), index = dates)

In [66]:
dup_ts

2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32

We can tell that the index is not unique by checking its <b>is_unique</b> property:

In [67]:
dup_ts.index.is_unique

False

Indexing into this time series will now either produce scalar values or slices depending on whether a timestamp is duplicated:


In [68]:
dup_ts['1/3/2000']

4

In [69]:
dup_ts['1/2/2000']

2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32

Suppose we wanted to aggregate the data having non-unique timestamps. One way to do this is to use groupby and pass level = 0:

In [70]:
grouped = dup_ts.groupby(level = 0)

In [71]:
grouped.mean()

2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32

In [72]:
grouped.count()

2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64

<h3>Date Ranges, Frequencies, and Shifting</h3>

Generic time series in pandas are assumed to be irregular; that is, they have no fixed frequency. For many applications this is sufficient. However, it's often desirable to work relative to a fixed frequency, such as daily, monthly, or every 15 minutes, even if that means introducing missing values into a time series. Fortunately pandas has a full suite of standard time series frequencies and tools for resampling, inferring frequences, and generatign fixed-frequency date ranges.

For example, we can convert the sample time series to be fixed daily frequency by calling resample:

In [73]:
ts

2011-01-02   -0.963655
2011-01-05   -0.868587
2011-01-07   -1.337060
2011-01-08   -0.670504
2011-01-10    0.607814
2011-01-12    1.017511
dtype: float64

In [74]:
resampler = ts.resample('D')

In [83]:
list(resampler)

[(Timestamp('2011-01-02 00:00:00', freq='D'), 2011-01-02   -0.963655
  dtype: float64),
 (Timestamp('2011-01-03 00:00:00', freq='D'), Series([], dtype: float64)),
 (Timestamp('2011-01-04 00:00:00', freq='D'), Series([], dtype: float64)),
 (Timestamp('2011-01-05 00:00:00', freq='D'), 2011-01-05   -0.868587
  dtype: float64),
 (Timestamp('2011-01-06 00:00:00', freq='D'), Series([], dtype: float64)),
 (Timestamp('2011-01-07 00:00:00', freq='D'), 2011-01-07   -1.33706
  dtype: float64),
 (Timestamp('2011-01-08 00:00:00', freq='D'), 2011-01-08   -0.670504
  dtype: float64),
 (Timestamp('2011-01-09 00:00:00', freq='D'), Series([], dtype: float64)),
 (Timestamp('2011-01-10 00:00:00', freq='D'), 2011-01-10    0.607814
  dtype: float64),
 (Timestamp('2011-01-11 00:00:00', freq='D'), Series([], dtype: float64)),
 (Timestamp('2011-01-12 00:00:00', freq='D'), 2011-01-12    1.017511
  dtype: float64)]

<h3>Generating Date Ranges</h3>

<b>pandas.date_range</b> is responsible for generating a DatetimeIndex with an indicated length according to a particular frequency:

In [75]:
index = pd.date_range('2012-04-01', '2012-06-01')

In [76]:
index

DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
               '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
               '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
               '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
               '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
               '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
               '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
               '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
               '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
               '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
      

By default, date_range generates daily timestamps. If we pass only a start or end date, we must pass a number of periods to generate:

In [77]:
pd.date_range(start = '2012-03-01', periods=20)

DatetimeIndex(['2012-03-01', '2012-03-02', '2012-03-03', '2012-03-04',
               '2012-03-05', '2012-03-06', '2012-03-07', '2012-03-08',
               '2012-03-09', '2012-03-10', '2012-03-11', '2012-03-12',
               '2012-03-13', '2012-03-14', '2012-03-15', '2012-03-16',
               '2012-03-17', '2012-03-18', '2012-03-19', '2012-03-20'],
              dtype='datetime64[ns]', freq='D')

In [78]:
pd.date_range(end= '2012-06-01', periods = 20)

DatetimeIndex(['2012-05-13', '2012-05-14', '2012-05-15', '2012-05-16',
               '2012-05-17', '2012-05-18', '2012-05-19', '2012-05-20',
               '2012-05-21', '2012-05-22', '2012-05-23', '2012-05-24',
               '2012-05-25', '2012-05-26', '2012-05-27', '2012-05-28',
               '2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')

The start and end dates define strict boundaries for the generated date index. For example, if we wanted a date index containing the last business day of each month, we would pass the 'BM' frequency (business end of month) and only dates falling on or inside the date interval will be included:

In [79]:
pd.date_range('2000-01-01', '2000-12-01', freq='BM')

DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-29', '2000-10-31', '2000-11-30'],
              dtype='datetime64[ns]', freq='BM')

![alt Text](Images/TimeSeries/base_time.png)

date_range by default preserves the time of the start or end timestamp:

In [85]:
pd.date_range('2012-5-02  12:56:31', periods=5)

DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
               '2012-05-04 12:56:31', '2012-05-05 12:56:31',
               '2012-05-06 12:56:31'],
              dtype='datetime64[ns]', freq='D')

Sometimes we will have start or end dates with time inofrmation but want to generate a set of timestamps normalized to midnight by convention. To do this, there is a <b>normalize</b> option:

In [86]:
pd.date_range('2012-05-02 12:56:31', periods=5, normalize=True)

DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
               '2012-05-06'],
              dtype='datetime64[ns]', freq='D')

<h3>Frequencies and Date Offsets</h3>

Frequencies in pandas are composed of a base frequency and a multiplier. Base frequencies are typically referred to by a sring alias, like 'M' for monthly or 'H' for hourly. For each base frequency, there is an object defined generally referred to as a date offset.For example, hourly frequency can be represented with the Hour class:

In [87]:
from pandas.tseries.offsets import Hour, Minute

In [88]:
hour = Hour()

In [89]:
hour

<Hour>

We can define a multiple of an offset by passing an integer:


In [90]:
four_hours = Hour(4)

In [91]:
four_hours

<4 * Hours>

In most applications we would never need to explicitley create one of these objcets, instead using a string alias like 'H' or '4H'. Putting an integer before the base frequency creates a multiple:

In [100]:
pd.date_range('2000-01-01', '2000-01-03 23:59', freq='4h')

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
               '2000-01-01 08:00:00', '2000-01-01 12:00:00',
               '2000-01-01 16:00:00', '2000-01-01 20:00:00',
               '2000-01-02 00:00:00', '2000-01-02 04:00:00',
               '2000-01-02 08:00:00', '2000-01-02 12:00:00',
               '2000-01-02 16:00:00', '2000-01-02 20:00:00',
               '2000-01-03 00:00:00', '2000-01-03 04:00:00',
               '2000-01-03 08:00:00', '2000-01-03 12:00:00',
               '2000-01-03 16:00:00', '2000-01-03 20:00:00'],
              dtype='datetime64[ns]', freq='4H')

Many offsets can be combined together by addition:

In [101]:
Hour(2) + Minute(30)

<150 * Minutes>

Similarly, we can pass frequency strings, like '1h30min', that will effectively be parsed to the same expression:

In [102]:
pd.date_range('2000-01-01', periods=10, freq='1h30min')

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
               '2000-01-01 03:00:00', '2000-01-01 04:30:00',
               '2000-01-01 06:00:00', '2000-01-01 07:30:00',
               '2000-01-01 09:00:00', '2000-01-01 10:30:00',
               '2000-01-01 12:00:00', '2000-01-01 13:30:00'],
              dtype='datetime64[ns]', freq='90T')

Some frequencies describe points in time that are not evenly spaced. For example, 'M' (calendar month end) and 'BM' (last business/weekday of month) depend on the number of days in a month and, in the latter case, whether the month ends on a weekend ornot. We refer to thesese as anchored offsetes.

<h4>Week of month dates</h3>

One useful frequency class is "week of month", starting with WOM. This enables us to get dates like the third Firday of each month:

In [103]:
rng = pd.date_range('2012-01-01', '2012-09-01', freq='WOM-3FRI')

In [104]:
list(rng)

[Timestamp('2012-01-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-02-17 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-03-16 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-04-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-05-18 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-06-15 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-07-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-08-17 00:00:00', freq='WOM-3FRI')]

<h3>Shifting(Leading and Lagging) Data</h3>

"Shifting" refers to moving data backward and forward through time. Both Series and DataFrame have a shift method for doing naive shifts forward or backward, leaving the index unmodified:

In [107]:
ts = pd.Series(np.random.randn(4),
              index = pd.date_range('1/1/2000', periods=4, freq='M'))

In [108]:
ts

2000-01-31    0.802652
2000-02-29    0.070153
2000-03-31   -1.446273
2000-04-30   -1.102354
Freq: M, dtype: float64

In [109]:
ts.shift(2)

2000-01-31         NaN
2000-02-29         NaN
2000-03-31    0.802652
2000-04-30    0.070153
Freq: M, dtype: float64

In [110]:
ts

2000-01-31    0.802652
2000-02-29    0.070153
2000-03-31   -1.446273
2000-04-30   -1.102354
Freq: M, dtype: float64

In [111]:
ts.shift(-2)

2000-01-31   -1.446273
2000-02-29   -1.102354
2000-03-31         NaN
2000-04-30         NaN
Freq: M, dtype: float64

When we shift like this, missing data is introduced either at the end or the start of the time series.

A common use of shift is computing percent changes in a time series or multiple time series as DataFrame columns. This is expressed as:

In [112]:
ts/ts.shift(1) - 1

2000-01-31          NaN
2000-02-29    -0.912599
2000-03-31   -21.616125
2000-04-30    -0.237796
Freq: M, dtype: float64

Because naive shifts leave the index unmodified, some data is discarded. Thus if the frequency is known, it can be passed to shift to advance the timestamps instead of simply the data:

In [113]:
ts

2000-01-31    0.802652
2000-02-29    0.070153
2000-03-31   -1.446273
2000-04-30   -1.102354
Freq: M, dtype: float64

In [114]:
ts.shift(2,freq='M')

2000-03-31    0.802652
2000-04-30    0.070153
2000-05-31   -1.446273
2000-06-30   -1.102354
Freq: M, dtype: float64

Other frequencies can be passed, too, giving us some flexibility in how to lead and lag the data:

In [115]:
ts

2000-01-31    0.802652
2000-02-29    0.070153
2000-03-31   -1.446273
2000-04-30   -1.102354
Freq: M, dtype: float64

In [116]:
ts.shift(3, freq='D')

2000-02-03    0.802652
2000-03-03    0.070153
2000-04-03   -1.446273
2000-05-03   -1.102354
dtype: float64

In [117]:
ts.shift(1, freq='90T')

2000-01-31 01:30:00    0.802652
2000-02-29 01:30:00    0.070153
2000-03-31 01:30:00   -1.446273
2000-04-30 01:30:00   -1.102354
dtype: float64

<b>Note: Here, T stands for minutes</b>

<h3>Shifting dates with offsets</h3>

The pandas date offsets can also be used with datetime or Timestamp objects:

In [118]:
from pandas.tseries.offsets import Day, MonthEnd

In [119]:
now = datetime(2011,11,17)

In [120]:
now + 3 * Day()

Timestamp('2011-11-20 00:00:00')

If we add an anchored offset like MonthEnd, the first increment will 'roll forward' a date to the next date according to the frequency rule:

In [121]:
now

datetime.datetime(2011, 11, 17, 0, 0)

In [122]:
now + MonthEnd()

Timestamp('2011-11-30 00:00:00')

In [123]:
now + MonthEnd(2)

Timestamp('2011-12-31 00:00:00')

Anchored offsets can explicitley 'roll'  dates forward or backward by simpling using their <b>rollforward</b> and <b>rollbackward</b>, respectively:

In [124]:
offset = MonthEnd()

In [125]:
offset

<MonthEnd>

In [127]:
now

datetime.datetime(2011, 11, 17, 0, 0)

In [126]:
offset.rollforward(now)

Timestamp('2011-11-30 00:00:00')

In [128]:
offset.rollback(now)

Timestamp('2011-10-31 00:00:00')

A creative use of date offsets is to use these methods with groupby:

In [129]:
ts = pd.Series(np.random.randn(20), 
              index = pd.date_range('1/15/2000', periods=20, freq='4d'))

In [130]:
ts

2000-01-15    0.899913
2000-01-19   -2.760128
2000-01-23    1.101279
2000-01-27    0.429476
2000-01-31    1.459624
2000-02-04    0.069247
2000-02-08   -0.167437
2000-02-12    0.030936
2000-02-16    0.430243
2000-02-20    0.957644
2000-02-24    2.333566
2000-02-28    0.151845
2000-03-03   -0.814657
2000-03-07   -1.652192
2000-03-11    0.831210
2000-03-15   -0.326000
2000-03-19    0.272009
2000-03-23   -0.158663
2000-03-27    0.977133
2000-03-31   -0.963732
Freq: 4D, dtype: float64

In [131]:
ts.groupby(offset.rollforward).mean()

2000-01-31    0.226033
2000-02-29    0.543721
2000-03-31   -0.229361
dtype: float64

Of course, an easier and faster way to do this is using <b>resample</b> 

In [132]:
ts

2000-01-15    0.899913
2000-01-19   -2.760128
2000-01-23    1.101279
2000-01-27    0.429476
2000-01-31    1.459624
2000-02-04    0.069247
2000-02-08   -0.167437
2000-02-12    0.030936
2000-02-16    0.430243
2000-02-20    0.957644
2000-02-24    2.333566
2000-02-28    0.151845
2000-03-03   -0.814657
2000-03-07   -1.652192
2000-03-11    0.831210
2000-03-15   -0.326000
2000-03-19    0.272009
2000-03-23   -0.158663
2000-03-27    0.977133
2000-03-31   -0.963732
Freq: 4D, dtype: float64

In [133]:
ts.resample('M').mean()

2000-01-31    0.226033
2000-02-29    0.543721
2000-03-31   -0.229361
Freq: M, dtype: float64

<h3>Time Zone Handling</h3>

Working with time zones is generally considered one of the most unpleasant parts of time series manipulation. Many time series  users choose to work with time series in <b>coordinated universal time or UTC </b>. Time series are expressed as offsetes from UTC; for example, New York is four hours behind UTC during daylight saving time and five hours behind the rest of the year.

In Python, time zone information comes form the third-party pytz library which exposes the Olson databse, a compilation of world time zone information. This is specially important for historical data because the daylight saving time (DST) transition dates (and even UTC offsets) have been changed numerous times depending on the whims of local governments.

In [135]:
import pytz

In [140]:
pytz.common_timezones[-5:]

['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']

To get a time zone object from pytz, use pytz.timezone:

In [141]:
tz = pytz.timezone('America/New_York')

In [142]:
tz

<DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD>

<h3>Time Zone Localization and Conversion</h3>

By default, time series in pandas are time zone naive. For example, consider the following time  seriees:

In [143]:
rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')

In [144]:
ts = pd.Series(np.random.randn(len(rng)), index = rng)

In [145]:
ts

2012-03-09 09:30:00    1.046507
2012-03-10 09:30:00    1.201385
2012-03-11 09:30:00   -0.952695
2012-03-12 09:30:00    0.466923
2012-03-13 09:30:00    1.217817
2012-03-14 09:30:00   -0.525037
Freq: D, dtype: float64

The index's tz field is None:

In [147]:
print(ts.index.tz)

None


Date ranges can be generated with a time zone set:

In [148]:
pd.date_range('3/9/2012 9:30', periods=10, freq='D', tz='UTC')

DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00', '2012-03-16 09:30:00+00:00',
               '2012-03-17 09:30:00+00:00', '2012-03-18 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')

Conversion from naive to localized is handled by the tz_localize method:

In [149]:
ts

2012-03-09 09:30:00    1.046507
2012-03-10 09:30:00    1.201385
2012-03-11 09:30:00   -0.952695
2012-03-12 09:30:00    0.466923
2012-03-13 09:30:00    1.217817
2012-03-14 09:30:00   -0.525037
Freq: D, dtype: float64

In [150]:
ts_utc = ts.tz_localize('UTC')

In [151]:
ts_utc

2012-03-09 09:30:00+00:00    1.046507
2012-03-10 09:30:00+00:00    1.201385
2012-03-11 09:30:00+00:00   -0.952695
2012-03-12 09:30:00+00:00    0.466923
2012-03-13 09:30:00+00:00    1.217817
2012-03-14 09:30:00+00:00   -0.525037
Freq: D, dtype: float64

In [152]:
ts_utc.index

DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')

Once a time series has been localized to a particular time zone, it can be converted to another time zone with tz_convert:

In [153]:
ts_utc.tz_convert('America/New_York')

2012-03-09 04:30:00-05:00    1.046507
2012-03-10 04:30:00-05:00    1.201385
2012-03-11 05:30:00-04:00   -0.952695
2012-03-12 05:30:00-04:00    0.466923
2012-03-13 05:30:00-04:00    1.217817
2012-03-14 05:30:00-04:00   -0.525037
Freq: D, dtype: float64

In the case of the preceding time series, which strraddles a DST transition in the America/New_York time zone, we could localize to EST and convert to, say, UTC or Berlin time:

In [154]:
ts_eastern = ts.tz_localize('America/New_York')

In [155]:
ts_eastern

2012-03-09 09:30:00-05:00    1.046507
2012-03-10 09:30:00-05:00    1.201385
2012-03-11 09:30:00-04:00   -0.952695
2012-03-12 09:30:00-04:00    0.466923
2012-03-13 09:30:00-04:00    1.217817
2012-03-14 09:30:00-04:00   -0.525037
dtype: float64

In [156]:
ts_eastern.tz_convert('UTC')

2012-03-09 14:30:00+00:00    1.046507
2012-03-10 14:30:00+00:00    1.201385
2012-03-11 13:30:00+00:00   -0.952695
2012-03-12 13:30:00+00:00    0.466923
2012-03-13 13:30:00+00:00    1.217817
2012-03-14 13:30:00+00:00   -0.525037
dtype: float64

In [157]:
ts_eastern.tz_convert('Europe/Berlin')

2012-03-09 15:30:00+01:00    1.046507
2012-03-10 15:30:00+01:00    1.201385
2012-03-11 14:30:00+01:00   -0.952695
2012-03-12 14:30:00+01:00    0.466923
2012-03-13 14:30:00+01:00    1.217817
2012-03-14 14:30:00+01:00   -0.525037
dtype: float64

<b>tz_localize and tz_convert</b> are also instance methods on DatetimeIndex:

In [158]:
ts.index.tz_localize('Asia/Shanghai')

DatetimeIndex(['2012-03-09 09:30:00+08:00', '2012-03-10 09:30:00+08:00',
               '2012-03-11 09:30:00+08:00', '2012-03-12 09:30:00+08:00',
               '2012-03-13 09:30:00+08:00', '2012-03-14 09:30:00+08:00'],
              dtype='datetime64[ns, Asia/Shanghai]', freq=None)

<b>Note: Localizing naive timestamps also checks for ambiguous or non-existent times around daylight saving time transitions. </b>

<h3>Operations with Time Zone - Aware Timestamp Objects </h3>

Similar to time series and date ranges, individual Timestamp objects similarly can be localized from naive to time zone-aware and converted from one time zone to another:

In [159]:
stamp = pd.Timestamp('2011-03-12 04:00')

In [160]:
stamp

Timestamp('2011-03-12 04:00:00')

In [161]:
stamp_utc = stamp.tz_localize('utc')

In [162]:
stamp_utc

Timestamp('2011-03-12 04:00:00+0000', tz='UTC')

In [163]:
stamp_utc.tz_convert('America/New_York')

Timestamp('2011-03-11 23:00:00-0500', tz='America/New_York')

We can also pass a time zone when creating the Timestamp:

In [164]:
stamp_moscow = pd.Timestamp('2011-03-12 04:00', tz= 'Europe/Moscow')

In [165]:
stamp_moscow

Timestamp('2011-03-12 04:00:00+0300', tz='Europe/Moscow')

Time zone-aware Timestamp objects internally store a UTC timestamp value as nanoseconds since the Unix epoch (January 1, 1970); this UTC value is invariant between time zone converstions:

In [166]:
stamp_utc.value

1299902400000000000

In [167]:
stamp_utc.tz_convert('America/New_York').value

1299902400000000000

When performing time arithmetic using pandas's DateOffset objcets, pandas respects daylight saving time transitions where possible. Here we construct time-stamps that occur right before DST transitions.

In [168]:
from pandas.tseries.offsets import Hour

In [169]:
stamp = pd.Timestamp('2012-03-12 01:30' , tz = 'US/Eastern')

In [170]:
stamp

Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')

In [171]:
stamp + Hour()

Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')

Then, 90 minuts before transitioning out of DST:

In [172]:
stamp = pd.Timestamp('2012-11-04 00:30', tz='US/Eastern')

In [173]:
stamp

Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')

In [174]:
stamp + 2 * Hour()

Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')

<h3>Operations between different time zones</h3>

If two time series with different time zones are combined, the result will be UTC. Since the timestamps are stored under the hood in UTC,this is a straightforward operation and requires no conversion to happend:

In [175]:
rng = pd.date_range('3/7/2012 9:30', periods=10, freq='B')

In [188]:
ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [189]:
ts

2012-03-07 09:30:00    3.144353
2012-03-08 09:30:00    0.601742
2012-03-09 09:30:00    1.112075
2012-03-12 09:30:00   -0.894807
2012-03-13 09:30:00    0.870856
2012-03-14 09:30:00    0.350866
2012-03-15 09:30:00   -0.537798
2012-03-16 09:30:00   -0.032265
2012-03-19 09:30:00    0.806165
2012-03-20 09:30:00    1.074257
Freq: B, dtype: float64

In [190]:
ts1 = ts[:7].tz_localize('Europe/London')

In [191]:
ts2 = ts1[2:].tz_convert('Europe/Moscow')

In [192]:
ts1

2012-03-07 09:30:00+00:00    3.144353
2012-03-08 09:30:00+00:00    0.601742
2012-03-09 09:30:00+00:00    1.112075
2012-03-12 09:30:00+00:00   -0.894807
2012-03-13 09:30:00+00:00    0.870856
2012-03-14 09:30:00+00:00    0.350866
2012-03-15 09:30:00+00:00   -0.537798
dtype: float64

In [193]:
ts2

2012-03-09 13:30:00+04:00    1.112075
2012-03-12 13:30:00+04:00   -0.894807
2012-03-13 13:30:00+04:00    0.870856
2012-03-14 13:30:00+04:00    0.350866
2012-03-15 13:30:00+04:00   -0.537798
dtype: float64

In [194]:
result = ts1+ ts2

In [195]:
result.index

DatetimeIndex(['2012-03-07 09:30:00+00:00', '2012-03-08 09:30:00+00:00',
               '2012-03-09 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)

<h3>Periods and Period Arithmetic </h3>

Periods represents timespans, like days, months, quarters, or years. The Period class represents this data type, requiring a string or integer and a frequency 

In [196]:
p = pd.Period(2007, freq = 'A-DEC')

In [197]:
p

Period('2007', 'A-DEC')

In this cae, the Period object represents the full timespan from January 1, 2007, to December 31, 2007, inclusive. Conveniently, adding and subtracting integers from periods has the effect of shifting by their frequency:

In [198]:
p + 5

Period('2012', 'A-DEC')

In [199]:
p - 2

Period('2005', 'A-DEC')

If two periods have the same frequency, their difference is the number of units between them:

In [200]:
pd.Period('2014', freq='A-DEC') - p

<7 * YearEnds: month=12>

Regular ranges of periods can be constructed with the period_range function:

In [201]:
rng = pd.period_range('2000-01-01', '2000-06-30', freq = 'M')

In [202]:
rng

PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]', freq='M')

The PeriodIndex class stores a sequence of periods and can serve as an axis index in any  pandas data structure:

In [203]:
pd.Series(np.random.randn(6), index=rng)

2000-01    0.313929
2000-02   -1.296426
2000-03    0.011889
2000-04    1.372549
2000-05    1.333078
2000-06    1.086136
Freq: M, dtype: float64

If we have an array of strings, we can also use the PeriodIndex class:

In [204]:
values = ['2001Q3', '2002Q2', '2003Q1']

In [205]:
index = pd.PeriodIndex(values, freq = 'Q-DEC')

In [206]:
index

PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq='Q-DEC')

<h3>Period Frequency Conversion</h3>

Periods and PeriodIndex objects can be converted to another frequency with their asfreq method. As an example, suppose we had an annual period and wanted to convert it into a monthly period either at the start or end of the year. This is fairly straightforward:

In [207]:
p = pd.Period('2007', freq='A-DEC')

In [208]:
p

Period('2007', 'A-DEC')

In [209]:
p.asfreq('M', how = 'start')

Period('2007-01', 'M')

In [210]:
p.asfreq('M', how = 'end')

Period('2007-12', 'M')

We can think of Period ('2007', 'A-DEC') as being a sort of cursor pointing to a span of time, subdivided by monthly periods. 

In [211]:
p = pd.Period('2007', freq = 'A-JUN')

In [212]:
p

Period('2007', 'A-JUN')

In [213]:
p.asfreq('M', 'start')

Period('2006-07', 'M')

In [214]:
p.asfreq('M', 'end')

Period('2007-06', 'M')

When we are converting from high to low frequency, pandas determines the super-period depending on where the subperiod 'belong'. For example, in A-JUN frequency, the month Aug-2007 is actually part of the 2008 period:

In [215]:
p = pd.Period('Aug-2007', 'M')

In [216]:
p.asfreq('A-JUN')

Period('2008', 'A-JUN')

Whole PeriodIndex objects or time seires can be similarly converted with the same semantics:

In [219]:
rng = pd.period_range('2006', '2009', freq='A-DEC')

In [220]:
ts = pd.Series(np.random.randn(len(rng)), index = rng)

In [221]:
ts

2006   -0.307137
2007    0.273076
2008   -0.290661
2009    0.654858
Freq: A-DEC, dtype: float64

In [222]:
ts.asfreq('M', how='start')

2006-01   -0.307137
2007-01    0.273076
2008-01   -0.290661
2009-01    0.654858
Freq: M, dtype: float64

Here, the annual periods are replaced with monthly periods corresponding to the first month falling within each annual period. If we instead wanted the last business day of each year, we can use teh 'B' frequency and indicate that we want the end of  the period:

In [223]:
ts.asfreq('B', how='end')

2006-12-29   -0.307137
2007-12-31    0.273076
2008-12-31   -0.290661
2009-12-31    0.654858
Freq: B, dtype: float64

<h3>Quarterly Period Frequencies</h3>

Quarterly data is standard in accounting, finance, and other fields. Much quarterly data is reported relative to a fiscal year end, typically the last calendar or business day of one of the 12 months of the year. Thus, the period 2012Q4 has a different meaning depending on fiscal year end. 

pandas supports all 12 possible quarterly frequencies as Q-JAN through Q-DEC:

In [224]:
p = pd.Period('2012Q4', freq='Q-JAN')

In [225]:
p

Period('2012Q4', 'Q-JAN')

In the case of fiscal year ending in January, 2012Q4 runds from November through January, which we can check by converting to daily frequency. For illustration:

![alt Text](Images/TimeSeries/quarterly_freq.png)

In [226]:
p.asfreq('D', 'start')

Period('2011-11-01', 'D')

In [227]:
p.asfreq('D', 'end')

Period('2012-01-31', 'D')

Thus, it is possible to do easy period arithmetic; for example, to get the timestamp at 4 PM on the second-to-last business day of the quarteer, we could do:

In [228]:
p4m = (p.asfreq('B', 'e') - 1).asfreq('T', 's')+16*60

In [229]:
p4m

Period('2012-01-30 16:00', 'T')

In [230]:
p4m.to_timestamp()

Timestamp('2012-01-30 16:00:00')

We can generate quarterly ranges usign period_range. Arithmetic is identical, too:

In [231]:
rng = pd.period_range('2011Q3', '2012Q4', freq='Q-JAN')

In [232]:
ts = pd.Series(np.arange(len(rng)), index=rng)

In [233]:
ts

2011Q3    0
2011Q4    1
2012Q1    2
2012Q2    3
2012Q3    4
2012Q4    5
Freq: Q-JAN, dtype: int32

<h3>Converting Timestamps to Periods (and Back)</h3>

Series and DataFrame objects indexed by timestamps can be converted to periods with the to_period method:

In [234]:
rng = pd.date_range('2000-01-01', periods = 3, freq='M')

In [235]:
rng

DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31'], dtype='datetime64[ns]', freq='M')

In [236]:
ts = pd.Series(np.random.randn(3), index=rng)

In [237]:
ts

2000-01-31   -0.842077
2000-02-29    0.923627
2000-03-31    0.456855
Freq: M, dtype: float64

In [238]:
pts = ts.to_period()

In [239]:
pts

2000-01   -0.842077
2000-02    0.923627
2000-03    0.456855
Freq: M, dtype: float64

Since, periods refer to non-overlapping timespans, a timestamp can only belong to a single period for a given frequency. While the frequency of the new PeriodIndex is inferred from the timestamps by default, we can specify any frequency we want. There is also no problem with having duplicate periods in the result:

In [240]:
rng = pd.date_range('1/29/2000', periods=6, freq='D')

In [241]:
rng

DatetimeIndex(['2000-01-29', '2000-01-30', '2000-01-31', '2000-02-01',
               '2000-02-02', '2000-02-03'],
              dtype='datetime64[ns]', freq='D')

In [242]:
ts2 = pd.Series(np.random.randn(6), index = rng)

In [243]:
ts2

2000-01-29    1.700939
2000-01-30    1.070403
2000-01-31   -0.305172
2000-02-01   -1.145902
2000-02-02   -0.094979
2000-02-03   -0.387212
Freq: D, dtype: float64

In [246]:
ts2.to_period('M')

2000-01    1.700939
2000-01    1.070403
2000-01   -0.305172
2000-02   -1.145902
2000-02   -0.094979
2000-02   -0.387212
Freq: M, dtype: float64

To convert back to timestamps, use to_timestamp:

In [247]:
pts = ts2.to_period()

In [248]:
pts

2000-01-29    1.700939
2000-01-30    1.070403
2000-01-31   -0.305172
2000-02-01   -1.145902
2000-02-02   -0.094979
2000-02-03   -0.387212
Freq: D, dtype: float64

In [252]:
pts.to_timestamp(how='end')

2000-01-29 23:59:59.999999999    1.700939
2000-01-30 23:59:59.999999999    1.070403
2000-01-31 23:59:59.999999999   -0.305172
2000-02-01 23:59:59.999999999   -1.145902
2000-02-02 23:59:59.999999999   -0.094979
2000-02-03 23:59:59.999999999   -0.387212
Freq: D, dtype: float64

<h3>Creating a PeriodIndex from Arrays</h3>

Fixed frequency datasets are sometimes stored with timespan information spread across multiple columns. For example, in this macroeconomic dataset, the year and quarter are in different columns:

In [253]:
data = pd.read_csv('pydata-book-2nd-edition/examples/macrodata.csv')

In [255]:
data.head(5)

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19


In [257]:
data.year.head()

0    1959.0
1    1959.0
2    1959.0
3    1959.0
4    1960.0
Name: year, dtype: float64

In [258]:
data.quarter.head()

0    1.0
1    2.0
2    3.0
3    4.0
4    1.0
Name: quarter, dtype: float64

By passing these arrays to PeriodIndex with a frequency, we can combine them to form an index for the DataFrame:

In [259]:
index = pd.PeriodIndex(year = data.year, quarter = data.quarter, freq='Q-DEC')

In [260]:
index

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', length=203, freq='Q-DEC')

In [262]:
data.index = index

In [263]:
data.infl

1959Q1    0.00
1959Q2    2.34
1959Q3    2.74
1959Q4    0.27
1960Q1    2.31
          ... 
2008Q3   -3.16
2008Q4   -8.79
2009Q1    0.94
2009Q2    3.37
2009Q3    3.56
Freq: Q-DEC, Name: infl, Length: 203, dtype: float64

<h3>Resampling and Frequency Conversion</h3>

"Resampling" refers to the process of converting a time series from one frequency to another. Aggregating higher frequency data to lower frequency is called <b>downsampling</b>, while converting lower frequency to higher frequency is called <b>upsampling</b>.  Not all resampling falls into either of these categories; for example, converting W-WED(weekly on Wednesday) to W-FRI is neither upsampling nor downsampling.

pandas objects are equipped with a <b>resample</b> method, which is the workhouse function for all frequency conversion. <b>resample</b> has a similar API to groupby; we call <b>resample</b> to group the data, then call an aggregation function:

In [264]:
rng = pd.date_range('2000-01-01', periods=100, freq='D')

In [265]:
ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [266]:
ts

2000-01-01   -0.794757
2000-01-02   -1.441977
2000-01-03    0.627531
2000-01-04    0.436418
2000-01-05    0.894234
                ...   
2000-04-05    1.332205
2000-04-06    0.183241
2000-04-07   -2.463434
2000-04-08   -0.485544
2000-04-09    0.979797
Freq: D, Length: 100, dtype: float64

In [268]:
ts.resample('M').mean()

2000-01-31    0.092687
2000-02-29    0.273506
2000-03-31    0.124549
2000-04-30    0.041927
Freq: M, dtype: float64

In [269]:
ts.resample('M', kind = 'period').mean()

2000-01    0.092687
2000-02    0.273506
2000-03    0.124549
2000-04    0.041927
Freq: M, dtype: float64

<b>resample</b> is a flexible and high-performance method that can be used to process very large time series.

![alt Text](Images/TimeSeries/resample.png)

<h3>Downsampling</h3>

Aggregating data to a regular, lower frequency is pretty normal time series task. The data we're aggregating doesn't need to be fixed frequently; the desired frequency defines bin edges that are used to slice the time series into pieces to aggregate. For example, to convert to monthly, 'M' or 'BM', we need to chop up the data into one-month intervals. 

Each interval is said to be half-open; a data point can only belong to one interval, and the union of the intervals must make up the whole time frame. There are a couple things to think about when suign <b>resample</b> to downsample data:

<ul>
    <li>Which side of each interval is closed</li>
    <li>How to label each aggregated bin, either with the start of the interval or the end</li>

To illustrate, let's look at some one-minute data:

In [270]:
rng = pd.date_range('2000-01-01', periods=12, freq='T')

In [271]:
ts = pd.Series(np.arange(12), index=rng)

In [272]:
ts

2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32

Supppose we wanted to aggregate this data into five-minute chunks or bars by taking the sum of each group:

In [273]:
ts.resample('5min', closed='right').sum()

1999-12-31 23:55:00     0
2000-01-01 00:00:00    15
2000-01-01 00:05:00    40
2000-01-01 00:10:00    11
Freq: 5T, dtype: int32

The frequency we pass defines the bin edges in five-minute increments. By default, the left bin edge is inclusive, so the 00:00 value is included in the 00:00 to 00:05 interval. Passing closed = 'right' changes the interval to be closed on the right:

In [274]:
ts.resample('5min', closed='right').sum()

1999-12-31 23:55:00     0
2000-01-01 00:00:00    15
2000-01-01 00:05:00    40
2000-01-01 00:10:00    11
Freq: 5T, dtype: int32

The resulting time series is labeled by the timestamps from the left side of each bin. 
By passing label = 'right' we can label them with the right bin edge:

In [275]:
ts.resample('5min', closed = 'right', label='right').sum()

2000-01-01 00:00:00     0
2000-01-01 00:05:00    15
2000-01-01 00:10:00    40
2000-01-01 00:15:00    11
Freq: 5T, dtype: int32

For an illustration of minute frequency data being resampled to five-minute frequency.

![alt Text](Images/TimeSeries/5minresample.png)

Lastly, we might want to shift the result index by some amount, say subtracting one second from the right edge to make it more clear which interval the timestamp refers to. To do this, pass a string or date offset to loffset:

In [277]:
ts.resample('5min', closed = 'right', label='right', loffset='-1s').sum()


>>> df.resample(freq="3s", loffset="8H")

becomes:

>>> from pandas.tseries.frequencies import to_offset
>>> df = df.resample(freq="3s").mean()
>>> df.index = df.index.to_timestamp() + to_offset("8H")

  """Entry point for launching an IPython kernel.


1999-12-31 23:59:59     0
2000-01-01 00:04:59    15
2000-01-01 00:09:59    40
2000-01-01 00:14:59    11
Freq: 5T, dtype: int32

<h3>Open-High-Low-Close(OHLC) resampling

In finance, a popular way to aggregate a time series is to compute four values for each bucket: the first(open), last(close), maximum(high), and minimal(low) values. By using the ohlc aggreagte function we will obtain a DataFrame having columns containing these four aggregates, which are efficiently computed in a single sweep of the data:

In [278]:
ts

2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32

In [280]:
ts.resample('5min', closed='right', label='right').ohlc()

Unnamed: 0,open,high,low,close
2000-01-01 00:00:00,0,0,0,0
2000-01-01 00:05:00,1,5,1,5
2000-01-01 00:10:00,6,10,6,10
2000-01-01 00:15:00,11,11,11,11


<h3>Unsampling and Interpolation</h3>

When converting from a low frequency to a higher frequency, no aggregation is needed. Let's consider a DataFrame with some weekly data:

In [281]:
frame = pd.DataFrame(np.random.randn(2,4),
                    index = pd.date_range('1/1/2000', periods=2,
                                         freq='W-Wed'),
                    columns = ['Colorado', 'Texas', 'New York', 'Ohio'])

In [282]:
frame

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,1.498757,0.832272,0.237795,-0.169644
2000-01-12,-0.914039,-0.090784,2.120439,1.442005


When we are using an aggregation function with this data, there is only one value per group, and missing values result in the gaps. We use the <b>asfreq</b> method to convert to the higher frequency without any aggregation:

In [283]:
df_daily = frame.resample('D').asfreq()

In [284]:
df_daily

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,1.498757,0.832272,0.237795,-0.169644
2000-01-06,,,,
2000-01-07,,,,
2000-01-08,,,,
2000-01-09,,,,
2000-01-10,,,,
2000-01-11,,,,
2000-01-12,-0.914039,-0.090784,2.120439,1.442005


Suppose we wanted to fill forward each weekly value on the non-Wednesdays. The same filling or interpolation methods available in the fillna and reindex methods are availabel for resampling:

In [285]:
frame.resample('D').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,1.498757,0.832272,0.237795,-0.169644
2000-01-06,1.498757,0.832272,0.237795,-0.169644
2000-01-07,1.498757,0.832272,0.237795,-0.169644
2000-01-08,1.498757,0.832272,0.237795,-0.169644
2000-01-09,1.498757,0.832272,0.237795,-0.169644
2000-01-10,1.498757,0.832272,0.237795,-0.169644
2000-01-11,1.498757,0.832272,0.237795,-0.169644
2000-01-12,-0.914039,-0.090784,2.120439,1.442005


We can similarly choose to only fill a certain number of periods forward to limit how far to continue using an observed value:

In [288]:
frame

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,1.498757,0.832272,0.237795,-0.169644
2000-01-12,-0.914039,-0.090784,2.120439,1.442005


In [291]:
frame.resample('D').ffill(limit = 2)

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,1.498757,0.832272,0.237795,-0.169644
2000-01-06,1.498757,0.832272,0.237795,-0.169644
2000-01-07,1.498757,0.832272,0.237795,-0.169644
2000-01-08,,,,
2000-01-09,,,,
2000-01-10,,,,
2000-01-11,,,,
2000-01-12,-0.914039,-0.090784,2.120439,1.442005


Notably, the new date index need not overlap with the old one at all:

In [292]:
frame.resample('W-THU').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-06,1.498757,0.832272,0.237795,-0.169644
2000-01-13,-0.914039,-0.090784,2.120439,1.442005


<h3>Resampling with Periods</h3>

Resampling data indexed by periods is similar to timestamps:

In [293]:
frame = pd.DataFrame(np.random.randn(24,4),
                    index = pd.period_range('1-2000', '12-2001',freq='M'),
                    columns = ['Colorado', 'Texas', 'New York', 'Ohio'])

In [294]:
frame[:5]

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01,-0.216255,-1.273938,1.032413,-0.077456
2000-02,-0.017401,0.608532,-0.873724,0.035283
2000-03,2.029037,0.466778,1.538876,0.216642
2000-04,0.478243,0.495075,-1.436132,0.329972
2000-05,0.611755,1.36858,-1.121545,-0.016993


In [295]:
annual_frame = frame.resample('A-DEC').mean()

In [296]:
annual_frame

Unnamed: 0,Colorado,Texas,New York,Ohio
2000,0.324212,0.113867,-0.139337,0.031556
2001,-0.45495,0.141913,-0.587019,0.126417


Unsampling is more nuanced, as we must make a decision about which end of the timespan in the new frequency to place the values before resampling, just like the asfreq method. The convention argument defaults to 'start' but can also be 'end':

In [297]:
annual_frame.resample('Q-DEC').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q1,0.324212,0.113867,-0.139337,0.031556
2000Q2,0.324212,0.113867,-0.139337,0.031556
2000Q3,0.324212,0.113867,-0.139337,0.031556
2000Q4,0.324212,0.113867,-0.139337,0.031556
2001Q1,-0.45495,0.141913,-0.587019,0.126417
2001Q2,-0.45495,0.141913,-0.587019,0.126417
2001Q3,-0.45495,0.141913,-0.587019,0.126417
2001Q4,-0.45495,0.141913,-0.587019,0.126417


In [298]:
annual_frame.resample('Q-DEC', convention='end').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q4,0.324212,0.113867,-0.139337,0.031556
2001Q1,0.324212,0.113867,-0.139337,0.031556
2001Q2,0.324212,0.113867,-0.139337,0.031556
2001Q3,0.324212,0.113867,-0.139337,0.031556
2001Q4,-0.45495,0.141913,-0.587019,0.126417


Since periods refer to timespans, the rules about unsampling and downsampling are more rigid:

<ul>
    <li>In downsampling, the target frequency must be a subperiod of the source frequency</li>
    <li>In upsampling, the target frequency must be a superperiod of the soruce frequency.</li>

If these rules are not satisfied, an exception will be raised. This mainly affects the quarterly, annual, and weekly frequencies; for example, the timespans defined by Q-MAR only line up with A-MAR, A-JUN, A-SEP, and A-DEC:

In [299]:
annual_frame.resample('Q-MAR').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q4,0.324212,0.113867,-0.139337,0.031556
2001Q1,0.324212,0.113867,-0.139337,0.031556
2001Q2,0.324212,0.113867,-0.139337,0.031556
2001Q3,0.324212,0.113867,-0.139337,0.031556
2001Q4,-0.45495,0.141913,-0.587019,0.126417
2002Q1,-0.45495,0.141913,-0.587019,0.126417
2002Q2,-0.45495,0.141913,-0.587019,0.126417
2002Q3,-0.45495,0.141913,-0.587019,0.126417
