# Chapter 11. Time Series

Time series data is an important form of structured data in many different fields, such as finance, economics, ecology, neuroscience etc.

Many time series are *fixed frequency* meaning that the data points occur on regular intervals

They can also be *irregular* which means that the intervals dont follow a certain pattern. There are different classes of time series:

* Timestamsps - Instants in time
* Fixed periods, such as the month January 2007 or the full year of 2010
* Intervals of time, indicated by a start and end timestamp. 
* Experiment or elapsed time, each timestamp is a measure of time relative to a particular start time.

## Date and Time Data types and Tools

In python, we have a standard library for date and time data as well as calendar related functionality.

In [None]:
from datetime import datetime
now = datetime.now()
now

In [None]:
now.year

In [None]:
now.month

In [None]:
now.day

datetime objects stores both the date and the time down to microsecond precision. 

You can even do arithmetic on datetime objects

In [None]:
from datetime import timedelta
start = datetime(2011, 1, 7)
start + timedelta(2)

### Converting between string and datetime

In [None]:
stamp = datetime(2011, 1, 3)
str(stamp)

In [None]:
stamp.strftime('%Y-%m-%d')

In [None]:
value = '2011-01-03'
datetime.strptime(value, '%Y-%m-%d')

And if you dont want to specify format string

In [None]:
from dateutil.parser import parse
parse('2011-01-03')

In [None]:
parse('Jan 31, 1997 10:45 PM')

In [None]:
parse('6/12/2011 10:45 PM', dayfirst=True)

Pandas is generally oriented toward working with arrays of dates, whether used as an axis index or a column in a DataFrame. The *to_datetime* method parses many different kinds of date representations. Standard date formats like ISO 8601 can be parsed very quickly.

In [None]:
import pandas as pd
datestrs = ['2011-07-06 12:00:00', '2011-08-06 00:00:00']
pd.to_datetime(datestrs)

## Time Series Basics

A basic kind of time series object in pandas is a Series indexed by timestamps, which is often represented external to pandas as python strings or datetime objects:

In [None]:
from datetime import datetime
import numpy as np
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
         datetime(2011, 1, 7), datetime(2011, 1, 8),
         datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = pd.Series(np.random.randn(6), index=dates)
ts

Under the hood, these datetime objects have been put in a DateTimeIndex:

In [None]:
ts.index

Like other Series, arithmetic operations between differently indexed time series automatically align on these dates:

In [None]:
ts + ts[::2]

Recall that [::2] selects every second element in ts.

pandas stores timestamps using Numpy's datetime64 data type at the nanosecond resolution:

In [None]:
ts.index.dtype

Scalar values from a DatetimeIndex are pandas Timestamp objects:

In [None]:
stamp = ts.index[0]
stamp

### Indexing, Selection, Subsetting

Time series behaves like any other pandas.Series when yu are indexing and selecting data based on label:

In [None]:
stamp = ts.index[2]
stamp

As a convenience, you can also pass a string that is interpretable as a date:

In [None]:
ts[stamp]

In [None]:
ts['1/10/2011']

In [None]:
ts['20110110']

For longer time series, a year or only a year and a month can be passed to easily select slices of data:

In [None]:
longer_ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
longer_ts

In [None]:
longer_ts['2001']

In [None]:
longer_ts['2001-05']

Slicing with datetime objects works as well

In [None]:
ts[datetime(2011, 1, 7)]

Because most time series data is orered chronologically, you can slice with timestamps not contained in a time series to perform a range query:

In [None]:
ts['1/6/2011':'1/11/2011']

### Time series with duplicate indices

In [None]:
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', '1/3/2000'])
dates.is_unique

## Date Ranges, Frequencies and Shifting

In [None]:
ts

In [None]:
resampler = ts.resample('D')
resampler

The string 'D' is interpreted as daily frequency

### Generating Date Ranges

While i used it previously without explanation, pandas.data_range is responsible for generating a *DatetimeIndex* with an indicated length according to a particular frequency:

In [None]:
index = pd.date_range('2011-04-01', '2012-06-01')
index

By default, date-ranges generates daily timestamps. If you pass only a start or end date, you must pass a number of periods to generte:

In [None]:
pd.date_range(start='2012-04-01', periods=20)

In [None]:
pd.date_range(end='2012-06-01', periods=20)

The start and end dates define strict boundaries for the generated date index. For example, if you wanted a date index containing the last business day of each month, you would pass the 'BM' frequency (business end of month), and there are many more examples.

In [None]:
pd.date_range('2000-01-01', '2000-12-01', freq='BM')

### Frequencies and Date Offsets

Frequencies in pandas are composed of a *base frequency* and a multiplier. 

In [None]:
from pandas.tseries.offsets import Hour, Minute
hour = Hour()
hour

In [None]:
four_hours = Hour(4)
four_hours

In [None]:
pd.date_range('2000-01-01', '2000-01-03', freq='4h')

## Time Zone Handling

Working with time zones is generally considered one of the most unpleasant parts of time series manipulation. This is why many choose to work with time series in coordinated universal time or UTC which is the current international standard. 

In [None]:
import pytz
pytz.common_timezones[-5:]

To get a time zone object from pytz, use pytz.timezone:

In [None]:
tz = pytz.timezone('America/New_York')
tz

### Time Zone Localization and Conversion

In [None]:
rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [None]:
print(ts.index.tz)

In [None]:
pd.date_range('3/9/2012 9:30', periods=10, freq='D', tz='UTC')

conversion from naive to loxalized is handled by the tz_localize method

In [None]:
ts_utc = ts.tz_localize('UTC')
ts_utc

In [None]:
ts_utc.index

Once a time series has been localized to a particular time zone, it can be converted to another time zone with tz_convert:

In [None]:
ts_utc.tz_convert('Europe/Berlin')

In the case of the preceding time series, which straddles a DST transition in the America/New_York time zone, we could localize to EST and convert to, say, UTC or Berlin Time:

In [None]:
ts_eastern = ts.tz_localize('America/New_York')
ts_eastern.tz_convert('UTC')

In [None]:
ts_eastern.tz_convert('Europe/Berlin')

### Operations with Time Zone-Aware timestamp objects

Similar to time series and date ranges, individual *timestamp* objects similarly can be localized from naive to time zone-aware and converted from one time zone to another:

In [None]:
stamp = pd.Timestamp('2011-03-12 04:00')
stamp_utc = stamp.tz_localize('utc')
stamp_utc.tz_convert('America/New_York')

You can also pass a time zone when creating the Timestamp:

In [None]:
stamp_moscow = pd.Timestamp('2011-03-12 04:00', tz='Europe/Moscow')
stamp_moscow

Time zone-aware Timestamp objects internally store a UTC timestamp value as nanoseconds since the Unix epoch (January 1, 1970); this UTC value is invariant between time zone conversions:

In [None]:
stamp_utc.value

In [None]:
stamp_utc.tz_convert('America/New_York').value

When performing time arithmetic using pandas DateOffset objects, pandas respects daylight saving time transitions where possible. Here we construct timestamps that occur right before DST transitions (forward and backward). First, 30 minutes before transitioning to DST:

In [None]:
from pandas.tseries.offsets import Hour
stamp = pd.Timestamp('2012-03-11 01:30', tz='US/Eastern')
stamp

In [None]:
stamp + Hour()

Then, 90 minutes before transitioning out of DST:

In [None]:
stamp = pd.Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')
stamp

In [None]:
stamp + 2 * Hour()

### Operations between different time zones

If two time series with different time zones are combined, the result will be UTC. Since the timestamps are stored under the hood in UTC. Since the timestamps are stored under the hood in UTC, this is a straightforward operation and requires no conversion to happen:

In [None]:
rng = pd.date_range('3/7/2012 9:30', periods=10, freq='B')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

In [None]:
ts1 = ts[:7].tz_localize('Europe/London')
ts2 = ts1[2:].tz_convert('Europe/Moscow')
result = ts1 + ts2
result.index

## Periods and Period Arithmetic

Periods represent timespans, like days, months, quarters, or years. The period class represents this data type, requiring a string or integer and a frequency.

In [None]:
p = pd.Period(2007, freq='A-DEC')
p

In this case, the *Period* object represents the full timespan from January 1, 2007, to December 31, 2007, inclusive. Conveniently, adding and subtracting integers from periods has the effect of shifting by their frequency:

In [None]:
p + 5

In [None]:
p - 2

If two periods have the same frequecy, their difference is the number of units between them:

In [None]:
pd.Period('2014', freq='A-DEC') - p

Regular ranges of periods can be constructed with the periods_range function:

In [None]:
rng = pd.period_range('2000-01-01', '2000-06-30', freq='M')
rng

The PeriodIndex class stores a sequence of periods and can serve as an axis index in any pandas data structure:

In [None]:
pd.Series(np.random.randn(6), index=rng)

If you have an array of strings, you can also use the PeriodIndex class:

In [None]:
values = ['2001Q3', '2002Q2', '2003Q1']
index = pd.PeriodIndex(values, freq='Q-DEC')
index

### Period Frequency Conversion

Periods and PeriodIndex objects can be converted to another frequency with their asfreq method. 

In [None]:
p = pd.Period('2007', freq='A-DEC')
p

In [None]:
p.asfreq('M', how='start')

You can think of Period('2007', 'A-DEC') as being a sort of cursor pointing to a span of time, subdivided by monthly periods. 

In [None]:
p = pd.Period('2007', freq='A-JUN')
p

In [None]:
p.asfreq('M', 'start')

In [None]:
p.asfreq('M', 'end')

When you are converting from high to low frequency, pandas determines the superperiod depending on where the subperiod "belongs". For example, in A-JUN frequency, the month Aug-2007 is actually part of the 2008 period. 

In [None]:
p = pd.Period('Aug-2007', 'M')
p.asfreq('A-JUN')

Whole PeriodIndex objects or time series can be similarly converted with the same semantics:

In [None]:
rng = pd.period_range('2006', '2009', freq='A-DEC')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

In [None]:
ts.asfreq('M', how='start')

In [None]:
ts.asfreq('B', how='end')

### Quarterly Period Frequencies 

Querterly data is standard in accounting, finance, and other fields. Much quarterly data is reported relative to a *fiscal year end*, typically the last calendar or business day of one of the 12 months of the year. Thus, the period 2012Q4 has a different meaning depending on fiscal year end. Pandas supports all possible quarterly frequences as Q-JAN through Q-DEC:

In [None]:
p = pd.Period('2012Q4', freq='Q-JAN')
p

In the case of fiscal year ending in January, 2012Q4 runs from November through January, which you can check by converting to daily frequency.

In [None]:
p.asfreq('D', 'start')

In [None]:
p.asfreq('D', 'end')

Thus, its possible to do easy period arithmetic; for example, to get the timestamp at 4 PM on the second-to-last business day of the quarter, you could do:

In [None]:
p4pm = (p.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60
p4pm

In [None]:
p4pm.to_timestamp()

You can generate quarterly ranges using period_range. Airhtmetic is identical, too:

In [None]:
rng = pd.period_range('2011Q3', '2012Q4', freq='Q-JAN')
ts = pd.Series(np.arange(len(rng)), index=rng)
ts

In [None]:
new_rng = (rng.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60
ts.index = new_rng.to_timestamp()
ts

### Converting timestamps to periods (and back)

Series and DataFrame objects indexed by timestamps can be converted to periods with the to_period method:

In [None]:
rng = pd.date_range('2000-01-01', periods=3, freq='M')
ts = pd.Series(np.random.randn(3), index=rng)
ts

In [None]:
pts = ts.to_period()
pts

### Creating a PeriodIndex from Arrays

Fixed frequency datasets are sometimes stored with timespan information spread across multiple columns. For example, in this macroeconomic dataset, the year and quarter are in different columns:

In [None]:
data = pd.read_csv('examples/macrodata.csv')
data.head(5)

In [None]:
data.year

In [None]:
data.quarter

By passing these arrays to PeriodIndex with a frequency, you can combine them to form an index for the DataFrame:

In [None]:
index = pd.PeriodIndex(year=data.year, quarter=data.quarter, freq='Q-DEC')
index

In [None]:
data.index = index
data.infl

## Resampling and Frequency Conversion

Resampling refers to the process of converting a time series from one frequency to another. Aggregating higher frequency data to lower frequency is called *downsampling*, while converting lower frequency to higher frequency is called upsampling.

Not all resampling falls into either of these categories.

Pandas objects are equipped with a *resample* method, which is the workhorse function for all frequency conversion. *resample* has a similar API to *groupby*; you call resample to group the data, then call an aggregation function:

In [None]:
rng = pd.date_range('2000-01-01', periods=100, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

In [None]:
ts.resample('M').mean()

In [None]:
ts.resample('M', kind='period').mean()

### Downsampling

Aggregating data to a regular, lower frequency is a pretty normal time series task. The data you're aggregating doesnt need to be fixed frequently. The desired frequency defines bin edges that are used to slice the time series to aggregate.

In [None]:
rng = pd.date_range('2000-01-01', periods=12, freq='T')
ts = pd.Series(np.arange(12), index=rng)
ts

Suppose you wanted to aggregate this data into five-minute chinks or bars by taking the sum of each group:

In [None]:
ts.resample('5min', closed='right').sum()

The frequency you pass defines bin edges in five-minute increments. By default, the left bin edge is inclusive, so the 00:00 value is included in the 00:00 to 00:05 interval. Passing closed='right' changes the interval to be closed on the right:

In [None]:
ts.resample('5min', closed='right').sum()

The resulting time series is labeled by the timestamps from the left side of each bin. By passing label='right' you can label them with the right bin edge:

In [None]:
ts.resample('5min', closed='right', label='right').sum()

### Upsampling and Interpolation

When converting from a low frequency to a higher frequency, no aggregation is needed. Let's consider a DataFrame with some weekly data:

In [None]:
frame = pd.DataFrame(np.random.randn(2, 4), index=pd.date_range('1/1/2000', periods=2, freq='W-WED'),
                     columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame

In [None]:
df_daily = frame.resample('D').asfreq()
df_daily

In [None]:
frame.resample('D').ffill()

## Moving Window Functions

An important class of array transformations used for time series operations are statistics and other functions evaluated over a sliding windows or with exponentially decaying weights. This can be useful for smoothing noisy or gappy data. 

In [None]:
close_px_all = pd.read_csv('examples/stock_px_2.csv', parse_dates=True, index_col=0)
close_px = close_px_all[['AAPL', 'MSFT', 'XOM']]
close_px = close_px.resample('B').ffill()

In [None]:
close_px.AAPL.plot()
close_px.AAPL.rolling(250).mean().plot()

The rolling operator above groups 250 days and calculates the average and this 250 windows slides across the data.

In [None]:
appl_std250 = close_px.AAPL.rolling(250, min_periods=10).std()
appl_std250.plot()

To calculate the *expanding windows mean* we use the expanding operator:

In [None]:
expanding_mean = appl_std250.expanding().mean()

In [None]:
close_px.rolling(60).mean().plot(logy=True)

### Exponentially Weighted Functions

An alternative to using a static windows size with equally weighted observations is to specify a constant decay to give more weight to more recent observations. We implement this by using exponentially weighted statistics.

In [None]:
import matplotlib.pyplot as plt
aapl_px = close_px.AAPL['2006':'2007']
ma60 = aapl_px.rolling(30, min_periods=20).mean()
ewma60 = aapl_px.ewm(span=30).mean()
ma60.plot(style='k--', label='Simple MA')
ewma60.plot(style='k-', label='EW MA')
plt.legend()

### Binary Moving Windows Functions

Some statistical operators, like correlation and covariance, need to operate on two time series. As an example, financial analysts are often interested in a stock's correlation to a benchmark index like the S&P 500. To have a look at this, we first compute the percent change for all of our time series of interest:

In [None]:
spx_px = close_px_all['SPX']
spx_rets = spx_px.pct_change()
returns = close_px.pct_change()
corr = returns.AAPL.rolling(125, min_periods=100).corr(spx_rets)
corr.plot()

In [None]:
corr = returns.rolling(125, min_periods=100).corr(spx_rets)
corr.plot()

### User-Defined Moving Window Functions

The apply method on rolling and related methods provides a means to apply an array function of your own devising over a moving windows. The only requirement is that the function produce a single value (a reduction) from each piece of the array.

In [None]:
from scipy.stats import percentileofscore

score_at_2percent = lambda x: percentileofscore(x, 0.02)
result = returns.AAPL.rolling(250).apply(score_at_2percent)
result.plot()