In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
import pandas as pd
from pandas import DataFrame, Series
from datetime import datetime, timedelta

Interesting thing from the start, he lays out three ways of referring to time that he's going to discuss in the chapter (timestamps, fixed periods - special case of intervals, and intervals), and then also says there's 'experiment or elapsed time', which is where you have a timestamp that's a measure of time relative to a particular start time... like the number of hours from when a machine has first boot. He focuses on the first three, but says you can 'apply many of the techniques' in the chapter where you have the index be an integer for floating point number that shows the elapsed time from the start of the experiment. Ok.

#Date and time data types and tools

In [3]:
now = datetime.now()
now

datetime.datetime(2015, 9, 5, 8, 30, 37, 690169)

In [4]:
now.year, now.month, now.day

(2015, 9, 5)

In [5]:
delta = datetime(2011,1,7) - datetime(2008,6,24,8,15)
delta

datetime.timedelta(926, 56700)

In [6]:
delta.days, delta.seconds

(926, 56700)

In [7]:
start = datetime(2011,1,7)
start + timedelta(12) # 12 days

datetime.datetime(2011, 1, 19, 0, 0)

In [8]:
start - 2 * timedelta(12)

datetime.datetime(2010, 12, 14, 0, 0)

##Converting between string and datetime

In [9]:
stamp = datetime(2011, 1, 3)

In [10]:
str(stamp)

'2011-01-03 00:00:00'

In [11]:
stamp.strftime('%Y-%m-%d') # strftime is string format time? time->str?

'2011-01-03'

In [12]:
value = '2011-01-03'
datetime.strptime(value, '%Y-%m-%d') # strptime is string parse time? str->time?

datetime.datetime(2011, 1, 3, 0, 0)

In [13]:
datestrs = ['7/6/2011','8/6/2011']
[datetime.strptime(x, '%m/%d/%Y') for x in datestrs]

[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]

The strptime method is the 'best' way to parse dates, IF the dates have a known format. The third party dateutil package has a parser.parse method that can be useful if the format isn't (as) known.

The parse method can also sometimes return a date when you might not want it to - for example, it'll turn '42' into the year 2042 w/ today's calendar date.

In [14]:
from dateutil.parser import parse

In [15]:
parse('2011-01-03')

datetime.datetime(2011, 1, 3, 0, 0)

In [16]:
parse('Jan 31, 1997 10:45 PM')

datetime.datetime(1997, 1, 31, 22, 45)

In [17]:
parse('6/12/2011', dayfirst=True)
# dayfirst is needed because parse guesses the format - we don't specify where the month is

datetime.datetime(2011, 12, 6, 0, 0)

In [18]:
datestrs

['7/6/2011', '8/6/2011']

In [19]:
pd.to_datetime(datestrs)

DatetimeIndex(['2011-07-06', '2011-08-06'], dtype='datetime64[ns]', freq=None, tz=None)

In [20]:
idx = pd.to_datetime(datestrs + [None])
idx

DatetimeIndex(['2011-07-06', '2011-08-06', 'NaT'], dtype='datetime64[ns]', freq=None, tz=None)

In [21]:
idx[2]

NaT

In [22]:
pd.isnull(idx)

array([False, False,  True], dtype=bool)

# Time series basics

In [25]:
dates = [datetime(2011,1,2), datetime(2011,1,5), 
         datetime(2011,1,7), datetime(2011,1,8),
         datetime(2011,1,10), datetime(2011,1,12)]
ts = Series(np.random.randn(6), index=dates)
ts

2011-01-02   -0.819811
2011-01-05    0.851267
2011-01-07    1.014123
2011-01-08    0.677080
2011-01-10    1.291552
2011-01-12   -0.981119
dtype: float64

In [26]:
ts.index

DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None, tz=None)

In [27]:
type(ts)

pandas.core.series.Series

In [28]:
ts[::2]

2011-01-02   -0.819811
2011-01-07    1.014123
2011-01-10    1.291552
dtype: float64

Like with all Series, arithmetic operations between differently-indexed time series automatically align on the dates. (In the example below, the 1/5, 1/8, and 1/12 values become <number>+NaN/empty, which is defined to be NaN.)

In [29]:
ts + ts[::2]

2011-01-02   -1.639623
2011-01-05         NaN
2011-01-07    2.028247
2011-01-08         NaN
2011-01-10    2.583105
2011-01-12         NaN
dtype: float64

In [30]:
ts.index.dtype

dtype('<M8[ns]')

In [31]:
type(ts.index[0])

pandas.tslib.Timestamp

As shown above, each value of the index is a Timestamp instance. Timestamp can be used anywhere a datetime would be used. It adds the ability 'to store frequency information' (explained below, I think), and knows how to do time zone conversions and 'other kinds of manipulations'.

## Indexing, selection, subsetting

In [34]:
stamp = ts.index[2]
stamp

Timestamp('2011-01-07 00:00:00')

In [33]:
ts[stamp]

1.0141233684971809

In [35]:
ts['1/10/2011']

1.2915524743071556

In [36]:
ts['20110110']

1.2915524743071556

In [37]:
ts['2011-01-10']

1.2915524743071556

In [41]:
longer_ts = Series(np.random.randn(1000),
                   index=pd.date_range('1/1/2000', periods=1000))
longer_ts[-6:]

2002-09-21    1.843199
2002-09-22    0.752557
2002-09-23    1.386551
2002-09-24   -0.326208
2002-09-25   -0.532932
2002-09-26   -1.215790
Freq: D, dtype: float64

In [42]:
longer_ts.tail()

2002-09-22    0.752557
2002-09-23    1.386551
2002-09-24   -0.326208
2002-09-25   -0.532932
2002-09-26   -1.215790
Freq: D, dtype: float64

A Series with a DatetimeIndex can be indexed as above with a string (one of many) that resolves into a particular day. You can also use strings that resolve into only a year, or a year and a month - for these you'll get back multiple days. (This also implies - ? - that if your DatetimeIndex includes multiple rows for a particular day, then even one of the above examples resolving to a single day would return multiple rows.)

In [43]:
longer_ts['2001']

2001-01-01    2.108166
2001-01-02    0.269148
2001-01-03    0.500549
2001-01-04   -1.676267
2001-01-05    0.558681
2001-01-06    0.209247
2001-01-07   -0.034517
2001-01-08   -0.887975
2001-01-09    0.324276
2001-01-10    0.737532
2001-01-11   -0.279957
2001-01-12   -0.364110
2001-01-13   -0.350812
2001-01-14   -0.007624
2001-01-15    0.390663
2001-01-16   -0.089332
2001-01-17   -0.940297
2001-01-18   -1.007832
2001-01-19   -0.154978
2001-01-20    0.233089
2001-01-21    0.782682
2001-01-22   -0.842832
2001-01-23    0.737489
2001-01-24    1.369528
2001-01-25    0.195031
2001-01-26   -0.959385
2001-01-27    0.279525
2001-01-28   -0.098430
2001-01-29    0.002582
2001-01-30    0.187159
                ...   
2001-12-02   -1.302635
2001-12-03   -0.080065
2001-12-04   -1.219857
2001-12-05   -0.604752
2001-12-06    0.014968
2001-12-07   -0.215877
2001-12-08   -0.379236
2001-12-09    1.821513
2001-12-10    0.555907
2001-12-11   -0.791237
2001-12-12   -1.331724
2001-12-13    1.297187
2001-12-14 

In [44]:
longer_ts['2001-05']

2001-05-01    1.159770
2001-05-02    0.288391
2001-05-03    0.877135
2001-05-04   -0.145547
2001-05-05    1.674836
2001-05-06    1.086258
2001-05-07   -0.899010
2001-05-08    0.007753
2001-05-09    0.805570
2001-05-10    0.856198
2001-05-11   -1.199865
2001-05-12    0.367740
2001-05-13   -0.283615
2001-05-14   -0.380956
2001-05-15   -0.496632
2001-05-16    0.168716
2001-05-17   -2.352882
2001-05-18    0.853919
2001-05-19    1.169345
2001-05-20   -1.458638
2001-05-21    0.453172
2001-05-22   -1.252340
2001-05-23   -0.263243
2001-05-24    0.006861
2001-05-25   -1.490155
2001-05-26    0.063191
2001-05-27   -0.122489
2001-05-28   -0.990888
2001-05-29   -0.279126
2001-05-30   -0.663833
2001-05-31   -0.860157
Freq: D, dtype: float64

You can slice w/ one or multiple timestamps, just like you would with integers or other key values in a non-time-index Series. Slicing, just like w/ NumPy arrays, produces a view, not a copy.

Here as with all use of index values, you can pass a string date, datetime, or Timestamp.

In [45]:
ts[datetime(2011,1,7):]

2011-01-07    1.014123
2011-01-08    0.677080
2011-01-10    1.291552
2011-01-12   -0.981119
dtype: float64

In [46]:
ts['1/6/2011':'1/11/2011']

2011-01-07    1.014123
2011-01-08    0.677080
2011-01-10    1.291552
dtype: float64

In [47]:
ts.truncate?

All of the above works the same w/ DataFrames, but you're indexing on rows (typically).

In [48]:
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')
long_df = DataFrame(np.random.randn(100, 4),
                    index=dates,
                    columns=['Colorado','Texas','New York','Ohio'])
long_df[:6]

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,0.586784,0.858904,-0.550682,-1.823103
2000-01-12,1.193288,0.553716,-0.476428,1.552103
2000-01-19,-0.517572,-0.732031,1.394554,0.800763
2000-01-26,-0.094424,-0.864802,-2.42183,0.151886
2000-02-02,-1.084833,-1.10491,-0.144051,0.349795
2000-02-09,0.053263,-0.348154,-0.600494,-0.481574


In [49]:
long_df.ix['5-2001']

Unnamed: 0,Colorado,Texas,New York,Ohio
2001-05-02,-1.301708,-2.022161,-2.156073,-0.405736
2001-05-09,-0.810789,-1.280961,-1.254807,-0.910124
2001-05-16,-0.034192,0.855185,1.535775,0.85605
2001-05-23,-1.655746,-1.453068,-0.557059,0.444331
2001-05-30,-1.107399,0.859973,-0.3324,0.112639


## Time series with duplicate indices

No problem to have multiple data observations with the same Timestamp.

In [50]:
dates = pd.DatetimeIndex(['1/1/2000','1/2/2000','1/2/2000',
                          '1/2/2000','1/3/2000'])
dup_ts = Series(np.arange(5), index=dates)
dup_ts

2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int64

In [53]:
dup_ts.index.is_unique

False

Since there are duplicate index values, when you index into the time series you'll get either scalar values or slices depending on whether the index value you use is duplicated or not.

In [55]:
print(type(dup_ts['1/3/2000']))
dup_ts['1/3/2000']

<class 'numpy.int64'>


4

In [56]:
print(type(dup_ts['1/2/2000']))
dup_ts['1/2/2000']

<class 'pandas.core.series.Series'>


2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int64

You can aggregate all of the duplicate index values (here, and I think generally - not just w/ DatetimeIndex indices), using groupby and level=0.

In [57]:
grouped = dup_ts.groupby(level=0)

In [58]:
grouped.count()

2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64

# Date ranges, frequencies, and shifting

Generally time series are considered to be 'irregular' (or at least there's nothing stopping them from being this way): values have no fixed frequency, and you can miss days, miss hours, etc.

That said, sometimes it's helpful to have time series that _don't_ miss values - that are regular and have an entry (even if it's NaN, etc.) for every value.

Pandas supports this natively, enabling easy resampling, inferrence of frequencies, and generating fixed-frequency date ranges.

In [64]:
# first example - more comes later
# convert to a fixed frequency - one obs per day
ts

2011-01-02   -0.819811
2011-01-05    0.851267
2011-01-07    1.014123
2011-01-08    0.677080
2011-01-10    1.291552
2011-01-12   -0.981119
dtype: float64

In [65]:
ts.resample('D')

2011-01-02   -0.819811
2011-01-03         NaN
2011-01-04         NaN
2011-01-05    0.851267
2011-01-06         NaN
2011-01-07    1.014123
2011-01-08    0.677080
2011-01-09         NaN
2011-01-10    1.291552
2011-01-11         NaN
2011-01-12   -0.981119
Freq: D, dtype: float64

## Generating date ranges

In [66]:
index = pd.date_range('4/1/2012','6/1/2012')
index

DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
               '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
               '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
               '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
               '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
               '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
               '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
               '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
               '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
               '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
      

In [67]:
# or you can pass start or end, and a number of periods
# (by default, date_range generates days)
pd.date_range(start='4/1/2012', periods=10)

DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10'],
              dtype='datetime64[ns]', freq='D', tz=None)

In [68]:
pd.date_range(end='6/1/2012', periods=5)

DatetimeIndex(['2012-05-28', '2012-05-29', '2012-05-30', '2012-05-31',
               '2012-06-01'],
              dtype='datetime64[ns]', freq='D', tz=None)

In [70]:
# start/end dates are strict start/ending times - for ex, to 
# get a range w/ the last business day of each month, use the 
# 'BM' frequency and you'll only get dates _inside_ the specified
# range.
pd.date_range('1/1/2000','12/1/2000', freq='BM')

DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-29', '2000-10-31', '2000-11-30'],
              dtype='datetime64[ns]', freq='BM', tz=None)

In [71]:
# if you pass time, it's retained by default
pd.date_range('5/2/2012 12:56:31', periods=5)

DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
               '2012-05-04 12:56:31', '2012-05-05 12:56:31',
               '2012-05-06 12:56:31'],
              dtype='datetime64[ns]', freq='D', tz=None)

In [72]:
# or pass normalize=True to have them normalized to midnight
pd.date_range('5/2/2012 12:56:31', periods=5, normalize=True)

DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
               '2012-05-06'],
              dtype='datetime64[ns]', freq='D', tz=None)

## Frequencies and date offsets