In today's lecture, where we'll be looking at the time series and date functionally in pandas. Manipulating dates and time is quite flexible in Pandas and thus allows us to conduct more analysis such as time series analysis, which we will talk about soon. Actually, pandas was originally created by Wed McKinney to handle date and time data when he worked as a consultant for hedge funds.

In [1]:
# Let's bring in pandas and numpy as usual
import pandas as pd
import numpy as np

### Timestamp

pandas拥有四个主要的与时间相关的class，分别是Timestamp，DatetimeIndex，Period和PeriodIndex。首先看Timestamp，它代表一个单一时间戳，与某些
时间点的值相关。

例如，通过一个字符串9/1/2019 10:05AM来建立一个时间戳。

In [2]:
# Timestamp is interchangeable with Python's datetime in most cases.
pd.Timestamp('9/1/2019 10:05AM')

Timestamp('2019-09-01 10:05:00')

In [3]:
# 我们还可以通过分别输入多个参数，例如year，month，date，hour，minute,来创造一个时间戳

pd.Timestamp(2019, 12, 20, 0, 0)

Timestamp('2019-12-20 00:00:00')

In [4]:
# 时间戳拥有一些有用的attribute，例如isoweekday()，它表示时间戳是星期几，其中1代表周一，7代表周日

pd.Timestamp(2019, 12, 20, 0, 0).isoweekday()

5

In [5]:
# 可以从一个时间戳中提取具体的year，month，day，hour，minute和second

pd.Timestamp(2019, 12, 20, 5, 2, 23).second

23

### Period

假设我们对一个特定的时间点不感兴趣，而是对一段时间感兴趣，这正是period class的应用之处。Period代表了一个单一的time span，例如特定的一天或一个月。

In [6]:
# Here we are creating a period that is January 2016,
pd.Period('1/2016')

Period('2016-01', 'M')

In [7]:
# You'll notice when we print that out that the granularity of the period is M for month, since that was the
# finest grained piece we provided. Here's an example of a period that is March 5th, 2016.
pd.Period('3/5/2016')

Period('2016-03-05', 'D')

In [8]:
# Period objects represent the full timespan that you specify. Arithmetic on period is very easy and
# intuitive, for instance, if we want to find out 5 months after January 2016, we simply plus 5
pd.Period('1/2016') + 5

Period('2016-06', 'M')

In [9]:
# From the result, you can see we get June 2016. If we want to find out two days before March 5th 2016, we
# simply subtract 2
pd.Period('3/5/2016') - 2

Period('2016-03-03', 'D')

The key here is that the period object encapsulates the granularity for arithmetic

### DatetimeIndex and PeriodIndex

In [10]:
# The index of a timestamp is DatetimeIndex. Let's look at a quick example. First, let's create our example
# series t1, we'll use the Timestamp of September 1st, 2nd and 3rd of 2016. When we look at the series, each
# Timestamp is the index and has a value associated with it, in this case, a, b and c.

t1 = pd.Series(list('abc'), [pd.Timestamp('2016-09-01'), pd.Timestamp('2016-09-02'), 
                             pd.Timestamp('2016-09-03')])
t1

2016-09-01    a
2016-09-02    b
2016-09-03    c
dtype: object

In [11]:
# 这里的dtype指的是data values的数据类型，而并非index的数据类型
t1.index

DatetimeIndex(['2016-09-01', '2016-09-02', '2016-09-03'], dtype='datetime64[ns]', freq=None)

In [12]:
# Looking at the type of our series index, we see that it's DatetimeIndex.
type(t1.index)

pandas.core.indexes.datetimes.DatetimeIndex

In [13]:
# Similarly, we can create a period-based index as well. 
t2 = pd.Series(list("def"), [pd.Period("2016-09"), pd.Period("10/2016"), pd.Period("2016-11")])
t2

2016-09    d
2016-10    e
2016-11    f
Freq: M, dtype: object

In [14]:
type(t2.index)

pandas.core.indexes.period.PeriodIndex

### Converting to Datetime

In [15]:
# Now, let's look into how to convert to Datetime. Suppose we have a list of dates as strings and we want to
# create a new dataframe

# I'm going to try a bunch of different date formats
d1 = ["2 June 2013", "Aug 29, 2014", "2015-06-26", "7/12/16"]

# And just some random data
t3 = pd.DataFrame(np.random.randint(10, 100, (4, 2)), index = d1, columns = list("ab"))
t3

Unnamed: 0,a,b
2 June 2013,97,82
"Aug 29, 2014",29,49
2015-06-26,17,11
7/12/16,65,33


In [16]:
# Using pandas to_datetime, pandas will try to convert these to Datetime and put them in a standard format.

t3.index = pd.to_datetime(t3.index)
t3

Unnamed: 0,a,b
2013-06-02,97,82
2014-08-29,29,49
2015-06-26,17,11
2016-07-12,65,33


In [17]:
# to_datetime可以改变它所解析的数据顺序，例如欧洲的规则是日期/月份/年份，则使用"dayfirst = True"。
# 这里我们发现，d1中前两个数据的月份以字母明确表示，所以不会变化。第三项实际上也是明确表示，也不会变化。
# 而最后一项7/12/16中，在欧洲则表示2016年12月7日，所以dayfirst = True会改变它的顺序。
print(d1)
pd.to_datetime(d1, dayfirst = True)

['2 June 2013', 'Aug 29, 2014', '2015-06-26', '7/12/16']


DatetimeIndex(['2013-06-02', '2014-08-29', '2015-06-26', '2016-12-07'], dtype='datetime64[ns]', freq=None)

In [18]:
# 4.7.12也是一个模糊的表达，也可以被dayfirst = True改变。

pd.to_datetime("4.7.12", dayfirst = True)

Timestamp('2012-07-04 00:00:00')

### Timedelta

In [19]:
# Timedeltas are differences in times. This is not the same as a a period, but conceptually similar. For
# instance, if we want to take the difference between September 3rd and  September 1st, we get a Timedelta of
# two days.

pd.Timestamp('9/3/2016') - pd.Timestamp("9/1/2016")

Timedelta('2 days 00:00:00')

In [20]:
# We can also do something like find what the date and time is for 12 days and three hours past September 2nd,
# at 8:10 AM.

pd.Timestamp("2016-9-2 8:10") + pd.Timedelta("12D 3H")

Timestamp('2016-09-14 11:10:00')

### Offset

In [21]:
# Offset is similar to timedelta, but it follows specific calendar duration rules. Offset allows flexibility
# in terms of types of time intervals. Besides hour, day, week, month, etc， it also has business day, end of
# month, semi month begin etc

# Let's create a timestamp, and see what day is that
# 注意，isoweekday()返回的1-7代表了周一到周日，weekday()返回的0-6代表了周一到周日
pd.Timestamp("9/4/2016").weekday()

6

In [22]:
# Now we can now add the timestamp with a week ahead
pd.Timestamp("9/4/2016") + pd.offsets.Week()

Timestamp('2016-09-11 00:00:00')

In [23]:
pd.Timestamp("9/4/2016") + pd.offsets.Week() == pd.Timestamp("9/4/2016") + pd.Timedelta("1W")

True

In [24]:
# Now let's try to do the month end, then we would have the last day of September
pd.Timestamp("9/4/2016") + pd.offsets.MonthEnd()

Timestamp('2016-09-30 00:00:00')

### Working with Dates in a DataFrame
2
假设我们需要在接下来每两周的周日做一种实验，从2016年10月开始做9次。使用date_range，我们可以建立这个DatatimeIndex。在data_range中，我们需要注明start date或者end date。如果没有被注明，则默认为start date。并且我们需要指定周期数作为频率。这里我们输入"2W-SUN"，这代表每两周的周日。

In [25]:
dates = pd.date_range(start = "10-01-2016", periods = 9, freq = "2W-SUN")
dates

DatetimeIndex(['2016-10-02', '2016-10-16', '2016-10-30', '2016-11-13',
               '2016-11-27', '2016-12-11', '2016-12-25', '2017-01-08',
               '2017-01-22'],
              dtype='datetime64[ns]', freq='2W-SUN')

In [26]:
# There are many other frequencies that you can specify. For example, you can do business day
pd.date_range("10-01-2016", periods = 9, freq = "B")

DatetimeIndex(['2016-10-03', '2016-10-04', '2016-10-05', '2016-10-06',
               '2016-10-07', '2016-10-10', '2016-10-11', '2016-10-12',
               '2016-10-13'],
              dtype='datetime64[ns]', freq='B')

In [27]:
# Or you can do quarterly, with the quarter start in June
# 这意味着虽然我们从2016年4月1日开始，但"QS-JUN"规定了从6月开始计数，QS代表每个季度做一次
pd.date_range("04-01-2016", periods = 12, freq = "QS-JUN")

DatetimeIndex(['2016-06-01', '2016-09-01', '2016-12-01', '2017-03-01',
               '2017-06-01', '2017-09-01', '2017-12-01', '2018-03-01',
               '2018-06-01', '2018-09-01', '2018-12-01', '2019-03-01'],
              dtype='datetime64[ns]', freq='QS-JUN')

In [28]:
# Now, let's go back to our weekly on Sunday example and create a DataFrame using these dates, and some random
# data, and see what we can do with it.

dates = pd.date_range("10-01-2016", periods = 9, freq = "2W-SUN")
dates

DatetimeIndex(['2016-10-02', '2016-10-16', '2016-10-30', '2016-11-13',
               '2016-11-27', '2016-12-11', '2016-12-25', '2017-01-08',
               '2017-01-22'],
              dtype='datetime64[ns]', freq='2W-SUN')

In [29]:
df = pd.DataFrame({"Count1": 100 + np.random.randint(-5, 10, 9).cumsum(),
                   "Count2": 120 + np.random.randint(-5, 10, 9)}, index = dates)
df

Unnamed: 0,Count1,Count2
2016-10-02,98,121
2016-10-16,100,119
2016-10-30,101,121
2016-11-13,100,117
2016-11-27,108,123
2016-12-11,116,115
2016-12-25,125,119
2017-01-08,126,119
2017-01-22,129,126


———————————————————————

In [30]:
# cumsum(array, axis=None, dtype=None, out=None) 表示了按xis的方向累加，得到新的array，
# axis = 0代表按行累加，axis = 1代表按列累加，不输入则代表把numpy数组当成一个一维数组，例如：

kkk = np.array([[1, 2, 3], [4, 5, 6]])
print(kkk)
np.cumsum(kkk)

# 此处没有输入axis的值，则默认把kkk当成一个一维数组，输出结果1=1，3=1+2，6=1+2+3, 10=1+2+3+4等等

[[1 2 3]
 [4 5 6]]


array([ 1,  3,  6, 10, 15, 21], dtype=int32)

In [31]:
kkk = np.array([[1, 2, 3], [4, 5, 6]])
print(kkk)
np.cumsum(kkk, axis = 0)

# 此处axis=0则按行累加，输出结果中第一行不变，第二行5=1+4，7=2+5，9=3+6

[[1 2 3]
 [4 5 6]]


array([[1, 2, 3],
       [5, 7, 9]], dtype=int32)

In [32]:
kkk = np.array([[1, 2, 3], [4, 5, 6]])
print(kkk)
np.cumsum(kkk, axis = 1)

# 同理此处按列累加，第一列结果不变，第二列3=1+2，9=4+5，第三列6=1+2+3，15=4+5+6

[[1 2 3]
 [4 5 6]]


array([[ 1,  3,  6],
       [ 4,  9, 15]], dtype=int32)

In [33]:
# cumsum()也可以作为方法，例如:
kkk = np.array([[1, 2, 3], [4, 5, 6]])
kkk.cumsum(axis = 1)

# 但注意.cumsum()不会直接改变原变量，则需要kkk = kkk.cumsum(axis = 1)

array([[ 1,  3,  6],
       [ 4,  9, 15]], dtype=int32)

———————————————————————

In [34]:
# 接着上面的例子
dates = pd.date_range("10-01-2016", periods = 9, freq = "2W-SUN")
df = pd.DataFrame({"Count1": 100 + np.random.randint(-5, 10, 9).cumsum(),
                   "Count2": 120 + np.random.randint(-5, 10, 9)}, index = dates)
df

Unnamed: 0,Count1,Count2
2016-10-02,106,126
2016-10-16,108,115
2016-10-30,104,118
2016-11-13,102,120
2016-11-27,110,122
2016-12-11,112,118
2016-12-25,117,115
2017-01-08,125,121
2017-01-22,133,117


In [35]:
# 首先我们可以根据日期求出它具体是星期几。例如这里的index中的所有日期都是周日，这与我们设定的频率吻合。
df.index.day_name()

Index(['Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday',
       'Sunday', 'Sunday'],
      dtype='object')

In [36]:
# 或者：
for i in df.index:
    print(i.isoweekday())

7
7
7
7
7
7
7
7
7


In [37]:
# We can also use diff() to find the difference between each date's value.
df.diff()

Unnamed: 0,Count1,Count2
2016-10-02,,
2016-10-16,2.0,-11.0
2016-10-30,-4.0,3.0
2016-11-13,-2.0,2.0
2016-11-27,8.0,2.0
2016-12-11,2.0,-4.0
2016-12-25,5.0,-3.0
2017-01-08,8.0,6.0
2017-01-22,8.0,-4.0


In [38]:
# 假设我们想知道我们的DataFrame中，每个月的mean count是多少。我们可以使用resample。从高频率转为低频率
# 称作downsampling

df.resample("M").mean()

Unnamed: 0,Count1,Count2
2016-10-31,106.0,119.666667
2016-11-30,106.0,121.0
2016-12-31,114.5,116.5
2017-01-31,129.0,119.0


### Datetime的索引和切片 (Indexing and Slicing)

例如，我们可以使用partial string indexing来得出某一具体年份对应的数值。

In [39]:
df.loc["2017"]

Unnamed: 0,Count1,Count2
2017-01-08,125,121
2017-01-22,133,117


In [40]:
# 同理我们可以得出某一具体月份对应的数值。
df.loc['2016-12']

Unnamed: 0,Count1,Count2
2016-12-11,112,118
2016-12-25,117,115


In [41]:
# 我们还可以在range of dates上作切片。例如，我们只想要从2016年12月以后的数据。
df.loc["2016-12-11":]

Unnamed: 0,Count1,Count2
2016-12-11,112,118
2016-12-25,117,115
2017-01-08,125,121
2017-01-22,133,117
