时间序列数据是一种重要的结构化数据形式。在多个时间点观察或测量到的任何事物都可以形成一段时间序列。根据时间序列的适用场景可以分为以下几种：
- 时间戳(timestamp): 特定的时刻。
- 固定日期(period): 如2021年全年。
- 时间间隔(interval): 由起始时间和结束时间戳表示。
- 实验或过程时间: 每个时间都是相对于特定起始时间的一个度量。

In [11]:
import pandas as pd
import numpy as np

# 1. 日期和时间数据类型
Python标准库中最常使用的数据类型为 `datetime.datetime`。主要的模块为：`datetime`, `time`, `calendar`。

## 1.1 Datetime Format
- %Y: 4位数的年
- %y: 2位数的年
- %m: 2位数的月 [01,12]
- %d: 2位数的日 [01,31]
- %H: 24小时制 时 [00,23] 
- %I: 12小时制 时 [01,12]
- %M: 2位数的 分 [00,59]
- %S: 秒 [00,61] (60和61用于闰秒)

---
- %w: 用整数表示的星期几 [0(星期天),6]
- %U: 每年的第几周 [0, 53]。星期天被认为是每周的第一天，每年第一个星期天之前的那几天被认为是第0周。
- %W: 每年的第几周 [0, 53]。星期一被认为是每周的第一天，每年第一个星期天之前的那几天被认为是第0周。

---
- %F: %Y-%m-%d的简写形式，例如2021-5-23
- %D: %m/%d/%y的简写形式，例如23/05/21

---
限于当前环境的日期格式

- %a: 星期几的简写
- %A: 星期几的全称
- %b: 月份的简写
- %B: 月份的全称
- %c: 完整的日期和时间
- %p: 不同环境的AM和PM
- %x: 适用于当前环境的日期格式
- %X: 适用于当前环境的时间格式

## 1.2 datetime.datetime

In [12]:
from datetime import datetime

In [14]:
now = datetime.now()
now

datetime.datetime(2021, 6, 1, 19, 22, 46, 84249)

In [15]:
# 1.访问其属性
now.year, now.month, now.day

(2021, 6, 1)

In [18]:
now.hour, now.minute, now.second

(19, 22, 46)

## 1.3 datetime.timedelta 

In [21]:
# 2.datetime对象的运算
start = datetime(2020, 1, 20)
diff = now - start
diff

datetime.timedelta(days=498, seconds=69766, microseconds=84249)

In [22]:
diff.days

498

In [23]:
diff.seconds

69766

In [28]:
now

datetime.datetime(2021, 6, 1, 19, 22, 46, 84249)

In [31]:
from datetime import timedelta
now + timedelta(12) # 默认加天数

datetime.datetime(2021, 6, 13, 19, 22, 46, 84249)

In [34]:
timedelta?

timedelta(days=0, seconds=0, microseconds=0, milliseconds=0, minutes=0, hours=0, weeks=0)

## 1.4 字符串和datetime的相互转换 

In [37]:
# 格式化日期
sixone = '2021-6-01 20:00:00'

datetime.strptime(sixone, '%Y-%m-%d %H:%M:%S')

datetime.datetime(2021, 6, 1, 20, 0)

In [39]:
pd.to_datetime(sixone)

Timestamp('2021-06-01 20:00:00')

In [68]:
# 获取指定日期属于周几
datetime.strptime(sixone, '%Y-%m-%d %H:%M:%S').strftime('%w')

'2'

In [69]:
# 获取指定日期属于当年的第几周
datetime.strptime(sixone, '%Y-%m-%d %H:%M:%S').strftime('%W')

'22'

In [70]:
# 获取指定日期属于当年的第几周
int(datetime.strptime(sixone, '%Y-%m-%d %H:%M:%S').strftime('%W'))

22

In [71]:
# 获取指定时间属于星期几
datetime.strptime(sixone, '%Y-%m-%d %H:%M:%S').strftime('%a')

'Tue'

In [72]:
datetime.strptime(sixone, '%Y-%m-%d %H:%M:%S').strftime('%A')

'Tuesday'

In [73]:
# 获取指定时间属于月份
datetime.strptime(sixone, '%Y-%m-%d %H:%M:%S').strftime('%b')

'Jun'

In [74]:
datetime.strptime(sixone, '%Y-%m-%d %H:%M:%S').strftime('%B')

'June'

## 1.5  `NaT` (Not a Time) —— pandas中时间戳数据的NA值 

![1](./images/NaT.jpg)

In [40]:
rootdir = 'D:/Github/BigDataAnalysis/01 Data Analysis and Pre-processing/Dataset/'
filenames = ['Auxiliary_Info.xlsx']
au_info = pd.read_excel(rootdir + filenames[0])
au_info.head()

Unnamed: 0,Semester Start Week,Holiday Date,Attendance period Start,Attendance period End,Attendance period Description,Make up lessons ID,Original lessons Date,Make up lessons Date,Make up lessons Week,Make up lessons Weekday,Make up lessons Schedule,Make up lessons Description
0,2021-03-01,2021-05-01,07:00:00,08:30:00,上午第1节课前一小时内考勤均认为正常考勤，可以按照自己的逻辑修改。,ML2020001,2021-03-15,2021-05-28,13.0,Fri,1--2,补课日期，补第几周的课，补周几的课，补第几节课
1,NaT,2021-05-02,10:05:00,10:25:00,上午1-2节下课到第3节课上课之间的时间被认为正常考勤。,DL2020001,2021-03-22,2021-05-24,13.0,Mon,3--4,
2,NaT,2021-05-03,13:00:00,14:00:00,下午第1节课前一小时内考勤均认为正常考勤，可以按照自己的逻辑修改。,CV2020001,2021-05-24,2021-05-28,13.0,Fri,5--6,
3,NaT,2021-05-04,15:35:00,15:55:00,下午1-2节下课到第3节课上课之间的时间被认为正常考勤。,,NaT,NaT,,,,
4,NaT,2021-05-05,17:30:00,18:00:00,下午3-4节下课到晚上第1节课上课之间的时间被认为正常考勤。,,NaT,NaT,,,,


## 1.6 Pandas与datetime的关系
pandas中最基本的时间序列类型就是以时间戳（通常以Python字符串或datetime对象表示）为索引的Series。这些datetime对象被放在一个DatetimeIndex中。

In [44]:
ts = [1, 2, 3, 4, 5, 6]
ts[::2]

[1, 3, 5]

In [45]:
ts[1::2]

[2, 4, 6]

In [46]:
ts[3::2]

[4, 6]

In [54]:
# Random values in a given shape.
# rand(d0, d1, ..., dn)
np.random.rand?

In [55]:
np.random.rand(6, 1)

array([[0.53064704],
       [0.54018723],
       [0.43826432],
       [0.98806201],
       [0.18967486],
       [0.96387848]])

In [57]:
# Return a sample (or samples) from the "standard normal" distribution.
# randn(d0, d1, ..., dn)
np.random.randn?

In [58]:
np.random.randn(6)

array([-0.60655019,  2.11467588, -1.15069137, -0.02745732, -0.62381701,
        0.20075619])

In [60]:
dates = [datetime(2021, 6, 1), 
         datetime(2021, 6, 2), 
         datetime(2021, 6, 3), 
         datetime(2021, 6, 10), 
         datetime(2021, 6, 18), 
         datetime(2021, 6, 20), 
        ]

mock_value = np.random.randn(len(dates))
# 显式构造 pandas.Series 对象
# 当创建具有DatetimeIndex的Series时，pandas会自动推断为时间序列。
ts = pd.Series(mock_value, index=dates)
ts

2021-06-01    0.784748
2021-06-02    0.452610
2021-06-03    1.949499
2021-06-10   -1.436581
2021-06-18    0.414006
2021-06-20   -0.503487
dtype: float64

In [61]:
type(ts)

pandas.core.series.Series

In [63]:
isinstance(ts, pd.core.series.Series)

True

In [65]:
# 以纳秒形式存储
ts.index.dtype

dtype('<M8[ns]')

In [79]:
# 索引切片
ts.index[0]

Timestamp('2021-06-01 00:00:00')

In [81]:
list(ts.index)

[Timestamp('2021-06-01 00:00:00'),
 Timestamp('2021-06-02 00:00:00'),
 Timestamp('2021-06-03 00:00:00'),
 Timestamp('2021-06-10 00:00:00'),
 Timestamp('2021-06-18 00:00:00'),
 Timestamp('2021-06-20 00:00:00')]

## 1.7 索引、选取、子集构造
TimeSeries是Series的一个子类，所以在索引以及数据选取方面，它们的行为是一样的。



### 1) 索引 

In [83]:
stamp = ts.index[2]
stamp

Timestamp('2021-06-03 00:00:00')

In [84]:
# 传入时间戳
ts[stamp]

1.9494987022879615

In [85]:
# 传入一个可以被解释为日期的字符串
ts['6/1/2021']

0.7847481402803347

### 2) 切片 
<font color=red> 只对Series有效！ </font>

In [86]:
# 日期切片
ts[datetime(2021, 6, 3):]

2021-06-03    1.949499
2021-06-10   -1.436581
2021-06-18    0.414006
2021-06-20   -0.503487
dtype: float64

In [87]:
# 范围查询
ts['6/1/2021':'6/3/2021']

2021-06-01    0.784748
2021-06-02    0.452610
2021-06-03    1.949499
dtype: float64

### 3) 子集构造 

In [89]:
periods = 100
longer_ts = pd.Series(np.random.randn(periods), 
                      index=pd.date_range('6/1/2021', periods=periods))
longer_ts

2021-06-01    0.542161
2021-06-02   -1.620210
2021-06-03   -0.504591
2021-06-04    0.829834
2021-06-05    0.615478
                ...   
2021-09-04    1.015031
2021-09-05    0.526046
2021-09-06    1.559286
2021-09-07   -0.352586
2021-09-08    0.135214
Freq: D, Length: 100, dtype: float64

In [90]:
%page longer_ts

In [94]:
# before日期之前的丢弃
# after日期之后的丢弃
longer_ts.truncate(before='6/10/2021',
                   after='6/18/2021')

2021-06-10   -1.297002
2021-06-11    1.586373
2021-06-12    0.048455
2021-06-13   -0.586656
2021-06-14    0.240073
2021-06-15   -1.077827
2021-06-16   -0.244207
2021-06-17    1.077276
2021-06-18   -0.024610
Freq: D, dtype: float64

In [92]:
longer_ts.truncate?

```python
longer_ts.truncate(
    before=None,
    after=None,
    axis=None,
    copy: 'bool_t' = True,
) -> 'FrameOrSeries'
```

### 4) pd.date_range()
注意 `freq` 参数设置！

In [95]:
pd.date_range?

```python
pd.date_range(
    start=None,
    end=None,
    periods=None,
    freq=None,
    tz=None,
    normalize=False,
    name=None,
    closed=None,
    **kwargs,
) -> pandas.core.indexes.datetimes.DatetimeIndex
```

In [98]:
dates = pd.date_range('6/18/2021', 
                      periods=100, 
                      freq='W-WED')
dates

DatetimeIndex(['2021-06-23', '2021-06-30', '2021-07-07', '2021-07-14',
               '2021-07-21', '2021-07-28', '2021-08-04', '2021-08-11',
               '2021-08-18', '2021-08-25', '2021-09-01', '2021-09-08',
               '2021-09-15', '2021-09-22', '2021-09-29', '2021-10-06',
               '2021-10-13', '2021-10-20', '2021-10-27', '2021-11-03',
               '2021-11-10', '2021-11-17', '2021-11-24', '2021-12-01',
               '2021-12-08', '2021-12-15', '2021-12-22', '2021-12-29',
               '2022-01-05', '2022-01-12', '2022-01-19', '2022-01-26',
               '2022-02-02', '2022-02-09', '2022-02-16', '2022-02-23',
               '2022-03-02', '2022-03-09', '2022-03-16', '2022-03-23',
               '2022-03-30', '2022-04-06', '2022-04-13', '2022-04-20',
               '2022-04-27', '2022-05-04', '2022-05-11', '2022-05-18',
               '2022-05-25', '2022-06-01', '2022-06-08', '2022-06-15',
               '2022-06-22', '2022-06-29', '2022-07-06', '2022-07-13',
      

### 5) DataFrame.iloc

In [103]:
# 已经移除了
pd.DataFrame.ix?

Object `pd.DataFrame.ix` not found.


In [104]:
pd.__version__

'1.2.4'

In [105]:
pd.DataFrame.iloc?

In [101]:
long_df = pd.DataFrame(np.random.randn(100, 4), 
                       index=dates, 
                       columns=['Colorado', 'Texas', 'New York', 'Califonia'])
long_df

Unnamed: 0,Colorado,Texas,New York,Califonia
2021-06-23,1.104667,-2.493659,-1.154782,0.372761
2021-06-30,0.445929,0.726355,0.925137,-1.875483
2021-07-07,1.937304,0.243263,0.711466,1.126121
2021-07-14,-0.530885,-0.915292,0.193696,-0.610519
2021-07-21,1.293584,0.333919,1.038634,-1.137250
...,...,...,...,...
2023-04-19,-0.452375,-0.297580,-1.209614,-0.756588
2023-04-26,-1.299277,-0.485290,-0.808197,-1.988203
2023-05-03,-0.977997,0.230503,-0.663490,0.486109
2023-05-10,0.838546,-1.123142,-0.766710,0.147016


In [106]:
long_df.index

DatetimeIndex(['2021-06-23', '2021-06-30', '2021-07-07', '2021-07-14',
               '2021-07-21', '2021-07-28', '2021-08-04', '2021-08-11',
               '2021-08-18', '2021-08-25', '2021-09-01', '2021-09-08',
               '2021-09-15', '2021-09-22', '2021-09-29', '2021-10-06',
               '2021-10-13', '2021-10-20', '2021-10-27', '2021-11-03',
               '2021-11-10', '2021-11-17', '2021-11-24', '2021-12-01',
               '2021-12-08', '2021-12-15', '2021-12-22', '2021-12-29',
               '2022-01-05', '2022-01-12', '2022-01-19', '2022-01-26',
               '2022-02-02', '2022-02-09', '2022-02-16', '2022-02-23',
               '2022-03-02', '2022-03-09', '2022-03-16', '2022-03-23',
               '2022-03-30', '2022-04-06', '2022-04-13', '2022-04-20',
               '2022-04-27', '2022-05-04', '2022-05-11', '2022-05-18',
               '2022-05-25', '2022-06-01', '2022-06-08', '2022-06-15',
               '2022-06-22', '2022-06-29', '2022-07-06', '2022-07-13',
      

## 1.8 带有重复索引的时间序列
在某些应用场景中，可能会存在多个观测数据落在同一个时间点上的情况。

In [108]:
dates = pd.DatetimeIndex(['2021-06-23', 
                          '2021-06-30', 
                          '2021-06-30', 
                          '2021-06-30', 
                          '2021-07-07', 
                          '2021-07-14',
                          '2021-07-14',
                          '2021-07-14',
                          '2021-07-21'])
dates

DatetimeIndex(['2021-06-23', '2021-06-30', '2021-06-30', '2021-06-30',
               '2021-07-07', '2021-07-14', '2021-07-14', '2021-07-14',
               '2021-07-21'],
              dtype='datetime64[ns]', freq=None)

In [110]:
dup_ts = pd.Series(np.arange(len(dates)), index=dates)
dup_ts

2021-06-23    0
2021-06-30    1
2021-06-30    2
2021-06-30    3
2021-07-07    4
2021-07-14    5
2021-07-14    6
2021-07-14    7
2021-07-21    8
dtype: int32

In [111]:
# 查看索引是否重复
dup_ts.index.is_unique

False

In [112]:
dup_ts['2021-06-30']

2021-06-30    1
2021-06-30    2
2021-06-30    3
dtype: int32

### 对非唯一索引进行聚合 groupby 

In [114]:
grouped = dup_ts.groupby(level=0)
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000021A1F146FD0>

In [115]:
dup_ts.groupby?

```python
dup_ts.groupby(
    by=None,
    axis=0,
    level=None,
    as_index: bool = True,
    sort: bool = True,
    group_keys: bool = True,
    squeeze: bool = <object object at 0x0000021A19AE6530>,
    observed: bool = False,
    dropna: bool = True,
) -> 'SeriesGroupBy'
```

In [116]:
grouped.count()

2021-06-23    1
2021-06-30    3
2021-07-07    1
2021-07-14    3
2021-07-21    1
dtype: int64

In [117]:
grouped.mean()

2021-06-23    0
2021-06-30    2
2021-07-07    4
2021-07-14    6
2021-07-21    8
dtype: int32

# 2. 日期的范围、频率及移动
Pandas具有一套标准时间序列频率以及用于重采样、频率推断、生成固定频率日期范围的工具。可以使用 `resample`将时间序列转换为具有固定频率的时间序列：

### 2.1 生成日期范围 `pd.date_range()`

In [118]:
# 默认按照天计算
index = pd.date_range('6/1/2021', '8/1/2021')
index

DatetimeIndex(['2021-06-01', '2021-06-02', '2021-06-03', '2021-06-04',
               '2021-06-05', '2021-06-06', '2021-06-07', '2021-06-08',
               '2021-06-09', '2021-06-10', '2021-06-11', '2021-06-12',
               '2021-06-13', '2021-06-14', '2021-06-15', '2021-06-16',
               '2021-06-17', '2021-06-18', '2021-06-19', '2021-06-20',
               '2021-06-21', '2021-06-22', '2021-06-23', '2021-06-24',
               '2021-06-25', '2021-06-26', '2021-06-27', '2021-06-28',
               '2021-06-29', '2021-06-30', '2021-07-01', '2021-07-02',
               '2021-07-03', '2021-07-04', '2021-07-05', '2021-07-06',
               '2021-07-07', '2021-07-08', '2021-07-09', '2021-07-10',
               '2021-07-11', '2021-07-12', '2021-07-13', '2021-07-14',
               '2021-07-15', '2021-07-16', '2021-07-17', '2021-07-18',
               '2021-07-19', '2021-07-20', '2021-07-21', '2021-07-22',
               '2021-07-23', '2021-07-24', '2021-07-25', '2021-07-26',
      

In [121]:
pd.date_range?

```python
Signature:
pd.date_range(
    start=None,
    end=None,
    periods=None,
    freq=None,
    tz=None,
    normalize=False,
    name=None,
    closed=None,
    **kwargs,
) -> pandas.core.indexes.datetimes.DatetimeIndex
Docstring:
Return a fixed frequency DatetimeIndex.
```

### 使用 `freq` 参数 
- BM (business end of month): 表示每月最后一个工作日

In [123]:
# 默认按照天计算
index = pd.date_range('1/1/2021', '1/1/2022', 
                      freq='BM')
index

DatetimeIndex(['2021-01-29', '2021-02-26', '2021-03-31', '2021-04-30',
               '2021-05-31', '2021-06-30', '2021-07-30', '2021-08-31',
               '2021-09-30', '2021-10-29', '2021-11-30', '2021-12-31'],
              dtype='datetime64[ns]', freq='BM')

### 使用 `peroids` 参数 

In [126]:
index = pd.date_range('1/1/2021', '1/1/2022', 
                      periods=24)
index, len(index)

(DatetimeIndex([          '2021-01-01 00:00:00',
                '2021-01-16 20:52:10.434782608',
                '2021-02-01 17:44:20.869565217',
                '2021-02-17 14:36:31.304347826',
                '2021-03-05 11:28:41.739130435',
                '2021-03-21 08:20:52.173913044',
                '2021-04-06 05:13:02.608695652',
                '2021-04-22 02:05:13.043478262',
                '2021-05-07 22:57:23.478260870',
                '2021-05-23 19:49:33.913043478',
                '2021-06-08 16:41:44.347826088',
                '2021-06-24 13:33:54.782608696',
                '2021-07-10 10:26:05.217391304',
                '2021-07-26 07:18:15.652173914',
                '2021-08-11 04:10:26.086956524',
                '2021-08-27 01:02:36.521739132',
                '2021-09-11 21:54:46.956521740',
                '2021-09-27 18:46:57.391304348',
                '2021-10-13 15:39:07.826086956',
                '2021-10-29 12:31:18.260869568',
                '202

### 使用 `normalize` 参数 
将时间戳规范化到午夜0点

In [129]:
index = pd.date_range('6/1/2021 11:11:11', periods=11, normalize=True)
index, len(index)

(DatetimeIndex(['2021-06-01', '2021-06-02', '2021-06-03', '2021-06-04',
                '2021-06-05', '2021-06-06', '2021-06-07', '2021-06-08',
                '2021-06-09', '2021-06-10', '2021-06-11'],
               dtype='datetime64[ns]', freq='D'),
 11)

In [130]:
index[0]

Timestamp('2021-06-01 00:00:00', freq='D')

## 2.2 频率和日期偏移量
- M：月
- H：小时

In [134]:
pd.date_range('6/1/2021', '12/11/2021', freq='4h')

DatetimeIndex(['2021-06-01 00:00:00', '2021-06-01 04:00:00',
               '2021-06-01 08:00:00', '2021-06-01 12:00:00',
               '2021-06-01 16:00:00', '2021-06-01 20:00:00',
               '2021-06-02 00:00:00', '2021-06-02 04:00:00',
               '2021-06-02 08:00:00', '2021-06-02 12:00:00',
               ...
               '2021-12-09 12:00:00', '2021-12-09 16:00:00',
               '2021-12-09 20:00:00', '2021-12-10 00:00:00',
               '2021-12-10 04:00:00', '2021-12-10 08:00:00',
               '2021-12-10 12:00:00', '2021-12-10 16:00:00',
               '2021-12-10 20:00:00', '2021-12-11 00:00:00'],
              dtype='datetime64[ns]', length=1159, freq='4H')

In [136]:
pd.date_range('6/1/2021', periods=10, freq='H')

DatetimeIndex(['2021-06-01 00:00:00', '2021-06-01 01:00:00',
               '2021-06-01 02:00:00', '2021-06-01 03:00:00',
               '2021-06-01 04:00:00', '2021-06-01 05:00:00',
               '2021-06-01 06:00:00', '2021-06-01 07:00:00',
               '2021-06-01 08:00:00', '2021-06-01 09:00:00'],
              dtype='datetime64[ns]', freq='H')

In [137]:
pd.date_range('6/1/2021', periods=10, freq='M')

DatetimeIndex(['2021-06-30', '2021-07-31', '2021-08-31', '2021-09-30',
               '2021-10-31', '2021-11-30', '2021-12-31', '2022-01-31',
               '2022-02-28', '2022-03-31'],
              dtype='datetime64[ns]', freq='M')

### 传入频率字符串 

In [135]:
pd.date_range('6/1/2021', periods=10, freq='4h30min')

DatetimeIndex(['2021-06-01 00:00:00', '2021-06-01 04:30:00',
               '2021-06-01 09:00:00', '2021-06-01 13:30:00',
               '2021-06-01 18:00:00', '2021-06-01 22:30:00',
               '2021-06-02 03:00:00', '2021-06-02 07:30:00',
               '2021-06-02 12:00:00', '2021-06-02 16:30:00'],
              dtype='datetime64[ns]', freq='270T')

### 时间序列基础频率参数 `freq` 表 

|别名|偏移量类型|说明|
|:--|:--|:--|
|D|Day|每日历日|
|B|BusinessDay|每工作日|
|H|Hour|每小时|
|T/min|Minute|每分|
|S|Second|每秒|
|L/ms|Milli|每毫秒|
|U|Micro|每微秒|
|M|MonthEnd|每月最后一个日历日|
|BM|BussinessMonthEnd|每月最后一个工作日|
|MS|MonthBegin|每月第一个日历日|
|BMS|BussinessMonthBegin|每月第一个工作日|
|W-MON\W-TUE...|Week|从指定的星期几（MON\TUE\WED\THU\FRI\SAT\SUN）开始算起，每周|
|WOM-1MON\WOM-2MON...|WeekOfMonth|产生每月第一、第二、第三或第四周的星期几。例如，WOM-3FRI表示每月第三个星期五|
|Q-JAN\Q-FEB...|QuarterEnd|对于以指定月份（JAN\FEB\MAR\APR\MAY\JUN\JUL\AUG\SEP\OCT\NOV\DEC）结束的年度，每季度最后一个月的最后一个日历日|
|BQ-JAN\BQ-FEB...|BussinessQuarterEnd|对于以指定月份结束的年度，每季度最后一个月的最后一个工作日|
|QS-JAN\QS-FEB...|QuarterBegin|对于以指定月份结束的年度，每季度最后一个月的第一个日历日|
|BQS-JAN\BQS-FEB...|BussinessQuarterBegin|对于以指定月份结束的年度，每季度最后一个月的第一个工作日|
|A-JAN\A-FEB...|YearEnd|每年指定月份（JAN\FEB\MAR\APR\MAY\JUN\JUL\AUG\SEP\OCT\NOV\DEC）的最后一个日历日|
|BA-JAN\BA-FEB...|BussinessYearEnd|每年指定月份的最后一个工作日|
|AS-JAN\AS-FEB...|YearBegin|每年指定月份的第一个日历日|
|BA-JAN\BA-FEB...|BussinessYearBegin|每年指定月份的第一个工作日|

In [138]:
# 示例
# 'WOM-3FRI'表示每月第三个星期五
rng = pd.date_range('6/1/2021','12/11/2021', freq='WOM-3FRI')
rng

DatetimeIndex(['2021-06-18', '2021-07-16', '2021-08-20', '2021-09-17',
               '2021-10-15', '2021-11-19'],
              dtype='datetime64[ns]', freq='WOM-3FRI')

In [140]:
rng = pd.date_range('6/1/2021','1/1/2022', freq='BQ-DEC')
rng

DatetimeIndex(['2021-06-30', '2021-09-30', '2021-12-31'], dtype='datetime64[ns]', freq='BQ-DEC')

In [141]:
pd.date_range?

## 2.3 移动（超前和滞后）数据
移动（shifting）指的是沿着时间轴将数据前移和后移。Series和DataFrame都有一个 `.shitf()` 方法用于执行单纯的前移或后移操作，保持索引不变。

In [155]:
periods = 10
ts = pd.Series(np.random.randn(periods), 
               index=pd.date_range('6/1/2021', periods=periods, freq='M'))
ts

2021-06-30   -0.841412
2021-07-31   -0.617966
2021-08-31   -0.944509
2021-09-30    0.630180
2021-10-31   -0.203361
2021-11-30    0.913479
2021-12-31    0.082896
2022-01-31   -1.603623
2022-02-28    1.464227
2022-03-31   -0.350378
Freq: M, dtype: float64

In [150]:
ts.shift?

```python
ts.shift(periods=1, freq=None, axis=0, fill_value=None) -> 'Series'
```

In [156]:
ts.shift(1)

2021-06-30         NaN
2021-07-31   -0.841412
2021-08-31   -0.617966
2021-09-30   -0.944509
2021-10-31    0.630180
2021-11-30   -0.203361
2021-12-31    0.913479
2022-01-31    0.082896
2022-02-28   -1.603623
2022-03-31    1.464227
Freq: M, dtype: float64

In [157]:
ts.shift(1, freq='M')

2021-07-31   -0.841412
2021-08-31   -0.617966
2021-09-30   -0.944509
2021-10-31    0.630180
2021-11-30   -0.203361
2021-12-31    0.913479
2022-01-31    0.082896
2022-02-28   -1.603623
2022-03-31    1.464227
2022-04-30   -0.350378
Freq: M, dtype: float64

In [158]:
ts

2021-06-30   -0.841412
2021-07-31   -0.617966
2021-08-31   -0.944509
2021-09-30    0.630180
2021-10-31   -0.203361
2021-11-30    0.913479
2021-12-31    0.082896
2022-01-31   -1.603623
2022-02-28    1.464227
2022-03-31   -0.350378
Freq: M, dtype: float64

###  计算一个或多个时间序列中的百分比变化

In [159]:
ts / ts.shift(1) - 1

2021-06-30          NaN
2021-07-31    -0.265561
2021-08-31     0.528417
2021-09-30    -1.667204
2021-10-31    -1.322704
2021-11-30    -5.491898
2021-12-31    -0.909253
2022-01-31   -20.345109
2022-02-28    -1.913075
2022-03-31    -1.239292
Freq: M, dtype: float64

### 通过偏移量对日期进行位移 

In [163]:
from pandas.tseries.offsets import Day, MonthEnd

In [164]:
now = datetime(2021, 6, 1)
now

datetime.datetime(2021, 6, 1, 0, 0)

In [165]:
Day?

In [166]:
now + 3 * Day()

Timestamp('2021-06-04 00:00:00')

In [167]:
MonthEnd?

In [169]:
offset = MonthEnd()
offset

<MonthEnd>

In [170]:
offset.rollforward(now)

Timestamp('2021-06-30 00:00:00')

In [171]:
offset.rollback(now)

Timestamp('2021-05-31 00:00:00')

# 3. 时期及其算术运算

In [175]:
p = pd.Period(2007, freq='A-DEC')
p

Period('2007', 'A-DEC')

In [177]:
pd.Period(2021, freq='A-DEC') - p

<14 * YearEnds: month=12>

In [180]:
rng = pd.period_range('6/1/2021', '5/31/2022', freq='M')
rng, len(rng)

(PeriodIndex(['2021-06', '2021-07', '2021-08', '2021-09', '2021-10', '2021-11',
              '2021-12', '2022-01', '2022-02', '2022-03', '2022-04', '2022-05'],
             dtype='period[M]', freq='M'),
 12)

PeriodIndex保存了一组Period，它可以在任何pandas树结构中被用作轴索引：

In [181]:
pd.Series(np.random.randn(len(rng)), index=rng)

2021-06   -0.028997
2021-07    2.037657
2021-08   -0.377063
2021-09   -0.039834
2021-10   -1.945070
2021-11   -1.081119
2021-12    0.935484
2022-01    0.903672
2022-02    0.458917
2022-03    1.330505
2022-04    2.759339
2022-05    1.038594
Freq: M, dtype: float64

In [184]:
values = ['2021Q3', '2021Q2', '2021Q1']
index = pd.PeriodIndex(values, freq = 'Q-DEC')
index

PeriodIndex(['2021Q3', '2021Q2', '2021Q1'], dtype='period[Q-DEC]', freq='Q-DEC')

## 3.1 时期的频率转换
`Period` 和 `PeriodIndex` 对象都可以通过其asfreq方法被转换成别的频率。

In [185]:
p = pd.Period(2007, freq='A-DEC')
p.asfreq('M', how='start')

Period('2007-01', 'M')

In [186]:
p.asfreq?

```python

Docstring:
Convert Period to desired frequency, at the start or end of the interval.

Parameters
----------
freq : str
    The desired frequency.
how : {'E', 'S', 'end', 'start'}, default 'end'
    Start or end of the timespan.

Returns
-------
resampled : Period
Type:      builtin_function_or_method
---

## 3.2 按季度计算的时间频率 

In [191]:
# 10,11,12月为第四季度
p = pd.Period('2021Q4', freq='Q-DEC')
p

Period('2021Q4', 'Q-DEC')

 # 4. 重采样即频率转换
 重采样（resampling）是指将时间序列从一个频率转换到另一个频率的处理过程。
 - 升采样（upsampling）：低频到高频
 - 降采样（downsampling）：高频到低频

In [192]:
periods = 10
ts = pd.Series(np.random.randn(periods), 
               index=pd.date_range('6/1/2021', periods=periods, freq='M'))
ts

2021-06-30    0.182690
2021-07-31   -0.592748
2021-08-31   -0.587611
2021-09-30    0.005664
2021-10-31    0.806200
2021-11-30    0.732487
2021-12-31   -1.499358
2022-01-31    1.078263
2022-02-28   -0.106380
2022-03-31   -0.649591
Freq: M, dtype: float64

In [193]:
ts.resample?

```python
Signature:
ts.resample(
    rule,
    axis=0,
    closed: 'Optional[str]' = None,
    label: 'Optional[str]' = None,
    convention: 'str' = 'start',
    kind: 'Optional[str]' = None,
    loffset=None,
    base: 'Optional[int]' = None,
    on=None,
    level=None,
    origin: 'Union[str, TimestampConvertibleTypes]' = 'start_day',
    offset: 'Optional[TimedeltaConvertibleTypes]' = None,
) -> 'Resampler'
Docstring:
Resample time-series data.

Convenience method for frequency conversion and resampling of time
series. Object must have a datetime-like index (`DatetimeIndex`,
`PeriodIndex`, or `TimedeltaIndex`), or pass datetime-like values
to the `on` or `level` keyword.

Parameters
----------
rule : DateOffset, Timedelta or str
    The offset string or object representing target conversion.
axis : {0 or 'index', 1 or 'columns'}, default 0
    Which axis to use for up- or down-sampling. For `Series` this
    will default to 0, i.e. along the rows. Must be
    `DatetimeIndex`, `TimedeltaIndex` or `PeriodIndex`.
closed : {'right', 'left'}, default None
    Which side of bin interval is closed. The default is 'left'
    for all frequency offsets except for 'M', 'A', 'Q', 'BM',
    'BA', 'BQ', and 'W' which all have a default of 'right'.
label : {'right', 'left'}, default None
    Which bin edge label to label bucket with. The default is 'left'
    for all frequency offsets except for 'M', 'A', 'Q', 'BM',
    'BA', 'BQ', and 'W' which all have a default of 'right'.
convention : {'start', 'end', 's', 'e'}, default 'start'
    For `PeriodIndex` only, controls whether to use the start or
    end of `rule`.
kind : {'timestamp', 'period'}, optional, default None
    Pass 'timestamp' to convert the resulting index to a
    `DateTimeIndex` or 'period' to convert it to a `PeriodIndex`.
    By default the input representation is retained.
loffset : timedelta, default None
    Adjust the resampled time labels.

    .. deprecated:: 1.1.0
        You should add the loffset to the `df.index` after the resample.
        See below.

base : int, default 0
    For frequencies that evenly subdivide 1 day, the "origin" of the
    aggregated intervals. For example, for '5min' frequency, base could
    range from 0 through 4. Defaults to 0.

    .. deprecated:: 1.1.0
        The new arguments that you should use are 'offset' or 'origin'.

on : str, optional
    For a DataFrame, column to use instead of index for resampling.
    Column must be datetime-like.
level : str or int, optional
    For a MultiIndex, level (name or number) to use for
    resampling. `level` must be datetime-like.
origin : {'epoch', 'start', 'start_day'}, Timestamp or str, default 'start_day'
    The timestamp on which to adjust the grouping. The timezone of origin
    must match the timezone of the index.
    If a timestamp is not used, these values are also supported:

    - 'epoch': `origin` is 1970-01-01
    - 'start': `origin` is the first value of the timeseries
    - 'start_day': `origin` is the first day at midnight of the timeseries

    .. versionadded:: 1.1.0

offset : Timedelta or str, default is None
    An offset timedelta added to the origin.

    .. versionadded:: 1.1.0

Returns
-------
Resampler object

See Also
--------
groupby : Group by mapping, function, label, or list of labels.
Series.resample : Resample a Series.
DataFrame.resample: Resample a DataFrame.

Notes
-----
See the `user guide
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling>`_
for more.

To learn more about the offset strings, please see `this link
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects>`__.

Examples
--------
Start by creating a series with 9 one minute timestamps.

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>>> series = pd.Series(range(9), index=index)
>>> series
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

Downsample the series into 3 minute bins and sum the values
of the timestamps falling into a bin.

>>> series.resample('3T').sum()
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64
```

In [205]:
rng = pd.date_range('6/1/2021', periods=100, freq='D')

ts = pd.Series(data=np.random.randn(len(rng)), index=rng)
ts

2021-06-01   -1.140991
2021-06-02   -0.744261
2021-06-03   -0.177664
2021-06-04    1.711626
2021-06-05    1.293986
                ...   
2021-09-04    0.764660
2021-09-05    0.046588
2021-09-06   -0.229593
2021-09-07   -2.401894
2021-09-08   -0.475323
Freq: D, Length: 100, dtype: float64

In [206]:
ts.resample('M', kind='period').mean()

2021-06    0.122118
2021-07   -0.203875
2021-08    0.379899
2021-09   -0.436096
Freq: M, dtype: float64

## 4.1 降采样 

In [207]:
rng = pd.date_range('6/1/2021', periods=12, freq='T')

ts = pd.Series(data=np.arange(len(rng)), index=rng)
ts

2021-06-01 00:00:00     0
2021-06-01 00:01:00     1
2021-06-01 00:02:00     2
2021-06-01 00:03:00     3
2021-06-01 00:04:00     4
2021-06-01 00:05:00     5
2021-06-01 00:06:00     6
2021-06-01 00:07:00     7
2021-06-01 00:08:00     8
2021-06-01 00:09:00     9
2021-06-01 00:10:00    10
2021-06-01 00:11:00    11
Freq: T, dtype: int32

In [208]:
ts.resample('5min').sum()

2021-06-01 00:00:00    10
2021-06-01 00:05:00    35
2021-06-01 00:10:00    21
Freq: 5T, dtype: int32

### `closed` 参数 
closed='left'：会让区间以左边界闭合

In [209]:
ts.resample('5min', closed='left').sum()

2021-06-01 00:00:00    10
2021-06-01 00:05:00    35
2021-06-01 00:10:00    21
Freq: 5T, dtype: int32

In [210]:
ts.resample('5min', closed='right').sum()

2021-05-31 23:55:00     0
2021-06-01 00:00:00    15
2021-06-01 00:05:00    40
2021-06-01 00:10:00    11
Freq: 5T, dtype: int32

### `label` 参数 
label='left'：可用面元的左边界对其进行标记

In [211]:
ts.resample('5min', closed='left', label='left').sum()

2021-06-01 00:00:00    10
2021-06-01 00:05:00    35
2021-06-01 00:10:00    21
Freq: 5T, dtype: int32

### `loffset` 参数 

In [213]:
ts.resample('5min', loffset='-5s').sum()


>>> df.resample(freq="3s", loffset="8H")

becomes:

>>> from pandas.tseries.frequencies import to_offset
>>> df = df.resample(freq="3s").mean()
>>> df.index = df.index.to_timestamp() + to_offset("8H")

  ts.resample('5min', loffset='-5s').sum()


2021-05-31 23:59:55    10
2021-06-01 00:04:55    35
2021-06-01 00:09:55    21
Freq: 5T, dtype: int32

## 4.2 OHLC 重采样
金融领域中的采样方式，即开盘值，最大值，最小值，收盘值。

In [214]:
ts.resample('5min').ohlc()

Unnamed: 0,open,high,low,close
2021-06-01 00:00:00,0,4,0,4
2021-06-01 00:05:00,5,9,5,9
2021-06-01 00:10:00,10,11,10,11


## 4.3 `.groupby()` 重采样 

In [215]:
rng = pd.date_range('6/1/2021', periods=100, freq='D')
ts = pd.Series(data=np.arange(len(rng)), index=rng)

ts

2021-06-01     0
2021-06-02     1
2021-06-03     2
2021-06-04     3
2021-06-05     4
              ..
2021-09-04    95
2021-09-05    96
2021-09-06    97
2021-09-07    98
2021-09-08    99
Freq: D, Length: 100, dtype: int32

In [216]:
ts.groupby(lambda x: x.weekday).mean()

0    51.5
1    49.0
2    50.0
3    47.5
4    48.5
5    49.5
6    50.5
dtype: float64

In [217]:
ts.groupby(lambda x: x.month).mean()

6    14.5
7    45.0
8    76.0
9    95.5
dtype: float64

## 4.4 升采样和插值 

In [223]:
dates = pd.date_range('6/18/2021', 
                      periods=2, 
                      freq='W-WED')

long_df = pd.DataFrame(np.random.randn(2, 4), 
                       index=dates, 
                       columns=['Colorado', 'Texas', 'New York', 'Califonia'])
long_df

Unnamed: 0,Colorado,Texas,New York,Califonia
2021-06-23,-2.462441,-1.250257,-0.537822,0.194408
2021-06-30,0.949421,-1.097458,-1.301532,-0.073963


In [225]:
long_df.resample('D').mean()

Unnamed: 0,Colorado,Texas,New York,Califonia
2021-06-23,-2.462441,-1.250257,-0.537822,0.194408
2021-06-24,,,,
2021-06-25,,,,
2021-06-26,,,,
2021-06-27,,,,
2021-06-28,,,,
2021-06-29,,,,
2021-06-30,0.949421,-1.097458,-1.301532,-0.073963


In [230]:
long_df.resample('D').ffill()

Unnamed: 0,Colorado,Texas,New York,Califonia
2021-06-23,-2.462441,-1.250257,-0.537822,0.194408
2021-06-24,-2.462441,-1.250257,-0.537822,0.194408
2021-06-25,-2.462441,-1.250257,-0.537822,0.194408
2021-06-26,-2.462441,-1.250257,-0.537822,0.194408
2021-06-27,-2.462441,-1.250257,-0.537822,0.194408
2021-06-28,-2.462441,-1.250257,-0.537822,0.194408
2021-06-29,-2.462441,-1.250257,-0.537822,0.194408
2021-06-30,0.949421,-1.097458,-1.301532,-0.073963


In [231]:
long_df.resample('D').ffill(limit=2)

Unnamed: 0,Colorado,Texas,New York,Califonia
2021-06-23,-2.462441,-1.250257,-0.537822,0.194408
2021-06-24,-2.462441,-1.250257,-0.537822,0.194408
2021-06-25,-2.462441,-1.250257,-0.537822,0.194408
2021-06-26,,,,
2021-06-27,,,,
2021-06-28,,,,
2021-06-29,,,,
2021-06-30,0.949421,-1.097458,-1.301532,-0.073963


In [235]:
long_df.ffill?

In [236]:
long_df.resample('D').backfill()

Unnamed: 0,Colorado,Texas,New York,Califonia
2021-06-23,-2.462441,-1.250257,-0.537822,0.194408
2021-06-24,0.949421,-1.097458,-1.301532,-0.073963
2021-06-25,0.949421,-1.097458,-1.301532,-0.073963
2021-06-26,0.949421,-1.097458,-1.301532,-0.073963
2021-06-27,0.949421,-1.097458,-1.301532,-0.073963
2021-06-28,0.949421,-1.097458,-1.301532,-0.073963
2021-06-29,0.949421,-1.097458,-1.301532,-0.073963
2021-06-30,0.949421,-1.097458,-1.301532,-0.073963


In [238]:
long_df.resample('D').fillna(method='bfill')

Unnamed: 0,Colorado,Texas,New York,Califonia
2021-06-23,-2.462441,-1.250257,-0.537822,0.194408
2021-06-24,0.949421,-1.097458,-1.301532,-0.073963
2021-06-25,0.949421,-1.097458,-1.301532,-0.073963
2021-06-26,0.949421,-1.097458,-1.301532,-0.073963
2021-06-27,0.949421,-1.097458,-1.301532,-0.073963
2021-06-28,0.949421,-1.097458,-1.301532,-0.073963
2021-06-29,0.949421,-1.097458,-1.301532,-0.073963
2021-06-30,0.949421,-1.097458,-1.301532,-0.073963
