# pandas로 시계열 자료 다루기

pandas에서 일반적인 테이블 형태의 자료와 시계열 자료의 차이점은 인덱스(Index)에 있다.

일반적인 테이블 형태의 자료는 임의의 값을 인덱스로 가질 수 있지만 시계열 자료는 다음 클래스 중 하나를 인덱스로 가진다.

* `DatetimeIndex` : 타임스탬프

* `PeriodIndex` : 구간 

## DatatimeIndex

`DatetimeIndex`는 특정한 순간에 기록된 타임스탬프(timestamp) 형식의 시계열 자료를 다루기 위한 인덱스이다. 타임스탬프 인덱스는 반드시 일정한 간격으로 자료가 있어야 한다는 조건은 없다.

`DatetimeIndex` 타입의 인덱스는 보통 다음 방법으로 생성한다.

* `pd.to_datetime` 함수
* `pd.date_range` 함수

### to_datetime  

In [1]:
date_str = ["2016, 1, 1", "2016, 1, 4", "2016, 1, 5", "2016, 1, 6"]
idx = pd.to_datetime(date_str)
idx

DatetimeIndex(['2016-01-01', '2016-01-04', '2016-01-05', '2016-01-06'], dtype='datetime64[ns]', freq=None)

In [2]:
np.random.seed(0)
s = pd.Series(np.random.randn(4), index=idx)
s

2016-01-01    1.764052
2016-01-04    0.400157
2016-01-05    0.978738
2016-01-06    2.240893
dtype: float64

In [3]:
s = pd.Series(np.random.randn(4), index=date_str)
s

2016, 1, 1    1.867558
2016, 1, 4   -0.977278
2016, 1, 5    0.950088
2016, 1, 6   -0.151357
dtype: float64

### date_range 

* 시작일과 종료일 또는 시작일과 기간을 입력하면 범위 내의 날짜 및 시간 인덱스 생성
* `freq` 인수로 빈도 지정 가능
  * http://pandas.pydata.org/pandas-docs/version/0.18.0/timeseries.html#offset-aliases

In [4]:
pd.date_range("2016-4-1", "2016-4-30")

DatetimeIndex(['2016-04-01', '2016-04-02', '2016-04-03', '2016-04-04',
               '2016-04-05', '2016-04-06', '2016-04-07', '2016-04-08',
               '2016-04-09', '2016-04-10', '2016-04-11', '2016-04-12',
               '2016-04-13', '2016-04-14', '2016-04-15', '2016-04-16',
               '2016-04-17', '2016-04-18', '2016-04-19', '2016-04-20',
               '2016-04-21', '2016-04-22', '2016-04-23', '2016-04-24',
               '2016-04-25', '2016-04-26', '2016-04-27', '2016-04-28',
               '2016-04-29', '2016-04-30'],
              dtype='datetime64[ns]', freq='D')

In [5]:
pd.date_range(start="2016-4-1", periods=30)

DatetimeIndex(['2016-04-01', '2016-04-02', '2016-04-03', '2016-04-04',
               '2016-04-05', '2016-04-06', '2016-04-07', '2016-04-08',
               '2016-04-09', '2016-04-10', '2016-04-11', '2016-04-12',
               '2016-04-13', '2016-04-14', '2016-04-15', '2016-04-16',
               '2016-04-17', '2016-04-18', '2016-04-19', '2016-04-20',
               '2016-04-21', '2016-04-22', '2016-04-23', '2016-04-24',
               '2016-04-25', '2016-04-26', '2016-04-27', '2016-04-28',
               '2016-04-29', '2016-04-30'],
              dtype='datetime64[ns]', freq='D')

In [6]:
pd.date_range("2016-4-1", "2016-4-30", freq="B")

DatetimeIndex(['2016-04-01', '2016-04-04', '2016-04-05', '2016-04-06',
               '2016-04-07', '2016-04-08', '2016-04-11', '2016-04-12',
               '2016-04-13', '2016-04-14', '2016-04-15', '2016-04-18',
               '2016-04-19', '2016-04-20', '2016-04-21', '2016-04-22',
               '2016-04-25', '2016-04-26', '2016-04-27', '2016-04-28',
               '2016-04-29'],
              dtype='datetime64[ns]', freq='B')

In [7]:
pd.date_range("2016-4-1", "2016-12-31", freq="MS")

DatetimeIndex(['2016-04-01', '2016-05-01', '2016-06-01', '2016-07-01',
               '2016-08-01', '2016-09-01', '2016-10-01', '2016-11-01',
               '2016-12-01'],
              dtype='datetime64[ns]', freq='MS')

In [8]:
pd.date_range("2016-4-1", "2016-12-31", freq="M")

DatetimeIndex(['2016-04-30', '2016-05-31', '2016-06-30', '2016-07-31',
               '2016-08-31', '2016-09-30', '2016-10-31', '2016-11-30',
               '2016-12-31'],
              dtype='datetime64[ns]', freq='M')

In [9]:
pd.date_range("2016-4-1", "2016-12-31", freq="BMS")

DatetimeIndex(['2016-04-01', '2016-05-02', '2016-06-01', '2016-07-01',
               '2016-08-01', '2016-09-01', '2016-10-03', '2016-11-01',
               '2016-12-01'],
              dtype='datetime64[ns]', freq='BMS')

In [10]:
pd.date_range("2016-4-1", "2016-12-31", freq="BM")

DatetimeIndex(['2016-04-29', '2016-05-31', '2016-06-30', '2016-07-29',
               '2016-08-31', '2016-09-30', '2016-10-31', '2016-11-30',
               '2016-12-30'],
              dtype='datetime64[ns]', freq='BM')

In [11]:
pd.date_range("2016-1-1", "2016-12-31", freq="W-MON")

DatetimeIndex(['2016-01-04', '2016-01-11', '2016-01-18', '2016-01-25',
               '2016-02-01', '2016-02-08', '2016-02-15', '2016-02-22',
               '2016-02-29', '2016-03-07', '2016-03-14', '2016-03-21',
               '2016-03-28', '2016-04-04', '2016-04-11', '2016-04-18',
               '2016-04-25', '2016-05-02', '2016-05-09', '2016-05-16',
               '2016-05-23', '2016-05-30', '2016-06-06', '2016-06-13',
               '2016-06-20', '2016-06-27', '2016-07-04', '2016-07-11',
               '2016-07-18', '2016-07-25', '2016-08-01', '2016-08-08',
               '2016-08-15', '2016-08-22', '2016-08-29', '2016-09-05',
               '2016-09-12', '2016-09-19', '2016-09-26', '2016-10-03',
               '2016-10-10', '2016-10-17', '2016-10-24', '2016-10-31',
               '2016-11-07', '2016-11-14', '2016-11-21', '2016-11-28',
               '2016-12-05', '2016-12-12', '2016-12-19', '2016-12-26'],
              dtype='datetime64[ns]', freq='W-MON')

In [12]:
pd.date_range("2016-1-1", "2016-12-31", freq="WOM-2THU")

DatetimeIndex(['2016-01-14', '2016-02-11', '2016-03-10', '2016-04-14',
               '2016-05-12', '2016-06-09', '2016-07-14', '2016-08-11',
               '2016-09-08', '2016-10-13', '2016-11-10', '2016-12-08'],
              dtype='datetime64[ns]', freq='WOM-2THU')

In [13]:
pd.date_range("2016-1-1", "2016-12-31", freq="Q-DEC")

DatetimeIndex(['2016-03-31', '2016-06-30', '2016-09-30', '2016-12-31'], dtype='datetime64[ns]', freq='Q-DEC')

## PeriodIndex

`PeriodIndex`는 같은 시간 구간이 반복되는 경우에 각 구간(time span)을 가리키기 위해 사용하는 시간 인덱스이다. 

`PeriodIndex` 타입의 인덱스는 보통 다음 방법으로 생성한다.

* `pd.Period` 클래스
* `pd.period_range` 함수

In [14]:
pd.Period("2007", freq="A")

Period('2007', 'A-DEC')

In [15]:
pd.period_range('1/1/2011', '1/1/2012', freq='M')

PeriodIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05', '2011-06',
             '2011-07', '2011-08', '2011-09', '2011-10', '2011-11', '2011-12',
             '2012-01'],
            dtype='int64', freq='M')

## shift 연산

* 날짜 이동

In [16]:
ts = pd.Series(np.random.randn(4), index=pd.date_range("2000-1-1", periods=4, freq="M"))
ts

2000-01-31   -0.103219
2000-02-29    0.410599
2000-03-31    0.144044
2000-04-30    1.454274
Freq: M, dtype: float64

In [17]:
ts.shift(1)

2000-01-31         NaN
2000-02-29   -0.103219
2000-03-31    0.410599
2000-04-30    0.144044
Freq: M, dtype: float64

In [18]:
ts.shift(-1)

2000-01-31    0.410599
2000-02-29    0.144044
2000-03-31    1.454274
2000-04-30         NaN
Freq: M, dtype: float64

In [19]:
ts.shift(1, freq="M")

2000-02-29   -0.103219
2000-03-31    0.410599
2000-04-30    0.144044
2000-05-31    1.454274
Freq: M, dtype: float64

In [20]:
ts.shift(1, freq="W")

2000-02-06   -0.103219
2000-03-05    0.410599
2000-04-02    0.144044
2000-05-07    1.454274
Freq: WOM-1SUN, dtype: float64

### 리샘플링 (Resampling)

* down-sampling : 구간이 작아지는 경우
* up-sampling: 구간이 커지는 경우

In [35]:
ts = pd.Series(np.random.randn(100), index=pd.date_range("2000-1-1", periods=100, freq="D"))
ts.tail(20)

2000-03-21   -0.173464
2000-03-22   -0.510030
2000-03-23    1.392518
2000-03-24    1.037586
2000-03-25    0.018792
2000-03-26   -0.593777
2000-03-27   -2.011880
2000-03-28    0.589704
2000-03-29   -0.896370
2000-03-30   -1.962732
2000-03-31    1.584821
2000-04-01    0.647968
2000-04-02   -1.139008
2000-04-03   -1.214401
2000-04-04    0.870962
2000-04-05   -0.877971
2000-04-06    1.296150
2000-04-07    0.616459
2000-04-08    0.536597
2000-04-09    0.404695
Freq: D, dtype: float64

In [36]:
ts.resample('W').mean()

2000-01-02   -0.395784
2000-01-09    0.912642
2000-01-16    0.237267
2000-01-23   -0.308441
2000-01-30   -0.168457
2000-02-06    0.251602
2000-02-13    0.088259
2000-02-20    0.337014
2000-02-27    0.606611
2000-03-05    0.263323
2000-03-12    0.263994
2000-03-19   -0.071099
2000-03-26    0.327933
2000-04-02   -0.455357
2000-04-09    0.233213
Freq: W-SUN, dtype: float64

In [37]:
ts.resample('M').first()

2000-01-31    0.704111
2000-02-29   -1.171160
2000-03-31    0.770673
2000-04-30    0.647968
Freq: M, dtype: float64

In [44]:
ts = pd.Series(np.random.randn(60), index=pd.date_range("2000-1-1", periods=60, freq="T"))
ts.head(20)

2000-01-01 00:00:00    0.004175
2000-01-01 00:01:00   -1.483492
2000-01-01 00:02:00   -1.479796
2000-01-01 00:03:00    0.134687
2000-01-01 00:04:00   -0.667723
2000-01-01 00:05:00   -0.011556
2000-01-01 00:06:00    0.839491
2000-01-01 00:07:00   -0.173930
2000-01-01 00:08:00   -2.810668
2000-01-01 00:09:00   -0.150654
2000-01-01 00:10:00   -0.481044
2000-01-01 00:11:00   -0.234694
2000-01-01 00:12:00    0.899731
2000-01-01 00:13:00   -1.578530
2000-01-01 00:14:00    0.243957
2000-01-01 00:15:00    1.570304
2000-01-01 00:16:00   -0.625943
2000-01-01 00:17:00    0.472328
2000-01-01 00:18:00    0.966306
2000-01-01 00:19:00    0.210231
Freq: T, dtype: float64

In [45]:
ts.resample('10min').sum()

2000-01-01 00:00:00   -5.799465
2000-01-01 00:10:00    1.442645
2000-01-01 00:20:00    0.651518
2000-01-01 00:30:00    4.085638
2000-01-01 00:40:00   -3.496429
2000-01-01 00:50:00   -3.965800
Freq: 10T, dtype: float64

In [46]:
ts.resample('10min', closed="right").sum()

1999-12-31 23:50:00    0.004175
2000-01-01 00:00:00   -6.284684
2000-01-01 00:10:00    1.238592
2000-01-01 00:20:00    1.892161
2000-01-01 00:30:00    2.522538
2000-01-01 00:40:00   -3.702398
2000-01-01 00:50:00   -2.752276
Freq: 10T, dtype: float64

In [47]:
ts.resample('5min').ohlc()

Unnamed: 0,open,high,low,close
2000-01-01 00:00:00,0.004175,0.134687,-1.483492,-0.667723
2000-01-01 00:05:00,-0.011556,0.839491,-2.810668,-0.150654
2000-01-01 00:10:00,-0.481044,0.899731,-1.57853,0.243957
2000-01-01 00:15:00,1.570304,1.570304,-0.625943,0.210231
2000-01-01 00:20:00,-0.685097,0.7438,-0.786468,-0.786468
2000-01-01 00:25:00,-1.176473,2.360229,-1.280807,2.360229
2000-01-01 00:30:00,0.555546,0.99915,-0.966063,-0.966063
2000-01-01 00:35:00,2.160013,2.160013,-0.7034,1.092339
2000-01-01 00:40:00,-1.007555,0.566869,-1.007555,-0.489482
2000-01-01 00:45:00,0.763541,0.763541,-1.109073,-0.84721


In [55]:
ts.resample('30s').ffill().head(20)

2000-01-01 00:00:00    0.004175
2000-01-01 00:00:30    0.004175
2000-01-01 00:01:00   -1.483492
2000-01-01 00:01:30   -1.483492
2000-01-01 00:02:00   -1.479796
2000-01-01 00:02:30   -1.479796
2000-01-01 00:03:00    0.134687
2000-01-01 00:03:30    0.134687
2000-01-01 00:04:00   -0.667723
2000-01-01 00:04:30   -0.667723
2000-01-01 00:05:00   -0.011556
2000-01-01 00:05:30   -0.011556
2000-01-01 00:06:00    0.839491
2000-01-01 00:06:30    0.839491
2000-01-01 00:07:00   -0.173930
2000-01-01 00:07:30   -0.173930
2000-01-01 00:08:00   -2.810668
2000-01-01 00:08:30   -2.810668
2000-01-01 00:09:00   -0.150654
2000-01-01 00:09:30   -0.150654
Freq: 30S, dtype: float64

In [56]:
ts.resample('30s').bfill().head(20)

2000-01-01 00:00:00    0.004175
2000-01-01 00:00:30   -1.483492
2000-01-01 00:01:00   -1.483492
2000-01-01 00:01:30   -1.479796
2000-01-01 00:02:00   -1.479796
2000-01-01 00:02:30    0.134687
2000-01-01 00:03:00    0.134687
2000-01-01 00:03:30   -0.667723
2000-01-01 00:04:00   -0.667723
2000-01-01 00:04:30   -0.011556
2000-01-01 00:05:00   -0.011556
2000-01-01 00:05:30    0.839491
2000-01-01 00:06:00    0.839491
2000-01-01 00:06:30   -0.173930
2000-01-01 00:07:00   -0.173930
2000-01-01 00:07:30   -2.810668
2000-01-01 00:08:00   -2.810668
2000-01-01 00:08:30   -0.150654
2000-01-01 00:09:00   -0.150654
2000-01-01 00:09:30   -0.481044
Freq: 30S, dtype: float64