# Pandas 소개

> 기본 소개 자료 (10 Minutes to Pandas) : http://pandas.pydata.org/pandas-docs/stable/10min.html

## Module Import

- `pandas`, `numpy` 전부 쓰는 대신 편의를 위해 `pd`, `np`로 별명(alias)을 붙여 Import

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 기본 객체 생성

### `Series`

- 벡터, 리스트와 비슷한 역할을 하는 Pandas 자료형
- `DataFrame`의 각 Column은 `Series` 형

In [2]:
s1 = pd.Series([1, 3, 5, np.nan, 6, 8]) # 표시된 자료형은 상위 자료형
s1

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

> `np.nan`: Not An Number. 결측치 (missing value) 를 나타낼 때 주로 쓰임

In [3]:
s1_elem = s1.iat[1] 
print(s1_elem, type(s1_elem))

3.0 <class 'numpy.float64'>


In [4]:
s2 = pd.Series([1, 3, 5, 6, 8])
s2

0    1
1    3
2    5
3    6
4    8
dtype: int64

In [5]:
s3 = pd.Series([1, 'aa', 2.0])
s3

0     1
1    aa
2     2
dtype: object

In [6]:
s3_elem = s3.iat[0] 
print(s3_elem, type(s3_elem))

1 <class 'int'>


### `DataFrame`

In [7]:
dates = pd.date_range('20130101', '20130106') # 원 페이지 코드 오류: 시작, 끝, 선택적으로 주기(=step)값 인수임
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [8]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

- `np.random.randn(6,4)` : `numpy.float64` 형의 실수 난수값 6x4 `numpy.array` 생성 

``` python
array([[ 0.42822299, -1.28919681, -1.39084655, -1.51130465],
       [ 0.9886781 ,  3.31895341, -0.46242458,  1.30657345],
       [-2.10034591, -2.34556215, -0.18027205, -0.44136575],
       [-0.30086497,  1.59777955,  0.31756746, -0.42374136],
       [ 0.21047147,  1.18928173,  0.31065258,  0.24569729],
       [ 3.21395634, -1.01428065,  0.84260331,  1.19876768]])
```

- `list('ABCD')` : `['A', 'B', 'C', 'D']`

In [9]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.790203,-0.017552,0.529122,0.941829
2013-01-02,-0.914385,1.437252,0.571389,-1.135716
2013-01-03,0.063952,0.557349,0.803096,-0.007955
2013-01-04,1.631663,1.377744,-0.442022,0.162752
2013-01-05,0.414766,-0.962596,-0.757042,0.986497
2013-01-06,-0.413272,-1.421938,0.120484,0.316631


In [10]:
df2 = pd.DataFrame({ 'A' : 1.,
                         'B' : pd.Timestamp('20130102'),
                         'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                         'D' : np.array([3] * 4,dtype='int32'),
                         'E' : pd.Categorical(["test","train","test","train"]),
                         'F' : 'foo' })

In [11]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


- `.dtypes` : Column 별 자료형에 대한 `Series` 반환

In [12]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [13]:
df2.__dict__ # 인스턴스 변수 매핑

{'_data': BlockManager
 Items: Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
 Axis 1: Int64Index([0, 1, 2, 3], dtype='int64')
 FloatBlock: slice(0, 1, 1), 1 x 4, dtype: float64
 FloatBlock: slice(2, 3, 1), 1 x 4, dtype: float32
 IntBlock: slice(3, 4, 1), 1 x 4, dtype: int32
 DatetimeBlock: slice(1, 2, 1), 1 x 4, dtype: datetime64[ns]
 ObjectBlock: slice(5, 6, 1), 1 x 4, dtype: object
 CategoricalBlock: slice(4, 5, 1), 1 x 4, dtype: category,
 '_iloc': <pandas.core.indexing._iLocIndexer at 0x7f2977080ba8>,
 '_item_cache': {},
 'is_copy': None}

In [14]:
# 특정 함수에 대한 도움말
help(pd.DataFrame.iloc)

Help on property:

    Purely integer-location based indexing for selection by position.
    
    ``.iloc[]`` is primarily integer position based (from ``0`` to
    ``length-1`` of the axis), but may also be used with a boolean
    array.
    
    Allowed inputs are:
    
    - An integer, e.g. ``5``.
    - A list or array of integers, e.g. ``[4, 3, 0]``.
    - A slice object with ints, e.g. ``1:7``.
    - A boolean array.
    - A ``callable`` function with one argument (the calling Series, DataFrame
      or Panel) and that returns valid output for indexing (one of the above)
    
    ``.iloc`` will raise ``IndexError`` if a requested indexer is
    out-of-bounds, except *slice* indexers which allow out-of-bounds
    indexing (this conforms with python/numpy *slice* semantics).
    
    See more at :ref:`Selection by Position <indexing.integer>`



### 보기 / 단순 변환 

#### `head()`, `tail()`

In [15]:
df.head(2)

Unnamed: 0,A,B,C,D
2013-01-01,-0.790203,-0.017552,0.529122,0.941829
2013-01-02,-0.914385,1.437252,0.571389,-1.135716


In [16]:
df.tail(2)

Unnamed: 0,A,B,C,D
2013-01-05,0.414766,-0.962596,-0.757042,0.986497
2013-01-06,-0.413272,-1.421938,0.120484,0.316631


#### `index`, `columns`, `values`

In [17]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [18]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [19]:
df.values

array([[-0.79020336, -0.01755248,  0.5291217 ,  0.9418294 ],
       [-0.91438514,  1.43725219,  0.57138911, -1.13571558],
       [ 0.06395153,  0.55734924,  0.80309564, -0.00795506],
       [ 1.63166298,  1.3777442 , -0.44202239,  0.16275179],
       [ 0.41476574, -0.96259604, -0.75704233,  0.98649686],
       [-0.41327178, -1.42193774,  0.12048386,  0.31663065]])

#### `describe()`

- column 별 데이터 요약정보(기술통계) 제공

In [20]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.001247,0.16171,0.137504,0.210673
std,0.945489,1.188816,0.619837,0.775961
min,-0.914385,-1.421938,-0.757042,-1.135716
25%,-0.69597,-0.726335,-0.301396,0.034722
50%,-0.17466,0.269898,0.324803,0.239691
75%,0.327062,1.172645,0.560822,0.78553
max,1.631663,1.437252,0.803096,0.986497


#### `T` : 행/열 바꾸기

- transpose 명령

In [21]:
df.T

Unnamed: 0,2013-01-01 00:00:00,2013-01-02 00:00:00,2013-01-03 00:00:00,2013-01-04 00:00:00,2013-01-05 00:00:00,2013-01-06 00:00:00
A,-0.790203,-0.914385,0.063952,1.631663,0.414766,-0.413272
B,-0.017552,1.437252,0.557349,1.377744,-0.962596,-1.421938
C,0.529122,0.571389,0.803096,-0.442022,-0.757042,0.120484
D,0.941829,-1.135716,-0.007955,0.162752,0.986497,0.316631


####  `sort_index()`

- 행/열 인덱스명(이름)에 따라 정렬

In [22]:
df.sort_index()

Unnamed: 0,A,B,C,D
2013-01-01,-0.790203,-0.017552,0.529122,0.941829
2013-01-02,-0.914385,1.437252,0.571389,-1.135716
2013-01-03,0.063952,0.557349,0.803096,-0.007955
2013-01-04,1.631663,1.377744,-0.442022,0.162752
2013-01-05,0.414766,-0.962596,-0.757042,0.986497
2013-01-06,-0.413272,-1.421938,0.120484,0.316631


In [23]:
df.sort_index(axis=0, ascending=False) # Index Label에 대한 내림차순 정렬

Unnamed: 0,A,B,C,D
2013-01-06,-0.413272,-1.421938,0.120484,0.316631
2013-01-05,0.414766,-0.962596,-0.757042,0.986497
2013-01-04,1.631663,1.377744,-0.442022,0.162752
2013-01-03,0.063952,0.557349,0.803096,-0.007955
2013-01-02,-0.914385,1.437252,0.571389,-1.135716
2013-01-01,-0.790203,-0.017552,0.529122,0.941829


In [24]:
df.sort_index(axis=1, ascending=False) # Column Label에 대한 내림차순 정렬

Unnamed: 0,D,C,B,A
2013-01-01,0.941829,0.529122,-0.017552,-0.790203
2013-01-02,-1.135716,0.571389,1.437252,-0.914385
2013-01-03,-0.007955,0.803096,0.557349,0.063952
2013-01-04,0.162752,-0.442022,1.377744,1.631663
2013-01-05,0.986497,-0.757042,-0.962596,0.414766
2013-01-06,0.316631,0.120484,-1.421938,-0.413272


####  `sort_values()`

- column 값에 따라 정렬

In [25]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-06,-0.413272,-1.421938,0.120484,0.316631
2013-01-05,0.414766,-0.962596,-0.757042,0.986497
2013-01-01,-0.790203,-0.017552,0.529122,0.941829
2013-01-03,0.063952,0.557349,0.803096,-0.007955
2013-01-04,1.631663,1.377744,-0.442022,0.162752
2013-01-02,-0.914385,1.437252,0.571389,-1.135716


## Selection (Get / Set)

> - `[]`를 사용한 Access는 `DataFrame` 형 지원으로 유명한 `R` 스타일을 차용함
> - 실제 최적화된 Access: `.at`, `.iat`, `.loc`, `.iloc`, `.ix`

### Row, Column, 특정 영역 가져오기

#### 열(Column) 하나 선택

- `.[column_label]` 또는 `.column_label`

In [26]:
df['A']

2013-01-01   -0.790203
2013-01-02   -0.914385
2013-01-03    0.063952
2013-01-04    1.631663
2013-01-05    0.414766
2013-01-06   -0.413272
Freq: D, Name: A, dtype: float64

In [27]:
df.A

2013-01-01   -0.790203
2013-01-02   -0.914385
2013-01-03    0.063952
2013-01-04    1.631663
2013-01-05    0.414766
2013-01-06   -0.413272
Freq: D, Name: A, dtype: float64

#### 행(Rows) 단위 Slicing

In [28]:
df[0:3] # 0,1,2 번째 Row 선택

Unnamed: 0,A,B,C,D
2013-01-01,-0.790203,-0.017552,0.529122,0.941829
2013-01-02,-0.914385,1.437252,0.571389,-1.135716
2013-01-03,0.063952,0.557349,0.803096,-0.007955


In [29]:
df['20130102':'20130104'] # 이때는 위치 인덱싱과 달리 마지막 같도 포함됨을 유의

Unnamed: 0,A,B,C,D
2013-01-02,-0.914385,1.437252,0.571389,-1.135716
2013-01-03,0.063952,0.557349,0.803096,-0.007955
2013-01-04,1.631663,1.377744,-0.442022,0.162752


### Index / Column Label 활용 Selection

#### Index Label로 Row 하나 선택

In [30]:
df.loc[dates[0]]

A   -0.790203
B   -0.017552
C    0.529122
D    0.941829
Name: 2013-01-01 00:00:00, dtype: float64

#### 여러 Column Label 선택

> 행 전체 선택을 위해 `:` 사용

- Column Label `list`

In [31]:
df.loc[:,['A', 'B']]

Unnamed: 0,A,B
2013-01-01,-0.790203,-0.017552
2013-01-02,-0.914385,1.437252
2013-01-03,0.063952,0.557349
2013-01-04,1.631663,1.377744
2013-01-05,0.414766,-0.962596
2013-01-06,-0.413272,-1.421938


- Column Label 범위를 `:` 활용해 지정

In [32]:
df.loc[:,'A':'D']

Unnamed: 0,A,B,C,D
2013-01-01,-0.790203,-0.017552,0.529122,0.941829
2013-01-02,-0.914385,1.437252,0.571389,-1.135716
2013-01-03,0.063952,0.557349,0.803096,-0.007955
2013-01-04,1.631663,1.377744,-0.442022,0.162752
2013-01-05,0.414766,-0.962596,-0.757042,0.986497
2013-01-06,-0.413272,-1.421938,0.120484,0.316631


#### Row, Column 모두 양 끝점 지정 : Slicing

In [33]:
df.loc['20130102':'20130104',['A','B']]

Unnamed: 0,A,B
2013-01-02,-0.914385,1.437252
2013-01-03,0.063952,0.557349
2013-01-04,1.631663,1.377744


#### 행/열중 값 하나만 선택

- 반환 객체는 `Series` 형으로 차원이 하나 줄어듦

In [34]:
df.loc['20130102',['A','B']]

A   -0.914385
B    1.437252
Name: 2013-01-02 00:00:00, dtype: float64

In [35]:
df.loc['20130102':'20130103','A']

2013-01-02   -0.914385
2013-01-03    0.063952
Freq: D, Name: A, dtype: float64

#### 행/열 모두 값 하나만 선택

- 스칼라 값 반환

In [36]:
df.loc[dates[0],'A']

-0.79020336341641018

#### `.at[row, col]` 

- 위와 동일하나, 더 빠른 값 Access 

In [37]:
df.at[dates[0],'A']

-0.79020336341641018

### 위치(Position) 지정 선택

- Label 대신 0부터 시작하는 위치 정수값 지정
- 더 자세한 방법: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer


#### `.iloc[n]`

- 넘어온 정수값 위치 Row의 모든 Column 값

In [38]:
df.iloc[3]

A    1.631663
B    1.377744
C   -0.442022
D    0.162752
Name: 2013-01-04 00:00:00, dtype: float64

#### `.iloc[r1:r2, c1:c2]`

- 이렇게 integer slice 활용하면, numpy와 비슷하게 slicing
- 대신 Row, Column 맨 끝값 위치의 값은 제외

In [39]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2013-01-04,1.631663,1.377744
2013-01-05,0.414766,-0.962596


#### `.iloc[[r0,r1,. ..], [c0,c1, ...]]`

- 마찬가지로 `numpy`와 같이 위치값 list로 slicing 가능

In [40]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2013-01-02,-0.914385,0.571389
2013-01-03,0.063952,0.803096
2013-01-05,0.414766,-0.757042


In [41]:
df.iloc[range(4,-1,-1),[2,0]]

Unnamed: 0,C,A
2013-01-05,-0.757042,0.414766
2013-01-04,-0.442022,1.631663
2013-01-03,0.803096,0.063952
2013-01-02,0.571389,-0.914385
2013-01-01,0.529122,-0.790203


In [42]:
df.iloc[[4,2,1],[2,0]] # 이렇게 순서를 바꾼다면?

Unnamed: 0,C,A
2013-01-05,-0.757042,0.414766
2013-01-03,0.803096,0.063952
2013-01-02,0.571389,-0.914385


#### `.iloc[r1:r2, :]`, `.iloc[:, c1:c2]`

- row / column 단위 slicing
- 위치(position) 범위에서 `range()`와 마찬가지로 위치 끝값은 포함되지 않음

In [43]:
 df.iloc[1:3, :] #또는 df.iloc[1:3]

Unnamed: 0,A,B,C,D
2013-01-02,-0.914385,1.437252,0.571389,-1.135716
2013-01-03,0.063952,0.557349,0.803096,-0.007955


In [44]:
df.iloc[:,1:3]

Unnamed: 0,B,C
2013-01-01,-0.017552,0.529122
2013-01-02,1.437252,0.571389
2013-01-03,0.557349,0.803096
2013-01-04,1.377744,-0.442022
2013-01-05,-0.962596,-0.757042
2013-01-06,-1.421938,0.120484


#### `.iloc[r,c]`

- 해당 위치의 scala값

In [45]:
df.iloc[1,1]

1.4372521869046888

In [46]:
df.iloc[1:2,1:2] # 마찬가지로 1개 원소이지만, 양쪽 모두 리스트형으로 간주하므로, 이때는 DataFrame 형태

Unnamed: 0,B
2013-01-02,1.437252


#### `.iat[r,c]`

- 위와 동일 (fast access)

In [47]:
 df.iat[1,1]

1.4372521869046888

### Boolean Indexing

- Boolean 값을 통한 indexing, 즉 특정 조건을 가진 값만 얻는 filtering 가능

#### Row 단위 Boolean Indexing

- 아래와 같이 모든 row의 지정 column 값에 대한 Boolean 연산 결과 `Series`를 얻을 수 있음

In [48]:
df.loc[:,['A','B']] > 0

Unnamed: 0,A,B
2013-01-01,False,False
2013-01-02,False,True
2013-01-03,True,True
2013-01-04,True,True
2013-01-05,True,False
2013-01-06,False,False


In [49]:
df.A > 0

2013-01-01    False
2013-01-02    False
2013-01-03     True
2013-01-04     True
2013-01-05     True
2013-01-06    False
Freq: D, Name: A, dtype: bool

- 위의 Boolean `Series` 값을 활용하면 해당 조건을 만족하는 모든 Row를 얻을 수 있음

In [50]:
df[df.A > 0]

Unnamed: 0,A,B,C,D
2013-01-03,0.063952,0.557349,0.803096,-0.007955
2013-01-04,1.631663,1.377744,-0.442022,0.162752
2013-01-05,0.414766,-0.962596,-0.757042,0.986497


#### 전체 값 대상 Boolean Indexing

In [51]:
df > 0

Unnamed: 0,A,B,C,D
2013-01-01,False,False,True,True
2013-01-02,False,True,True,False
2013-01-03,True,True,True,False
2013-01-04,True,True,False,True
2013-01-05,True,False,False,True
2013-01-06,False,False,True,True


In [52]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,,,0.529122,0.941829
2013-01-02,,1.437252,0.571389,
2013-01-03,0.063952,0.557349,0.803096,
2013-01-04,1.631663,1.377744,,0.162752
2013-01-05,0.414766,,,0.986497
2013-01-06,,,0.120484,0.316631


#### `.isin(value_list)`

- 주어진 `value_list`를 가졌는지에 대한 Boolean 결과

In [53]:
df2 = df.copy()

df2['E'] = ['one', 'one','two','three','four','three']
df2

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.790203,-0.017552,0.529122,0.941829,one
2013-01-02,-0.914385,1.437252,0.571389,-1.135716,one
2013-01-03,0.063952,0.557349,0.803096,-0.007955,two
2013-01-04,1.631663,1.377744,-0.442022,0.162752,three
2013-01-05,0.414766,-0.962596,-0.757042,0.986497,four
2013-01-06,-0.413272,-1.421938,0.120484,0.316631,three


In [54]:
df2['E'].isin(['two','four'])

2013-01-01    False
2013-01-02    False
2013-01-03     True
2013-01-04    False
2013-01-05     True
2013-01-06    False
Freq: D, Name: E, dtype: bool

- 위 결과를 활용한 필터링 가능

In [55]:
df2[df2['E'].isin(['two','four'])]

Unnamed: 0,A,B,C,D,E
2013-01-03,0.063952,0.557349,0.803096,-0.007955,two
2013-01-05,0.414766,-0.962596,-0.757042,0.986497,four


### 값 변경

In [56]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', '20130107'))
s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [57]:
df3 = df.copy()
df3['F'] = s1 # s1 Series값을 'F'라는 Label을 가진 열로 설정
df3

Unnamed: 0,A,B,C,D,F
2013-01-01,-0.790203,-0.017552,0.529122,0.941829,
2013-01-02,-0.914385,1.437252,0.571389,-1.135716,1.0
2013-01-03,0.063952,0.557349,0.803096,-0.007955,2.0
2013-01-04,1.631663,1.377744,-0.442022,0.162752,3.0
2013-01-05,0.414766,-0.962596,-0.757042,0.986497,4.0
2013-01-06,-0.413272,-1.421938,0.120484,0.316631,5.0


In [58]:
df3.at[dates[0],'A'] = 0 # Label 지정, 값 설정 (2013-01-02,A)

In [59]:
df3.iat[0,1] = 0 # 위치 지정, 값 설정 (2013-01-01,B)

In [60]:
df3.loc[:,'D'] = np.array([5] * len(df)) # numpy array 생성, 값 설정

In [61]:
df3

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,0.529122,5,
2013-01-02,-0.914385,1.437252,0.571389,5,1.0
2013-01-03,0.063952,0.557349,0.803096,5,2.0
2013-01-04,1.631663,1.377744,-0.442022,5,3.0
2013-01-05,0.414766,-0.962596,-0.757042,5,4.0
2013-01-06,-0.413272,-1.421938,0.120484,5,5.0


- 예제: 0보다 큰 값을 모두 음수로 바꾸기

In [62]:
df3_inv = df3.copy()
df3_inv[df3_inv > 0] = -df3_inv
df3_inv

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.529122,-5,
2013-01-02,-0.914385,-1.437252,-0.571389,-5,-1.0
2013-01-03,-0.063952,-0.557349,-0.803096,-5,-2.0
2013-01-04,-1.631663,-1.377744,-0.442022,-5,-3.0
2013-01-05,-0.414766,-0.962596,-0.757042,-5,-4.0
2013-01-06,-0.413272,-1.421938,-0.120484,-5,-5.0


### 결측치 (Missing Data)

> `.reindex()` : index 인자에 명시된 행만 선택해 새로운 `DataFrame` 생성

In [63]:
df4 = df3.reindex(index=dates[0:4], columns=list(df3.columns) + ['E'])
df4

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.0,0.529122,5,,
2013-01-02,-0.914385,1.437252,0.571389,5,1.0,
2013-01-03,0.063952,0.557349,0.803096,5,2.0,
2013-01-04,1.631663,1.377744,-0.442022,5,3.0,


- 새로 추가된 `E` Column 모두 `NaN` (결측치 의미)로 설정됨. 이중 일부만 선택해 값 설정

In [64]:
df4.loc[dates[0]:dates[1], 'E'] = 1
df4

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.0,0.529122,5,,1.0
2013-01-02,-0.914385,1.437252,0.571389,5,1.0,1.0
2013-01-03,0.063952,0.557349,0.803096,5,2.0,
2013-01-04,1.631663,1.377744,-0.442022,5,3.0,


#### `.dropna()`

- 결측치를 가진 Row 제거

In [65]:
df4.dropna(how='any')

Unnamed: 0,A,B,C,D,F,E
2013-01-02,-0.914385,1.437252,0.571389,5,1.0,1.0


#### `.fillna(value=val)`

- 결측값을 `value` 인수 값으로 변경

In [66]:
df4.fillna(value=5)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.0,0.529122,5,5.0,1.0
2013-01-02,-0.914385,1.437252,0.571389,5,1.0,1.0
2013-01-03,0.063952,0.557349,0.803096,5,2.0,5.0
2013-01-04,1.631663,1.377744,-0.442022,5,3.0,5.0


#### `pandas.isnull(frame)`

- `numpy.nan` (`NaN`) 여부를 나타내는 `DataFrame` 반환

In [67]:
pd.isnull(df4)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,False,False,False,False,True,False
2013-01-02,False,False,False,False,False,False
2013-01-03,False,False,False,False,False,True
2013-01-04,False,False,False,False,False,True


## Operations

### 통계 연산

- 각종 기술통계(descriptive statistics)값 구함

> 보통 결측치를 빼고 구함

In [68]:
df.mean() # Column별 평균

A   -0.001247
B    0.161710
C    0.137504
D    0.210673
dtype: float64

In [69]:
df.mean(axis=1) # Row별 평균

2013-01-01    0.165799
2013-01-02   -0.010365
2013-01-03    0.354110
2013-01-04    0.682534
2013-01-05   -0.079594
2013-01-06   -0.349524
Freq: D, dtype: float64

### Alignment

- `.shift()` 연산으로 빈 자리에 `NaN`이 채워짐

In [70]:
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

- `.sub()`는 빼기 연산 - `Series` 하나가 각 Column에 적용됨

In [71]:
df3.sub(s, axis='index') 

Unnamed: 0,A,B,C,D,F
2013-01-01,,,,,
2013-01-02,,,,,
2013-01-03,-0.936048,-0.442651,-0.196904,4.0,1.0
2013-01-04,-1.368337,-1.622256,-3.442022,2.0,0.0
2013-01-05,-4.585234,-5.962596,-5.757042,0.0,-1.0
2013-01-06,,,,,


### `.apply(func)`

- Column별로 주어진 함수 `func` 적용

In [72]:
df3

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,0.529122,5,
2013-01-02,-0.914385,1.437252,0.571389,5,1.0
2013-01-03,0.063952,0.557349,0.803096,5,2.0
2013-01-04,1.631663,1.377744,-0.442022,5,3.0
2013-01-05,0.414766,-0.962596,-0.757042,5,4.0
2013-01-06,-0.413272,-1.421938,0.120484,5,5.0


- `numpy.cumsum()` : `Series`의 누적합을 구함. 아래에서는 각 Column별로 적용됨

In [73]:
df3.apply(np.cumsum)

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,0.529122,5,
2013-01-02,-0.914385,1.437252,1.100511,10,1.0
2013-01-03,-0.850434,1.994601,1.903606,15,3.0
2013-01-04,0.781229,3.372346,1.461584,20,6.0
2013-01-05,1.195995,2.40975,0.704542,25,10.0
2013-01-06,0.782723,0.987812,0.825026,30,15.0


- `numpy.mean()`의 결과가 스칼라값이므로, 결과는 `Series` 형

In [74]:
df3.apply(np.mean) 

A    0.130454
B    0.164635
C    0.137504
D    5.000000
F    3.000000
dtype: float64

- `lambda` 함수 사용 가능

In [75]:
df3.apply(lambda x: x.max() - x.min())

A    2.546048
B    2.859190
C    1.560138
D    0.000000
F    4.000000
dtype: float64

### String Methods

- `Series` 형 객체는 str 속성을 통해 각종 Vectorized String Operation 제공
- 정규식(Regular Expression) 사용 함수도 많음
- 참고 페이지: [ Vectorized String Methods](http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods)

In [76]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

## Merge

### `pandas.concat()`

In [77]:
df = pd.DataFrame(np.random.randn(10, 4))
df

Unnamed: 0,0,1,2,3
0,-0.417482,1.710685,-0.792482,-0.526517
1,-0.247316,0.704435,0.588487,-0.547979
2,0.392963,0.111145,-0.152219,0.828455
3,1.598796,-0.24352,1.283022,2.944093
4,2.292833,-0.195461,0.553132,-0.48471
5,-0.078717,-1.712367,0.474291,-0.27064
6,1.486077,0.354844,-0.145241,-1.123682
7,-1.198684,0.372384,1.764393,0.6452
8,-0.380074,0.473909,0.888729,1.729658
9,1.697569,0.559378,0.444881,1.716868


In [78]:
pieces = [df[:3], df[3:7], df[7:]]
pieces

[          0         1         2         3
 0 -0.417482  1.710685 -0.792482 -0.526517
 1 -0.247316  0.704435  0.588487 -0.547979
 2  0.392963  0.111145 -0.152219  0.828455,
           0         1         2         3
 3  1.598796 -0.243520  1.283022  2.944093
 4  2.292833 -0.195461  0.553132 -0.484710
 5 -0.078717 -1.712367  0.474291 -0.270640
 6  1.486077  0.354844 -0.145241 -1.123682,
           0         1         2         3
 7 -1.198684  0.372384  1.764393  0.645200
 8 -0.380074  0.473909  0.888729  1.729658
 9  1.697569  0.559378  0.444881  1.716868]

In [79]:
pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,-0.417482,1.710685,-0.792482,-0.526517
1,-0.247316,0.704435,0.588487,-0.547979
2,0.392963,0.111145,-0.152219,0.828455
3,1.598796,-0.24352,1.283022,2.944093
4,2.292833,-0.195461,0.553132,-0.48471
5,-0.078717,-1.712367,0.474291,-0.27064
6,1.486077,0.354844,-0.145241,-1.123682
7,-1.198684,0.372384,1.764393,0.6452
8,-0.380074,0.473909,0.888729,1.729658
9,1.697569,0.559378,0.444881,1.716868


### `pandas.merge()`

- DBMS의 `join` 역할

In [80]:
left = pd.DataFrame({'key': ['foo', 'f'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'f'], 'rval': [4, 5]})

In [81]:
left

Unnamed: 0,key,lval
0,foo,1
1,f,2


In [82]:
right

Unnamed: 0,key,rval
0,foo,4
1,f,5


In [83]:
pd.merge(left, right, on='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,f,2,5


### `.append()`

- `DataFrame`에 Rows 추가

In [84]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,0.826269,-0.435197,0.50354,0.512605
1,0.797082,0.646145,-1.773805,-0.497814
2,1.45012,-0.174771,-1.257036,0.807225
3,1.536859,0.255702,-0.320527,0.681855
4,0.844097,1.93506,-0.902201,0.894311
5,-0.409292,1.621974,-0.95518,0.407255
6,0.952465,0.928864,-1.214268,-0.155378
7,0.247295,1.191205,0.922235,0.458111


In [85]:
s = df.iloc[3:5]
df.append(s, ignore_index=True)

Unnamed: 0,A,B,C,D
0,0.826269,-0.435197,0.50354,0.512605
1,0.797082,0.646145,-1.773805,-0.497814
2,1.45012,-0.174771,-1.257036,0.807225
3,1.536859,0.255702,-0.320527,0.681855
4,0.844097,1.93506,-0.902201,0.894311
5,-0.409292,1.621974,-0.95518,0.407255
6,0.952465,0.928864,-1.214268,-0.155378
7,0.247295,1.191205,0.922235,0.458111
8,1.536859,0.255702,-0.320527,0.681855
9,0.844097,1.93506,-0.902201,0.894311


## Grouping

- **Group By**
    - Splitting
    - Applying
    - Combining

In [86]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                    'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                    'C' : np.random.randn(8),
                    'D' : np.random.randn(8)})
df

Unnamed: 0,A,B,C,D
0,foo,one,1.256523,-1.247212
1,bar,one,-0.589568,-0.056018
2,foo,two,-2.704794,0.635845
3,bar,three,1.426632,-0.527465
4,foo,two,-1.074233,0.5751
5,bar,two,-0.026146,-3.237808
6,foo,one,0.403499,1.450569
7,foo,three,-0.142425,-1.097315


In [87]:
grouped_1 = df.groupby('A').sum()
grouped_1

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,0.810918,-3.821291
foo,-2.261429,0.316987


In [88]:
grouped_1 = df.groupby('A').sum(numeric_only=False) # Aggregation method별로 numeric_only 기본값이 다름
grouped_1

Unnamed: 0_level_0,B,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,onethreetwo,0.810918,-3.821291
foo,onetwotwoonethree,-2.261429,0.316987


In [89]:
grouped_1 = df.groupby('A').max()
grouped_1

Unnamed: 0_level_0,B,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,two,1.426632,-0.056018
foo,two,1.256523,1.450569


In [90]:
grouped_2d = df.groupby(['A','B']).sum()
grouped_2d

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.589568,-0.056018
bar,three,1.426632,-0.527465
bar,two,-0.026146,-3.237808
foo,one,1.660022,0.203357
foo,three,-0.142425,-1.097315
foo,two,-3.779026,1.210945


In [91]:
grouped_2d.index

MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
           names=['A', 'B'])

In [92]:
pd.DataFrame(grouped_2d.index)

Unnamed: 0,0
0,"(bar, one)"
1,"(bar, three)"
2,"(bar, two)"
3,"(foo, one)"
4,"(foo, three)"
5,"(foo, two)"


In [93]:
for key, value in grouped_2d.iterrows() : print('key:%s, value:%s' % (key, value))
type(value)

key:('bar', 'one'), value:C   -0.589568
D   -0.056018
Name: (bar, one), dtype: float64
key:('bar', 'three'), value:C    1.426632
D   -0.527465
Name: (bar, three), dtype: float64
key:('bar', 'two'), value:C   -0.026146
D   -3.237808
Name: (bar, two), dtype: float64
key:('foo', 'one'), value:C    1.660022
D    0.203357
Name: (foo, one), dtype: float64
key:('foo', 'three'), value:C   -0.142425
D   -1.097315
Name: (foo, three), dtype: float64
key:('foo', 'two'), value:C   -3.779026
D    1.210945
Name: (foo, two), dtype: float64


pandas.core.series.Series

## Reshaping

### Stack

In [94]:
list(zip(range(0,5), list("abcde")))

[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e')]

In [95]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
               ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]))

In [96]:
tuples

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [97]:
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
index

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

In [98]:
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]

In [99]:
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-2.14685,-2.230374
bar,two,-0.124924,-1.670766
baz,one,-0.63983,1.2623
baz,two,1.509322,-0.047733


In [100]:
stacked = df2.stack()
stacked

first  second   
bar    one     A   -2.146850
               B   -2.230374
       two     A   -0.124924
               B   -1.670766
baz    one     A   -0.639830
               B    1.262300
       two     A    1.509322
               B   -0.047733
dtype: float64

In [101]:
stacked.index

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two'], ['A', 'B']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second', None])

In [102]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-2.14685,-2.230374
bar,two,-0.124924,-1.670766
baz,one,-0.63983,1.2623
baz,two,1.509322,-0.047733


In [103]:
stacked.unstack(1)

Unnamed: 0_level_0,second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,-2.14685,-0.124924
bar,B,-2.230374,-1.670766
baz,A,-0.63983,1.509322
baz,B,1.2623,-0.047733


In [104]:
stacked.unstack('second')

Unnamed: 0_level_0,second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,-2.14685,-0.124924
bar,B,-2.230374,-1.670766
baz,A,-0.63983,1.509322
baz,B,1.2623,-0.047733


In [105]:
stacked.unstack(0)

Unnamed: 0_level_0,first,bar,baz
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,-2.14685,-0.63983
one,B,-2.230374,1.2623
two,A,-0.124924,1.509322
two,B,-1.670766,-0.047733


In [106]:
stacked.unstack([0,1])

first,bar,bar,baz,baz
second,one,two,one,two
A,-2.14685,-0.124924,-0.63983,1.509322
B,-2.230374,-1.670766,1.2623,-0.047733


In [107]:
stacked.unstack(None)

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-2.14685,-2.230374
bar,two,-0.124924,-1.670766
baz,one,-0.63983,1.2623
baz,two,1.509322,-0.047733


### Pivot Tables

In [108]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                       'B' : ['A', 'B', 'C'] * 4,
                       'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                       'D' : np.random.randn(12),
                       'E' : np.random.randn(12)})
df

Unnamed: 0,A,B,C,D,E
0,one,A,foo,-0.24403,1.217429
1,one,B,foo,0.112285,-0.452343
2,two,C,foo,0.910474,-0.239796
3,three,A,bar,0.937343,0.143384
4,one,B,bar,-0.025901,1.055161
5,one,C,bar,0.356746,0.416997
6,two,A,foo,0.449616,-0.235839
7,three,B,foo,-0.759305,0.451718
8,one,C,foo,0.888418,1.157877
9,one,A,bar,-0.123872,-0.784133


In [109]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,-0.123872,-0.24403
one,B,-0.025901,0.112285
one,C,0.356746,0.888418
three,A,0.937343,
three,B,,-0.759305
three,C,-0.547283,
two,A,,0.449616
two,B,0.374205,
two,C,,0.910474


In [110]:
pivoted = pd.pivot_table(df, values=['D','E'], index=['A', 'B'], columns=['C'])
pivoted

Unnamed: 0_level_0,Unnamed: 1_level_0,D,D,E,E
Unnamed: 0_level_1,C,bar,foo,bar,foo
A,B,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
one,A,-0.123872,-0.24403,-0.784133,1.217429
one,B,-0.025901,0.112285,1.055161,-0.452343
one,C,0.356746,0.888418,0.416997,1.157877
three,A,0.937343,,0.143384,
three,B,,-0.759305,,0.451718
three,C,-0.547283,,-0.938091,
two,A,,0.449616,,-0.235839
two,B,0.374205,,0.143702,
two,C,,0.910474,,-0.239796


In [111]:
for item in pivoted.items() : print(item)

(('D', 'bar'), A      B
one    A   -0.123872
       B   -0.025901
       C    0.356746
three  A    0.937343
       B         NaN
       C   -0.547283
two    A         NaN
       B    0.374205
       C         NaN
Name: (D, bar), dtype: float64)
(('D', 'foo'), A      B
one    A   -0.244030
       B    0.112285
       C    0.888418
three  A         NaN
       B   -0.759305
       C         NaN
two    A    0.449616
       B         NaN
       C    0.910474
Name: (D, foo), dtype: float64)
(('E', 'bar'), A      B
one    A   -0.784133
       B    1.055161
       C    0.416997
three  A    0.143384
       B         NaN
       C   -0.938091
two    A         NaN
       B    0.143702
       C         NaN
Name: (E, bar), dtype: float64)
(('E', 'foo'), A      B
one    A    1.217429
       B   -0.452343
       C    1.157877
three  A         NaN
       B    0.451718
       C         NaN
two    A   -0.235839
       B         NaN
       C   -0.239796
Name: (E, foo), dtype: float64)


## 시계열

In [112]:
rng = pd.date_range('2017-01-01', periods=100, freq='S')

ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
print(len(rng))
ts.head()

100


2017-01-01 00:00:00     53
2017-01-01 00:00:01     89
2017-01-01 00:00:02    259
2017-01-01 00:00:03    298
2017-01-01 00:00:04    108
Freq: S, dtype: int64

In [113]:
ts.resample('30s').sum()

2017-01-01 00:00:00    6658
2017-01-01 00:00:30    6937
2017-01-01 00:01:00    7185
2017-01-01 00:01:30    2510
Freq: 30S, dtype: int64

In [114]:
ts.resample('30S').mean()

2017-01-01 00:00:00    221.933333
2017-01-01 00:00:30    231.233333
2017-01-01 00:01:00    239.500000
2017-01-01 00:01:30    251.000000
Freq: 30S, dtype: float64

In [115]:
ts.resample('5Min').sum()

2017-01-01    23290
Freq: 5T, dtype: int64

### Time Zone

In [116]:
rng = pd.date_range('2017-01-01 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts

2017-01-01    1.110801
2017-01-02   -1.158353
2017-01-03    1.562841
2017-01-04    0.213680
2017-01-05   -0.180826
Freq: D, dtype: float64

In [117]:
ts_utc = ts.tz_localize('UTC')
ts_utc

2017-01-01 00:00:00+00:00    1.110801
2017-01-02 00:00:00+00:00   -1.158353
2017-01-03 00:00:00+00:00    1.562841
2017-01-04 00:00:00+00:00    0.213680
2017-01-05 00:00:00+00:00   -0.180826
Freq: D, dtype: float64

In [118]:
ts_utc.tz_convert('US/Eastern')

2016-12-31 19:00:00-05:00    1.110801
2017-01-01 19:00:00-05:00   -1.158353
2017-01-02 19:00:00-05:00    1.562841
2017-01-03 19:00:00-05:00    0.213680
2017-01-04 19:00:00-05:00   -0.180826
Freq: D, dtype: float64

In [119]:
ts_utc.tz_convert('Asia/Seoul')

2017-01-01 09:00:00+09:00    1.110801
2017-01-02 09:00:00+09:00   -1.158353
2017-01-03 09:00:00+09:00    1.562841
2017-01-04 09:00:00+09:00    0.213680
2017-01-05 09:00:00+09:00   -0.180826
Freq: D, dtype: float64

### Time Span 간 변환

In [120]:
rng = pd.date_range('2017-01-01', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

2017-01-31   -0.008218
2017-02-28   -1.401153
2017-03-31   -1.686510
2017-04-30   -0.021091
2017-05-31   -1.086826
Freq: M, dtype: float64

In [121]:
ps = ts.to_period()
ps

2017-01   -0.008218
2017-02   -1.401153
2017-03   -1.686510
2017-04   -0.021091
2017-05   -1.086826
Freq: M, dtype: float64

In [122]:
ps.to_timestamp()

2017-01-01   -0.008218
2017-02-01   -1.401153
2017-03-01   -1.686510
2017-04-01   -0.021091
2017-05-01   -1.086826
Freq: MS, dtype: float64

## 범주 데이터 (Category, Factor)

In [123]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df["grade"] = df["raw_grade"].astype("category")
df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

In [124]:
df["grade"].cat.categories = ["very good", "good", "very bad"] # 범주 재설정
df

Unnamed: 0,id,raw_grade,grade
0,1,a,very good
1,2,b,good
2,3,b,good
3,4,a,very good
4,5,a,very good
5,6,e,very bad


- 순서 재설정 + 빠진 범주 추가

In [125]:
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

In [126]:
df.sort_values(by="grade")


Unnamed: 0,id,raw_grade,grade
5,6,e,very bad
1,2,b,good
2,3,b,good
0,1,a,very good
3,4,a,very good
4,5,a,very good


In [127]:
df.groupby("grade").size()

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

## I/O

In [128]:
env_ts_01 = pd.read_csv('data/env_00.csv')

In [129]:
env_ts_01.head()

Unnamed: 0,date,hour,day_of_week,timestamp,temperature,icon,humidity,visibility,summary,apparentTemperature,pressure,windSpeed,cloudCover,windBearing,precipIntensity,dewPoint,precipProbability
0,2014-01-01,0,2,1388534400,19.76,clear-night,0.48,10.0,Clear,7.42,1020.07,11.96,0.0,280,0.0,3.42,0.0
1,2014-01-01,1,2,1388538000,18.74,clear-night,0.49,10.0,Clear,7.44,1024.01,9.78,0.0,280,0.0,2.58,0.0
2,2014-01-01,2,2,1388541600,17.4,clear-night,0.53,10.0,Clear,7.74,1024.97,7.19,0.0,252,0.0,3.15,0.0
3,2014-01-01,3,2,1388545200,16.94,clear-night,0.54,10.0,Clear,7.0,1025.83,7.4,0.0,244,0.0,3.37,0.0
4,2014-01-01,4,2,1388548800,15.51,clear-night,0.6,10.0,Clear,7.16,1026.0,5.47,0.0,225,0.0,4.11,0.0


### List Comprehension

In [130]:
[(x,x+1) for x in range(0,5)]

[(0, 1), (1, 2), (2, 3), (3, 4), (4, 5)]

In [131]:
[(x,y) for x in range(0,5) for y in range(0,x)]

[(1, 0),
 (2, 0),
 (2, 1),
 (3, 0),
 (3, 1),
 (3, 2),
 (4, 0),
 (4, 1),
 (4, 2),
 (4, 3)]

In [132]:
frames = [pd.read_csv('data/env_%02d.csv' % idx) for idx in range(0,3)]
frame_env = pd.concat(frames)
frame_env.to_csv('data/env_merged.csv')