# 第五章 Pandas入门

## 5.1 数据结构

### 5.1.1 创建Series Dataframe 对象

传入嵌套字典

In [1]:
from pandas import Series,DataFrame
import pandas as pd
import numpy as np

In [2]:
pop={'Nevada':{2001:2.4,2002:2.9},'Ohio':{2000:1.5,2002:3.6}}
f1=DataFrame(pop)
f1

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,
2002,2.9,3.6


### index的方法和属性<br>
方法|说明
:--|:--
`append` |连接两个 `index` ，产生一个**新的 Index**                                                   
`delete` |删除索引 `i` 处的元素，并得到**新的 Index**
`is_monotonic` |是否单调递增
`is_unique` |`index` 中没有重复项时，返回 `True`
`unique` |计算 index 中唯一值的数组

## 基本功能

### reindex

#### Series

In [3]:
obj=Series(['blue','yellow','red'],index=[0,2,4])
obj2=obj.reindex(range(6),method='ffill')               

#### DataFrame
传入一个列表，重新索引行<br>
传入两个 list ,分别重新索引行列<br>
重新索引列：`frame.reindex(columns=list)`

In [4]:
frame=DataFrame(np.arange(9).reshape(3,3),index=['a','b','c'],columns=['Ohio','Texas','California'])

**对行进行索引**

In [5]:
frame2=frame.reindex(['a','d','b','c'],method="ffill")  
frame2

Unnamed: 0,Ohio,Texas,California
a,0,1,2
d,6,7,8
b,3,4,5
c,6,7,8


※ **用 method 的前提：**`index must be monotonic increasing or decreasing`
<br>`method='ffill'`
<br>所以d行的值，取自c行：`frame['d']=frame['c']`

In [6]:
frame2.columns=[1,2,3]
frame3=frame2.reindex(columns=[1,2,3,4,5,5],method='ffill')     #bfill 时返回 NaN
frame3.columns=['col1','col2','col3','col4','col5','col6']
frame3

Unnamed: 0,col1,col2,col3,col4,col5,col6
a,0,1,2,2,2,2
d,6,7,8,8,8,8
b,3,4,5,5,5,5
c,6,7,8,8,8,8


In [7]:
frame3.xs('col2',axis=1)  #xs方法，获得单行或单列

a    1
d    7
b    4
c    7
Name: col2, dtype: int32

**NOTES**
1. `columns` 为有序值时，可用 `method` 确认`fill` 方式。
2. 当传入两个列表同时 `reindex` 行列，`method` 只应用于 **行**
3. 调用 `limit` 时，`columns/index` 必须是单调的

### 丢弃值 `get/set values`

In [8]:
#丢弃行
frame3.drop('a')
#丢弃列
new_frame=frame3.drop('col6',axis=1)   #drop返回新的DataFrame对象

In [9]:
frame3.set_value('c','col5',-1)
frame3

Unnamed: 0,col1,col2,col3,col4,col5,col6
a,0,1,2,2,2,2
d,6,7,8,8,8,8
b,3,4,5,5,5,5
c,6,7,8,8,-1,8


### 填充
1. `reindex` `method` 填充： [`ffill`](#reindex)、`bfill`
2. 值填充，`fill_value`
3. `reindex` 时也可传入 `fill_value`

In [10]:
df1=DataFrame(np.arange(12).reshape(3,4),columns=list('abcd'))
df2=DataFrame(np.arange(20).reshape(4,5),columns=list('abcde'))
df1+df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [11]:
df1.add(df2,fill_value=0)    #fill_value的值传入df1/df1的NaN位置，两个同时为NaN的位置，仍为NaN

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


**算术方法**


方法|说明
:--|:--
add|加
sub|减
mul|乘
div|除

### DataFrame与Series之间的运算
行广播运算(默认)
```python
df-series    #按行广播
```
列广播
```python
df.sub(series,axis=0)  #传入的轴是希望seriesd.index匹配的轴，匹配0轴后，在1轴上广播
```

### 函数应用与映射

In [12]:
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-1.345511,1.24106,-0.335556
Ohio,0.581443,0.386226,2.126242
Texas,-0.019844,-0.420891,-0.177956
Oregon,-0.247697,-0.661138,1.690165


#### **`apply applymap`**方法
- apply()
 - `df.apply(func)`：作用在一行或一列上，默认0轴（一列上）
 - `Series.apply(func)`：作用在每个元素上
 
 
- applymap()
 - 只用于`df`上，作用在每个元素上<br>
 like doing `map(func, series)` for each series in the DataFrame

In [13]:
f=lambda x: x.max()-x.min()
frame.apply(f)
frame.apply(f,axis=1)

Utah      2.586572
Ohio      1.740016
Texas     0.401047
Oregon    2.351303
dtype: float64

In [15]:
f3=lambda x: Series([x.min(),x.max()],index=['min','max'])
frame.apply(f3)

Unnamed: 0,b,d,e
min,-1.345511,-0.661138,-0.335556
max,0.581443,1.24106,2.126242


对每个元素进行格式化

In [16]:
format=lambda x: '%.2f' %x
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-1.35,1.24,-0.34
Ohio,0.58,0.39,2.13
Texas,-0.02,-0.42,-0.18
Oregon,-0.25,-0.66,1.69


对 `df` 中的某一列进行格式化

In [3]:
ff=DataFrame(np.random.randn(20).reshape((4,5)))
formater=lambda x: '%.2f' %x
ff[2]=ff[2].apply(formater)
ff

Unnamed: 0,0,1,2,3,4
0,0.409225,-0.059035,-0.69,0.60801,-0.604662
1,-1.388213,-0.294574,1.03,1.250235,0.795981
2,-0.546295,0.716735,-0.68,0.875924,0.589751
3,-0.635919,1.229887,-1.29,-0.920333,1.507943


### 排序 `sort`
Refer: [numpy排序](ch04_01.ipynb#排序)
1. 按索引排序: 
```python
Series.sort_index()
Df.sort_index()
Df.sort_index(axis=1)
```
2. 按值排序
```python
Series.order()
Df.sort_index(by='column_name')   #will be deprecated
Df.sort_values(by=['a','b'])    #先按a排序，a值相同再按b排序
```

### 排名 `ranking`

返回元素在数组中的排名。<br>
相同值的处理：

method|说明
:--|:---
average|默认：Equal values are assigned a rank that is the average of the ranks of those values
min|使用最小排名
max|使用最大排名
first|按出现顺序排名

In [17]:
frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                   'c': [-2, 5, 8, -2.5]})
frame.rank(axis=1,method='first')

Unnamed: 0,a,b,c
0,2.0,3.0,1.0
1,1.0,3.0,2.0
2,2.0,1.0,3.0
3,2.0,3.0,1.0


## 汇总和计算描述统计

`pandas.io.data` 模块已迁移到 `pandas_datareader` ( [`Doc`](http://pandas-datareader.readthedocs.io/en/latest/) )
```python
conda install pandas-datareader
```

### Series.pct_change()
<br>返回的Series：在 `index=2` 处：$$\frac{value[2]-value[1]}{value[1]}$$

In [19]:
ss=Series([1,2,3])
ss.pct_change()

0    NaN
1    1.0
2    0.5
dtype: float64

### 协方差/相关系数
`.corr` `cov` `corrwith`

In [23]:
df=DataFrame(np.random.randn(5,5),columns=[1,2,3,4,5])
df

Unnamed: 0,1,2,3,4,5
0,-0.085949,-0.069787,-0.447926,0.68555,0.070329
1,-0.762842,-0.938235,0.511513,0.727425,0.024499
2,-0.083003,0.170441,0.267452,-0.749128,0.67576
3,0.964924,-0.499859,1.107623,-0.899374,3.01626
4,1.299568,1.03569,1.041145,-0.328281,0.32681


In [24]:
C=df.corr()
C

Unnamed: 0,1,2,3,4,5
1,1.0,0.655582,0.645286,-0.645377,0.508995
2,0.655582,1.0,0.143303,-0.338167,-0.223311
3,0.645286,0.143303,1.0,-0.604673,0.565986
4,-0.645377,-0.338167,-0.604673,1.0,-0.718155
5,0.508995,-0.223311,0.565986,-0.718155,1.0


`df.corr()` 返回相关系数矩阵 $C$<br>

$C_{ij}=$ `df[i].corr(df[j])` <br>

$C_{ij}=C_{ji}$

In [25]:
df[2].corr(df[1])

0.65558222415170764

In [26]:
df.corrwith(df[3])

1    0.645286
2    0.143303
3    1.000000
4   -0.604673
5    0.565986
dtype: float64

### 常用方法
1. 唯一性
2. 计数
3. 成员资格

In [27]:
obj=Series(['a','c','d','a','a','b','c','c'])

In [28]:
obj.unique()                                       #唯一性
obj.value_counts()                                 #计数
pd.value_counts(obj.values,sort=False)

a    3
d    1
c    3
b    1
dtype: int64

**成员资格**

In [29]:
mask=obj.isin(['a','d'])
obj[mask]

0    a
2    d
3    a
4    a
dtype: object

 **DataFrame.apply**(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)[source]

    Applies function along input axis of DataFrame.

    Objects passed to functions are Series objects having index either the DataFrame’s index (axis=0) or the columns (axis=1).

In [30]:
data=DataFrame({'qu1':[1,3,4,3,4],
               'qu2':[2,3,1,2,3],
               'que3':[1,5,2,4,4]})
data.apply(pd.value_counts).fillna(0)

Unnamed: 0,qu1,qu2,que3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


## 处理缺失数据
### 填充确实数据

In [31]:
df=DataFrame(np.random.randn(7,3))
df.iloc[:3,1:]=np.nan    
df.iloc[[3,4],1]=np.nan
df

Unnamed: 0,0,1,2
0,-1.046038,,
1,1.880446,,
2,-0.038886,,
3,-0.153734,,-0.57706
4,-0.772119,,-0.228833
5,0.189213,-1.483098,-1.54174
6,1.139778,1.269114,-0.900236


#### 保留有效值个数2及2以上的行

In [32]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
3,-0.153734,,-0.57706
4,-0.772119,,-0.228833
5,0.189213,-1.483098,-1.54174
6,1.139778,1.269114,-0.900236


#### 不同的列填充不同的值
传入一个 `dict`

In [33]:
df.fillna({1:1,2:2})

Unnamed: 0,0,1,2
0,-1.046038,1.0,2.0
1,1.880446,1.0,2.0
2,-0.038886,1.0,2.0
3,-0.153734,1.0,-0.57706
4,-0.772119,1.0,-0.228833
5,0.189213,-1.483098,-1.54174
6,1.139778,1.269114,-0.900236


#### 就地修改原对象
`.fillna` 默认返回一个 `copy` 

就地修改：`.fillna(value,inpalce=True)`

## 5.5 层次化索引

In [34]:
data=Series(np.random.randn(10),index=[['a','a','a','b','b','b','c','c','d','d'],[1,2,3,1,2,3,1,2,2,3]])
data,data.index

(a  1   -0.036478
    2    0.696885
    3    0.233812
 b  1    0.038440
    2    1.558086
    3    1.520032
 c  1   -0.408008
    2   -0.272061
 d  2   -0.298656
    3   -0.681329
 dtype: float64, MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
            labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]]))

In [35]:
frame=DataFrame(np.random.rand(4,3),index=[['a','a','b','b'],[1,2,1,2]],columns=[['ohio','ohio','cali'],['green','red','green']])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,ohio,ohio,cali
Unnamed: 0_level_1,Unnamed: 1_level_1,green,red,green
a,1,0.997447,0.864915,0.895538
a,2,0.286276,0.777191,0.331116
b,1,0.59963,0.987011,0.769337
b,2,0.87137,0.58962,0.63987


In [36]:
# m_index=pd.Multiindex

In [37]:
frame.index.names=['key1','key2']
frame.columns.names=['state','color']

frame

Unnamed: 0_level_0,state,ohio,ohio,cali
Unnamed: 0_level_1,color,green,red,green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0.997447,0.864915,0.895538
a,2,0.286276,0.777191,0.331116
b,1,0.59963,0.987011,0.769337
b,2,0.87137,0.58962,0.63987


In [38]:
frame.swaplevel('key1','key2')

Unnamed: 0_level_0,state,ohio,ohio,cali
Unnamed: 0_level_1,color,green,red,green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0.997447,0.864915,0.895538
2,a,0.286276,0.777191,0.331116
1,b,0.59963,0.987011,0.769337
2,b,0.87137,0.58962,0.63987


In [39]:
frame

Unnamed: 0_level_0,state,ohio,ohio,cali
Unnamed: 0_level_1,color,green,red,green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0.997447,0.864915,0.895538
a,2,0.286276,0.777191,0.331116
b,1,0.59963,0.987011,0.769337
b,2,0.87137,0.58962,0.63987


In [2]:
df=DataFrame({'phase_def':{1:115,2:160,3:170,4:175,5:175},
            'phase_non_def':{1:56.5,2:55,3:55.1,4:55.08,5:55.08}})
df

Unnamed: 0,phase_def,phase_non_def
1,115,56.5
2,160,55.0
3,170,55.1
4,175,55.08
5,175,55.08


### 转换成 `MultiIndex`

`.unstack()` 方法生成新的对象

`.swaplevel(0,1)`  生成新对象

`.sort_index(level=0)`  生成新对象

In [5]:
df2se=df.unstack()
df2se

phase_def      1    115.00
               2    160.00
               3    170.00
               4    175.00
               5    175.00
phase_non_def  1     56.50
               2     55.00
               3     55.10
               4     55.08
               5     55.08
dtype: float64

In [6]:
df2se.index

MultiIndex(levels=[['phase_def', 'phase_non_def'], [1, 2, 3, 4, 5]],
           labels=[[0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]])

In [9]:
df2se.index.names=['phase','Number']
df2se_sw=df2se.swaplevel(0,1)        #生成新对象
df2se_sw

Number  phase        
1       phase_def        115.00
2       phase_def        160.00
3       phase_def        170.00
4       phase_def        175.00
5       phase_def        175.00
1       phase_non_def     56.50
2       phase_non_def     55.00
3       phase_non_def     55.10
4       phase_non_def     55.08
5       phase_non_def     55.08
dtype: float64

In [10]:
df2se_sw.sort_index(level=0)

Number  phase        
1       phase_def        115.00
        phase_non_def     56.50
2       phase_def        160.00
        phase_non_def     55.00
3       phase_def        170.00
        phase_non_def     55.10
4       phase_def        175.00
        phase_non_def     55.08
5       phase_def        175.00
        phase_non_def     55.08
dtype: float64

### 5.5.1 重排分级顺序

### 5.5.2 据级别汇总统计

### 5.5.3 使用DataFrame的列做索引

## 5.6 其他有关Pandas的话题

### 5.6.1 整数索引

`Series`<br>
当索引中 *有整数* 时，根据整数进行数据切片的操作都是面向**标签**的。

当索引中无整数时，根据整数进行数据切片的操作都是都是面向**顺序**的

`DataFrame`<br>
`df` 用 `loc` 进行标签索引，`iloc` 进行顺序索引

In [40]:
dff=DataFrame(np.random.rand(5,4),columns=[1,-1,4,5])
dff

Unnamed: 0,1,-1,4,5
0,0.8666,0.86595,0.890617,0.667237
1,0.487846,0.277667,0.106869,0.097285
2,0.251482,0.665738,0.116685,0.137195
3,0.616372,0.854474,0.062562,0.24347
4,0.12207,0.087875,0.134753,0.273231


In [41]:
dff[-1]

0    0.865950
1    0.277667
2    0.665738
3    0.854474
4    0.087875
Name: -1, dtype: float64

In [42]:
dff2=DataFrame(np.random.randn(5,4),columns=['a','f','b','d'])
dff2

Unnamed: 0,a,f,b,d
0,-1.086879,-1.109839,-0.07589,0.700439
1,1.150916,-0.978938,0.47199,-0.783764
2,-0.3168,-1.097486,-0.740895,1.310597
3,0.857949,-0.144822,-0.327928,0.539144
4,1.499346,1.927254,0.431937,1.122111


In [43]:
dff2.iloc[:,-1]

0    0.700439
1   -0.783764
2    1.310597
3    0.539144
4    1.122111
Name: d, dtype: float64

### 5.6.2 面板数据