# Pandas库入门
- http://pandas.pydata.org/pandas-docs/stable/

## Pandas库介绍

- Pandas是Python第三方库，提供高性能易用数据类型和分析工具
- 两个数据类型：Series，DataFrame
- 基于上述数据类型的各类操作
- 基本操作、运算操作、特征类操作、关联类操作

In [1]:
import pandas as pd

In [2]:
d = pd.Series(range(20))

In [3]:
d

0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
dtype: int32

In [4]:
d.cumsum()

0       0
1       1
2       3
3       6
4      10
5      15
6      21
7      28
8      36
9      45
10     55
11     66
12     78
13     91
14    105
15    120
16    136
17    153
18    171
19    190
dtype: int32

|NumPy| Pandas|
|:----:|:---|
|基础数据类型|扩展数据类型|
|关注数据的结构表达 |关注数据的应用表达|
|维度：数据间关系| 数据与索引间关系|

## Pandas中Series类型

- 是由一组数据及相关的数据索引组成

In [5]:
a = pd.Series([9,8,7,6])

In [6]:
a

0    9
1    8
2    7
3    6
dtype: int64

In [8]:
b = pd.Series([9,8,7,6], index=['a','b','c','d'])

In [9]:
b

a    9
b    8
c    7
d    6
dtype: int64

### Series类型可以由如下类型创建：
- Python列表
- 标量值
- Python字典
- ndarray
- 其他函数

** 从标量值创建 **

In [10]:
c = pd.Series(25, index=['a','b','c','d'])
c

a    25
b    25
c    25
d    25
dtype: int64

** 从字典类型创建 **

In [11]:
d = pd.Series({'a':9,'b':7})

In [12]:
d

a    9
b    7
dtype: int64

In [14]:
e = pd.Series({'a':9,'b':8,'c':7},index=['a','b','c','d'])

In [15]:
e

a    9.0
b    8.0
c    7.0
d    NaN
dtype: float64

** 从ndarray类型创建 **

In [18]:
import numpy as np
n = pd.Series(np.arange(5))

In [19]:
n

0    0
1    1
2    2
3    3
4    4
dtype: int32

### Series类型基本操作

- Series类型包括index和values两部分
- Series类型的操作类似ndarray类型
- Series类型的操作类似Python字典类型

In [20]:
b = pd.Series([9,8,7,6], index=['a','b','c','d'])

In [21]:
b

a    9
b    8
c    7
d    6
dtype: int64

In [22]:
b.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [23]:
b.values

array([9, 8, 7, 6], dtype=int64)

In [24]:
b['b']#自动索引

8

In [26]:
b[1]#自定义索引

8

In [25]:
b[['c','d',0]]#两者不能混用

c    7.0
d    6.0
0    NaN
dtype: float64

In [27]:
b[:3]

a    9
b    8
c    7
dtype: int64

In [28]:
b[b > b.median()]

a    9
b    8
dtype: int64

In [29]:
np.exp(b)

a    8103.083928
b    2980.957987
c    1096.633158
d     403.428793
dtype: float64

In [31]:
'c' in b

True

In [32]:
0 in b

False

In [33]:
b.get('f', 100)

100

### Series类型的对齐操作

In [34]:
a = pd.Series([1,2,3],index=['c','d','e'])

In [35]:
a + b

a    NaN
b    NaN
c    8.0
d    8.0
e    NaN
dtype: float64

### Series类型的name属性

- Series对象可以随时修改并即刻生效

In [36]:
b.name

In [37]:
b.name = 'Series对象'
b.index.name='索引列'

In [38]:
b

索引列
a    9
b    8
c    7
d    6
Name: Series对象, dtype: int64

## Pandas的DataFrame类型

- DataFrame类型由共用相同索引的一组列组成

- DataFrame是一个表格型的数据类型，每列值类型可以不同
- DataFrame既有行索引、也有列索引
- DataFrame常用于表达二维数据，但可以表达多维数据

### DataFrame类型可以由如下类型创建：
- 二维ndarray对象
- 由一维ndarray、列表、字典、元组或Series构成的字典
- Series类型
- 其他的DataFrame类型

**二维ndarray对象 **

In [39]:
d = pd.DataFrame(np.arange(10).reshape(2,5))

In [40]:
d

Unnamed: 0,0,1,2,3,4
0,0,1,2,3,4
1,5,6,7,8,9


** 由一维ndarray、列表、字典、元组或Series构成的字典 **

In [41]:
dt = {'one':pd.Series([1,2,3],index=['c','d','e']),
     'two':pd.Series([9,8,7,6], index=['a','b','c','d'])}

In [44]:
d=pd.DataFrame(dt)

In [45]:
d

Unnamed: 0,one,two
a,,9.0
b,,8.0
c,1.0,7.0
d,2.0,6.0
e,3.0,


In [46]:
dl = {'one':[1,2,3,4],'two':[9,8,7,6]}

In [47]:
d = pd.DataFrame(dl,index=['a','b','c','d'])

In [48]:
d

Unnamed: 0,one,two
a,1,9
b,2,8
c,3,7
d,4,6


## Pandas库的数据类型操作
- 增加或重排：重新索引
- 删除：drop

### 重新索引

- `.reindex()`能够改变或重排Series和DataFrame索引

In [49]:
dl = {'城市':['北京','上海','广州','深圳','沈阳'],
     '环比':[101.5, 101.2, 101.3, 102.0, 100.1],
     '同比':[121.5, 121.2, 101.3, 102.0, 101.1],
     '定基':[121.7, 127.2, 120.3, 145.0, 100.1]}

In [50]:
d = pd.DataFrame(dl, index=['c1','c2','c3','c4','c5'])

In [51]:
d

Unnamed: 0,同比,城市,定基,环比
c1,121.5,北京,121.7,101.5
c2,121.2,上海,127.2,101.2
c3,101.3,广州,120.3,101.3
c4,102.0,深圳,145.0,102.0
c5,101.1,沈阳,100.1,100.1


In [52]:
d = d.reindex(columns = ['城市','同比','环比','定基'])

In [53]:
d

Unnamed: 0,城市,同比,环比,定基
c1,北京,121.5,101.5,121.7
c2,上海,121.2,101.2,127.2
c3,广州,101.3,101.3,120.3
c4,深圳,102.0,102.0,145.0
c5,沈阳,101.1,100.1,100.1


** `.reindex(index=None, columns=None, …)`的参数 **

|参数| 说明|
|:-----:|:----|
|index, columns |新的行列自定义索引|
|fill_value |重新索引中，用于填充缺失位置的值|
|method |填充方法, ffill当前值向前填充，bfill向后填充|
|limit |最大填充量|
|copy| 默认True，生成新的对象，False时，新旧相等不复制|

In [56]:
newc = d.columns.insert(4, '新增')

In [57]:
newd = d.reindex(columns=newc, fill_value=200)

In [58]:
newd

Unnamed: 0,城市,同比,环比,定基,新增
c1,北京,121.5,101.5,121.7,200
c2,上海,121.2,101.2,127.2,200
c3,广州,101.3,101.3,120.3,200
c4,深圳,102.0,102.0,145.0,200
c5,沈阳,101.1,100.1,100.1,200


In [60]:
d.index

Index(['c1', 'c2', 'c3', 'c4', 'c5'], dtype='object')

In [61]:
d.columns

Index(['城市', '同比', '环比', '定基'], dtype='object')

** 索引的常用方法 **

|方法 |说明|
|:----:|:---|
|.append(idx) |连接另一个Index对象，产生新的Index对象|
|.diff(idx) |计算差集，产生新的Index对象|
|.intersection(idx) |计算交集|
|.union(idx) |计算并集|
|.delete(loc) |删除loc位置处的元素|
|.insert(loc,e) |在loc位置增加一个元素e|

In [62]:
nc = d.columns.delete(2)

In [63]:
ni = d.index.insert(5, 'c0')

In [64]:
nd = d.reindex(index=ni, columns=nc, method='ffill')

In [65]:
nd

Unnamed: 0,城市,同比,定基
c1,北京,121.5,121.7
c2,上海,121.2,127.2
c3,广州,101.3,120.3
c4,深圳,102.0,145.0
c5,沈阳,101.1,100.1
c0,,,


** `.drop()` 能够删除Series和DataFrame指定行或列索引 **

In [66]:
a

c    1
d    2
e    3
dtype: int64

In [67]:
a.drop(['c'])

d    2
e    3
dtype: int64

In [68]:
d

Unnamed: 0,城市,同比,环比,定基
c1,北京,121.5,101.5,121.7
c2,上海,121.2,101.2,127.2
c3,广州,101.3,101.3,120.3
c4,深圳,102.0,102.0,145.0
c5,沈阳,101.1,100.1,100.1


In [69]:
d.drop(['c5'])

Unnamed: 0,城市,同比,环比,定基
c1,北京,121.5,101.5,121.7
c2,上海,121.2,101.2,127.2
c3,广州,101.3,101.3,120.3
c4,深圳,102.0,102.0,145.0


In [72]:
d.drop('同比', axis=1)

Unnamed: 0,城市,环比,定基
c1,北京,101.5,121.7
c2,上海,101.2,127.2
c3,广州,101.3,120.3
c4,深圳,102.0,145.0
c5,沈阳,100.1,100.1


## Pandas库的数据类型运算

- 算术运算根据行列索引，补齐后运算，运算默认产生浮点数
- 补齐时缺项填充NaN (空值)
- 二维和一维、一维和零维间为广播运算

- 采用`+ ‐ * /`符号进行的二元运算产生新的对象

In [75]:
a = pd.DataFrame(np.arange(12).reshape(3,4))

In [76]:
a

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [77]:
b = pd.DataFrame(np.arange(20).reshape(4,5))

In [78]:
b

Unnamed: 0,0,1,2,3,4
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19


In [79]:
a + b

Unnamed: 0,0,1,2,3,4
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [80]:
a * b

Unnamed: 0,0,1,2,3,4
0,0.0,1.0,4.0,9.0,
1,20.0,30.0,42.0,56.0,
2,80.0,99.0,120.0,143.0,
3,,,,,


|方法 |说明|
|:---:|:----|
|`.add(d, **argws)` |类型间加法运算，可选参数|
|`.sub(d, **argws)`| 类型间减法运算，可选参数|
|`.mul(d, **argws)` |类型间乘法运算，可选参数|
|`.div(d, **argws)` |类型间除法运算，可选参数|

In [81]:
b.add(a, fill_value=100)

Unnamed: 0,0,1,2,3,4
0,0.0,2.0,4.0,6.0,104.0
1,9.0,11.0,13.0,15.0,109.0
2,18.0,20.0,22.0,24.0,114.0
3,115.0,116.0,117.0,118.0,119.0


In [82]:
a.mul(b, fill_value=0)

Unnamed: 0,0,1,2,3,4
0,0.0,1.0,4.0,9.0,0.0
1,20.0,30.0,42.0,56.0,0.0
2,80.0,99.0,120.0,143.0,0.0
3,0.0,0.0,0.0,0.0,0.0


In [83]:
c = pd.Series(np.arange(4))

In [84]:
c

0    0
1    1
2    2
3    3
dtype: int32

In [85]:
c - 10

0   -10
1    -9
2    -8
3    -7
dtype: int32

In [88]:
b - c # 不同维度间为广播运算，一维Series默认在轴1参与运算

Unnamed: 0,0,1,2,3,4
0,0.0,0.0,0.0,0.0,
1,5.0,5.0,5.0,5.0,
2,10.0,10.0,10.0,10.0,
3,15.0,15.0,15.0,15.0,


> 不同维度间为广播运算，一维Series默认在轴
1参与运算

In [87]:
b.sub(c, axis=0)

Unnamed: 0,0,1,2,3,4
0,0,1,2,3,4
1,4,5,6,7,8
2,8,9,10,11,12
3,12,13,14,15,16


- 比较运算只能比较相同索引的元素，不进行补齐
- 二维和一维、一维和零维间为广播运算
- 采用`>< >= <= == !=`等符号进行的二元运算产生布尔对象