# Pandas库
![](media/pandas.png)
## 引子

Numpy 在向量化的数值计算中表现优异，但是在处理更灵活、复杂的数据任务： 
  
如为数据添加标签、处理缺失值、分组和透视表等方面 Numpy 显得力不从心

**而基于Numpy构建的Pandas库，提供了使得数据分析变得更快更简单的高级数据结构和操作工具**

## 一、对象创建

### 1.1 Pandas Series对象

Series 是带标签数据的一维数组

#### 1.1.1 Series对象的创建

通用结构：
```
pd.Series(data, index=index, dtype=dtype)
```  
- data：数据，可以是列表，字典或Numpy数组
- index：索引，为可选参数
- dtype: 数据类型，为可选参数

**1、用列表创建**

* index缺省，默认为整数序列

In [1]:
import pandas as pd

data = pd.Series([1.5, 3, 4.5, 6])
data

0    1.5
1    3.0
2    4.5
3    6.0
dtype: float64

* 增加index

In [2]:
data = pd.Series([1.5, 3, 4.5, 6], index=["a", "b", "c", "d"])
data

a    1.5
b    3.0
c    4.5
d    6.0
dtype: float64

* 增加数据类型
  
      缺省则从传入的数据自动判断

In [3]:
data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])    
data

a    1
b    2
c    3
d    4
dtype: int64

In [4]:
data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"], dtype="float")
data

a    1.0
b    2.0
c    3.0
d    4.0
dtype: float64

**注意：数据支持多种类型**

In [5]:
data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"])
data

a    1
b    2
c    3
d    4
dtype: object

In [6]:
data["a"]

1

In [7]:
data["c"]

'3'

**数据类型可被强制改变**

In [8]:
data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"], dtype=float)
data

a    1.0
b    2.0
c    3.0
d    4.0
dtype: float64

In [9]:
data["c"]

3.0

In [None]:
data = pd.Series([1, 2, "a", 4], index=["a", "b", "c", "d"], dtype=float)
data

#### 1.1.2 用一维numpy数组创建

In [10]:
import numpy as np

x = np.arange(5)
pd.Series(x)

0    0
1    1
2    2
3    3
4    4
dtype: int32

#### 1.1.3 用字典创建

* 默认以键为 `index`，值为 `data`

In [11]:
population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981 }
population = pd.Series(population_dict)    
population

BeiJing     2154
ShangHai    2424
ShenZhen    1303
HangZhou     981
dtype: int64

* 字典创建，如果指定index，则会到字典的键中筛选，找不到的，值设为NaN

In [12]:
population = pd.Series(population_dict, index=["BeiJing", "HangZhou", "c", "d"])    
population

BeiJing     2154.0
HangZhou     981.0
c              NaN
d              NaN
dtype: float64

**4、data为标量的情况**

In [13]:
pd.Series(5, index=[100, 200, 300])  # 值会自动复制

100    5
200    5
300    5
dtype: int64

## 二、Pandas DataFrame对象

DataFrame 是带标签数据的多维数组

### 2.1 DataFrame对象的创建

通用结构：
```
pd.DataFrame(data, index=index, columns=columns)
```  
- `data`：数据，可以是列表，字典或 `Numpy` 数组
- `index`：索引，为可选参数
- `columns`: 列标签，为可选参数

**1、通过Series对象创建**

In [14]:
population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981 }

population = pd.Series(population_dict)
pd.DataFrame(population)

Unnamed: 0,0
BeiJing,2154
ShangHai,2424
ShenZhen,1303
HangZhou,981


In [15]:
pd.DataFrame(population, columns=["population"])

Unnamed: 0,population
BeiJing,2154
ShangHai,2424
ShenZhen,1303
HangZhou,981


**2、通过Series对象字典创建**

In [16]:
GDP_dict = {"BeiJing": 30320,
            "ShangHai": 32680,
            "ShenZhen": 24222,
            "HangZhou": 13468 }

GDP = pd.Series(GDP_dict)
GDP

BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
dtype: int64

In [17]:
pd.DataFrame({"population": population,
              "GDP": GDP})

Unnamed: 0,population,GDP
BeiJing,2154,30320
ShangHai,2424,32680
ShenZhen,1303,24222
HangZhou,981,13468


**注意：数量不够的会自动补齐**

In [18]:
pd.DataFrame({"population": population,
              "GDP": GDP,
              "country": "China"})

Unnamed: 0,population,GDP,country
BeiJing,2154,30320,China
ShangHai,2424,32680,China
ShenZhen,1303,24222,China
HangZhou,981,13468,China


**3、通过字典列表对象创建**

* 字典索引作为index，字典键作为columns

In [19]:
import numpy as np
import pandas as pd

data = [{"a": i, "b": 2*i} for i in range(3)]
data

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

In [20]:
data = pd.DataFrame(data)
data

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [16]:
data1 = data["a"].copy()
data1

0    0
1    1
2    2
Name: a, dtype: int64

In [17]:
data1[0] = 10
data1

0    10
1     1
2     2
Name: a, dtype: int64

In [18]:
data

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


* 不存在的键，会默认值为NaN

In [21]:
data = [{"a": 1, "b":1},{"b": 3, "c":4}]
data

[{'a': 1, 'b': 1}, {'b': 3, 'c': 4}]

In [22]:
pd.DataFrame(data)

Unnamed: 0,a,b,c
0,1.0,1,
1,,3,4.0


**4、通过Numpy二维数组创建**

In [23]:
data = np.random.randint(10, size=(3, 2))
data

array([[0, 8],
       [9, 7],
       [3, 7]])

In [24]:
pd.DataFrame(data, columns=["foo", "bar"], index=["a", "b", "c"])

Unnamed: 0,foo,bar
a,0,8
b,9,7
c,3,7


### 2.2 DataFrame性质

#### 2.2.1 属性

In [25]:
data = pd.DataFrame({"pop": population, "GDP": GDP})
data

Unnamed: 0,pop,GDP
BeiJing,2154,30320
ShangHai,2424,32680
ShenZhen,1303,24222
HangZhou,981,13468


**（1）df.values  返回numpy数组表示的数据**

In [26]:
data.values

array([[ 2154, 30320],
       [ 2424, 32680],
       [ 1303, 24222],
       [  981, 13468]], dtype=int64)

**（2）df.index 返回行索引**

In [27]:
data.index

Index(['BeiJing', 'ShangHai', 'ShenZhen', 'HangZhou'], dtype='object')

**（3）df.columns 返回列索引**

In [83]:
data.columns

Index(['pop', 'GDP'], dtype='object')

**（4）df.shape  形状**

In [84]:
data.shape

(4, 2)

**（5） pd.size 大小**

In [85]:
data.size

8

**（6）pd.dtypes 返回每列数据类型**

In [86]:
data.dtypes

pop    int64
GDP    int64
dtype: object

#### 2.2.2 索引

In [28]:
data

Unnamed: 0,pop,GDP
BeiJing,2154,30320
ShangHai,2424,32680
ShenZhen,1303,24222
HangZhou,981,13468


**（1）获取列**

* 字典式

In [29]:
data["pop"]

BeiJing     2154
ShangHai    2424
ShenZhen    1303
HangZhou     981
Name: pop, dtype: int64

In [30]:
data[["GDP", "pop"]]

Unnamed: 0,GDP,pop
BeiJing,30320,2154
ShangHai,32680,2424
ShenZhen,24222,1303
HangZhou,13468,981


* 对象属性式

In [31]:
data.GDP

BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
Name: GDP, dtype: int64

**（2）获取行**

* 绝对索引 df.loc

In [32]:
data.loc["BeiJing"]

pop     2154
GDP    30320
Name: BeiJing, dtype: int64

In [33]:
data.loc[["BeiJing", "HangZhou"]]

Unnamed: 0,pop,GDP
BeiJing,2154,30320
HangZhou,981,13468


* 相对索引 df.iloc

In [34]:
data

Unnamed: 0,pop,GDP
BeiJing,2154,30320
ShangHai,2424,32680
ShenZhen,1303,24222
HangZhou,981,13468


In [35]:
data.iloc[0]

pop     2154
GDP    30320
Name: BeiJing, dtype: int64

In [36]:
data.iloc[[1, 3]]

Unnamed: 0,pop,GDP
ShangHai,2424,32680
HangZhou,981,13468


**（3）获取标量**

In [37]:
data

Unnamed: 0,pop,GDP
BeiJing,2154,30320
ShangHai,2424,32680
ShenZhen,1303,24222
HangZhou,981,13468


In [38]:
data.loc["BeiJing", "GDP"]

30320

In [39]:
data.iloc[0, 1]

30320

In [40]:
data.values[0][1]

30320

**（4）Series对象的索引**

In [41]:
type(data.GDP)

pandas.core.series.Series

In [42]:
GDP

BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
dtype: int64

In [43]:
GDP["BeiJing"]

30320

#### 2.2.3 切片

In [44]:
dates = pd.date_range(start='2019-01-01', periods=6)
dates

DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06'],
              dtype='datetime64[ns]', freq='D')

In [45]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df

Unnamed: 0,A,B,C,D
2019-01-01,-0.92171,-0.172802,0.4112,1.807156
2019-01-02,0.745224,-0.018015,-1.87469,0.373795
2019-01-03,-1.323749,-0.234723,-1.768072,-0.454503
2019-01-04,0.251244,-1.147055,-0.384193,0.416921
2019-01-05,0.80336,-0.486047,-0.808534,-0.037337
2019-01-06,-0.425767,1.956439,-0.145754,-0.538147


**（1）行切片**

In [46]:
df["2019-01-01": "2019-01-03"]

Unnamed: 0,A,B,C,D
2019-01-01,-0.92171,-0.172802,0.4112,1.807156
2019-01-02,0.745224,-0.018015,-1.87469,0.373795
2019-01-03,-1.323749,-0.234723,-1.768072,-0.454503


In [47]:
df.loc["2019-01-01": "2019-01-03"]

Unnamed: 0,A,B,C,D
2019-01-01,-0.92171,-0.172802,0.4112,1.807156
2019-01-02,0.745224,-0.018015,-1.87469,0.373795
2019-01-03,-1.323749,-0.234723,-1.768072,-0.454503


In [48]:
df.iloc[0: 3]

Unnamed: 0,A,B,C,D
2019-01-01,-0.92171,-0.172802,0.4112,1.807156
2019-01-02,0.745224,-0.018015,-1.87469,0.373795
2019-01-03,-1.323749,-0.234723,-1.768072,-0.454503


**（2）列切片**

In [49]:
df

Unnamed: 0,A,B,C,D
2019-01-01,-0.92171,-0.172802,0.4112,1.807156
2019-01-02,0.745224,-0.018015,-1.87469,0.373795
2019-01-03,-1.323749,-0.234723,-1.768072,-0.454503
2019-01-04,0.251244,-1.147055,-0.384193,0.416921
2019-01-05,0.80336,-0.486047,-0.808534,-0.037337
2019-01-06,-0.425767,1.956439,-0.145754,-0.538147


In [50]:
df.loc[:, "A": "C"]

Unnamed: 0,A,B,C
2019-01-01,-0.92171,-0.172802,0.4112
2019-01-02,0.745224,-0.018015,-1.87469
2019-01-03,-1.323749,-0.234723,-1.768072
2019-01-04,0.251244,-1.147055,-0.384193
2019-01-05,0.80336,-0.486047,-0.808534
2019-01-06,-0.425767,1.956439,-0.145754


In [51]:
df.iloc[:, 0: 3]

Unnamed: 0,A,B,C
2019-01-01,-0.92171,-0.172802,0.4112
2019-01-02,0.745224,-0.018015,-1.87469
2019-01-03,-1.323749,-0.234723,-1.768072
2019-01-04,0.251244,-1.147055,-0.384193
2019-01-05,0.80336,-0.486047,-0.808534
2019-01-06,-0.425767,1.956439,-0.145754


**（3）多种多样的取值**

In [52]:
df

Unnamed: 0,A,B,C,D
2019-01-01,-0.92171,-0.172802,0.4112,1.807156
2019-01-02,0.745224,-0.018015,-1.87469,0.373795
2019-01-03,-1.323749,-0.234723,-1.768072,-0.454503
2019-01-04,0.251244,-1.147055,-0.384193,0.416921
2019-01-05,0.80336,-0.486047,-0.808534,-0.037337
2019-01-06,-0.425767,1.956439,-0.145754,-0.538147


* 行、列同时切片

In [53]:
df.loc["2019-01-02": "2019-01-03", "C":"D"]

Unnamed: 0,C,D
2019-01-02,-1.87469,0.373795
2019-01-03,-1.768072,-0.454503


In [54]:
df.iloc[1: 3, 2:]

Unnamed: 0,C,D
2019-01-02,-1.87469,0.373795
2019-01-03,-1.768072,-0.454503


* 行切片，列分散取值

In [55]:
df.loc["2019-01-04": "2019-01-06", ["A", "C"]]

Unnamed: 0,A,C
2019-01-04,0.251244,-0.384193
2019-01-05,0.80336,-0.808534
2019-01-06,-0.425767,-0.145754


In [56]:
df.iloc[3:, [0, 2]]

Unnamed: 0,A,C
2019-01-04,0.251244,-0.384193
2019-01-05,0.80336,-0.808534
2019-01-06,-0.425767,-0.145754


* 行分散取值，列切片

In [57]:
df.loc[["2019-01-02", "2019-01-06"], "C": "D"]

KeyError: "None of [['2019-01-02', '2019-01-06']] are in the [index]"

In [58]:
df.iloc[[1, 5], 0: 3]

Unnamed: 0,A,B,C
2019-01-02,0.745224,-0.018015,-1.87469
2019-01-06,-0.425767,1.956439,-0.145754


* 行、列均分散取值

In [59]:
df.loc[["2019-01-04", "2019-01-06"], ["A", "D"]]  # 错误，只能用相对位置索引

KeyError: KeyError("None of [Index(['2019-01-04', '2019-01-06'], dtype='object')] are in the [index]",)

In [60]:
df.iloc[[1, 5], [0, 3]]

Unnamed: 0,A,D
2019-01-02,0.745224,0.373795
2019-01-06,-0.425767,-0.538147


#### 2.2.4 布尔索引

In [61]:
df

Unnamed: 0,A,B,C,D
2019-01-01,-0.92171,-0.172802,0.4112,1.807156
2019-01-02,0.745224,-0.018015,-1.87469,0.373795
2019-01-03,-1.323749,-0.234723,-1.768072,-0.454503
2019-01-04,0.251244,-1.147055,-0.384193,0.416921
2019-01-05,0.80336,-0.486047,-0.808534,-0.037337
2019-01-06,-0.425767,1.956439,-0.145754,-0.538147


In [62]:
df > 0

Unnamed: 0,A,B,C,D
2019-01-01,False,False,True,True
2019-01-02,True,False,False,True
2019-01-03,False,False,False,False
2019-01-04,True,False,False,True
2019-01-05,True,False,False,False
2019-01-06,False,True,False,False


In [63]:
df[df > 0]

Unnamed: 0,A,B,C,D
2019-01-01,,,0.4112,1.807156
2019-01-02,0.745224,,,0.373795
2019-01-03,,,,
2019-01-04,0.251244,,,0.416921
2019-01-05,0.80336,,,
2019-01-06,,1.956439,,


In [64]:
df.A > 0

2019-01-01    False
2019-01-02     True
2019-01-03    False
2019-01-04     True
2019-01-05     True
2019-01-06    False
Freq: D, Name: A, dtype: bool

In [65]:
df[df.A > 0]

Unnamed: 0,A,B,C,D
2019-01-02,0.745224,-0.018015,-1.87469,0.373795
2019-01-04,0.251244,-1.147055,-0.384193,0.416921
2019-01-05,0.80336,-0.486047,-0.808534,-0.037337


* isin（）方法

In [66]:
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2

Unnamed: 0,A,B,C,D,E
2019-01-01,-0.92171,-0.172802,0.4112,1.807156,one
2019-01-02,0.745224,-0.018015,-1.87469,0.373795,one
2019-01-03,-1.323749,-0.234723,-1.768072,-0.454503,two
2019-01-04,0.251244,-1.147055,-0.384193,0.416921,three
2019-01-05,0.80336,-0.486047,-0.808534,-0.037337,four
2019-01-06,-0.425767,1.956439,-0.145754,-0.538147,three


In [67]:
ind = df2["E"].isin(["two", "four"])
ind     

2019-01-01    False
2019-01-02    False
2019-01-03     True
2019-01-04    False
2019-01-05     True
2019-01-06    False
Freq: D, Name: E, dtype: bool

In [140]:
df2[ind]

Unnamed: 0,A,B,C,D,E
2019-01-03,-0.141572,0.058118,1.102248,1.207726,two
2019-01-05,0.313383,0.234041,0.163155,-0.296649,four


#### 2.2.5 赋值

In [68]:
df

Unnamed: 0,A,B,C,D
2019-01-01,-0.92171,-0.172802,0.4112,1.807156
2019-01-02,0.745224,-0.018015,-1.87469,0.373795
2019-01-03,-1.323749,-0.234723,-1.768072,-0.454503
2019-01-04,0.251244,-1.147055,-0.384193,0.416921
2019-01-05,0.80336,-0.486047,-0.808534,-0.037337
2019-01-06,-0.425767,1.956439,-0.145754,-0.538147


* DataFrame 增加新列

In [69]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20190101', periods=6))
s1

2019-01-01    1
2019-01-02    2
2019-01-03    3
2019-01-04    4
2019-01-05    5
2019-01-06    6
Freq: D, dtype: int64

In [70]:
df["E"] = s1
df

Unnamed: 0,A,B,C,D,E
2019-01-01,-0.92171,-0.172802,0.4112,1.807156,1
2019-01-02,0.745224,-0.018015,-1.87469,0.373795,2
2019-01-03,-1.323749,-0.234723,-1.768072,-0.454503,3
2019-01-04,0.251244,-1.147055,-0.384193,0.416921,4
2019-01-05,0.80336,-0.486047,-0.808534,-0.037337,5
2019-01-06,-0.425767,1.956439,-0.145754,-0.538147,6


* 修改赋值

In [71]:
df.loc["2019-01-01", "A"] = 0
df

Unnamed: 0,A,B,C,D,E
2019-01-01,0.0,-0.172802,0.4112,1.807156,1
2019-01-02,0.745224,-0.018015,-1.87469,0.373795,2
2019-01-03,-1.323749,-0.234723,-1.768072,-0.454503,3
2019-01-04,0.251244,-1.147055,-0.384193,0.416921,4
2019-01-05,0.80336,-0.486047,-0.808534,-0.037337,5
2019-01-06,-0.425767,1.956439,-0.145754,-0.538147,6


In [72]:
df.iloc[0, 1] = 0
df

Unnamed: 0,A,B,C,D,E
2019-01-01,0.0,0.0,0.4112,1.807156,1
2019-01-02,0.745224,-0.018015,-1.87469,0.373795,2
2019-01-03,-1.323749,-0.234723,-1.768072,-0.454503,3
2019-01-04,0.251244,-1.147055,-0.384193,0.416921,4
2019-01-05,0.80336,-0.486047,-0.808534,-0.037337,5
2019-01-06,-0.425767,1.956439,-0.145754,-0.538147,6


In [73]:
df["D"] = np.array([5]*len(df))   # 可简化成df["D"] = 5
df

Unnamed: 0,A,B,C,D,E
2019-01-01,0.0,0.0,0.4112,5,1
2019-01-02,0.745224,-0.018015,-1.87469,5,2
2019-01-03,-1.323749,-0.234723,-1.768072,5,3
2019-01-04,0.251244,-1.147055,-0.384193,5,4
2019-01-05,0.80336,-0.486047,-0.808534,5,5
2019-01-06,-0.425767,1.956439,-0.145754,5,6


* 修改index和columns

In [74]:
df.index = [i for i in range(len(df))]
df

Unnamed: 0,A,B,C,D,E
0,0.0,0.0,0.4112,5,1
1,0.745224,-0.018015,-1.87469,5,2
2,-1.323749,-0.234723,-1.768072,5,3
3,0.251244,-1.147055,-0.384193,5,4
4,0.80336,-0.486047,-0.808534,5,5
5,-0.425767,1.956439,-0.145754,5,6


In [75]:
df.columns = [i for i in range(df.shape[1])]
df

Unnamed: 0,0,1,2,3,4
0,0.0,0.0,0.4112,5,1
1,0.745224,-0.018015,-1.87469,5,2
2,-1.323749,-0.234723,-1.768072,5,3
3,0.251244,-1.147055,-0.384193,5,4
4,0.80336,-0.486047,-0.808534,5,5
5,-0.425767,1.956439,-0.145754,5,6


## 三、数值运算及统计分析

### 3.1 数据的查看

In [76]:
import pandas as pd
import numpy as np

dates = pd.date_range(start='2019-01-01', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df

Unnamed: 0,A,B,C,D
2019-01-01,1.21318,-0.222184,-0.684833,-0.639893
2019-01-02,0.429525,0.691497,-0.392205,-0.986018
2019-01-03,0.343462,0.043833,0.328261,1.252595
2019-01-04,-0.119875,2.306442,-0.068138,-0.969746
2019-01-05,-0.308247,1.316928,-0.558191,-0.365397
2019-01-06,0.104123,2.093969,-1.06826,0.075882


**（1）查看前面的行**

In [77]:
df.head()    # 默认5行

Unnamed: 0,A,B,C,D
2019-01-01,1.21318,-0.222184,-0.684833,-0.639893
2019-01-02,0.429525,0.691497,-0.392205,-0.986018
2019-01-03,0.343462,0.043833,0.328261,1.252595
2019-01-04,-0.119875,2.306442,-0.068138,-0.969746
2019-01-05,-0.308247,1.316928,-0.558191,-0.365397


In [78]:
df.head(2)

Unnamed: 0,A,B,C,D
2019-01-01,1.21318,-0.222184,-0.684833,-0.639893
2019-01-02,0.429525,0.691497,-0.392205,-0.986018


**（2）查看后面的行**

In [79]:
df.tail()    # 默认5行

Unnamed: 0,A,B,C,D
2019-01-02,0.429525,0.691497,-0.392205,-0.986018
2019-01-03,0.343462,0.043833,0.328261,1.252595
2019-01-04,-0.119875,2.306442,-0.068138,-0.969746
2019-01-05,-0.308247,1.316928,-0.558191,-0.365397
2019-01-06,0.104123,2.093969,-1.06826,0.075882


In [80]:
df.tail(3) 

Unnamed: 0,A,B,C,D
2019-01-04,-0.119875,2.306442,-0.068138,-0.969746
2019-01-05,-0.308247,1.316928,-0.558191,-0.365397
2019-01-06,0.104123,2.093969,-1.06826,0.075882


**（3）查看总体信息**

In [81]:
df.iloc[0, 3] = np.nan
df

Unnamed: 0,A,B,C,D
2019-01-01,1.21318,-0.222184,-0.684833,
2019-01-02,0.429525,0.691497,-0.392205,-0.986018
2019-01-03,0.343462,0.043833,0.328261,1.252595
2019-01-04,-0.119875,2.306442,-0.068138,-0.969746
2019-01-05,-0.308247,1.316928,-0.558191,-0.365397
2019-01-06,0.104123,2.093969,-1.06826,0.075882


In [82]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2019-01-01 to 2019-01-06
Freq: D
Data columns (total 4 columns):
A    6 non-null float64
B    6 non-null float64
C    6 non-null float64
D    5 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes


### 3.2 Numpy通用函数同样适用于Pandas

#### 3.2.1 向量化运算

In [83]:
x = pd.DataFrame(np.arange(4).reshape(1, 4))
x

Unnamed: 0,0,1,2,3
0,0,1,2,3


In [84]:
x+5

Unnamed: 0,0,1,2,3
0,5,6,7,8


In [85]:
np.exp(x)

Unnamed: 0,0,1,2,3
0,1.0,2.718282,7.389056,20.085537


In [86]:
y = pd.DataFrame(np.arange(4,8).reshape(1, 4))
y

Unnamed: 0,0,1,2,3
0,4,5,6,7


In [87]:
x*y

Unnamed: 0,0,1,2,3
0,0,5,12,21


#### 3.2.2 矩阵化运算

In [102]:
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(30, 30)))

* 转置

In [103]:
z = x.T

In [104]:
np.random.seed(1)
y = pd.DataFrame(np.random.randint(10, size=(30, 30)))

In [105]:
x.dot(y)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,616,560,723,739,612,457,681,799,575,590,...,523,739,613,580,668,602,733,585,657,700
1,520,438,691,600,612,455,666,764,707,592,...,555,681,503,679,641,506,779,494,633,590
2,557,570,786,807,690,469,804,828,704,573,...,563,675,712,758,793,672,754,550,756,638
3,605,507,664,701,660,496,698,806,651,575,...,582,685,668,586,629,534,678,484,591,626
4,599,681,753,873,721,563,754,770,620,654,...,633,747,661,677,726,649,716,610,735,706
5,422,354,602,627,613,396,617,627,489,423,...,456,572,559,537,499,384,589,436,574,507
6,359,446,599,599,481,357,577,572,451,464,...,449,550,495,532,633,554,663,476,565,602
7,531,520,698,590,607,537,665,696,571,472,...,576,588,551,665,652,527,742,528,650,599
8,449,322,547,533,593,399,584,638,587,424,...,402,596,523,523,447,362,561,386,529,484
9,373,433,525,601,522,345,551,521,434,447,...,508,498,438,478,459,418,488,407,503,496


In [106]:
%timeit x.dot(y)

196 µs ± 23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [107]:
%timeit np.dot(x, y)

63.9 µs ± 4.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


* 执行相同运算，Numpy与Pandas的对比

In [108]:
x1 = np.array(x)

In [109]:
y1 = np.array(y)

In [110]:
%timeit x1.dot(y1)

20.1 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [111]:
%timeit np.dot(x1, y1)

20.4 µs ± 614 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [112]:
%timeit np.dot(x.values, y.values)

34.9 µs ± 7.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [113]:
x2 = list(x1)
y2 = list(y1)
x3 = []
y3 = []
for i in x2:
    res = []
    for j in i:
        res.append(int(j))
    x3.append(res)
for i in y2:
    res = []
    for j in i:
        res.append(int(j))
    y3.append(res)

In [114]:
def f(x, y):
    res = []
    for i in range(len(x)):
        row = []
        for j in range(len(y[0])):
            sum_row = 0
            for k in range(len(x[0])):
                sum_row += x[i][k]*y[k][j]
            row.append(sum_row)
        res.append(row)
    return res          

In [115]:
%timeit f(x3, y3)

3.4 ms ± 302 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


**一般来说，纯粹的计算在Numpy里执行的更快**

Numpy更侧重于计算，Pandas更侧重于数据处理

#### 3.2.3 广播运算

In [116]:
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(3, 3)), columns=list("ABC"))
x

Unnamed: 0,A,B,C
0,6,3,7
1,4,6,9
2,2,6,7


* 按行广播

In [117]:
x.iloc[0]

A    6
B    3
C    7
Name: 0, dtype: int32

In [118]:
x/x.iloc[0]

Unnamed: 0,A,B,C
0,1.0,1.0,1.0
1,0.666667,2.0,1.285714
2,0.333333,2.0,1.0


* 按列广播

In [119]:
x.A

0    6
1    4
2    2
Name: A, dtype: int32

In [120]:
x.div(x.A, axis=0)             # add sub div mul

Unnamed: 0,A,B,C
0,1.0,0.5,1.166667
1,1.0,1.5,2.25
2,1.0,3.0,3.5


In [121]:
x.div(x.iloc[0], axis=1)

Unnamed: 0,A,B,C
0,1.0,1.0,1.0
1,0.666667,2.0,1.285714
2,0.333333,2.0,1.0


#### 3.2.4 新的用法

**（1）索引对齐**

In [122]:
A = pd.DataFrame(np.random.randint(0, 20, size=(2, 2)), columns=list("AB"))
A

Unnamed: 0,A,B
0,3,7
1,2,1


In [123]:
B = pd.DataFrame(np.random.randint(0, 10, size=(3, 3)), columns=list("ABC"))
B

Unnamed: 0,A,B,C
0,7,5,1
1,4,0,9
2,5,8,0


* pandas会自动对齐两个对象的索引，没有的值用np.nan表示

In [124]:
A+B

Unnamed: 0,A,B,C
0,10.0,12.0,
1,6.0,1.0,
2,,,


* 缺省值也可用fill_value来填充

In [125]:
A.add(B, fill_value=0)

Unnamed: 0,A,B,C
0,10.0,12.0,1.0
1,6.0,1.0,9.0
2,5.0,8.0,0.0


In [126]:
A*B

Unnamed: 0,A,B,C
0,21.0,35.0,
1,8.0,0.0,
2,,,


**（2）统计相关**

* 数据种类统计

In [127]:
y = np.random.randint(3, size=20)
y

array([2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 1])

In [128]:
np.unique(y)

array([0, 1, 2])

In [129]:
from collections import Counter
Counter(y)

Counter({2: 11, 1: 5, 0: 4})

In [58]:
y1 = pd.DataFrame(y, columns=["A"])
y1

Unnamed: 0,A
0,2
1,2
2,2
3,1
4,2
5,1
6,1
7,2
8,1
9,2


In [59]:
np.unique(y1)

array([0, 1, 2])

In [60]:
y1["A"].value_counts()

2    11
1     5
0     4
Name: A, dtype: int64

* 产生新的结果，并进行排序

In [61]:
population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981 }
population = pd.Series(population_dict) 

GDP_dict = {"BeiJing": 30320,
            "ShangHai": 32680,
            "ShenZhen": 24222,
            "HangZhou": 13468 }
GDP = pd.Series(GDP_dict)

city_info = pd.DataFrame({"population": population,"GDP": GDP})
city_info

Unnamed: 0,population,GDP
BeiJing,2154,30320
ShangHai,2424,32680
ShenZhen,1303,24222
HangZhou,981,13468


In [62]:
city_info["per_GDP"] = city_info["GDP"]/city_info["population"]
city_info

Unnamed: 0,population,GDP,per_GDP
BeiJing,2154,30320,14.076137
ShangHai,2424,32680,13.481848
ShenZhen,1303,24222,18.589409
HangZhou,981,13468,13.728848


递增排序

In [63]:
city_info.sort_values(by="per_GDP")

Unnamed: 0,population,GDP,per_GDP
ShangHai,2424,32680,13.481848
HangZhou,981,13468,13.728848
BeiJing,2154,30320,14.076137
ShenZhen,1303,24222,18.589409


递减排序

In [64]:
city_info.sort_values(by="per_GDP", ascending=False)

Unnamed: 0,population,GDP,per_GDP
ShenZhen,1303,24222,18.589409
BeiJing,2154,30320,14.076137
HangZhou,981,13468,13.728848
ShangHai,2424,32680,13.481848


**按轴进行排序**

In [65]:
data = pd.DataFrame(np.random.randint(20, size=(3, 4)), index=[2, 1, 0], columns=list("CBAD"))
data

Unnamed: 0,C,B,A,D
2,3,13,17,8
1,1,19,14,6
0,11,7,14,2


行排序

In [66]:
data.sort_index()

Unnamed: 0,C,B,A,D
0,11,7,14,2
1,1,19,14,6
2,3,13,17,8


列排序

In [68]:
data.sort_index(axis=1)

Unnamed: 0,A,B,C,D
2,17,13,3,8
1,14,19,1,6
0,14,7,11,2


In [69]:
data.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2,8,3,13,17
1,6,1,19,14
0,2,11,7,14


* 统计方法

In [70]:
df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),columns=list("ABCD"))
df

Unnamed: 0,A,B,C,D
0,1.082198,3.557396,-3.060476,6.367969
1,13.113252,6.774559,2.874553,5.527044
2,-2.036341,-4.333177,5.094802,-0.152567
3,-3.386712,-1.522365,-2.522209,2.537716
4,4.328491,5.550994,5.577329,5.019991
5,1.171336,-0.49391,-4.032613,6.398588


非空个数

In [71]:
df.count()

A    6
B    6
C    6
D    6
dtype: int64

求和

In [72]:
df.sum()

A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64

In [73]:
df.sum(axis=1)

0     7.947086
1    28.289408
2    -1.427283
3    -4.893571
4    20.476806
5     3.043402
dtype: float64

最大值 最小值

In [74]:
df.min()

A   -3.386712
B   -4.333177
C   -4.032613
D   -0.152567
dtype: float64

In [75]:
df.max(axis=1)

0     6.367969
1    13.113252
2     5.094802
3     2.537716
4     5.577329
5     6.398588
dtype: float64

In [77]:
df

Unnamed: 0,A,B,C,D
0,1.082198,3.557396,-3.060476,6.367969
1,13.113252,6.774559,2.874553,5.527044
2,-2.036341,-4.333177,5.094802,-0.152567
3,-3.386712,-1.522365,-2.522209,2.537716
4,4.328491,5.550994,5.577329,5.019991
5,1.171336,-0.49391,-4.032613,6.398588


In [76]:
df.idxmax()

A    1
B    1
C    4
D    5
dtype: int64

均值

In [78]:
df.mean()

A    2.378704
B    1.588916
C    0.655231
D    4.283124
dtype: float64

方差

In [79]:
df.var()

A    34.980702
B    19.110656
C    18.948144
D     6.726776
dtype: float64

标准差

In [80]:
df.std()

A    5.914449
B    4.371574
C    4.352947
D    2.593603
dtype: float64

中位数

In [81]:
df.median()

A    1.126767
B    1.531743
C    0.176172
D    5.273518
dtype: float64

众数

In [82]:
data = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list("AB"))
data

Unnamed: 0,A,B
0,4,2
1,3,2
2,2,0
3,2,4
4,2,0
5,4,1
6,2,0
7,1,1
8,3,4
9,2,0


In [83]:
data.mode()

Unnamed: 0,A,B
0,2,0


75%分位数

In [84]:
df.quantile(0.75)

A    3.539202
B    5.052594
C    4.539740
D    6.157738
Name: 0.75, dtype: float64

一网打尽

In [85]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,2.378704,1.588916,0.655231,4.283124
std,5.914449,4.371574,4.352947,2.593603
min,-3.386712,-4.333177,-4.032613,-0.152567
25%,-1.256706,-1.265251,-2.92591,3.158284
50%,1.126767,1.531743,0.176172,5.273518
75%,3.539202,5.052594,4.53974,6.157738
max,13.113252,6.774559,5.577329,6.398588


In [86]:
data_2 = pd.DataFrame([["a", "a", "c", "d"],
                       ["c", "a", "c", "b"],
                       ["a", "a", "d", "c"]], columns=list("ABCD"))
data_2

Unnamed: 0,A,B,C,D
0,a,a,c,d
1,c,a,c,b
2,a,a,d,c


In [87]:
data_2.describe()

Unnamed: 0,A,B,C,D
count,3,3,3,3
unique,2,1,2,3
top,a,a,c,d
freq,2,3,2,1


相关性系数和协方差

In [88]:
df.corr()

Unnamed: 0,A,B,C,D
A,1.0,0.831063,0.33106,0.510821
B,0.831063,1.0,0.179244,0.719112
C,0.33106,0.179244,1.0,-0.450365
D,0.510821,0.719112,-0.450365,1.0


In [89]:
df.corrwith(df["A"])

A    1.000000
B    0.831063
C    0.331060
D    0.510821
dtype: float64

自定义输出

apply（method）的用法：使用method方法默认对每一列进行相应的操作

In [90]:
df

Unnamed: 0,A,B,C,D
0,1.082198,3.557396,-3.060476,6.367969
1,13.113252,6.774559,2.874553,5.527044
2,-2.036341,-4.333177,5.094802,-0.152567
3,-3.386712,-1.522365,-2.522209,2.537716
4,4.328491,5.550994,5.577329,5.019991
5,1.171336,-0.49391,-4.032613,6.398588


In [91]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D
0,1.082198,3.557396,-3.060476,6.367969
1,14.19545,10.331955,-0.185923,11.895013
2,12.159109,5.998778,4.908878,11.742447
3,8.772397,4.476413,2.386669,14.280162
4,13.100888,10.027406,7.963999,19.300153
5,14.272224,9.533497,3.931385,25.698741


In [94]:
df.apply(np.cumsum, axis=1)

Unnamed: 0,A,B,C,D
0,1.082198,4.639594,1.579117,7.947086
1,13.113252,19.887811,22.762364,28.289408
2,-2.036341,-6.369518,-1.274717,-1.427283
3,-3.386712,-4.909077,-7.431287,-4.893571
4,4.328491,9.879485,15.456814,20.476806
5,1.171336,0.677427,-3.355186,3.043402


In [95]:
df.apply(sum)

A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64

In [96]:
df.sum()

A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64

In [97]:
df.apply(lambda x: x.max()-x.min())

A    16.499965
B    11.107736
C     9.609942
D     6.551155
dtype: float64

In [98]:
def my_describe(x):
    return pd.Series([x.count(), x.mean(), x.max(), x.idxmin(), x.std()], \
                     index=["Count", "mean", "max", "idxmin", "std"])
df.apply(my_describe)

Unnamed: 0,A,B,C,D
Count,6.0,6.0,6.0,6.0
mean,2.378704,1.588916,0.655231,4.283124
max,13.113252,6.774559,5.577329,6.398588
idxmin,3.0,2.0,5.0,2.0
std,5.914449,4.371574,4.352947,2.593603
