# Pandas 
[中文文档](https://www.pypandas.cn/)
[英文文档](https://pandas.pydata.org/)

## Pandas概览

Pandas 是 Python 的核心数据分析支持库，提供了快速、灵活、明确的数据结构，旨在简单、直观地处理关系型、标记型数据。Pandas 的目标是成为 Python 数据分析实践与实战的必备高级工具，其长远目标是成为最强大、最灵活、可以支持任何语言的开源数据分析工具。经过多年不懈的努力，Pandas 离这个目标已经越来越近了。

Pandas 适用于处理以下类型的数据：

- 与 SQL 或 Excel 表类似的，含异构列的表格数据;
- 有序和无序（非固定频率）的时间序列数据;
- 带行列标签的矩阵数据，包括同构或异构型数据;
- 任意其它形式的观测、统计数据集, 数据转入 Pandas 数据结构时不必事先标记。

Pandas 的主要数据结构是 [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)（一维数据）与 [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame)（二维数据），这两种数据结构足以处理金融、统计、- 社会科学、工程等领域里的大多数典型用例。对于 R 用户，DataFrame 提供了比 R 语言 data.frame 更丰富的功能。Pandas 基于 [NumPy](https://www.numpy.org/) 开发，可以与其它第三方科学计算支持库完美集成。

Pandas 就像一把万能瑞士军刀，下面仅列出了它的部分优势 ：

- 处理浮点与非浮点数据里的**缺失数据**，表示为 NaN；
- 大小可变：**插入或删除** DataFrame 等多维对象的列；
- 自动、显式**数据对齐**：显式地将对象与一组标签对齐，也可以忽略标签，在 Series、DataFrame 计算时自动与数据对齐；
- 强大、灵活的**分组**（group by）功能：**拆分-应用-组合**数据集，**聚合、转换**数据；
- 把 Python 和 NumPy 数据结构里不规则、不同索引的数据轻松地转换为 DataFrame 对象；
- 基于智能标签，对大型数据集进行**切片、花式索引、子集分解**等操作；
- 直观地**合并（merge）、连接（join）**数据集；
- 灵活地**重塑（reshape）、透视（pivot）**数据集；
- **轴**支持结构化标签：一个刻度支持多个标签；
- 成熟的 IO 工具：读取**文本文件**（CSV 等支持分隔符的文件）、Excel 文件、数据库等来源的数据，利用超快的 HDF5 格式保存 / 加载数据；
- **时间序列**：支持日期范围生成、频率转换、移动窗口统计、移动窗口线性回归、日期位移等时间序列功能。

这些功能主要是为了解决其它编程语言、科研环境的痛点。处理数据一般分为几个阶段：数据整理与清洗、数据分析与建模、数据可视化与制表，Pandas 是处理数据的理想工具。

其它说明：

- Pandas 速度很快。Pandas 的很多底层算法都用 Cython 优化过。然而，为了保持通用性，必然要牺牲一些性能，如果专注某一功能，完全可以开发出比 Pandas 更快的专用工具。
- Pandas 是 statsmodels 的依赖项，因此，Pandas 也是 Python 中统计计算生态系统的重要组成部分。
- Pandas 已广泛应用于金融领域。

## 数据结构

| 维数  | 名称  |     描述       |
|---------------------|------------------|------------------|
|  1   | Series|  带标签的一维同构数组|
|  2   | DataFrame| 带标签的，大小可变的，二维异构表格|

### 为什么有多个数据结构？

Pandas 数据结构就像是低维数据的容器。比如，DataFrame 是 Series 的容器，Series 则是标量的容器。使用这种方式，可以在容器中以字典的形式插入或删除对象。

此外，通用 API 函数的默认操作要顾及时间序列与截面数据集的方向。多维数组存储二维或三维数据时，编写函数要注意数据集的方向，这对用户来说是一种负担；如果不考虑 C 或 Fortran 中连续性对性能的影响，一般情况下，不同的轴在程序里其实没有什么区别。Pandas 里，轴的概念主要是为了给数据赋予更直观的语义，即用“更恰当”的方式表示数据集的方向。这样做可以让用户编写数据转换函数时，少费点脑子。

处理 DataFrame 等表格数据时，index（行）或 columns（列）比 axis 0 和 axis 1 更直观。用这种方式迭代 DataFrame 的列，代码更易读易懂：
```python
for col in df.columns:
    series = df[col]
    # do something with series
```

## 大小可变与数据复制

Pandas 所有数据结构的值都是可变的，但数据结构的大小并非都是可变的，比如，Series 的长度不可改变，但 DataFrame 里就可以插入列。

Pandas 里，绝大多数方法都不改变原始的输入数据，而是复制数据，生成新的对象。 一般来说，原始输入数据**不变**更稳妥。

## 十分钟入门Pandas

本节是帮助 Pandas 新手快速上手的简介。[烹饪指南](https://www.pypandas.cn/docs/user_guide/cookbook.html)里介绍了更多实用案例。

本节以下列方式导入 Pandas 与 NumPy：

In [None]:
import numpy as np
import pandas as pd

### 生成对象

详见[数据结构](https://www.pypandas.cn/docs/getting_started/dsintro.html#dsintro)简介文档。

用值列表生成 [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) 时，Pandas 默认自动生成整数索引：

In [4]:
s = pd.Series([1,2,3,4,5,np.nan, 6,8])
s

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    NaN
6    6.0
7    8.0
dtype: float64

用含日期时间索引与标签的Numpy数组生成[DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame)

In [20]:
df1 = pd.date_range(start='20150101', periods=6)
df1

DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
               '2015-01-05', '2015-01-06'],
              dtype='datetime64[ns]', freq='D')

In [22]:
df2 = pd.DataFrame(np.random.randn(6,4), index=df1, columns=list('ABCD'))
df2

Unnamed: 0,A,B,C,D
2015-01-01,0.42505,0.174167,-0.086784,0.437715
2015-01-02,0.000845,-0.895524,-0.175462,-0.145002
2015-01-03,0.797813,-1.397892,1.773624,-0.569873
2015-01-04,-0.832057,-0.107309,0.539165,0.166025
2015-01-05,0.146565,-0.920169,-1.132986,-0.084363
2015-01-06,0.057463,-0.211505,0.399993,-0.970626


用Series字典对象生成DataFrame

In [25]:
df3 = pd.DataFrame({
    'A': 1.,
    'B': pd.Timestamp('20150101'),
    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
    'D': np.array([3]* 4, dtype='int32'),
    'E': pd.Categorical(['test', 'train','test', 'train']),
    'F':'foo'
})
df3

Unnamed: 0,A,B,C,D,E,F
0,1.0,2015-01-01,1.0,3,test,foo
1,1.0,2015-01-01,1.0,3,train,foo
2,1.0,2015-01-01,1.0,3,test,foo
3,1.0,2015-01-01,1.0,3,train,foo


DataFrame 的列有不同[数据类型](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes)。

In [26]:
df3.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

IPython支持 tab 键自动补全列名与公共属性。下面是部分可自动补全的属性：

In [30]:
df3.<TAB>
# df2.A                  df2.bool
# df2.abs                df2.boxplot
# df2.add                df2.C
# df2.add_prefix         df2.clip
# df2.add_suffix         df2.clip_lower
# df2.align              df2.clip_upper
# df2.all                df2.columns
# df2.any                df2.combine
# df2.append             df2.combine_first
# df2.apply              df2.compound
# df2.applymap           df2.consolidate
# df2.D
# 列 A、B、C、D 和 E 都可以自动补全；为简洁起见，此处只显示了部分属性。

### 查看数据

详见[基础用法](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics)文档。

下列代码说明如何查看 DataFrame 头部和尾部数据：

In [41]:
df2.head()

Unnamed: 0,A,B,C,D
2015-01-01,0.42505,0.174167,-0.086784,0.437715
2015-01-02,0.000845,-0.895524,-0.175462,-0.145002
2015-01-03,0.797813,-1.397892,1.773624,-0.569873
2015-01-04,-0.832057,-0.107309,0.539165,0.166025
2015-01-05,0.146565,-0.920169,-1.132986,-0.084363


In [40]:
df2.tail(3)

Unnamed: 0,A,B,C,D
2015-01-04,-0.832057,-0.107309,0.539165,0.166025
2015-01-05,0.146565,-0.920169,-1.132986,-0.084363
2015-01-06,0.057463,-0.211505,0.399993,-0.970626


显示索引与列名

In [43]:
df2.index

DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
               '2015-01-05', '2015-01-06'],
              dtype='datetime64[ns]', freq='D')

In [44]:
df2.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

[DataFrame.to_numpy()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy) 输出底层数据的 NumPy 对象。注意，DataFrame 的列由多种数据类型组成时，该操作耗费系统资源较大，这也是 Pandas 和 NumPy 的本质区别：**NumPy 数组只有一种数据类型，DataFrame 每列的数据类型各不相同**。调用 DataFrame.to_numpy() 时，Pandas 查找支持 DataFrame 里所有数据类型的 NumPy 数据类型。还有一种数据类型是 `object`，可以把 DataFrame 列里的值强制转换为 Python 对象。

下面的 `df2` 这个 DataFrame 里的值都是浮点数，DataFrame.to_numpy() 的操作会很快，而且不复制数据。

In [45]:
df2.to_numpy()

array([[ 4.25049707e-01,  1.74166667e-01, -8.67841986e-02,
         4.37715329e-01],
       [ 8.44889963e-04, -8.95524331e-01, -1.75461575e-01,
        -1.45002426e-01],
       [ 7.97812759e-01, -1.39789163e+00,  1.77362401e+00,
        -5.69872973e-01],
       [-8.32057284e-01, -1.07309263e-01,  5.39165238e-01,
         1.66024968e-01],
       [ 1.46565394e-01, -9.20169073e-01, -1.13298619e+00,
        -8.43633483e-02],
       [ 5.74625068e-02, -2.11505449e-01,  3.99992930e-01,
        -9.70626263e-01]])

`df3` 这个 DataFrame 包含了多种类型，DataFrame.to_numpy() 操作就会耗费较多资源。

In [47]:
df3.to_numpy()

array([[1.0, Timestamp('2015-01-01 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2015-01-01 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2015-01-01 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2015-01-01 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

**提醒**：
DataFrame.to_numpy() 的输出不包含行索引和列标签。

[describe()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe) 可以快速查看数据的统计摘要：

In [48]:
df2.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.09928,-0.559706,0.219592,-0.194354
std,0.543164,0.601621,0.962194,0.507451
min,-0.832057,-1.397892,-1.132986,-0.970626
25%,0.014999,-0.914008,-0.153292,-0.463655
50%,0.102014,-0.553515,0.156604,-0.114683
75%,0.355429,-0.133358,0.504372,0.103428
max,0.797813,0.174167,1.773624,0.437715


转置数据：

In [57]:
print(df2)
print("\n-------------------转置后-------------------------")
df2.T

                   A         B         C         D
2015-01-01  0.425050  0.174167 -0.086784  0.437715
2015-01-02  0.000845 -0.895524 -0.175462 -0.145002
2015-01-03  0.797813 -1.397892  1.773624 -0.569873
2015-01-04 -0.832057 -0.107309  0.539165  0.166025
2015-01-05  0.146565 -0.920169 -1.132986 -0.084363
2015-01-06  0.057463 -0.211505  0.399993 -0.970626

-------------------转置后-------------------------


Unnamed: 0,2015-01-01,2015-01-02,2015-01-03,2015-01-04,2015-01-05,2015-01-06
A,0.42505,0.000845,0.797813,-0.832057,0.146565,0.057463
B,0.174167,-0.895524,-1.397892,-0.107309,-0.920169,-0.211505
C,-0.086784,-0.175462,1.773624,0.539165,-1.132986,0.399993
D,0.437715,-0.145002,-0.569873,0.166025,-0.084363,-0.970626


按轴排序：

In [56]:
df2.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2015-01-01,0.437715,-0.086784,0.174167,0.42505
2015-01-02,-0.145002,-0.175462,-0.895524,0.000845
2015-01-03,-0.569873,1.773624,-1.397892,0.797813
2015-01-04,0.166025,0.539165,-0.107309,-0.832057
2015-01-05,-0.084363,-1.132986,-0.920169,0.146565
2015-01-06,-0.970626,0.399993,-0.211505,0.057463


按值排序：

In [60]:
df2.sort_values(by='B')

Unnamed: 0,A,B,C,D
2015-01-03,0.797813,-1.397892,1.773624,-0.569873
2015-01-05,0.146565,-0.920169,-1.132986,-0.084363
2015-01-02,0.000845,-0.895524,-0.175462,-0.145002
2015-01-06,0.057463,-0.211505,0.399993,-0.970626
2015-01-04,-0.832057,-0.107309,0.539165,0.166025
2015-01-01,0.42505,0.174167,-0.086784,0.437715


### 选择

**提醒**：

选择、设置标准 Python / Numpy 的表达式已经非常直观，交互也很方便，但对于生产代码，我们还是推荐优化过的 Pandas 数据访问方法：`.at`、`.iat`、`.loc` 和 `.iloc`。

详见[索引与选择数据](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing)、[多层索引与高级索引](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced)文档。

#### 获取数据

选择单列，产生`Series`，与`df2.A`等效：

In [62]:
df2['A']

2015-01-01    0.425050
2015-01-02    0.000845
2015-01-03    0.797813
2015-01-04   -0.832057
2015-01-05    0.146565
2015-01-06    0.057463
Freq: D, Name: A, dtype: float64

用`[]`切片行：

In [64]:
df2[0:3]

Unnamed: 0,A,B,C,D
2015-01-01,0.42505,0.174167,-0.086784,0.437715
2015-01-02,0.000845,-0.895524,-0.175462,-0.145002
2015-01-03,0.797813,-1.397892,1.773624,-0.569873


In [68]:
df2['20150102': '20150103']

Unnamed: 0,A,B,C,D
2015-01-02,0.000845,-0.895524,-0.175462,-0.145002
2015-01-03,0.797813,-1.397892,1.773624,-0.569873


#### 按标签选择

详见[按标签选择](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-label)

用标签提取一行数据：

In [66]:
df2.loc[df1[0]]

A    0.425050
B    0.174167
C   -0.086784
D    0.437715
Name: 2015-01-01 00:00:00, dtype: float64

用标签选择多列数据：

In [67]:
df2.loc[:,['A', 'B']]

Unnamed: 0,A,B
2015-01-01,0.42505,0.174167
2015-01-02,0.000845,-0.895524
2015-01-03,0.797813,-1.397892
2015-01-04,-0.832057,-0.107309
2015-01-05,0.146565,-0.920169
2015-01-06,0.057463,-0.211505


用标签切片，包含行与列结束点：

In [69]:
df2.loc['20150102': '20150104', ['A','B']]

Unnamed: 0,A,B
2015-01-02,0.000845,-0.895524
2015-01-03,0.797813,-1.397892
2015-01-04,-0.832057,-0.107309


返回对象降维：

In [71]:
df2.loc['20150102', ['A','B']]

A    0.000845
B   -0.895524
Name: 2015-01-02 00:00:00, dtype: float64

提取标量值：

In [72]:
df2.loc[df1[0],'A']

0.4250497071849015

快速访问标量，与上述方法等效

In [73]:
df2.at[df1[0], 'A']

0.4250497071849015

#### 按位置选择

详见按[位置选择](http://pandas.pydata.org/Pandas-docs/stable/indexing.html#indexing-integer)

用整数位置选择：

In [74]:
df2.iloc[3]

A   -0.832057
B   -0.107309
C    0.539165
D    0.166025
Name: 2015-01-04 00:00:00, dtype: float64

类似Numpy/Python，用整数切片

In [76]:
df2.iloc[3:5,0:2]

Unnamed: 0,A,B
2015-01-04,-0.832057,-0.107309
2015-01-05,0.146565,-0.920169


类型Numpy/Python，用整数列表按位置切片：

In [77]:
df2.iloc[[1,2,4], [0,2]]

Unnamed: 0,A,C
2015-01-02,0.000845,-0.175462
2015-01-03,0.797813,1.773624
2015-01-05,0.146565,-1.132986


显式整行切片

In [78]:
df2.iloc[1:3, :]

Unnamed: 0,A,B,C,D
2015-01-02,0.000845,-0.895524,-0.175462,-0.145002
2015-01-03,0.797813,-1.397892,1.773624,-0.569873


显式整列切片：

In [79]:
df2.iloc[:,1:3]

Unnamed: 0,B,C
2015-01-01,0.174167,-0.086784
2015-01-02,-0.895524,-0.175462
2015-01-03,-1.397892,1.773624
2015-01-04,-0.107309,0.539165
2015-01-05,-0.920169,-1.132986
2015-01-06,-0.211505,0.399993


显式提取值：

In [80]:
df2.iloc[1,1]

-0.8955243305296425

快速访问标量，与上述方法等效：

In [82]:
df2.iat[1,1]

-0.8955243305296425

#### 布尔索引

用单列的值选择数据：

In [92]:
df2[df2.A > 0]

Unnamed: 0,A,B,C,D
2015-01-01,0.42505,0.174167,-0.086784,0.437715
2015-01-02,0.000845,-0.895524,-0.175462,-0.145002
2015-01-03,0.797813,-1.397892,1.773624,-0.569873
2015-01-05,0.146565,-0.920169,-1.132986,-0.084363
2015-01-06,0.057463,-0.211505,0.399993,-0.970626


选择DataFrame里满足条件的值：

In [93]:
df2[df2 > 0]

Unnamed: 0,A,B,C,D
2015-01-01,0.42505,0.174167,,0.437715
2015-01-02,0.000845,,,
2015-01-03,0.797813,,1.773624,
2015-01-04,,,0.539165,0.166025
2015-01-05,0.146565,,,
2015-01-06,0.057463,,0.399993,


用[isin()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html#pandas.Series.isin)筛选：

In [98]:
df4 = df2.copy()
df4['E'] = ['one', 'two', 'three', 'four', 'three', 'one']
df4

Unnamed: 0,A,B,C,D,E
2015-01-01,0.42505,0.174167,-0.086784,0.437715,one
2015-01-02,0.000845,-0.895524,-0.175462,-0.145002,two
2015-01-03,0.797813,-1.397892,1.773624,-0.569873,three
2015-01-04,-0.832057,-0.107309,0.539165,0.166025,four
2015-01-05,0.146565,-0.920169,-1.132986,-0.084363,three
2015-01-06,0.057463,-0.211505,0.399993,-0.970626,one


In [99]:
df4[df4['E'].isin(['two', 'four'])]

Unnamed: 0,A,B,C,D,E
2015-01-02,0.000845,-0.895524,-0.175462,-0.145002,two
2015-01-04,-0.832057,-0.107309,0.539165,0.166025,four


### 赋值

用索引自动对齐新增列的数据：

In [104]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20150101', periods=6))
df2['F'] = s1
print(df2)

                   A         B         C         D  F
2015-01-01  0.425050  0.174167 -0.086784  0.437715  1
2015-01-02  0.000845 -0.895524 -0.175462 -0.145002  2
2015-01-03  0.797813 -1.397892  1.773624 -0.569873  3
2015-01-04 -0.832057 -0.107309  0.539165  0.166025  4
2015-01-05  0.146565 -0.920169 -1.132986 -0.084363  5
2015-01-06  0.057463 -0.211505  0.399993 -0.970626  6


按标签赋值：

In [106]:
df2.at[df1[0], 'A'] = 0
df2

Unnamed: 0,A,B,C,D,F
2015-01-01,0.0,0.174167,-0.086784,0.437715,1
2015-01-02,0.000845,-0.895524,-0.175462,-0.145002,2
2015-01-03,0.797813,-1.397892,1.773624,-0.569873,3
2015-01-04,-0.832057,-0.107309,0.539165,0.166025,4
2015-01-05,0.146565,-0.920169,-1.132986,-0.084363,5
2015-01-06,0.057463,-0.211505,0.399993,-0.970626,6


按位置赋值：

In [107]:
df2.iat[0,1] = 0
df2

Unnamed: 0,A,B,C,D,F
2015-01-01,0.0,0.0,-0.086784,0.437715,1
2015-01-02,0.000845,-0.895524,-0.175462,-0.145002,2
2015-01-03,0.797813,-1.397892,1.773624,-0.569873,3
2015-01-04,-0.832057,-0.107309,0.539165,0.166025,4
2015-01-05,0.146565,-0.920169,-1.132986,-0.084363,5
2015-01-06,0.057463,-0.211505,0.399993,-0.970626,6


按Numpy数组赋值：

In [109]:
df2.loc[:,"D"] = np.array([5]* len(df2))
df2

Unnamed: 0,A,B,C,D,F
2015-01-01,0.0,0.0,-0.086784,5,1
2015-01-02,0.000845,-0.895524,-0.175462,5,2
2015-01-03,0.797813,-1.397892,1.773624,5,3
2015-01-04,-0.832057,-0.107309,0.539165,5,4
2015-01-05,0.146565,-0.920169,-1.132986,5,5
2015-01-06,0.057463,-0.211505,0.399993,5,6


用`where`条件赋值：

In [110]:
df5 = df2.copy()
df5[df5 > 0] = -df5
df5

Unnamed: 0,A,B,C,D,F
2015-01-01,0.0,0.0,-0.086784,-5,-1
2015-01-02,-0.000845,-0.895524,-0.175462,-5,-2
2015-01-03,-0.797813,-1.397892,-1.773624,-5,-3
2015-01-04,-0.832057,-0.107309,-0.539165,-5,-4
2015-01-05,-0.146565,-0.920169,-1.132986,-5,-5
2015-01-06,-0.057463,-0.211505,-0.399993,-5,-6


### 缺失值

Pandas 主要用 `np.nan` 表示缺失数据。 计算时，默认不包含空值。详见[缺失数据](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data)。

重建索引（reindex）可以更改、添加、删除指定轴的索引，并返回数据副本，即不更改原数据。

In [116]:
df6 = df2.reindex(index=df1[0:4], columns=list(df2.columns) + ['E'])
df6.loc[df1[0]:df1[1], 'E'] = 1
df6

Unnamed: 0,A,B,C,D,F,E
2015-01-01,0.0,0.0,-0.086784,5,1,1.0
2015-01-02,0.000845,-0.895524,-0.175462,5,2,1.0
2015-01-03,0.797813,-1.397892,1.773624,5,3,
2015-01-04,-0.832057,-0.107309,0.539165,5,4,


删除所有含缺失值得行：

In [118]:
df6.dropna(how='any')

Unnamed: 0,A,B,C,D,F,E
2015-01-01,0.0,0.0,-0.086784,5,1,1.0
2015-01-02,0.000845,-0.895524,-0.175462,5,2,1.0


填充缺失值：

In [119]:
df6.fillna(value=4)

Unnamed: 0,A,B,C,D,F,E
2015-01-01,0.0,0.0,-0.086784,5,1,1.0
2015-01-02,0.000845,-0.895524,-0.175462,5,2,1.0
2015-01-03,0.797813,-1.397892,1.773624,5,3,4.0
2015-01-04,-0.832057,-0.107309,0.539165,5,4,4.0


提取`nan`值得布尔掩码：

In [120]:
pd.isna(df6)

Unnamed: 0,A,B,C,D,F,E
2015-01-01,False,False,False,False,False,False
2015-01-02,False,False,False,False,False,False
2015-01-03,False,False,False,False,False,True
2015-01-04,False,False,False,False,False,True


### 运算
详见[二进制操作](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-binop)

#### 统计

一般情况下，运算时排除缺失值。

描述性统计：

In [121]:
df2.mean()

A    0.028438
B   -0.588733
C    0.219592
D    5.000000
F    3.500000
dtype: float64

在另一个轴(即,行)上执行同样的操作：

In [122]:
df2.mean(1)

2015-01-01    1.182643
2015-01-02    1.185972
2015-01-03    1.834709
2015-01-04    1.719960
2015-01-05    1.618682
2015-01-06    2.249190
Freq: D, dtype: float64

不同维度对象运算时，要先对齐。此外，Pandas自动沿指定维度广播。

In [125]:
s2 = pd.Series([1,3,5, np.nan, 6, 8], index=df1).shift(2)
s2

2015-01-01    NaN
2015-01-02    NaN
2015-01-03    1.0
2015-01-04    3.0
2015-01-05    5.0
2015-01-06    NaN
Freq: D, dtype: float64

In [126]:
df2.sub(s2, axis='index')

Unnamed: 0,A,B,C,D,F
2015-01-01,,,,,
2015-01-02,,,,,
2015-01-03,-0.202187,-2.397892,0.773624,4.0,2.0
2015-01-04,-3.832057,-3.107309,-2.460835,2.0,1.0
2015-01-05,-4.853435,-5.920169,-6.132986,0.0,0.0
2015-01-06,,,,,


#### Apply函数

Apply函数处理数据

In [128]:
df2.apply(np.cumsum)

Unnamed: 0,A,B,C,D,F
2015-01-01,0.0,0.0,-0.086784,5,1
2015-01-02,0.000845,-0.895524,-0.262246,10,3
2015-01-03,0.798658,-2.293416,1.511378,15,6
2015-01-04,-0.0334,-2.400725,2.050543,20,10
2015-01-05,0.113166,-3.320894,0.917557,25,15
2015-01-06,0.170628,-3.5324,1.31755,30,21


In [130]:
df2.apply(lambda x: x.max() - x.min())

A    1.629870
B    1.397892
C    2.906610
D    0.000000
F    5.000000
dtype: float64

#### 直方图

详见[直方图与离散化](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-discretization)。

In [133]:
s3 = pd.Series(np.random.randint(0, 7, size=10))
s3

0    5
1    1
2    0
3    2
4    4
5    6
6    6
7    0
8    2
9    1
dtype: int32

In [134]:
s3.value_counts()

6    2
2    2
1    2
0    2
5    1
4    1
dtype: int64

#### 字符串方法

Series 的 `str` 属性包含一组字符串处理功能，如下列代码所示。注意，`str` 的模式匹配默认使用[正则表达式](https://docs.python.org/3/library/re.html)。详见[矢量字符串方法](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#text-string-methods)。

In [137]:
s4 = pd.Series(['A','B', 'C', 'Aaba','Baca', np.nan, 'CABA', 'dog', 'cat'])
s4.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

### 合并（Merge）

#### 结合（Concat）
Pandas 提供了多种将 Series、DataFrame 对象组合在一起的功能，用索引与关联代数功能的多种设置逻辑可执行连接（join）与合并（merge）操作。

详见[合并](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging)。

`concat()` 用于连接 Pandas 对象：

In [141]:
df7 = pd.DataFrame(np.random.randn(10, 4))
df7

Unnamed: 0,0,1,2,3
0,-1.296076,0.294882,-0.632433,-0.561483
1,-0.025625,-0.082323,0.421752,-0.055686
2,0.083312,1.422946,-0.369731,-1.30235
3,1.395324,0.38745,0.309097,-0.438859
4,1.040691,0.220567,0.257021,0.734074
5,-0.050505,-0.754979,0.401254,1.768321
6,-0.559591,-0.56452,-0.667834,0.657654
7,-0.986849,2.273356,0.665441,-1.051597
8,1.398413,0.339848,-0.372046,-0.196928
9,-0.033368,0.557387,-2.104741,0.372245


In [143]:
# 分解为多组
pieces = [df7[:3], df7[3:7],df7[7:]]
pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,-1.296076,0.294882,-0.632433,-0.561483
1,-0.025625,-0.082323,0.421752,-0.055686
2,0.083312,1.422946,-0.369731,-1.30235
3,1.395324,0.38745,0.309097,-0.438859
4,1.040691,0.220567,0.257021,0.734074
5,-0.050505,-0.754979,0.401254,1.768321
6,-0.559591,-0.56452,-0.667834,0.657654
7,-0.986849,2.273356,0.665441,-1.051597
8,1.398413,0.339848,-0.372046,-0.196928
9,-0.033368,0.557387,-2.104741,0.372245


#### 连接（join）

SQL 风格的合并。 详见[数据库风格连接](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging-join)。

In [150]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1,2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4,5]})
print(left)
print('-------------------')
print(right)

   key  lval
0  foo     1
1  foo     2
-------------------
   key  rval
0  foo     4
1  foo     5


In [151]:
pd.merge(left, right, on='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


还有一个例子：

In [152]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1,2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4,5]})
print(left)
print('-------------------')
print(right)

   key  lval
0  foo     1
1  bar     2
-------------------
   key  rval
0  foo     4
1  bar     5


In [153]:
pd.merge(left, right, on='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,bar,2,5


#### 追加（Append）

为 DataFrame 追加行。详见[追加](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging-concatenation)文档。

In [154]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,0.417606,1.152513,-0.993736,-0.762617
1,0.238766,-1.787532,-0.244937,1.351114
2,-1.56346,1.167297,-0.695212,1.748017
3,2.898314,0.341063,1.119913,0.0103
4,-0.185712,1.275648,0.862067,-0.353386
5,0.382568,-1.903187,0.388535,0.783452
6,1.254587,0.169247,0.766574,0.750564
7,0.759676,1.701502,-0.598657,0.349115


In [155]:
s = df.iloc[3]
df.append(s, ignore_index=True)

Unnamed: 0,A,B,C,D
0,0.417606,1.152513,-0.993736,-0.762617
1,0.238766,-1.787532,-0.244937,1.351114
2,-1.56346,1.167297,-0.695212,1.748017
3,2.898314,0.341063,1.119913,0.0103
4,-0.185712,1.275648,0.862067,-0.353386
5,0.382568,-1.903187,0.388535,0.783452
6,1.254587,0.169247,0.766574,0.750564
7,0.759676,1.701502,-0.598657,0.349115
8,2.898314,0.341063,1.119913,0.0103
