# 第5章 pandas入门

## Series

- Series是一种类似于一维数组的对象，它由一组数据（各种NumPy数据类型）以及一组与之相关的数据标签（即索引）组成

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
# 索引在左边，值在右边，由于没有为数据指定索引，于是自动创建了一个0到N-1的整数型索引
obj = Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

- 可以通过Series的values和index属性获取其数组表示形式和索引对象

In [3]:
obj.values

array([ 4,  7, -5,  3])

In [4]:
obj.index

RangeIndex(start=0, stop=4, step=1)

- 可以自行设定Series的index

In [5]:
obj2 = Series([4, 7, -5, 3], index = ['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [6]:
obj2.index

Index([u'd', u'b', u'a', u'c'], dtype='object')

- 与普通数组相比，可以通过索引的方式选取Series中的单个或一组值

In [7]:
obj2['a']

-5

In [8]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

- NumPy数组运算都会保留索引和值之间的链接

In [9]:
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

In [10]:
obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

- 可以将Series看成是一个定长的有序词典，因为它是索引值到数据值的一个映射，它可以用在许多原本需要字典参数的函数中

In [11]:
'b' in obj2

True

- 如果数据被存放在一个Python字典中，也可以通过这个字典直接创建Series

In [12]:
# 只传入一个字典，则Series的索引就是原字典的键（有序排列）
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [13]:
# 字典中如果没有对应的键，则Series中的value为NaN
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [14]:
# 检测无效值
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [15]:
# 同上功能一致
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

- Series在算术运算中会自动对齐不同索引的数据

In [16]:
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [17]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [18]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

- Series的索引可以通过赋值的方式就地修改

In [19]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [20]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

## DataFrame

- DataFrame是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型；DataFrame既有行索引也有列索引，它可以被看做由Series组成的字典（共用同一个索引）；DataFrame中面向行和列的操作基本上是平衡的；DataFrame中的数据是以一个或多个二维块存放的（而不是列表、字典或别的一维数据结构）

- 构建DataFrame的方法有很多，最常用的一种是直接传入一个由等长列表或NumPy数组组成的字典

In [21]:
# DataFrame会自动加上索引（同Series），且全部列会被有序排列
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


In [22]:
# 如果指定了列序列，则DataFrame的列就会按照指定顺序进行排列
DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


In [23]:
# 跟Series一样，如果传入的列在数据中找不到，就会产生NaN值
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


- 通过类似字典标记的方式或属性的方式，可以将DataFrame的列获取为一个Series

In [24]:
# 返回的Series拥有原DataFrame相同的索引，且其name属性也已经被相应地设置好了
print frame2['state']
print "\n"
print frame2['state'].name
print "\n"
print type(frame2['state'])

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object


state


<class 'pandas.core.series.Series'>


In [25]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

- 行可以通过位置或名称的方式进行获取，比如用索引字段loc

In [26]:
print frame2.loc['three']
print "\n"
print type(frame2.loc['three'])

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object


<class 'pandas.core.series.Series'>


- 列可以通过赋值的方式进行修改

In [27]:
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [28]:
# 将列表或数组赋值给某个列时，其长度必须跟DataFrame的长度相匹配
frame2.debt = np.arange(5)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


In [29]:
# 如果赋值的是一个Series，就会精确匹配DataFrame的索引，所有的空位都会被填上缺失值
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'six'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,


- 为不存在的列赋值会创建出一个新列；关键字del用于删除列

In [30]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,,False


In [31]:
del frame2['eastern']
frame2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

- 通过索引方式返回的列只是相应数据的视图而已，并不是副本；通过Series的copy方法即可显式地复制列

In [32]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,


In [33]:
state = frame2['state']
state

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [34]:
state['one'] = 'test'
state

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


one        test
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [35]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,test,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,


In [36]:
pop = frame2['pop'].copy()
pop

one      1.5
two      1.7
three    3.6
four     2.4
five     2.9
Name: pop, dtype: float64

In [37]:
pop['one'] = '100'
pop

one      100.0
two        1.7
three      3.6
four       2.4
five       2.9
Name: pop, dtype: float64

In [38]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,test,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,


- 还可以通过嵌套词典来生成DataFrame：外层字典的键作为列索引，内层字典的键作为行索引

In [39]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [40]:
# 可以对DataFrame转置
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


- 如果设置了DataFrame的index和columns的name属性，则这些信息也会被显示出来

In [41]:
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


- values属性会以二维ndarray的形式返回DataFrame中的数据

In [42]:
frame3.values

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

In [43]:
# 如果DataFrame各列的数据类型不同，则值数组的数据类型就会选用能兼容所有列的数据类型
frame2.values

array([[2000, 'test', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, nan]], dtype=object)

## 索引对象

- pandas的索引对象负责管理轴标签和其它元数据（比如轴名称等），构建Series或DataFrame时，所用到的任何数组或其它序列的标签都会被转换成一个Index

In [44]:
obj = Series(range(3), index=['a', 'b', 'c'])
obj

a    0
b    1
c    2
dtype: int64

In [45]:
index = obj.index
index

Index([u'a', u'b', u'c'], dtype='object')

In [46]:
type(index)

pandas.core.indexes.base.Index

In [47]:
index[1:]

Index([u'b', u'c'], dtype='object')

In [48]:
pd.Index(np.arange(3))

Int64Index([0, 1, 2], dtype='int64')

- Index对象是不可修改（immutable）的（功能类似一个固定大小的集合），这样才能使其在多个数据结构之间安全共享

- pandas中主要的Index对象有：Index（最泛化的Index对象）、Int64Index（针对整数的特殊Index）、MultiIndex（“层次化”索引对象，表示单个轴上的多层索引）、DatatimeIndex（存储纳秒级别时间戳）、PeriodIndex（针对Period数据的特殊Index）

- 每个索引都有一些方法和属性，它们可用于设置逻辑并回答有关该索引所包含数据的常见问题，如：intersection（交集）、union（并集）等

## 基本功能

### 重新索引

- pandas对象的一个重要方法是reindex，其作用是创建一个适应新索引的新对象，首先来看一下Series

In [49]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [50]:
# 如果某个索引值当前不存在，就引入缺失值
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [51]:
# 指定缺失值
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

In [52]:
# 通过method参数指定缺失值，这里的ffill是前向值填充
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [53]:
# bfill，后向值填充
obj3.reindex(range(6), method='bfill')

0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object

- 对于DataFrame，reindex可以修改行索引、列索引，或两个都修改，如果仅传入一个序列，则会重新索引行

In [54]:
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [55]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [56]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [57]:
# 同时对行和列重新索引，method方法也可以用在最后，不过pandas的插值只能按行应用
frame.reindex(index=['a', 'b', 'c', 'd'], columns=states).ffill()

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,1.0,,2.0
c,4.0,,5.0
d,7.0,,8.0


### 丢弃指定轴上的项

- 丢弃某条轴上的一个或多个项很简单，只要有一个索引数组或列表即可

In [58]:
obj= Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [59]:
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [60]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

- 对于DataFrame，可以删除任意轴上的索引值

In [61]:
data = DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [62]:
# 默认删除行索引，即axis=0
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [63]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


### 索引、选取和过滤

- Series索引的工作方式类似于NumPy数组的索引，只不过Series的索引值不只是整数

In [64]:
obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [65]:
obj['b']

1.0

In [66]:
obj[1]

1.0

In [67]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [68]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [69]:
obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [70]:
# 利用标签的切片运算与普通的Python切片运算不同，它的末端是包含的
obj['b':'c']

b    1.0
c    2.0
dtype: float64

In [71]:
obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

- 对DataFrame进行索引就是获取一个或多个列

In [72]:
data = DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [73]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [74]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [75]:
# 通过切片选取，选取的是行
data['Ohio':'Utah'] # data[:3]也是一样的效果

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11


In [76]:
# 通过布尔型数组，选取的是行
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [77]:
# 通过布尔型数组还可以赋值
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


- 对DataFrame进行行索引可以用loc或者iloc

In [78]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [79]:
# loc，通过行标签进行索引，包括末端
data.loc[:'Utah']

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11


In [80]:
# iloc，通过行号进行索引，不包括末端
data.iloc[:3]

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11


In [81]:
data.loc['Utah']

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [82]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [83]:
# 同时选取行、列，两个索引分别是行标签、列标签
data.loc[:, ['one', 'three']]

Unnamed: 0,one,three
Ohio,0,0
Colorado,0,6
Utah,8,10
New York,12,14


In [84]:
# 同时选取行、列，两个索引分别是行号和列号
data.iloc[:, [0, 2]]

Unnamed: 0,one,three
Ohio,0,0
Colorado,0,6
Utah,8,10
New York,12,14


### 算术运算和数据对齐

- pandas最重要的一个功能，就是可以对不同索引的对象进行算术运算，对象相加时，如果存在不同的索引对，则结果的索引就是该索引对的并集

- 首先看一下Series

In [85]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [86]:
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [87]:
# 不重叠的索引处引入NaN
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

- 对于DataFrame，对齐操作会同时发生在行和列上

In [88]:
df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [89]:
df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [90]:
# 其索引和列为原来那两个DataFrame的并集
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


### 算术填充值

In [91]:
# 没有值时，填充0，当二者在这个地方都没有值时，还是NaN
df1.add(df2, fill_value=0)

Unnamed: 0,b,c,d,e
Colorado,6.0,7.0,8.0,
Ohio,3.0,1.0,6.0,5.0
Oregon,9.0,,10.0,11.0
Texas,9.0,4.0,12.0,8.0
Utah,0.0,,1.0,2.0


In [92]:
# 重新索引时也可以填充
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,b,d,e
Ohio,0.0,2.0,0
Texas,3.0,5.0,0
Colorado,6.0,8.0,0


### DataFrame和Series之间的运算

- DataFrame和Series之间的运算会通过广播（broadcasting）

In [93]:
frame = DataFrame(np.arange(12.).reshape(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [94]:
series = frame.iloc[0]
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [95]:
# 默认情况下，会讲Series的索引匹配到DataFrame的列，然后沿着行一直向下广播
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [96]:
series2 = Series(range(3), index=['b', 'e', 'f'])
series2

b    0
e    1
f    2
dtype: int64

In [97]:
# 如果索引不同，则参与运算的两个对象会被重新索引以形成并集
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [98]:
# 使用算术运算，可以指定轴
frame.add(series2, axis=1)

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [99]:
series3 = frame['d']
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [100]:
frame.sub(series3, axis=0)

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


### 函数应用和映射

- NumPy的ufuncs（元素级数组方法）也可用于操作pandas对象

In [101]:
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-0.281275,0.440579,1.993742
Ohio,-0.746468,-1.30483,0.667958
Texas,-0.345682,0.457166,0.637068
Oregon,-1.119292,-1.04046,0.849036


In [102]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.281275,0.440579,1.993742
Ohio,0.746468,1.30483,0.667958
Texas,0.345682,0.457166,0.637068
Oregon,1.119292,1.04046,0.849036


- DataFrame的apply方法可以将函数应用打由各行或列所形成的一维数组上

In [103]:
f = lambda x: x.max() - x.min()
frame.apply(f, axis=1)

Utah      2.275017
Ohio      1.972788
Texas     0.982750
Oregon    1.968328
dtype: float64

In [104]:
frame.apply(f)

b    0.838017
d    1.761996
e    1.356674
dtype: float64

- 除标量值外，传递给apply的函数还可以返回由多个值组成的Series

In [105]:
def f(x):
    return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Unnamed: 0,b,d,e
min,-1.119292,-1.30483,0.637068
max,-0.281275,0.457166,1.993742


- 元素级的Python函数也是可以用的，使用applymap

In [106]:
format = lambda x: '%.2f' % x
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-0.28,0.44,1.99
Ohio,-0.75,-1.3,0.67
Texas,-0.35,0.46,0.64
Oregon,-1.12,-1.04,0.85


- Series有一个应用元素级函数的map方法

In [107]:
frame['e'].map(format)

Utah      1.99
Ohio      0.67
Texas     0.64
Oregon    0.85
Name: e, dtype: object

### 排序和排名

- 要对行或列索引进行排序，可使用sort_index方法

In [108]:
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj

d    0
a    1
b    2
c    3
dtype: int64

In [109]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [110]:
frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [111]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [112]:
# DataFrame可以指定轴
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [113]:
# 默认是升序，可以指定降序
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


- 也可以按值排序，使用sort_values方法

In [114]:
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj

0    4.0
1    NaN
2    7.0
3    NaN
4   -3.0
5    2.0
dtype: float64

In [115]:
# 缺失值会被放到末尾
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [116]:
frame = DataFrame({'b': [2, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

Unnamed: 0,a,b
0,0,2
1,1,7
2,0,-3
3,1,2


In [117]:
frame.sort_values(by='b')

Unnamed: 0,a,b
2,0,-3
0,0,2
3,1,2
1,1,7


In [118]:
# 可以根据多个列进行排序，这可以看做多级排序
frame.sort_values(by=['b', 'a'])

Unnamed: 0,a,b
2,0,-3
0,0,2
3,1,2
1,1,7


In [119]:
# 对行值进行排序
frame.sort_values(by=2, axis=1)

Unnamed: 0,b,a
0,2,0
1,7,1
2,-3,0
3,2,1


- 排名（ranking）跟排序关系密切，且它会增设一个排名值（从1开始，一直到数组中有效数据的数量），它可以根据某种规则破坏平级关系

In [120]:
obj = Series([7, -5, 7, 4, 2, 0, 4])
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [121]:
# 默认情况下，rank通过“为各组分配一个平均排名“的方式破坏平级关系
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [122]:
# 也可以根据值在原数据中出现的顺序给出排名，这类似于稳定排序
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [123]:
# max表示使用整个分组的最大排名，ascending控制升序还是降序，默认升序
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [124]:
frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1], 'c': [-2, 5, 8, -2.5]})
frame

Unnamed: 0,a,b,c
0,0,4.3,-2.0
1,1,7.0,5.0
2,0,-3.0,8.0
3,1,2.0,-2.5


In [125]:
# 默认在列上进行rank
frame.rank()

Unnamed: 0,a,b,c
0,1.5,3.0,2.0
1,3.5,4.0,3.0
2,1.5,1.0,4.0
3,3.5,2.0,1.0


In [126]:
# 在行上rank
frame.rank(axis=1)

Unnamed: 0,a,b,c
0,2.0,3.0,1.0
1,1.0,3.0,2.0
2,2.0,1.0,3.0
3,2.0,3.0,1.0


### 带有重复值的轴索引

- pandas的轴标签可以不唯一

In [127]:
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [128]:
# 判断索引的值是否唯一
obj.index.is_unique

False

In [129]:
obj['a']

a    0
a    1
dtype: int64

In [130]:
obj['c']

4

In [131]:
df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df

Unnamed: 0,0,1,2
a,0.95625,0.22504,-2.615496
a,-0.721335,0.560779,0.433583
b,-0.963332,-1.508413,0.363071
b,0.29566,1.140155,-0.564373


In [132]:
df.loc['a']

Unnamed: 0,0,1,2
a,0.95625,0.22504,-2.615496
a,-0.721335,0.560779,0.433583


## 汇总和计算描述统计

- pandas对象拥有一组常用的数学和统计方法，它们大部分都属于约简和汇总统计，用于从Series中提取单个值（如sum和mean）或从DataFrame的行或列中提取一个Series，计算时会忽略掉nan

In [133]:
df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [134]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [135]:
# 按行进行求和
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [136]:
# 通过skipna选项可以保留nan
df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

In [137]:
# 返回间接统计——达到最小值或最大值的索引
df.idxmax()

one    b
two    d
dtype: object

In [138]:
# 累计型运算
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [139]:
# 汇总型运算
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [140]:
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

In [141]:
# 非数值型数据，describe会产生另外一种汇总统计
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

### 相关系数与协方差

In [142]:
data = DataFrame(np.arange(1,19).reshape(6,3), index=['a', 'b', 'c', 'd', 'e', 'f'], columns=['one', 'two', 'three'])
data

Unnamed: 0,one,two,three
a,1,2,3
b,4,5,6
c,7,8,9
d,10,11,12
e,13,14,15
f,16,17,18


In [143]:
# 计算列的百分比变化
data.pct_change()

Unnamed: 0,one,two,three
a,,,
b,3.0,1.5,1.0
c,0.75,0.6,0.5
d,0.428571,0.375,0.333333
e,0.3,0.272727,0.25
f,0.230769,0.214286,0.2


In [144]:
# 计算行的百分比变化
data.pct_change(axis=1)

Unnamed: 0,one,two,three
a,,1.0,0.5
b,,0.25,0.2
c,,0.142857,0.125
d,,0.1,0.090909
e,,0.076923,0.071429
f,,0.0625,0.058824


- Series的corr方法用于计算两个Series中重叠的、非NA的、按索引对齐的值的相关系数，与此类似，cov用于计算协方差

In [145]:
data.one.corr(data.two)

1.0

In [146]:
data.one.cov(data.two)

31.5

- DataFrame的corr和cov方法将以DataFrame的形式返回完整的相关系数或协方差矩阵

In [147]:
data.corr()

Unnamed: 0,one,two,three
one,1.0,1.0,1.0
two,1.0,1.0,1.0
three,1.0,1.0,1.0


In [148]:
data.cov()

Unnamed: 0,one,two,three
one,31.5,31.5,31.5
two,31.5,31.5,31.5
three,31.5,31.5,31.5


- 利用DataFrame的corrwith方法，可以计算它和另一个Series或DataFrame之间的相关系数

In [149]:
data.corrwith(data.one)

one      1.0
two      1.0
three    1.0
dtype: float64

In [150]:
# 也可以传入一个Series，但是行索引必须和DataFrame保持一致
data.corrwith(Series(np.arange(6), index=['a', 'b', 'c', 'd', 'e', 'f']))

one      1.0
two      1.0
three    1.0
dtype: float64

In [151]:
data_2 = DataFrame(np.arange(19,37).reshape(6,3), index=['a', 'b', 'c', 'd', 'e', 'f'], columns=['one', 'two', 'three'])
data_2

Unnamed: 0,one,two,three
a,19,20,21
b,22,23,24
c,25,26,27
d,28,29,30
e,31,32,33
f,34,35,36


In [152]:
# 计算之前，所有数据项都会按标签对齐
data.corrwith(data_2)

one      1.0
two      1.0
three    1.0
dtype: float64

### 唯一值、值计数以及成员资格

In [153]:
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [154]:
# 返回唯一值，按发现的顺序排序
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [155]:
# 值计数，按值频率降序排列
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

In [156]:
# pandas有一个顶级方法，也可用于值计数
pd.value_counts(obj.values, sort=False)

a    3
c    3
b    2
d    1
dtype: int64

In [157]:
# 判断矢量化集合的成员资格
mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [158]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

In [159]:
data = DataFrame({'Qu1': [1, 3, 4, 3, 4], 'Qu2': [2, 3, 1, 2, 3], 'Qu3': [1, 5, 2, 4, 4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [160]:
# 对每一列做值计数，行索引变成每一列出现过的值
data.apply(pd.value_counts).fillna(0)

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


## 处理缺失数据

- pandas使用浮点值NaN（Not a Number）表示浮点和非浮点数组中的缺失数据，它只是一个便于被检测出来的标记而已

- Python内置的None值也会被当成NA处理

In [161]:
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [162]:
string_data[0] = None
string_data

0         None
1    artichoke
2          NaN
3      avocado
dtype: object

In [163]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

In [164]:
string_data.notnull()

0    False
1     True
2    False
3     True
dtype: bool

### 滤除缺失数据

- 对于Series，dropna返回一个仅含非空数据和索引值的Series

In [165]:
data = Series([1, np.nan, 3.5, np.nan, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [166]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

- 对于DataFrame，dropna默认丢弃任何含有缺失值的行

In [167]:
data = DataFrame([[1, 6.5, 3], [1, np.nan, np.nan], [np.nan, np.nan, np.nan], [np.nan, 6.5, 3]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [168]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [169]:
# 只丢弃全为NaN的行
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [170]:
# 丢弃列
data.dropna(axis=1)

0
1
2
3


In [171]:
# 保留至少有2个非NaN值的行
data.dropna(thresh=2)

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
3,,6.5,3.0


### 填充缺失数据

In [172]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [173]:
data.fillna(0)

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,0.0,0.0
2,0.0,0.0,0.0
3,0.0,6.5,3.0


In [174]:
# 对不同的列采用不同的填充值
data.fillna({0: 100, 1: 1000, 2: 10000})

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,1000.0,10000.0
2,100.0,1000.0,10000.0
3,100.0,6.5,3.0


In [175]:
# 原地修改
data.fillna(0, inplace=True)
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,0.0,0.0
2,0.0,0.0,0.0
3,0.0,6.5,3.0


- reindex中的method参数也适用于fillna

In [176]:
data = DataFrame([[1, 6.5, 4], [1, np.nan, np.nan], [np.nan, 6.5, np.nan], [np.nan, np.nan, 3]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,4.0
1,1.0,,
2,,6.5,
3,,,3.0


In [177]:
data.fillna(method='ffill')

Unnamed: 0,0,1,2
0,1.0,6.5,4.0
1,1.0,6.5,4.0
2,1.0,6.5,4.0
3,1.0,6.5,3.0


In [178]:
# 指定可以连续填充的最大数量
data.fillna(method='ffill', limit=1)

Unnamed: 0,0,1,2
0,1.0,6.5,4.0
1,1.0,6.5,4.0
2,1.0,6.5,
3,,6.5,3.0


## 层次化索引

- 层次化索引（hierarchical indexing）是pandas的一项重要功能，它使你能在一个轴上拥有多个（两个以上）索引级别，首先来看一下Series

In [179]:
data = Series(np.random.randn(10), index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
data

a  1    1.111620
   2   -0.575146
   3    0.109823
b  1   -1.043389
   2   -0.071187
   3   -0.225158
c  1    0.369948
   2   -0.180931
d  2   -0.484408
   3    0.619635
dtype: float64

In [180]:
data.index

MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

In [181]:
data['b']

1   -1.043389
2   -0.071187
3   -0.225158
dtype: float64

In [182]:
data['b': 'c']

b  1   -1.043389
   2   -0.071187
   3   -0.225158
c  1    0.369948
   2   -0.180931
dtype: float64

In [183]:
data[:, 2]

a   -0.575146
b   -0.071187
c   -0.180931
d   -0.484408
dtype: float64

In [184]:
# 层次化索引在数据重塑和基于分组的操作（如透视表生成）中扮演着重要的角色，如通过unstack可以把这段数据重新安排到DataFrame中
data.unstack()

Unnamed: 0,1,2,3
a,1.11162,-0.575146,0.109823
b,-1.043389,-0.071187,-0.225158
c,0.369948,-0.180931,
d,,-0.484408,0.619635


- 对于DataFrame，每条轴都可以有分层索引

In [185]:
frame = DataFrame(np.arange(12).reshape((4, 3)), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [186]:
# 各层都可以有名字，不要将索引名和轴标签混为一谈
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [187]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


In [188]:
# 可以单独创建MultiIndex然后复用
idx = pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']], names=['state', 'color'])
idx

MultiIndex(levels=[[u'Colorado', u'Ohio'], [u'Green', u'Red']],
           labels=[[1, 1, 0], [0, 1, 0]],
           names=[u'state', u'color'])

In [189]:
frame_2 = DataFrame(np.arange(12).reshape((4, 3)), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], columns=idx)
frame_2

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


### 重排分级顺序

- 有时，需要调整某条轴上各级别的顺序，或根据指定级别上的值对数据进行排序

In [190]:
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [191]:
# 也可以接受级别编号
frame.swaplevel(0, 1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [192]:
# 还可以根据单个级别中的值对数据进行排序（稳定的）
frame.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


### 根据级别汇总统计

- 许多对DataFrame和Series的描述和统计都有一个level选项，它用于指定在某条轴上求和的级别

In [193]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [194]:
frame.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [195]:
frame.sum(level='color', axis=1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


### 使用DataFrame的列

- 人们经常想要将DataFrame的一个或多个列当做行索引来用，或者可能希望将行索引变成DataFrame的列

In [196]:
frame = DataFrame({'a': range(7), 'b': range(7, 0, -1), 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'], 'd': [0, 1, 2, 0, 1, 2, 3]})
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [197]:
# set_index函数会将其一个或多个列转换为行索引
frame_2 = frame.set_index(['c', 'd'])
frame_2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [198]:
# 保留原数据
frame_3 = frame.set_index(['c', 'd'], drop=False)
frame_3

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


In [199]:
# reset_index 和 set_index相反
frame_2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


## 其他有关pandas的话题

- pandas有一个Panel数据结构，你可以将其看做一个三维版的DataFrame，pandas的大部分开发工作都集中在表格型数据的操作上，因为这些数据更常见，而且层次化索引页使得多数情况下没必要使用真正的N维数组