Pandas 是一个Python数据分析库，安装完 Anaconda 会安装好 Pandas

# 1 基础数据结构
Pandas 包含了两种主要的数据结构
+ Series
+ DataFrame
Series 用来存储一维数据，而DataFrame则存储复杂的数据

## Series
表示一维数据，并且内部结构简单，由两个关联的数组组成，主数组用来存放数据，每个主数组有一个与之相关的标签。

index | value
--- | ---
 0 | 12
 1 | -4
 2 | 22
 3 | 0
 

In [1]:
import numpy as np
import pandas as pd
a = pd.Series([12,-4,22,0])
a

0    12
1    -4
2    22
3     0
dtype: int64

In [2]:
# 讲标签替换成有意义的值
a = pd.Series([12,-4,22,0],index=['a','b','c','d'])
print a

a    12
b    -4
c    22
d     0
dtype: int64


In [3]:
print a.values
print a.index

[12 -4 22  0]
Index([u'a', u'b', u'c', u'd'], dtype='object')


### 选取数据

In [4]:
print a[2]
print a['b']
print a[0:2]
print a[['b','d']]

22
-4
a    12
b    -4
dtype: int64
b   -4
d    0
dtype: int64


### 数据赋值

In [5]:
a[2]=10
a['d']=-5

### 从numpy对象中创建Series

In [6]:
arr =np.array([1,2,3,4])
s = pd.Series(arr)
s

0    1
1    2
2    3
3    4
dtype: int64

In [7]:
s1 = pd.Series(s)
s1[0]=-10
print s

0   -10
1     2
2     3
3     4
dtype: int64


### 筛选数据

In [8]:
s[s>2]

2    3
3    4
dtype: int64

### 数学运算

In [9]:
s/2

0   -5.0
1    1.0
2    1.5
3    2.0
dtype: float64

In [10]:
np.log(s)

0         NaN
1    0.693147
2    1.098612
3    1.386294
dtype: float64

### NaN
字段中若为空或者不符合要求的数字定义的是，放回NaN（Not a Number)

In [1]:
import numpy as np
import pandas as pd
s2 = pd.Series([5,-3,np.NaN,14])
s2

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64

In [2]:
s2.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [4]:
s2.notnull()

0     True
1     True
2    False
3     True
dtype: bool

In [5]:
# 当做筛选条件
s2[s2.notnull()]

0     5.0
1    -3.0
3    14.0
dtype: float64

In [6]:
s2[s2.isnull()]

2   NaN
dtype: float64

### 字典使用

In [8]:
mydic = {'red':200,'blue':1000,'yellow':500,'orange':1000}
myseries = pd.Series(mydic)
myseries

blue      1000
orange    1000
red        200
yellow     500
dtype: int64

In [9]:
colors = ['red','yellow','orange','blue','green']
mySeries = pd.Series(mydic, index=colors)
mySeries

red        200.0
yellow     500.0
orange    1000.0
blue      1000.0
green        NaN
dtype: float64

### Series 对象之前运算

In [10]:
mydict2 = {'red':400,'yellow':1000,'black':700}
mySeries2 = pd.Series(mydict2)
mySeries + mySeries2

black        NaN
blue         NaN
green        NaN
orange       NaN
red        600.0
yellow    1500.0
dtype: float64

## DataFrame

DataFrame 列表式跟Excel比较类似，其设计初衷将Series的使用场景扩展至多维

index | color | object | price
--- | --- | --- | ---
0 | blue | ball | 1.2
1 | green | pen | 1.0
2 | yellow | pencil | 0.6

DataFrame 对象则有所不同，它有两个索引数组
1. 与行相关，与Series的索引组类似
2. 一系列标签，每个标签与列数据关联

In [3]:
import numpy as np
import pandas as pd
data = {'color':['blue','green','yellow'],'object':['ball','pen','pencil'],'price':[1.2,1.0,0.6]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,color,object,price
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6


当然也可以选择你感兴趣的内容

In [3]:
frame2 = pd.DataFrame(data,columns=['object','price'])
frame2

Unnamed: 0,object,price
0,ball,1.2
1,pen,1.0
2,pencil,0.6


修改index值

In [4]:
frame2 = pd.DataFrame(data,index=['one','two','three'])
frame2

Unnamed: 0,color,object,price
one,blue,ball,1.2
two,green,pen,1.0
three,yellow,pencil,0.6


其他创建DataFrame的方式

In [7]:
frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=['red','blue','yellow','white'],
                     columns=['ball','pen','pencil','paper'])
frame3

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


#### 选择元素

In [9]:
frame.columns

Index([u'color', u'object', u'price'], dtype='object')

In [10]:
frame.index

RangeIndex(start=0, stop=3, step=1)

In [11]:
frame.values

array([['blue', 'ball', 1.2],
       ['green', 'pen', 1.0],
       ['yellow', 'pencil', 0.6]], dtype=object)

In [12]:
frame['price']

0    1.2
1    1.0
2    0.6
Name: price, dtype: float64

In [13]:
frame.ix[2]

color     yellow
object    pencil
price        0.6
Name: 2, dtype: object

In [16]:
# 选择选择元素
frame.ix[[0,2]]

Unnamed: 0,color,object,price
0,blue,ball,1.2
2,yellow,pencil,0.6


In [17]:
# 索引值选择类似切片
frame[0:1]

Unnamed: 0,color,object,price
0,blue,ball,1.2


In [18]:
frame[0:3]

Unnamed: 0,color,object,price
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6


In [19]:
# 选择其中的元素
frame['object'][2]

'pencil'

#### 赋值

In [23]:
frame.index.name='id'
frame.columns.name='item'
frame

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6


In [24]:
# 增加一个新列
frame['new']=12 # 默认
frame

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,12
1,green,pen,1.0,12
2,yellow,pencil,0.6,12


In [25]:
frame['new'] = [3.0,1.3,2.2]
frame

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,3.0
1,green,pen,1.0,1.3
2,yellow,pencil,0.6,2.2


In [26]:
# 通过Series其他方式更新
ser = pd.Series(np.arange(3))
frame['new']=ser
frame

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,0
1,green,pen,1.0,1
2,yellow,pencil,0.6,2


#### 元素的所属关系

In [30]:
frame.isin([1.0,'pen'])

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,False,False,False,False
1,False,True,True,True
2,False,False,False,False


In [31]:
frame[frame.isin([1.0,'pen'])]

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,,,,
1,,pen,1.0,1.0
2,,,,


#### 删除某一列

In [32]:
del frame['new']
frame

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6


#### 筛选元素

In [33]:
frame[frame<3]

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,,,1.2
1,,,1.0
2,,,0.6


#### 有嵌套的字典生成DataFrame对象
pandas 将会将外部的key解释为列名称，将内部的key解释为索引的标签

In [34]:
nestdict = {'red':{2012:22,2013:33},'white':{2011:13,2012:22,2013:16},'blue':{2011:17,2012:27,2013:18}}
frame2 = pd.DataFrame(nestdict)
frame2

Unnamed: 0,blue,red,white
2011,17,,13
2012,27,22.0,22
2013,18,33.0,16


#### DataFrame转置

In [36]:
frame2.T

Unnamed: 0,2011,2012,2013
blue,17.0,27.0,18.0
red,,22.0,33.0
white,13.0,22.0,16.0


### index对象

In [3]:
ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
ser.index

Index([u'red', u'blue', u'yellow', u'white', u'green'], dtype='object')

#### index 对象方法

In [4]:
ser.idxmin()

'blue'

In [5]:
ser.idxmax()

'white'

### 含有重复标签的index

In [6]:
serd = pd.Series(range(6),index=['white','white','blue','green','green','yellow'])
serd

white     0
white     1
blue      2
green     3
green     4
yellow    5
dtype: int64

In [7]:
serd['white']

white    0
white    1
dtype: int64

In [10]:
serd.index.is_unique

False

# 索引对象其他的作用
+ 更换索引
+ 删除
+ 对齐

## 更换索引

In [11]:
ser = pd.Series([2,5,7,4],index=['one','two','three','four'])
ser

one      2
two      5
three    7
four     4
dtype: int64

In [13]:
ser.reindex(['three','four','five','one'])

three    7.0
four     4.0
five     NaN
one      2.0
dtype: float64

上述reindex函数删除了'two'标签，增加了'five'标签，并且该值为NaN
自动填充标签

In [5]:
ser3 = pd.Series([1,5,6,3],index=[0,3,5,6])
ser3

0    1
3    5
5    6
6    3
dtype: int64

上述索引列并不完整，而是缺失了1，2和4，常见的需求为插值，得到一个完整的序列，reindex函数的method选项的值进行给定

In [6]:
ser3.reindex(range(6),method='ffill')

0    1
1    1
2    1
3    5
4    5
5    6
dtype: int64

在插值过程中，所以缺失的索引所对应的值是比其小的索引值，如果想要后面的值，修改method的选项

In [7]:
ser3.reindex(range(6),method='bfill')

0    1
1    5
2    5
3    5
4    6
5    6
dtype: int64

## 删除操作

In [8]:
ser = pd.Series(np.arange(4.),index=['red','blue','yellow','white'])
ser

red       0.0
blue      1.0
yellow    2.0
white     3.0
dtype: float64

In [9]:
ser.drop('yellow')

red      0.0
blue     1.0
white    3.0
dtype: float64

### DataFrame删除操作

In [11]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [12]:
# 删除行
frame.drop(['blue','yellow'])

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
white,12,13,14,15


删除列操作，指定列名，也要指定axis的值

In [14]:
frame.drop(['pen','pencil'],axis=1)

Unnamed: 0,ball,paper
red,0,3
blue,4,7
yellow,8,11
white,12,15


### 算术和数据对齐

In [15]:
s1 = pd.Series([3,2,5,1],['white','yellow','green','blue'])
s2 = pd.Series([1,4,7,2,1],['white','yellow','black','blue','brown'])
s1+s2

black     NaN
blue      3.0
brown     NaN
green     NaN
white     4.0
yellow    6.0
dtype: float64

In [16]:
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=['red','blue','yellow','white'],
                     columns=['ball','pen','pencil','paper'])
frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
                     index=['blue','green','white','yellow'],
                     columns=['mug','pen','ball'])
frame1+frame2

Unnamed: 0,ball,mug,paper,pen,pencil
blue,6.0,,,6.0,
green,,,,,
red,,,,,
white,20.0,,,20.0,
yellow,19.0,,,19.0,


# 数据结构之间的运算

## 算术运算
+ add()
+ sub()
+ div()
+ mul()

In [17]:
frame1.add(frame2)

Unnamed: 0,ball,mug,paper,pen,pencil
blue,6.0,,,6.0,
green,,,,,
red,,,,,
white,20.0,,,20.0,
yellow,19.0,,,19.0,


In [18]:
frame1.sub(frame2)

Unnamed: 0,ball,mug,paper,pen,pencil
blue,2.0,,,4.0,
green,,,,,
red,,,,,
white,4.0,,,6.0,
yellow,-3.0,,,-1.0,


In [19]:
frame1.div(frame2)

Unnamed: 0,ball,mug,paper,pen,pencil
blue,2.0,,,5.0,
green,,,,,
red,,,,,
white,1.5,,,1.857143,
yellow,0.727273,,,0.9,


In [20]:
frame1.mul(frame2)

Unnamed: 0,ball,mug,paper,pen,pencil
blue,8.0,,,5.0,
green,,,,,
red,,,,,
white,96.0,,,91.0,
yellow,88.0,,,90.0,


## DataFrame与Series之间运算

In [21]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [22]:
ser = pd.Series(np.arange(4),index=['ball','pen','pencil','paper'])
ser

ball      0
pen       1
pencil    2
paper     3
dtype: int32

In [23]:
frame - ser

Unnamed: 0,ball,pen,pencil,paper
red,0,0,0,0
blue,4,4,4,4
yellow,8,8,8,8
white,12,12,12,12


In [24]:
ser['mug']=9
frame - ser

Unnamed: 0,ball,mug,paper,pen,pencil
red,0,,0,0,0
blue,4,,4,4,4
yellow,8,,8,8,8
white,12,,12,12,12


# 函数应用与映射

## 操作元素的函数
Pandas与numpy 一样，有大量关于元素操作的通用函数（universal function）

In [25]:
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [26]:
np.sqrt(frame)

Unnamed: 0,ball,pen,pencil,paper
red,0.0,1.0,1.414214,1.732051
blue,2.0,2.236068,2.44949,2.645751
yellow,2.828427,3.0,3.162278,3.316625
white,3.464102,3.605551,3.741657,3.872983


## 按行和列执行操作函数

In [27]:
f = lambda x:x.max()-x.min()
frame.apply(f)

ball      12
pen       12
pencil    12
paper     12
dtype: int64

In [28]:
# 对行进行操作
frame.apply(f, axis=1)

red       3
blue      3
yellow    3
white     3
dtype: int64

apply 函数不一定返回标量，也可以返回一个向量

In [29]:
def f(x):
    return pd.Series([x.min(),x.max()],index=['min','max'])
frame.apply(f)

Unnamed: 0,ball,pen,pencil,paper
min,0,1,2,3
max,12,13,14,15


In [30]:
frame.apply(f,axis=1)

Unnamed: 0,min,max
red,0,3
blue,4,7
yellow,8,11
white,12,15


## 统计函数

In [31]:
frame.sum()

ball      24
pen       28
pencil    32
paper     36
dtype: int64

In [33]:
frame.mean()

ball      6.0
pen       7.0
pencil    8.0
paper     9.0
dtype: float64

In [34]:
frame.describe()

Unnamed: 0,ball,pen,pencil,paper
count,4.0,4.0,4.0,4.0
mean,6.0,7.0,8.0,9.0
std,5.163978,5.163978,5.163978,5.163978
min,0.0,1.0,2.0,3.0
25%,3.0,4.0,5.0,6.0
50%,6.0,7.0,8.0,9.0
75%,9.0,10.0,11.0,12.0
max,12.0,13.0,14.0,15.0


# 排序

## Series 
series对象排序只有索引这一列

In [38]:
ser =pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
ser

red       5
blue      0
yellow    3
white     8
green     4
dtype: int64

In [39]:
ser.sort_index()

blue      0
green     4
red       5
white     8
yellow    3
dtype: int64

## DataFrame排序

DataFrame 可以对两个轴任意一条进行排序，如果用索引列进行排序，直接使用sort_index()函数，如果对列进行排序，指定axis=1

In [40]:
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [41]:
frame.sort_index()

Unnamed: 0,ball,pen,pencil,paper
blue,4,5,6,7
red,0,1,2,3
white,12,13,14,15
yellow,8,9,10,11


In [42]:
frame.sort_index(axis=1)

Unnamed: 0,ball,paper,pen,pencil
red,0,3,1,2
blue,4,7,5,6
yellow,8,11,9,10
white,12,15,13,14


## 对其中的元素进行排序

In [46]:
ser.sort_values()

blue      0
yellow    3
green     4
red       5
white     8
dtype: int64

In [47]:
frame.sort_values(by='pen')

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [48]:
#基于多列排序
frame.sort_values(by=['pen','pencil'])

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


ranking 操作与排序操作相关，该操作为序列的每一个元素安排一个位置，从1开始

In [49]:
ser.rank()

red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64

In [50]:
ser.rank(method='first')

red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64

In [52]:
# 降序排列
ser.rank(ascending=False)

red       2.0
blue      5.0
yellow    4.0
white     1.0
green     3.0
dtype: float64

# 相关性和协方差

correlation和covariance是两个重要的统计量，pandas计算这两个量使用corr()和cov().
+ 标准差
$$D(X)=E([X - E(X)]^2)$$
+ 协方差
$$COV(X,Y)=E([X-E(X)][Y-E(Y)])$$ 
+ 相关系数
$$\frac{COV(X,Y)}{\sqrt(D(X)) \times \sqrt(D(Y))}$$

In [53]:
seq1 = pd.Series([3,4,3,4,5,4,3,2],index=['2006','2007','2008','2009','2010','2011','2012','2013'])
seq2 = pd.Series([1,2,3,4,4,3,2,1],index=['2006','2007','2008','2009','2010','2011','2012','2013'])
print seq1.corr(seq2)
print seq1.cov(seq2)

0.774596669241
0.857142857143


In [54]:
frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],
                     index=['red','blue','yellow','white'],
                     columns=['ball','pen','pencil','paper'])
frame2

Unnamed: 0,ball,pen,pencil,paper
red,1,4,3,6
blue,4,5,6,1
yellow,3,3,1,5
white,4,1,6,4


In [61]:
frame2.corr()

Unnamed: 0,ball,pen,pencil,paper
ball,1.0,-0.276026,0.57735,-0.763763
pen,-0.276026,1.0,-0.079682,-0.361403
pencil,0.57735,-0.079682,1.0,-0.692935
paper,-0.763763,-0.361403,-0.692935,1.0


In [62]:
frame2.cov()

Unnamed: 0,ball,pen,pencil,paper
ball,2.0,-0.666667,2.0,-2.333333
pen,-0.666667,2.916667,-0.333333,-1.333333
pencil,2.0,-0.333333,6.0,-3.666667
paper,-2.333333,-1.333333,-3.666667,4.666667


In [64]:
frame2.corrwith(ser)

ball     -0.140028
pen      -0.869657
pencil    0.080845
paper     0.595854
dtype: float64

In [65]:
frame2.corrwith(frame)

ball      0.730297
pen      -0.831522
pencil    0.210819
paper    -0.119523
dtype: float64

# NaN数据

In [66]:
ser = pd.Series([0,1,2,np.NaN,9],index=['red','blue','yellow','white','green'])
ser

red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64

In [69]:
ser['green']=None
ser

red       0.0
blue      1.0
yellow    2.0
white     NaN
green     NaN
dtype: float64

## 过滤NaN数据

In [70]:
ser.dropna()

red       0.0
blue      1.0
yellow    2.0
dtype: float64

In [71]:
ser[ser.notnull()]

red       0.0
blue      1.0
yellow    2.0
dtype: float64

In [72]:
frame3 = pd.DataFrame([[6,np.nan,6],[np.nan,np.nan,np.nan],[2,np.nan,5]],
                     index=['blue','green','red'],
                     columns=['ball','mug','pen'])
frame3

Unnamed: 0,ball,mug,pen
blue,6.0,,6.0
green,,,
red,2.0,,5.0


在DataFrame中使用dropna()函数，一旦这一行或者列存在NaN，则将其整行或者整列全部删除

In [73]:
frame3.dropna()

Unnamed: 0,ball,mug,pen


In [76]:
# 改进
frame3.dropna(how='all')

Unnamed: 0,ball,mug,pen
blue,6.0,,6.0
red,2.0,,5.0


## 为NaN赋值

In [75]:
frame3.fillna(0)

Unnamed: 0,ball,mug,pen
blue,6.0,0.0,6.0
green,0.0,0.0,0.0
red,2.0,0.0,5.0


In [77]:
frame3.fillna({'ball':1,'mug':0,'pen':99})

Unnamed: 0,ball,mug,pen
blue,6.0,0.0,6.0
green,1.0,0.0,99.0
red,2.0,0.0,5.0


# 等级索引和分级
等级索引（hierarchical indexing）是pandas的一个重要的功能，单条轴也可以有多级索引，可以像操作两维结构那样处理多维数据

In [79]:
mser = pd.Series(np.random.rand(8),
                index=[['white','white','white','blue','blue','red','red','red'],
                      ['up','down','right','up','down','up','down','left']])
mser

white  up       0.913087
       down     0.055399
       right    0.629899
blue   up       0.310707
       down     0.638752
red    up       0.472957
       down     0.211078
       left     0.984104
dtype: float64

In [81]:
mser.index

MultiIndex(levels=[[u'blue', u'red', u'white'], [u'down', u'left', u'right', u'up']],
           labels=[[2, 2, 2, 0, 0, 1, 1, 1], [3, 0, 2, 3, 0, 3, 0, 1]])

In [82]:
mser['white']

up       0.913087
down     0.055399
right    0.629899
dtype: float64

In [83]:
mser[:,'up']

white    0.913087
blue     0.310707
red      0.472957
dtype: float64

In [85]:
mser['white','up']

0.91308667237103303

使用unstack()函数将其转换成DataFrame,其中第二列的索引转换成列

In [86]:
mser.unstack()

Unnamed: 0,down,left,right,up
blue,0.638752,,,0.310707
red,0.211078,0.984104,,0.472957
white,0.055399,,0.629899,0.913087


逆操作，将DataFrame转换成Series对象，使用stack()函数

In [87]:
frame.stack()

red     ball       0
        pen        1
        pencil     2
        paper      3
blue    ball       4
        pen        5
        pencil     6
        paper      7
yellow  ball       8
        pen        9
        pencil    10
        paper     11
white   ball      12
        pen       13
        pencil    14
        paper     15
dtype: int32

对于DataFrame来讲，可以对其DataFrame对象的行和列分别进行定义等级索引

In [88]:
mframe = pd.DataFrame(np.random.randn(16).reshape((4,4)),
                     index=[['white','white','red','red'],['up','down','up','down']],
                     columns=[['pen','pen','paper','paper'],[1,2,1,2]])
mframe

Unnamed: 0_level_0,Unnamed: 1_level_0,pen,pen,paper,paper
Unnamed: 0_level_1,Unnamed: 1_level_1,1,2,1,2
white,up,-1.729715,1.571391,-1.139209,1.28939
white,down,0.560778,-2.16403,0.162912,-0.286249
red,up,0.828617,-1.035014,-1.417486,1.044125
red,down,0.416866,-0.513328,-0.439312,-0.11989


## 调整顺序和为层级排序

In [91]:
# 指定index和columns的名称
mframe.columns.names=['object','id']
mframe.index.names=['colors','status']
mframe

Unnamed: 0_level_0,object,pen,pen,paper,paper
Unnamed: 0_level_1,id,1,2,1,2
colors,status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
white,up,-1.729715,1.571391,-1.139209,1.28939
white,down,0.560778,-2.16403,0.162912,-0.286249
red,up,0.828617,-1.035014,-1.417486,1.044125
red,down,0.416866,-0.513328,-0.439312,-0.11989


In [92]:
mframe.swaplevel('colors','status')

Unnamed: 0_level_0,object,pen,pen,paper,paper
Unnamed: 0_level_1,id,1,2,1,2
status,colors,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
up,white,-1.729715,1.571391,-1.139209,1.28939
down,white,0.560778,-2.16403,0.162912,-0.286249
up,red,0.828617,-1.035014,-1.417486,1.044125
down,red,0.416866,-0.513328,-0.439312,-0.11989


In [93]:
mframe.sortlevel('colors')

Unnamed: 0_level_0,object,pen,pen,paper,paper
Unnamed: 0_level_1,id,1,2,1,2
colors,status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
red,down,0.416866,-0.513328,-0.439312,-0.11989
red,up,0.828617,-1.035014,-1.417486,1.044125
white,down,0.560778,-2.16403,0.162912,-0.286249
white,up,-1.729715,1.571391,-1.139209,1.28939


## 按层进行统计数据

In [94]:
mframe.sum(level='colors')

object,pen,pen,paper,paper
id,1,2,1,2
colors,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
red,1.245484,-1.548342,-1.856798,0.924236
white,-1.168937,-0.592638,-0.976297,1.003141


如果相对某一列进行统计分析

In [95]:
mframe.sum(level='id',axis=1)

Unnamed: 0_level_0,id,1,2
colors,status,Unnamed: 2_level_1,Unnamed: 3_level_1
white,up,-2.868924,2.860781
white,down,0.72369,-2.450278
red,up,-0.588869,0.009111
red,down,-0.022446,-0.633218
