Pandas数据规整 - 转换 - 层次化索引
===

---

层次化索引（hierarchical indexing）使你能在一个轴上拥有超过1个索引级别，

多层索引可以对数据结构升维，能以低维度形式处理高维度数据

用多层索引 (Multi-index) 的 Series/DataFrame,存储2维/3维或以上维度的信息

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.Series(np.random.randn(9), index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data

a  1    0.025268
   2   -1.689894
   3   -0.149434
b  1   -0.664289
   3   -1.693823
c  1    0.031938
   2    0.640842
d  2    0.315597
   3    0.416073
dtype: float64

MultiIndex 的对象，里面有 levels 和 labels 二类信息(层次索引的标签和层次索引的值（标签位置)

In [3]:
data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])

层次化索引的一维数据可以模拟二维数据

In [4]:
data.unstack()

Unnamed: 0,1,2,3
a,0.025268,-1.689894,-0.149434
b,-0.664289,,-1.693823
c,0.031938,0.640842,
d,,0.315597,0.416073


Series层次化索引的查询
---

In [5]:
data

a  1    0.025268
   2   -1.689894
   3   -0.149434
b  1   -0.664289
   3   -1.693823
c  1    0.031938
   2    0.640842
d  2    0.315597
   3    0.416073
dtype: float64

In [10]:
# 默认索引
data[0]  # 查询单值
data[[0, 3]]  # 查询多值
data[1:4]  # 切片

a  2   -1.689894
   3   -0.149434
b  1   -0.664289
dtype: float64

In [14]:
# 自定义索引,默认查询外层索引
data['a']  # 查询单值
data[['a', 'c']]  # 查询多值
data['a':'c']  # 切片

a  1    0.025268
   2   -1.689894
   3   -0.149434
b  1   -0.664289
   3   -1.693823
c  1    0.031938
   2    0.640842
dtype: float64

In [19]:
# 自定义索引，复杂查询
# loc查询
data.loc['a']  # 查询外层索引
data.loc['a', 2]  # 外层、内层
data.loc[:, 2]  # 外层所有，内层2

a   -1.689894
c    0.640842
d    0.315597
dtype: float64

#### 将层次化索引的Series转为DataFrame

多层索引的 Series 其实和 DataFrame 维度一样，只是展示形式不同

重塑就是通过改变数据表里面的 行索引 和 列索引 来改变展示形式

In [20]:
data

a  1    0.025268
   2   -1.689894
   3   -0.149434
b  1   -0.664289
   3   -1.693823
c  1    0.031938
   2    0.640842
d  2    0.315597
   3    0.416073
dtype: float64

In [21]:
data.unstack()

Unnamed: 0,1,2,3
a,0.025268,-1.689894,-0.149434
b,-0.664289,,-1.693823
c,0.031938,0.640842,
d,,0.315597,0.416073


In [25]:
data.unstack().stack()

a  1    0.025268
   2   -1.689894
   3   -0.149434
b  1   -0.664289
   3   -1.693823
c  1    0.031938
   2    0.640842
d  2    0.315597
   3    0.416073
dtype: float64

In [26]:
# 自定义索引,默认查询外层索引
data.unstack()[1]  # 查询列
data.unstack().loc['a']  # 查询行
data.unstack().loc['a', 2]  # 行，列

-1.689893771119204

交换索引顺序

In [31]:
# DataFrame交换
data.unstack()
data.unstack().T

Unnamed: 0,a,b,c,d
1,0.025268,-0.664289,0.031938,
2,-1.689894,,0.640842,0.315597
3,-0.149434,-1.693823,,0.416073


In [37]:
# Series交换
data

data.unstack().T.stack()
data.unstack().unstack().dropna()

1  a    0.025268
   b   -0.664289
   c    0.031938
2  a   -1.689894
   c    0.640842
   d    0.315597
3  a   -0.149434
   b   -1.693823
   d    0.416073
dtype: float64

DataFrame层次化索引
---

二维DataFrame转为多维

### 使用DataFrame的列或行进行索引（重要）

使用set_index()将DataFrame的一个或多个列当做行索引来用，或者将行索引变成DataFrame的列

In [38]:
frame2 = pd.DataFrame(
    {'a': range(7), 'b': range(7, 0, -1),
     'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
     'd': [0, 1, 2, 0, 1, 2, 3]}
)
frame2

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


#### 将普通列转为行索引

In [42]:
frame2.set_index('a')
frame3 = frame2.set_index(['c', 'd'])
frame3

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [43]:
frame2.set_index(['c', 'd'], append=True)  # 增加索引，非替换，保留原索引

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,a,b
Unnamed: 0_level_1,c,d,Unnamed: 3_level_1,Unnamed: 4_level_1
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


In [44]:
frame2.set_index(['c', 'd'], drop=False)  # 列转索引后，保留原列

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


#### 行索引转为普通列

In [45]:
frame3

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [46]:
frame3.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


#### 转换列索引

In [47]:
frame2

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [51]:
# 先将原表转置，修改行索引后，再转置
frame2.T.set_index(0, append=True).T

Unnamed: 0_level_0,a,b,c,d
Unnamed: 0_level_1,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


### 通过修改行列索引值实现层次化索引

In [52]:
symbol = ['BABA', 'JD', 'AAPL', 'MS', 'GS', 'WMT']
data = {'行业': ['电商', '电商', '科技', '金融', '金融', '零售'],
        '价格': [176.92, 25.95, 172.97, 41.79, 196.00, 99.55],
        '交易量': [16175610, 27113291, 18913154, 10132145, 2626634, 8086946],
        '雇员': [101550, 175336, 100000, 60348, 36600, 2200000]}
df2 = pd.DataFrame(data, index=symbol)

df2.name='美股'
df2.index.name = '代号'
df2

Unnamed: 0_level_0,行业,价格,交易量,雇员
代号,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BABA,电商,176.92,16175610,101550
JD,电商,25.95,27113291,175336
AAPL,科技,172.97,18913154,100000
MS,金融,41.79,10132145,60348
GS,金融,196.0,2626634,36600
WMT,零售,99.55,8086946,2200000


In [54]:
df2.index
df2.columns

Index(['行业', '价格', '交易量', '雇员'], dtype='object')

通过给索引赋值实现层次化索引

In [56]:
# 行索引层次化

# 错误，不能直接给索引赋多维数据实现层次化
df2.index
df2.index = [('中国公司','BABA'), ('中国公司','JD'), ('美国公司','AAPL'), ('美国公司','MS'), ('美国公司','GS'), ('美国公司','WMT')]
df2

Unnamed: 0,行业,价格,交易量,雇员
"(中国公司, BABA)",电商,176.92,16175610,101550
"(中国公司, JD)",电商,25.95,27113291,175336
"(美国公司, AAPL)",科技,172.97,18913154,100000
"(美国公司, MS)",金融,41.79,10132145,60348
"(美国公司, GS)",金融,196.0,2626634,36600
"(美国公司, WMT)",零售,99.55,8086946,2200000


In [59]:
df2.index

Index([('中国公司', 'BABA'),   ('中国公司', 'JD'), ('美国公司', 'AAPL'),   ('美国公司', 'MS'),
         ('美国公司', 'GS'),  ('美国公司', 'WMT')],
      dtype='object')

In [62]:
# df2.loc["('中国公司', 'BABA')"]  # 错误，无法查询

In [65]:
# 正确：使用MultiIndex赋值
df2.index = pd.MultiIndex.from_tuples(
    [('中国公司','BABA'), ('中国公司','JD'), ('美国公司','AAPL'), ('美国公司','MS'), ('美国公司','GS'), ('美国公司','WMT')],
    names=('country', 'company'),
)
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,行业,价格,交易量,雇员
country,company,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
中国公司,BABA,电商,176.92,16175610,101550
中国公司,JD,电商,25.95,27113291,175336
美国公司,AAPL,科技,172.97,18913154,100000
美国公司,MS,金融,41.79,10132145,60348
美国公司,GS,金融,196.0,2626634,36600
美国公司,WMT,零售,99.55,8086946,2200000


In [66]:
# 没有列索引层次化方法，可以转置后操作原列索引
df2x = df2.T
df2x

country,中国公司,中国公司,美国公司,美国公司,美国公司,美国公司
company,BABA,JD,AAPL,MS,GS,WMT
行业,电商,电商,科技,金融,金融,零售
价格,176.92,25.95,172.97,41.79,196,99.55
交易量,16175610,27113291,18913154,10132145,2626634,8086946
雇员,101550,175336,100000,60348,36600,2200000


In [68]:
df2x.index = pd.MultiIndex.from_tuples([('aaa', '行业'),('aaa','价格'),('bbb','交易量'),('bbb','雇员')], names=('ab', '情况'))
df2x.T

Unnamed: 0_level_0,ab,aaa,aaa,bbb,bbb
Unnamed: 0_level_1,情况,行业,价格,交易量,雇员
country,company,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
中国公司,BABA,电商,176.92,16175610,101550
中国公司,JD,电商,25.95,27113291,175336
美国公司,AAPL,科技,172.97,18913154,100000
美国公司,MS,金融,41.79,10132145,60348
美国公司,GS,金融,196.0,2626634,36600
美国公司,WMT,零售,99.55,8086946,2200000


In [72]:
df2x.T.index

MultiIndex(levels=[['中国公司', '美国公司'], ['AAPL', 'BABA', 'GS', 'JD', 'MS', 'WMT']],
           codes=[[0, 0, 1, 1, 1, 1], [1, 3, 0, 4, 2, 5]],
           names=['country', 'company'])

### 直接创建多层索引DataFrame（了解）

In [70]:
frame = pd.DataFrame(
    np.arange(12).reshape((4, 3)),
    index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    columns=[['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']],
)

frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']

frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [73]:
# 或者这样创建
midx = pd.MultiIndex(
    levels=[['a', 'b'], [1, 2]],
    codes=[[0, 0, 1, 1], [0, 1, 0, 1]],
    names=['key1', 'key2']
)
mcol = pd.MultiIndex(
    levels=[[ 'Colorado', 'Ohio'], ['Green', 'Red']],
    codes=[[1, 1, 0],[0, 1, 0]],
    names=['state', 'color']
)

frame = pd.DataFrame(
    np.arange(12).reshape((4, 3)),
    index=midx,
    columns=mcol,
)

frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [74]:
frame.index
frame.columns

MultiIndex(levels=[['Colorado', 'Ohio'], ['Green', 'Red']],
           codes=[[1, 1, 0], [0, 1, 0]],
           names=['state', 'color'])

### DataFrame层次化索引查询

In [75]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [77]:
frame['Ohio']  # 外层列索引
frame['Ohio', 'Green']  # 外层列，内层列

key1  key2
a     1       0
      2       3
b     1       6
      2       9
Name: (Ohio, Green), dtype: int32

In [82]:
frame.loc['a']  # 外层行索引
frame.loc['a', 2]  # 外层行索引，内层行索引
frame.loc['a', 'Ohio']  # 外层行，外层列

color,Green,Red
key2,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,1
2,3,4


综合应用

In [83]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [88]:
# 外层行，内层行，外层列，内层列
frame.loc['a', 1]['Ohio', 'Red']

# 外层行，外层列，内层行，内层列
frame.loc['a', 'Ohio'].loc[1, 'Red']

1

### 重排与分级排序

调整某条轴上各级别的顺序，或根据指定级别上的值对数据进行排序

In [89]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [95]:
frame.swaplevel()  # 默认交换行索引

frame.swaplevel('key2', 'key1')  # 手动指定调换索引name
frame.swaplevel('key1', 'key2')  # 交换顺序，效果一样

frame.swaplevel(0, 1)  # 手动指定调换索引层级
frame.swaplevel(1, 0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [91]:
frame.swaplevel(axis=1)  # 交换列索引

Unnamed: 0_level_0,color,Green,Red,Green
Unnamed: 0_level_1,state,Ohio,Ohio,Colorado
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


#### 按索引层级排序

In [96]:
frame.sort_index(ascending=False)  # 行索引排序(优先最外层索引)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
b,2,9,10,11
b,1,6,7,8
a,2,3,4,5
a,1,0,1,2


In [97]:
frame.sort_index(ascending=False, level='key2')  # 排序索引层级

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
b,2,9,10,11
a,2,3,4,5
b,1,6,7,8
a,1,0,1,2


In [98]:
frame.sort_index(ascending=False, axis=1)  # 列索引排序

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Red,Green,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,1,0,2
a,2,4,3,5
b,1,7,6,8
b,2,10,9,11


### 根据级别汇总统计

许多对DataFrame和Series的描述和汇总统计函数都有一个level选项，它用于指定在某条轴上计算的级别

其实是利用了pandas的groupby功能

In [99]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [100]:
frame.sum()  # 按行求和

state     color
Ohio      Green    18
          Red      22
Colorado  Green    26
dtype: int64

In [101]:
frame.sum(level='key1')  # 以key1索引分组求和

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
a,3,5,7
b,15,17,19


In [106]:
# 用groupby实现

# 以外层行索引为分组基准a
frame.groupby('key1').sum()  # 分组基准，行索引name
frame.groupby(level='key1').sum()  # level传入分组基准
frame.groupby(['a', 'a', 'b', 'b']).sum()  # 手动构造分组基准

state,Ohio,Ohio,Colorado
color,Green,Red,Green
a,3,5,7
b,15,17,19


In [107]:
# 以内层行索引为分组基准
frame.groupby('key2').sum()  # 分组基准，行索引name
frame.groupby(level='key2').sum()  # level传入分组基准
frame.groupby([1,2,1,2]).sum()  # 手动构造分组基准

state,Ohio,Ohio,Colorado
color,Green,Red,Green
1,6,8,10
2,12,14,16


按列求和

In [112]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [110]:
frame.sum(axis=1)   # 按列求和

key1  key2
a     1        3
      2       12
b     1       21
      2       30
dtype: int64

以内层列索引为基准实现

In [111]:
frame.sum(axis=1, level='color')  # 两个 Green 相加

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


In [114]:
# 用分组实现
frame.groupby(['Green', 'Red', 'Green'], axis=1).sum()
frame.groupby(axis=1, level='color').sum()

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


以外层列索引为基准

In [115]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [116]:
frame.sum(axis=1, level='state')  # Ohio下的两列相加

Unnamed: 0_level_0,state,Ohio,Colorado
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,1,2
a,2,7,5
b,1,13,8
b,2,19,11


In [123]:
frame.groupby(level='state', axis=1).sum().sort_index(ascending=False, axis=1)

# 和上面的区别：使用手打的分组基准代替level定义
# 报错，直接传入分组索引值，默认使用最里层列索引
# frame.groupby(['Ohio', 'Ohio', 'Colorado'], axis=1).sum().sort_index(ascending=False, axis=1)

# 列索引交换层级
frame.swaplevel(axis=1)
frame.swaplevel(axis=1).groupby(['Ohio', 'Ohio', 'Colorado'], axis=1).sum().sort_index(ascending=False, axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Colorado
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,1,2
a,2,7,5
b,1,13,8
b,2,19,11
