In [8]:
import pandas as pd
import numpy as np

# MultiIndex

这次我们来介绍MultiIndex(多重索引),或者称为Hierarchical index(层次化索引)

MultiIndex是pandas提供的一项突破性的功能,使用该标签让用低维度数据结构(1d的Series和2d的DataFrame)表示高维数据(3d或以上)成为可能

我们平时见到的index只有一层,例如下面的数据

In [9]:
df = pd.DataFrame(np.random.randn(5, 2), 
                  index=pd.date_range("2000-01-01", periods=5, name="trade_date"),
                  columns=pd.Index(["000001.SZ", "000002.SZ"], name="code"))

In [10]:
df

code,000001.SZ,000002.SZ
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1
2000-01-01,-1.15425,-0.126325
2000-01-02,0.401441,0.404857
2000-01-03,-0.04476,1.182319
2000-01-04,0.087543,0.922531
2000-01-05,0.406262,2.104594


但是有的时候，我们需要表示更高维的数据，

比如，同时保存000001.SZ 000002.SZ 000003.SZ三只股票的收盘价和开盘价的长度为5的时间序列数据

In [11]:
df_close = pd.DataFrame(np.random.randn(5, 2), 
                  index=pd.date_range("2000-01-01", periods=5, name="trade_date"),
                  columns=pd.Index(["000001.SZ", "000002.SZ"], name="code"))

In [12]:
df_open = pd.DataFrame(np.random.randn(5, 2), 
                  index=pd.date_range("2000-01-01", periods=5, name="trade_date"),
                  columns=pd.Index(["000001.SZ", "000002.SZ"], name="code"))

In [13]:
df_close 

code,000001.SZ,000002.SZ
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1
2000-01-01,0.543519,-0.172963
2000-01-02,1.28506,0.26437
2000-01-03,-0.080431,-0.127875
2000-01-04,-0.919127,-0.170319
2000-01-05,-0.020811,1.140572


In [14]:
df_open

code,000001.SZ,000002.SZ
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1
2000-01-01,0.484584,-0.422358
2000-01-02,0.799314,0.135294
2000-01-03,-0.15814,0.026406
2000-01-04,-0.887081,0.737286
2000-01-05,-1.738284,-0.314605


我们可能想要通过一个list包含这两个DataFrame，或这一个dict包含这两个DataFrame的方式进行存储

In [15]:
tmp_dict = {"open": df_open, "close": df_close}

In [16]:
tmp_dict

{'close': code        000001.SZ  000002.SZ
 trade_date                      
 2000-01-01   0.543519  -0.172963
 2000-01-02   1.285060   0.264370
 2000-01-03  -0.080431  -0.127875
 2000-01-04  -0.919127  -0.170319
 2000-01-05  -0.020811   1.140572, 'open': code        000001.SZ  000002.SZ
 trade_date                      
 2000-01-01   0.484584  -0.422358
 2000-01-02   0.799314   0.135294
 2000-01-03  -0.158140   0.026406
 2000-01-04  -0.887081   0.737286
 2000-01-05  -1.738284  -0.314605}

但这个时候除了按照数据类别（close， open）选取信息比较方便外，按照时间进行截取和按照股票代码进行截取都非常麻烦

我们需要用更好的方式存储高维数据

这里就需要用到MultiIndex

In [17]:
df = pd.DataFrame(np.random.randn(5, 6), 
                  index=pd.date_range("2000-01-01", periods=5, name="trade_date"),
                  columns=pd.MultiIndex.from_product([["000001.SZ", "000002.SZ", "000003.SZ"], ["close", "open"]],
                                                     names=["code", "field"]))

In [18]:
df

code,000001.SZ,000001.SZ,000002.SZ,000002.SZ,000003.SZ,000003.SZ
field,close,open,close,open,close,open
trade_date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2000-01-01,-0.084868,1.311083,-1.683967,0.348613,-0.299127,0.246763
2000-01-02,-0.033918,0.823621,-0.586938,-1.161244,0.577677,-0.555554
2000-01-03,0.626138,-2.449616,-0.104334,0.133032,-0.697537,0.339072
2000-01-04,1.182504,0.115843,-0.474775,-0.131368,-0.085706,0.161225
2000-01-05,0.948927,1.960194,-0.419402,1.14674,-0.684873,-1.473225


## MultiIndex的属性

In [19]:
index1 = df.columns

In [20]:
index1

MultiIndex(levels=[['000001.SZ', '000002.SZ', '000003.SZ'], ['close', 'open']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=['code', 'field'])

### names

MultiIndex都会有一个names属性，类似于index的name属性，是我们给Index取的名字，MultiIndex因为是多层的，所以每一层都需要取一个名字，因此names属性是一个列表

In [21]:
index1.names

FrozenList(['code', 'field'])

不在构造MultiIndex指明names属性也可以，但是大多数情况下指明一个合理的名称，并根据名称而不是序号选择MultiIndex的每一层（level）是有助于代码的可读性的。

### levels

levels属性是一个包含MultiIndex的每一层的唯一元素的列表

In [22]:
index1.levels

FrozenList([['000001.SZ', '000002.SZ', '000003.SZ'], ['close', 'open']])

可以根据序号选择levels的每一层

In [23]:
index1.levels[0]

Index(['000001.SZ', '000002.SZ', '000003.SZ'], dtype='object', name='code')

In [24]:
index1.levels[1]

Index(['close', 'open'], dtype='object', name='field')

In [25]:
index1.get_level_values(0)

Index(['000001.SZ', '000001.SZ', '000002.SZ', '000002.SZ', '000003.SZ',
       '000003.SZ'],
      dtype='object', name='code')

In [26]:
index1.get_level_values(1)

Index(['close', 'open', 'close', 'open', 'close', 'open'], dtype='object', name='field')

也可以根据names属性获得对应的level

In [27]:
index1.get_level_values("code")

Index(['000001.SZ', '000001.SZ', '000002.SZ', '000002.SZ', '000003.SZ',
       '000003.SZ'],
      dtype='object', name='code')

In [28]:
index1.get_level_values("field")

Index(['close', 'open', 'close', 'open', 'close', 'open'], dtype='object', name='field')

### labels

In [29]:
index1.labels

FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

labels属性实际上是MutiIndex的存储方式，每一层都和levels的对应层一一对应。列表里的元素是一个个数字，表示该位置的标签在对应的level的序号

我们不做过多探究

## 如何构建MutiIndex

实际上在大多数情况下我们不需要主动构建MultiIndex

因为我们日常使用的带有MutiIndex的DataFrame或者Series都是根据concat，groupby，stack等等函数自动赋予的

In [30]:
pd.concat({"open": df_open, "close": df_close}, axis=1)

Unnamed: 0_level_0,close,close,open,open
code,000001.SZ,000002.SZ,000001.SZ,000002.SZ
trade_date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2000-01-01,0.543519,-0.172963,0.484584,-0.422358
2000-01-02,1.28506,0.26437,0.799314,0.135294
2000-01-03,-0.080431,-0.127875,-0.15814,0.026406
2000-01-04,-0.919127,-0.170319,-0.887081,0.737286
2000-01-05,-0.020811,1.140572,-1.738284,-0.314605


但是了解构造方法有助于理解MutiIndex

### from_arrays

In [31]:
arrays = [['000001.SZ', '000001.SZ', '000002.SZ', '000002.SZ', '000003.SZ', '000003.SZ'], ['close', 'open', 'close', 'open', 'close', 'open']]

In [32]:
pd.MultiIndex.from_arrays(arrays, names=('codes', 'fields'))

MultiIndex(levels=[['000001.SZ', '000002.SZ', '000003.SZ'], ['close', 'open']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=['codes', 'fields'])

### from_product

In [33]:
arrays2 = [['000001.SZ', '000002.SZ', '000003.SZ'], ['close', 'open']]

In [34]:
pd.MultiIndex.from_product(arrays2, names=('codes', 'fields'))

MultiIndex(levels=[['000001.SZ', '000002.SZ', '000003.SZ'], ['close', 'open']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=['codes', 'fields'])

### from_tuples

In [42]:
tuples = [('000001.SZ', 'close'), 
          ('000001.SZ', 'open'),
          ('000002.SZ', 'close'), 
          ('000002.SZ', 'open'),
          ('000003.SZ', 'close'), 
          ('000003.SZ', 'open')]
pd.MultiIndex.from_tuples(tuples, names=('codes', 'fields'))

MultiIndex(levels=[['000001.SZ', '000002.SZ', '000003.SZ'], ['close', 'open']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=['codes', 'fields'])

大家好好领会一下这三种构建方法分别适用于什么样的场景。具体不再详述

## 带有MutiIndex时的索引/选取

在某一个轴是MutiIndex时，传统的选取就不太奏效了，这时候我们推荐使用pd.IndexSlice进行选取

我们一般遵循下面的引用规范

In [43]:
idx = pd.IndexSlice

In [44]:
df

code,000001.SZ,000001.SZ,000002.SZ,000002.SZ,000003.SZ,000003.SZ
field,close,open,close,open,close,open
trade_date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2000-01-01,-0.084868,1.311083,-1.683967,0.348613,-0.299127,0.246763
2000-01-02,-0.033918,0.823621,-0.586938,-1.161244,0.577677,-0.555554
2000-01-03,0.626138,-2.449616,-0.104334,0.133032,-0.697537,0.339072
2000-01-04,1.182504,0.115843,-0.474775,-0.131368,-0.085706,0.161225
2000-01-05,0.948927,1.960194,-0.419402,1.14674,-0.684873,-1.473225


In [45]:
df.loc[:, idx["000001.SZ", :]]

code,000001.SZ,000001.SZ
field,close,open
trade_date,Unnamed: 1_level_2,Unnamed: 2_level_2
2000-01-01,-0.084868,1.311083
2000-01-02,-0.033918,0.823621
2000-01-03,0.626138,-2.449616
2000-01-04,1.182504,0.115843
2000-01-05,0.948927,1.960194


In [46]:
df.loc[:, idx[:, "open"]]

code,000001.SZ,000002.SZ,000003.SZ
field,open,open,open
trade_date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2000-01-01,1.311083,0.348613,0.246763
2000-01-02,0.823621,-1.161244,-0.555554
2000-01-03,-2.449616,0.133032,0.339072
2000-01-04,0.115843,-0.131368,0.161225
2000-01-05,1.960194,1.14674,-1.473225


In [47]:
df.loc[:, idx[["000001.SZ", "000002.SZ"], :]]

code,000001.SZ,000001.SZ,000002.SZ,000002.SZ
field,close,open,close,open
trade_date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2000-01-01,-0.084868,1.311083,-1.683967,0.348613
2000-01-02,-0.033918,0.823621,-0.586938,-1.161244
2000-01-03,0.626138,-2.449616,-0.104334,0.133032
2000-01-04,1.182504,0.115843,-0.474775,-0.131368
2000-01-05,0.948927,1.960194,-0.419402,1.14674


如果想要选取截面数据（获得一个低维度的dataframe，用xs方法）

In [48]:
df.xs("000001.SZ", axis=1, level=0)

field,close,open
trade_date,Unnamed: 1_level_1,Unnamed: 2_level_1
2000-01-01,-0.084868,1.311083
2000-01-02,-0.033918,0.823621
2000-01-03,0.626138,-2.449616
2000-01-04,1.182504,0.115843
2000-01-05,0.948927,1.960194
