# `Pandas`多级索引的创建方法

### 多级索引的`Series`类型

`Pandas`提供了一个`MultiIndex`类型，专门用来处理这种多级索引的问题

In [2]:
import pandas as pd
import numpy as np

index = [('California', 2008), ('California', 2018),
         ('New York', 2008), ('New York', 2018),
         ('Texas', 2008), ('Texas', 2018)]
mul_index = pd.MultiIndex.from_tuples(index)
print(mul_index)

MultiIndex([('California', 2008),
            ('California', 2018),
            (  'New York', 2008),
            (  'New York', 2018),
            (     'Texas', 2008),
            (     'Texas', 2018)],
           )


构建一个带有多维索引的`Series`类型数据

In [4]:
population = [33870000, 37250000,
               18970000, 19370000,
               20850000, 25140000]

pop = pd.Series(population, index=mul_index)
print(pop)

California  2008    33870000
            2018    37250000
New York    2008    18970000
            2018    19370000
Texas       2008    20850000
            2018    25140000
dtype: int64


其中，前两列表示`Series`的多级索引，第三列表示数据

获取2018年的所有数据，用二维切片的方式

In [5]:
print(pop[:, 2018])

California    37250000
New York      19370000
Texas         25140000
dtype: int64


### 多级`Series`和普通`DataFrmae`的转换

`unstack( )`方法可以使一个带有多级索引的`Series`转化为一个普通的`DataFrame`对象，而`stack( )`方法实现的则是相反方向的操作

In [6]:
df_pop = pop.unstack()
print(df_pop)
print(df_pop.stack())

                2008      2018
California  33870000  37250000
New York    18970000  19370000
Texas       20850000  25140000
California  2008    33870000
            2018    37250000
New York    2008    18970000
            2018    19370000
Texas       2008    20850000
            2018    25140000
dtype: int64


### 多级索引`DataFrame`类型

In [7]:
index = [('California', 2008), ('California', 2018),
         ('New York', 2008), ('New York', 2018),
         ('Texas', 2008), ('Texas', 2018)]

mul_index = pd.MultiIndex.from_tuples(index)
population = [33870000, 37250000,18970000, 19370000, 20850000, 25140000]
under_18_pop = [9267089, 9284094, 4687374, 4318033, 5906301, 6879014]

pop = pd.Series(population, index=mul_index)
pop_df = pd.DataFrame({'total':pop,
                       'under18':under_18_pop})
print(pop_df)

                    total  under18
California 2008  33870000  9267089
           2018  37250000  9284094
New York   2008  18970000  4687374
           2018  19370000  4318033
Texas      2008  20850000  5906301
           2018  25140000  6879014


这种有多级索引的`DataFrame`，同样也是运用类似`pop_df['total']`这种字典键的形式来获取列信息

In [8]:
print(pop_df['under18'])
print(pop_df['under18']/pop_df['total'])

California  2008    9267089
            2018    9284094
New York    2008    4687374
            2018    4318033
Texas       2008    5906301
            2018    6879014
Name: under18, dtype: int64
California  2008    0.273608
            2018    0.249237
New York    2008    0.247094
            2018    0.222924
Texas       2008    0.283276
            2018    0.273628
dtype: float64


### 多级索引的创建方法

* **嵌套列表创建**

In [9]:
mul_index = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'],[1,2,1,2]])
print(mul_index)

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )


* **元组列表创建**

In [10]:
mul_index = pd.MultiIndex.from_tuples([('a', 1), ('a', 0), ('b', 1), ('b', 0)])
print(mul_index)

MultiIndex([('a', 1),
            ('a', 0),
            ('b', 1),
            ('b', 0)],
           )


* **`levels`和`labels`标签直接创建**

In [11]:
mul_index = pd.MultiIndex(levels=[['a', 'b'], [0, 1]],
           codes=[[0, 0, 1, 1], [1, 0, 1, 0]])
print(mul_index)

MultiIndex([('a', 1),
            ('a', 0),
            ('b', 1),
            ('b', 0)],
           )


* **索引“相城”形式创建**

In [12]:
mul_index = pd.MultiIndex.from_product([[2008,2018],[1,2]])
print(mul_index)

MultiIndex([(2008, 1),
            (2008, 2),
            (2018, 1),
            (2018, 2)],
           )


***为了更清晰的表征每一个索引列，我们可以给每一个索引都加一个名字***

In [13]:
index = [('California', 2008), ('California', 2018),
         ('New York', 2008), ('New York', 2018),
         ('Texas', 2008), ('Texas', 2018)]
# 加上索引名称
mul_index = pd.MultiIndex.from_tuples(index, names=('state', 'year'))
population = [33870000, 37250000,18970000, 19370000, 20850000, 25140000]
under_18_pop = [9267089, 9284094, 4687374, 4318033, 5906301, 6879014]

pop = pd.Series(population, index=mul_index)
# pop.index.names = ['state', 'year']
print(pop)

state       year
California  2008    33870000
            2018    37250000
New York    2008    18970000
            2018    19370000
Texas       2008    20850000
            2018    25140000
dtype: int64


### 多级行索引和多级列索引举例

我们可以设置二维列索引，结合之前的二维行索引，我们就可以表示四维数据了

In [14]:
index = pd.MultiIndex.from_product([[2008,2018],[1,2]],
                                      names=['year','visit'])
colomus = pd.MultiIndex.from_product([['Tom','Bill'],[22,18]],
                                      names=['name','age'])
data = np.random.randn(4,4)
print(index)
print(colomus)
df_data = pd.DataFrame(data,index=index,columns=colomus)
print(df_data)

MultiIndex([(2008, 1),
            (2008, 2),
            (2018, 1),
            (2018, 2)],
           names=['year', 'visit'])
MultiIndex([( 'Tom', 22),
            ( 'Tom', 18),
            ('Bill', 22),
            ('Bill', 18)],
           names=['name', 'age'])
name             Tom                Bill          
age               22        18        22        18
year visit                                        
2008 1      1.948807 -0.277856  0.526454  0.377724
     2      0.640332  1.470054  0.725624  0.506353
2018 1      0.601660 -0.300470  0.769044 -0.303792
     2      1.103208 -0.020874 -2.765182  0.292247


# 多级索引`Pandas`对象的取值、分片与运算方法 

## 多级索引的`Sreies`

### 多级索引下的取值方法

In [15]:
index = [('California', 2008), ('California', 2018),
         ('New York', 2008), ('New York', 2018),
         ('Texas', 2008), ('Texas', 2018)]
mul_index = pd.MultiIndex.from_tuples(index)

population = [33870000, 37250000,
               18970000, 19370000,
               20850000, 25140000]

pop = pd.Series(population, index=mul_index)
print(pop)

California  2008    33870000
            2018    37250000
New York    2008    18970000
            2018    19370000
Texas       2008    20850000
            2018    25140000
dtype: int64


* 指定每个索引值来获取单个元素值

In [16]:
print(pop['New York', 2008])

18970000


* 只指定二维索引中的其中一个索引

In [18]:
print(pop['New York'])
print(pop[:, 2008])

2008    18970000
2018    19370000
dtype: int64
California    33870000
New York      18970000
Texas         20850000
dtype: int64


### 多级索引下的切片方法

* 先对州索引进行切片，我们看看从加州到纽约的人口数据

In [19]:
print(pop.loc['California':'New York'])

California  2008    33870000
            2018    37250000
New York    2008    18970000
            2018    19370000
dtype: int64


* 再对另一个索引进行切片，看看2008到2018年所有州的人口数据

In [20]:
print(pop.loc[:, 2008:2018])

California  2008    33870000
            2018    37250000
New York    2008    18970000
            2018    19370000
Texas       2008    20850000
            2018    25140000
dtype: int64


* 过滤

In [21]:
print(pop[pop > 30000000])

California  2008    33870000
            2018    37250000
dtype: int64


* 花哨索引

In [22]:
print(pop[['California', 'Texas']])

California  2008    33870000
            2018    37250000
Texas       2008    20850000
            2018    25140000
dtype: int64


## 多级索引的`DataFrame`

### 多级索引下的取值方法

使用之前的四维`DataFrame`对象`df_data`

In [23]:
print(df_data)

name             Tom                Bill          
age               22        18        22        18
year visit                                        
2008 1      1.948807 -0.277856  0.526454  0.377724
     2      0.640332  1.470054  0.725624  0.506353
2018 1      0.601660 -0.300470  0.769044 -0.303792
     2      1.103208 -0.020874 -2.765182  0.292247


因为`DataFrame`数据类型的基本索引是列索引，索引首先是应用到列上

首先访问`Tom`的数据，数据也因此从四维降维到了三维

In [24]:
print(df_data['Tom'])

age               22        18
year visit                    
2008 1      1.948807 -0.277856
     2      0.640332  1.470054
2018 1      0.601660 -0.300470
     2      1.103208 -0.020874


再进一步，访问`Tom`年龄为`22`岁时的数据，将三维`DataFrame`降维变成了二维`Series`对象

In [25]:
print(df_data['Tom', 22])

year  visit
2008  1        1.948807
      2        0.640332
2018  1        0.601660
      2        1.103208
Name: (Tom, 22), dtype: float64


然后选择`2008`年的数据，将二维`Series`降为一维

In [26]:
print(df_data['Tom',22][2008])

visit
1    1.948807
2    0.640332
Name: (Tom, 22), dtype: float64


最后我们选取`visit=1`时的值，即最终定位到具体的一个值

In [27]:
print(df_data['Tom',22][2008,1])

1.9488074936644946


### 索引器分片方法

* 使用`iloc`索引器，`iloc`索引器中均使用索引序号进行分片

In [29]:
print(df_data.iloc[:2, :3])     # iloc[行， 列]

name             Tom                Bill
age               22        18        22
year visit                              
2008 1      1.948807 -0.277856  0.526454
     2      0.640332  1.470054  0.725624


* `DataFrame`一般用序号进行行的索引，用名称进行列的索引，因此，采用多级索引的`DataFrame`数据类型使用ix索引器会更方便一些

In [30]:
print(df_data.ix[:3,'Tom'])

age               22        18
year visit                    
2008 1      1.948807 -0.277856
     2      0.640332  1.470054
2018 1      0.601660 -0.300470


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)


## 多级索引转换

我们可以在不同方向上将`Sreies`类型的多级索引转换为`DataFrame`数据类型

In [32]:
pop.index.names = ['state', 'year']
print(pop.unstack(level=0))
print(pop.unstack(level=1))

state  California  New York     Texas
year                                 
2008     33870000  18970000  20850000
2018     37250000  19370000  25140000
year            2008      2018
state                         
California  33870000  37250000
New York    18970000  19370000
Texas       20850000  25140000


或是不用任何一列作为索引，而是采用默认的整数序列作为索引

In [33]:
print(pop.reset_index(name='population'))

        state  year  population
0  California  2008    33870000
1  California  2018    37250000
2    New York  2008    18970000
3    New York  2018    19370000
4       Texas  2008    20850000
5       Texas  2018    25140000


反过来，我们可以重新选择任意两列，如：`state`、`population`作为两个索引列

In [36]:
pop_un = pop.reset_index(name='population')
print(pop_un.set_index(['state' ,'population']))

                       year
state      population      
California 33870000    2008
           37250000    2018
New York   18970000    2008
           19370000    2018
Texas      20850000    2008
           25140000    2018


## 多级索引的行列统计

In [37]:
print(df_data)

name             Tom                Bill          
age               22        18        22        18
year visit                                        
2008 1      1.948807 -0.277856  0.526454  0.377724
     2      0.640332  1.470054  0.725624  0.506353
2018 1      0.601660 -0.300470  0.769044 -0.303792
     2      1.103208 -0.020874 -2.765182  0.292247


* 求每年的均值，这是横向上的统计

In [38]:
print(df_data.mean(level='year'))

name       Tom                Bill          
age         22        18        22        18
year                                        
2008  1.294570  0.596099  0.626039  0.442039
2018  0.852434 -0.160672 -0.998069 -0.005773


* 再从另一个方向上，按年龄来，求每个年龄对应的值的和

In [39]:
print(df_data.sum(axis=1, level='age'))

age               22        18
year visit                    
2008 1      2.475262  0.099869
     2      1.365955  1.976408
2018 1      1.370704 -0.604263
     2     -1.661974  0.271372
