**前言：我们的获得数据是离散的，但是我们想将相关的，可对比的数据联系在一起，那么我们就需要将离散的数据合并在一起。当然不同目的的合并，会有不同的方法。这次的课程目的就是讨论离散数据合并的事情。**

In [1]:
import numpy as np
import pandas as pd

# 层次化索引

**我们知道一组数据的一个轴向，如果按照不同的分发会有不同的索引，如果将不同的索引同时表现在一张表上，这时候的索引叫做层次化索引**

**先看看Series**

In [2]:
data = pd.Series(['a','s','d','d','f','f','g','h'])
data

0    a
1    s
2    d
3    d
4    f
5    f
6    g
7    h
dtype: object

In [3]:
#一个索引下，简历新的索引
data = pd.Series([1,23,4,5,6,7,8],index=[['大','大','大','小','小','小','中'],['a','s','d','f','g','h','t']]) #index的--->方向是从外到里的方向

In [4]:
data

大  a     1
   s    23
   d     4
小  f     5
   g     6
   h     7
中  t     8
dtype: int64

**用不同的索引，会有不同的结果，当然索引之间也会互相的影响** 

In [5]:
data['大']

a     1
s    23
d     4
dtype: int64

In [6]:
data['大','a']

1

**根据上面的表达，你有没有发现上面的取值的形式和DataFrame很像，是的使用unstack()方法可以将数据的形式变成DataFrame**

In [7]:
dstyle = data.unstack()
dstyle

Unnamed: 0,a,d,f,g,h,s,t
中,,,,,,,8.0
大,1.0,4.0,,,,23.0,
小,,,5.0,6.0,7.0,,


In [8]:
#反之亦然
dstyle.stack()

中  t     8.0
大  a     1.0
   d     4.0
   s    23.0
小  f     5.0
   g     6.0
   h     7.0
dtype: float64

**当然在DataFrame也有分层索引，构建的方法是多维数组**

In [9]:
frame = pd.DataFrame(np.ceil(np.random.uniform(1,1000,(6,6))))
frame

Unnamed: 0,0,1,2,3,4,5
0,407.0,945.0,386.0,983.0,538.0,854.0
1,381.0,991.0,387.0,606.0,719.0,565.0
2,356.0,132.0,738.0,550.0,331.0,631.0
3,165.0,255.0,797.0,611.0,263.0,848.0
4,555.0,837.0,214.0,927.0,927.0,248.0
5,919.0,677.0,989.0,142.0,745.0,356.0


In [10]:
frame = pd.DataFrame(np.ceil(np.random.uniform(1,1000,(6,6))),index=[['Dave','Wasa','Dave','Json','Json','Honey'],['age','age','money','home','grade','talent']],columns=[['a','a','a','f','f','r'],['a','s','d','f','g','h']])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,f,f,r
Unnamed: 0_level_1,Unnamed: 1_level_1,a,s,d,f,g,h
Dave,age,460.0,7.0,202.0,442.0,288.0,845.0
Wasa,age,235.0,275.0,354.0,657.0,57.0,272.0
Dave,money,918.0,429.0,852.0,992.0,224.0,325.0
Json,home,898.0,179.0,593.0,176.0,636.0,763.0
Json,grade,83.0,816.0,648.0,989.0,397.0,355.0
Honey,talent,23.0,231.0,129.0,16.0,28.0,792.0


In [11]:
frame['r']

Unnamed: 0,Unnamed: 1,h
Dave,age,845.0
Wasa,age,272.0
Dave,money,325.0
Json,home,763.0
Json,grade,355.0
Honey,talent,792.0


**为了更好的说明索引本身的含义，我们可以为每个索引命名，使用index.names(),column.names()**

In [12]:
frame = pd.DataFrame(np.ceil(np.random.uniform(1,1000,(6,6))),
                     index=[['Dave','Wasa','Dave','Json','Json','Honey'],['age','age','money','home','grade','talent']],
                     columns=[['a','a','a','f','f','r'],['a','s','d','f','g','h']])
frame.index.names=['字母','瞎写']
frame.columns.names=['名字','标签']

frame

Unnamed: 0_level_0,名字,a,a,a,f,f,r
Unnamed: 0_level_1,标签,a,s,d,f,g,h
字母,瞎写,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Dave,age,67.0,555.0,126.0,229.0,863.0,445.0
Wasa,age,615.0,234.0,918.0,863.0,742.0,552.0
Dave,money,607.0,51.0,844.0,425.0,400.0,258.0
Json,home,833.0,397.0,400.0,851.0,57.0,48.0
Json,grade,569.0,21.0,203.0,771.0,589.0,539.0
Honey,talent,972.0,19.0,757.0,397.0,111.0,170.0


In [13]:
test = pd.DataFrame(np.ceil(np.random.uniform(1,100,(5,5))))

In [14]:
test

Unnamed: 0,0,1,2,3,4
0,86.0,71.0,25.0,15.0,26.0
1,31.0,53.0,100.0,28.0,39.0
2,89.0,37.0,31.0,70.0,55.0
3,76.0,42.0,23.0,81.0,4.0
4,8.0,5.0,25.0,56.0,75.0


## 重排与分级排序

**我们设计了分层索引的索引名称，但是设计好的东西并不是一成不变的，我们可能存在替换或者改动的情况，拿替换来说，我们将索引的顺序替换使用的是swaplevel()**

In [29]:
frame = pd.DataFrame(np.ceil(np.random.uniform(1,1000,(6,6))),
                     columns=[['Dave','Wasa','Dave','Json','Json','Honey'],['age','age','money','home','grade','talent']],
                     index=[['a','a','a','f','f','r'],['a','s','d','f','g','h']])
frame.index.names=['字母','瞎写']
frame.columns.names=['名字','标签']

frame.swaplevel(0,1)  

Unnamed: 0_level_0,名字,Dave,Wasa,Dave,Json,Json,Honey
Unnamed: 0_level_1,标签,age,age,money,home,grade,talent
瞎写,字母,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
a,a,673.0,531.0,755.0,997.0,140.0,810.0
s,a,317.0,259.0,120.0,403.0,969.0,429.0
d,a,165.0,912.0,251.0,352.0,231.0,789.0
f,f,395.0,561.0,913.0,517.0,428.0,572.0
g,f,577.0,240.0,990.0,243.0,481.0,328.0
h,r,127.0,650.0,907.0,246.0,80.0,590.0


In [30]:
frame.sort_index(level=0)   #这里的level是对于从外往里索引的序号（最外边的是0）

Unnamed: 0_level_0,名字,Dave,Wasa,Dave,Json,Json,Honey
Unnamed: 0_level_1,标签,age,age,money,home,grade,talent
字母,瞎写,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
a,a,673.0,531.0,755.0,997.0,140.0,810.0
a,d,165.0,912.0,251.0,352.0,231.0,789.0
a,s,317.0,259.0,120.0,403.0,969.0,429.0
f,f,395.0,561.0,913.0,517.0,428.0,572.0
f,g,577.0,240.0,990.0,243.0,481.0,328.0
r,h,127.0,650.0,907.0,246.0,80.0,590.0


## 根据级别汇总统计

**题目简单来说就是，我们有多层索引，对于某一个我们感兴趣的索引，我们统计其数据**

**想要做到对某个索引的统计，我们需要注意几点：1.是哪个索引？这个通过level=来确定。2.哪个方向？通过axis=来确定**

In [46]:
frame = pd.DataFrame(np.ceil(np.random.uniform(1,10,(5,5))),index=[['a','s','a','f','g'],['z','x','c','a','s']],columns=[[1,2,3,4,5],[1,'s','d','f','re']])
frame.index.names=['key1','key2']
frame.columns.names=['time','color']

In [47]:
frame

Unnamed: 0_level_0,time,1,2,3,4,5
Unnamed: 0_level_1,color,1,s,d,f,re
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
a,z,3.0,10.0,5.0,3.0,4.0
s,x,5.0,7.0,9.0,6.0,3.0
a,c,10.0,10.0,7.0,3.0,5.0
f,a,3.0,9.0,2.0,2.0,6.0
g,s,2.0,3.0,7.0,7.0,3.0


In [48]:
#特定索引求和，记住我们的几点
frame.sum(level='color',axis=1)

Unnamed: 0_level_0,color,1,s,d,f,re
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
a,z,3.0,10.0,5.0,3.0,4.0
s,x,5.0,7.0,9.0,6.0,3.0
a,c,10.0,10.0,7.0,3.0,5.0
f,a,3.0,9.0,2.0,2.0,6.0
g,s,2.0,3.0,7.0,7.0,3.0


In [49]:
frame.sum(level='key1')

time,1,2,3,4,5
color,1,s,d,f,re
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
a,13.0,20.0,12.0,6.0,9.0
s,5.0,7.0,9.0,6.0,3.0
f,3.0,9.0,2.0,2.0,6.0
g,2.0,3.0,7.0,7.0,3.0


## 使DataFrame的列变成索引

**对于数据本身，我直接将DataFrame的列拿来当索引，索引的内容是行索引,使用的方法是set_index()**

In [58]:
data = pd.DataFrame({'a':[1,2,3,4],'b':['one','two','three','four'],'c':range(4)})

In [59]:
data

Unnamed: 0,a,b,c
0,1,one,0
1,2,two,1
2,3,three,2
3,4,four,3


In [61]:
data.set_index(['c'])

Unnamed: 0_level_0,a,b
c,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1,one
1,2,two
2,3,three
3,4,four


**其中里面也有参数是drop=，drop默认数值是True，代表被当做index的列被抹去，如果改为Flase，那么这列就还在，看例子**

In [63]:
data.set_index(['c'],drop=False)

Unnamed: 0_level_0,a,b,c
c,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,one,0
1,2,two,1
2,3,three,2
3,4,four,3


**如果想回去，或者说一个index想变为数据的一部分，使用reset_index()**

In [64]:
data01 = data.set_index(['c'])
data01.reset_index(['c'])

Unnamed: 0,c,a,b
0,0,1,one
1,1,2,two
2,2,3,three
3,3,4,four


# 合并数据集

**这一节的内容是比较绕的，对于几组数据的合并，不同的需求有不同的合并方法，join(),concat(),merge()**