**前言：我们的获得数据是离散的，但是我们想将相关的，可对比的数据联系在一起，那么我们就需要将离散的数据合并在一起。当然不同目的的合并，会有不同的方法。这次的课程目的就是讨论离散数据合并的事情。**

In [1]:
import numpy as np
import pandas as pd

# 层次化索引

**我们知道一组数据的一个轴向，如果按照不同的分发会有不同的索引，如果将不同的索引同时表现在一张表上，这时候的索引叫做层次化索引**

**先看看Series**

In [2]:
data = pd.Series(['a','s','d','d','f','f','g','h'])
data

0    a
1    s
2    d
3    d
4    f
5    f
6    g
7    h
dtype: object

In [3]:
#一个索引下，简历新的索引
data = pd.Series([1,23,4,5,6,7,8],index=[['大','大','大','小','小','小','中'],['a','s','d','f','g','h','t']]) #index的--->方向是从外到里的方向

In [4]:
data

大  a     1
   s    23
   d     4
小  f     5
   g     6
   h     7
中  t     8
dtype: int64

**用不同的索引，会有不同的结果，当然索引之间也会互相的影响** 

In [5]:
data['大']

a     1
s    23
d     4
dtype: int64

In [6]:
data['大','a']

1

**根据上面的表达，你有没有发现上面的取值的形式和DataFrame很像，是的使用unstack()方法可以将数据的形式变成DataFrame**

In [7]:
dstyle = data.unstack()
dstyle

Unnamed: 0,a,d,f,g,h,s,t
中,,,,,,,8.0
大,1.0,4.0,,,,23.0,
小,,,5.0,6.0,7.0,,


In [8]:
#反之亦然
dstyle.stack()

中  t     8.0
大  a     1.0
   d     4.0
   s    23.0
小  f     5.0
   g     6.0
   h     7.0
dtype: float64

**当然在DataFrame也有分层索引，构建的方法是多维数组**

In [9]:
frame = pd.DataFrame(np.ceil(np.random.uniform(1,1000,(6,6))))
frame

Unnamed: 0,0,1,2,3,4,5
0,140.0,871.0,819.0,24.0,683.0,474.0
1,240.0,194.0,478.0,508.0,131.0,53.0
2,345.0,393.0,79.0,184.0,237.0,920.0
3,116.0,55.0,412.0,518.0,396.0,579.0
4,743.0,733.0,504.0,21.0,90.0,93.0
5,89.0,364.0,358.0,761.0,86.0,126.0


In [10]:
frame = pd.DataFrame(np.ceil(np.random.uniform(1,1000,(6,6))),index=[['Dave','Wasa','Dave','Json','Json','Honey'],['age','age','money','home','grade','talent']],columns=[['a','a','a','f','f','r'],['a','s','d','f','g','h']])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,f,f,r
Unnamed: 0_level_1,Unnamed: 1_level_1,a,s,d,f,g,h
Dave,age,751.0,13.0,390.0,453.0,950.0,225.0
Wasa,age,843.0,856.0,284.0,979.0,496.0,813.0
Dave,money,809.0,442.0,739.0,687.0,800.0,927.0
Json,home,416.0,853.0,233.0,856.0,661.0,677.0
Json,grade,991.0,552.0,574.0,7.0,977.0,90.0
Honey,talent,702.0,846.0,969.0,479.0,496.0,166.0


In [11]:
frame['r']

Unnamed: 0,Unnamed: 1,h
Dave,age,225.0
Wasa,age,813.0
Dave,money,927.0
Json,home,677.0
Json,grade,90.0
Honey,talent,166.0


**为了更好的说明索引本身的含义，我们可以为每个索引命名，使用index.names(),column.names()**

In [12]:
frame = pd.DataFrame(np.ceil(np.random.uniform(1,1000,(6,6))),
                     index=[['Dave','Wasa','Dave','Json','Json','Honey'],['age','age','money','home','grade','talent']],
                     columns=[['a','a','a','f','f','r'],['a','s','d','f','g','h']])
frame.index.names=['字母','瞎写']
frame.columns.names=['名字','标签']

frame

Unnamed: 0_level_0,名字,a,a,a,f,f,r
Unnamed: 0_level_1,标签,a,s,d,f,g,h
字母,瞎写,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Dave,age,735.0,682.0,752.0,611.0,806.0,308.0
Wasa,age,48.0,160.0,150.0,145.0,841.0,954.0
Dave,money,132.0,577.0,270.0,17.0,619.0,100.0
Json,home,268.0,705.0,896.0,242.0,27.0,893.0
Json,grade,917.0,894.0,146.0,789.0,215.0,826.0
Honey,talent,417.0,160.0,658.0,394.0,777.0,861.0


In [13]:
test = pd.DataFrame(np.ceil(np.random.uniform(1,100,(5,5))))

In [14]:
test

Unnamed: 0,0,1,2,3,4
0,67.0,42.0,9.0,82.0,85.0
1,24.0,55.0,3.0,69.0,23.0
2,23.0,32.0,12.0,98.0,34.0
3,32.0,69.0,25.0,93.0,54.0
4,69.0,78.0,40.0,94.0,45.0


## 重排与分级排序

**我们设计了分层索引的索引名称，但是设计好的东西并不是一成不变的，我们可能存在替换或者改动的情况，拿替换来说，我们将索引的顺序替换使用的是swaplevel()**

In [15]:
frame = pd.DataFrame(np.ceil(np.random.uniform(1,1000,(6,6))),
                     columns=[['Dave','Wasa','Dave','Json','Json','Honey'],['age','age','money','home','grade','talent']],
                     index=[['a','a','a','f','f','r'],['a','s','d','f','g','h']])
frame.index.names=['字母','瞎写']
frame.columns.names=['名字','标签']

frame.swaplevel(0,1)  

Unnamed: 0_level_0,名字,Dave,Wasa,Dave,Json,Json,Honey
Unnamed: 0_level_1,标签,age,age,money,home,grade,talent
瞎写,字母,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
a,a,882.0,900.0,23.0,146.0,229.0,653.0
s,a,619.0,706.0,491.0,481.0,902.0,675.0
d,a,136.0,363.0,778.0,758.0,902.0,779.0
f,f,580.0,434.0,497.0,559.0,153.0,412.0
g,f,35.0,66.0,237.0,193.0,339.0,539.0
h,r,828.0,792.0,404.0,436.0,931.0,273.0


In [16]:
frame.sort_index(level=0)   #这里的level是对于从外往里索引的序号（最外边的是0）

Unnamed: 0_level_0,名字,Dave,Wasa,Dave,Json,Json,Honey
Unnamed: 0_level_1,标签,age,age,money,home,grade,talent
字母,瞎写,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
a,a,882.0,900.0,23.0,146.0,229.0,653.0
a,d,136.0,363.0,778.0,758.0,902.0,779.0
a,s,619.0,706.0,491.0,481.0,902.0,675.0
f,f,580.0,434.0,497.0,559.0,153.0,412.0
f,g,35.0,66.0,237.0,193.0,339.0,539.0
r,h,828.0,792.0,404.0,436.0,931.0,273.0


## 根据级别汇总统计

**题目简单来说就是，我们有多层索引，对于某一个我们感兴趣的索引，我们统计其数据**

**想要做到对某个索引的统计，我们需要注意几点：1.是哪个索引？这个通过level=来确定。2.哪个方向？通过axis=来确定**

In [17]:
frame = pd.DataFrame(np.ceil(np.random.uniform(1,10,(5,5))),index=[['a','s','a','f','g'],['z','x','c','a','s']],columns=[[1,2,3,4,5],[1,'s','d','f','re']])
frame.index.names=['key1','key2']
frame.columns.names=['time','color']

In [18]:
frame

Unnamed: 0_level_0,time,1,2,3,4,5
Unnamed: 0_level_1,color,1,s,d,f,re
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
a,z,8.0,2.0,6.0,7.0,2.0
s,x,8.0,7.0,6.0,3.0,4.0
a,c,6.0,2.0,9.0,9.0,6.0
f,a,8.0,4.0,2.0,7.0,4.0
g,s,6.0,6.0,6.0,4.0,5.0


In [19]:
#特定索引求和，记住我们的几点
frame.sum(level='color',axis=1)

Unnamed: 0_level_0,color,1,s,d,f,re
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
a,z,8.0,2.0,6.0,7.0,2.0
s,x,8.0,7.0,6.0,3.0,4.0
a,c,6.0,2.0,9.0,9.0,6.0
f,a,8.0,4.0,2.0,7.0,4.0
g,s,6.0,6.0,6.0,4.0,5.0


In [20]:
frame.sum(level='key1')

time,1,2,3,4,5
color,1,s,d,f,re
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
a,14.0,4.0,15.0,16.0,8.0
s,8.0,7.0,6.0,3.0,4.0
f,8.0,4.0,2.0,7.0,4.0
g,6.0,6.0,6.0,4.0,5.0


## 使DataFrame的列变成索引

**对于数据本身，我直接将DataFrame的列拿来当索引，索引的内容是行索引,使用的方法是set_index()**

In [21]:
data = pd.DataFrame({'a':[1,2,3,4],'b':['one','two','three','four'],'c':range(4)})

In [22]:
data

Unnamed: 0,a,b,c
0,1,one,0
1,2,two,1
2,3,three,2
3,4,four,3


In [23]:
data.set_index(['c'])

Unnamed: 0_level_0,a,b
c,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1,one
1,2,two
2,3,three
3,4,four


**其中里面也有参数是drop=，drop默认数值是True，代表被当做index的列被抹去，如果改为Flase，那么这列就还在，看例子**

In [24]:
data.set_index(['c'],drop=False)

Unnamed: 0_level_0,a,b,c
c,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,one,0
1,2,two,1
2,3,three,2
3,4,four,3


**如果想回去，或者说一个index想变为数据的一部分，使用reset_index()**

In [25]:
data01 = data.set_index(['c'])
data01.reset_index(['c'])

Unnamed: 0,c,a,b
0,0,1,one
1,1,2,two
2,2,3,three
3,3,4,four


# 合并数据集

**这一节的内容是比较绕的，对于几组数据的合并，不同的需求有不同的合并方法，join(),concat(),merge()**

## 数据库型风格的DataFrame合并

**简单说就是使用merge()方法可以实现DataFrame表格的SQL运算**

In [26]:
df1 = pd.DataFrame({'key':['a','s','d','f','g','h'],'data1':range(6)}) 

In [27]:
df2 = pd.DataFrame({'key':['a','a','d'],'data2':range(3)})

In [28]:
#例子1：
pd.merge(df1,df2)

Unnamed: 0,key,data1,data2
0,a,0,0
1,a,0,1
2,d,2,2


**这里有以个默认，是以key为轴，所以如合并的时候，最好需要确定这个轴(使用on = )**

In [29]:
pd.merge(df1,df2,on='key')

Unnamed: 0,key,data1,data2
0,a,0,0
1,a,0,1
2,d,2,2


In [30]:
#还有其他情况
pd.merge(df1,df2,on='key',how='outer')

Unnamed: 0,key,data1,data2
0,a,0,0.0
1,a,0,1.0
2,s,1,
3,d,2,2.0
4,f,3,
5,g,4,
6,h,5,


**我们从上面注意到，改变了变量how=改变了合并的方向，当how='outer'的时候，合并取得是并集，how默认参数是innner，取得是交集**

> inner:使用两个表都有的键

> outer:使用两个表中所有的键

> left：使用左边中所有的键

> right:使用右表中所有的键

In [31]:
pd.merge(df1,df2,on='key',how='left')

Unnamed: 0,key,data1,data2
0,a,0,0.0
1,a,0,1.0
2,s,1,
3,d,2,2.0
4,f,3,
5,g,4,
6,h,5,


In [32]:
pd.merge(df1,df2,on='key',how='right')

Unnamed: 0,key,data1,data2
0,a,0,0
1,a,0,1
2,d,2,2


**还有一种情况：**

In [33]:
left = pd.DataFrame({'key1':['a','s','d','f','g','h','j'],'val':['one','two','three','four','fiv','six','seven'],
                    'key2':[1,2,3,4,5,6,7]})

In [34]:
right =  pd.DataFrame({'key1':['a','s','d'],'val':['one','two','three'],
                    'key2':[1,2,2]})

In [35]:
pd.merge(left,right,on=['key1','key2'])

Unnamed: 0,key1,val_x,key2,val_y
0,a,one,1,one
1,s,two,2,two


**这种情况是有两个列当做轴，所以做法是使用两个列的元素组成元组，然后再去做和上面一样的比较**

## 索引上的合并

**我们想合并数据，有需要用索引合并的需求，在merge()方法中使用left_index=和right_index=可以使用索引合并**

In [36]:
left = pd.DataFrame({'NUM':range(4),'time':range(4,8)},index=['a','b','c','e'])

In [37]:
left

Unnamed: 0,NUM,time
a,0,4
b,1,5
c,2,6
e,3,7


In [38]:
right = pd.DataFrame({'code':range(6),'push':range(100,106),'key':['a','b','c','d','e','f']})
right

Unnamed: 0,code,push,key
0,0,100,a
1,1,101,b
2,2,102,c
3,3,103,d
4,4,104,e
5,5,105,f


In [39]:
pd.merge(left,right,right_on='key',left_index=True)  

Unnamed: 0,NUM,time,code,push,key
0,0,4,0,100,a
1,1,5,1,101,b
2,2,6,2,102,c
4,3,7,4,104,e


In [40]:
pd.merge(left,right,right_on='key',left_index=True,how='outer')

Unnamed: 0,NUM,time,code,push,key
0,0.0,4.0,0,100,a
1,1.0,5.0,1,101,b
2,2.0,6.0,2,102,c
4,3.0,7.0,4,104,e
3,,,3,103,d
5,,,5,105,f


**当你拥有了right_on和left_on，你就拥有了控制那一列当做合并的轴；当你发现可以使用left_index和right_index，你连索引都可以当做轴来合并，可以说是很完备了**

**如果说，单个索引搞定了，多层次索引就是一次类推罢了**

In [41]:
left = pd.DataFrame({'name':['Dave','Json','Hash','Happy'],'year':['2000','2001','2002','2003'],'money':['1318','1551','15315','48644']})

In [42]:
right = pd.DataFrame({'age':['18','18','19','20']},index=[['Dave','Dave','Lashi','Hash'],['2000','2001','2004','2005']])

In [43]:
left

Unnamed: 0,name,year,money
0,Dave,2000,1318
1,Json,2001,1551
2,Hash,2002,15315
3,Happy,2003,48644


In [44]:
right

Unnamed: 0,Unnamed: 1,age
Dave,2000,18
Dave,2001,18
Lashi,2004,19
Hash,2005,20


In [45]:
pd.merge(left,right,left_on=['name','year'],right_index=True)

Unnamed: 0,name,year,money,age
0,Dave,2000,1318,18


**然后说一个特殊的情况，就是两个数据的都是以索引来合并的**

In [46]:
left = pd.DataFrame({'name1':['Dave','Json','Hash','Happy'],'year1':['2000','2001','2002','2003'],'money1':['1318','1551','15315','48644']},
                   index=['a','b','c','d'])

In [47]:
right = pd.DataFrame({'name2':['Andong','Json','Beihang','Happy'],'year2':['1955','2001','1999','2003'],'money2':['1318','1551','15315','48644']},
                   index=['a','e','g','d'])

In [48]:
left

Unnamed: 0,name1,year1,money1
a,Dave,2000,1318
b,Json,2001,1551
c,Hash,2002,15315
d,Happy,2003,48644


In [49]:
right

Unnamed: 0,name2,year2,money2
a,Andong,1955,1318
e,Json,2001,1551
g,Beihang,1999,15315
d,Happy,2003,48644


In [50]:
pd.merge(right,left,left_index=True,right_index=True)

Unnamed: 0,name2,year2,money2,name1,year1,money1
a,Andong,1955,1318,Dave,2000,1318
d,Happy,2003,48644,Happy,2003,48644


**对于上面的这种情况，pandas中的join()函数的可以完成**

In [51]:
left.join(right,how='inner')  #join()的要求是不能有overleap项

Unnamed: 0,name1,year1,money1,name2,year2,money2
a,Dave,2000,1318,Andong,1955,1318
d,Happy,2003,48644,Happy,2003,48644


## 轴向连接

**说道轴向连接，我的理解是，就像几个表叠加在一起，如果没有的列就在后面添加上，如果有重合的就让它重合，如果位置有重合，但是数值没有重合，那么就会引发错误**

**在轴向连接的方法是concat()，里面的参数慢慢讲**

In [52]:
left = pd.DataFrame({'name1':['Dave','Json','Hash','Happy'],'year1':['2000','2001','2002','2003'],'money1':['1318','1551','15315','48644']},
                   index=['a','b','c','d'])

In [53]:
right = pd.DataFrame({'name2':['Andong','Json','Beihang','Happy'],'year2':['1955','2001','1999','2003'],'money2':['1318','1551','15315','48644']},
                   index=['a','e','g','d'])

In [54]:
pd.concat([left,right])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,money1,money2,name1,name2,year1,year2
a,1318.0,,Dave,,2000.0,
b,1551.0,,Json,,2001.0,
c,15315.0,,Hash,,2002.0,
d,48644.0,,Happy,,2003.0,
a,,1318.0,,Andong,,1955.0
e,,1551.0,,Json,,2001.0
g,,15315.0,,Beihang,,1999.0
d,,48644.0,,Happy,,2003.0


**从Series开始**

In [55]:
data1 = pd.Series(range(5))
data2 = pd.Series(range(7,12),index=range(5,10))

In [56]:
data

Unnamed: 0,a,b,c
0,1,one,0
1,2,two,1
2,3,three,2
3,4,four,3


In [57]:
data2

5     7
6     8
7     9
8    10
9    11
dtype: int64

In [58]:
pd.concat([data1,data2])

0     0
1     1
2     2
3     3
4     4
5     7
6     8
7     9
8    10
9    11
dtype: int64

**如果我想要通过加入标签来区分不同的Series，那么我需要添加剂的参数是keys=**

In [59]:
a=pd.concat([data1,data2],keys=['key1','key2'])
a

key1  0     0
      1     1
      2     2
      3     3
      4     4
key2  5     7
      6     8
      7     9
      8    10
      9    11
dtype: int64

In [60]:
a.unstack()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
key1,0.0,1.0,2.0,3.0,4.0,,,,,
key2,,,,,,7.0,8.0,9.0,10.0,11.0


**我们不知不觉的添加了多层索引**

In [61]:
pd.concat([data1,data2],axis=1,keys=['key1','key2'])  #你这样看是增加了colunm，其实是将不同的Series作区分

Unnamed: 0,key1,key2
0,0.0,
1,1.0,
2,2.0,
3,3.0,
4,4.0,
5,,7.0
6,,8.0
7,,9.0
8,,10.0
9,,11.0


**我们如果可以自选index来显示，那么我们可以使用参数join_axes=**

In [62]:
pd.concat([data1,data2],axis=1,keys=['key1','key2'],join_axes=[[1,2,5,3]])  #你这样看是增加了colunm，其实是将不同的Series作区分

Unnamed: 0,key1,key2
1,1.0,
2,2.0,
5,,7.0
3,3.0,


## 合并重叠数据

**我们前面的数据合并是merge和concat，问题在于当我们遇到数轴名称重合时，我们需要有一种方法将空缺的是数据填充**

**在Series中，如果有缺失的数据，使用numpy的where来与其他数据填充**

In [63]:
#举个例子
test_a = pd.Series(['a','s','d','f','g']) 

In [64]:
test_a[4]=np.nan
test_a

0      a
1      s
2      d
3      f
4    NaN
dtype: object

In [65]:
test_b = pd.Series(['q','w','e','r','t'])
test_b

0    q
1    w
2    e
3    r
4    t
dtype: object

In [66]:
np.info(np.where)

where(condition, [x, y])

Return elements chosen from `x` or `y` depending on `condition`.

.. note::
    When only `condition` is provided, this function is a shorthand for
    ``np.asarray(condition).nonzero()``. Using `nonzero` directly should be
    preferred, as it behaves correctly for subclasses. The rest of this
    documentation covers only the case where all three arguments are
    provided.

Parameters
----------
condition : array_like, bool
    Where True, yield `x`, otherwise yield `y`.
x, y : array_like
    Values from which to choose. `x`, `y` and `condition` need to be
    broadcastable to some shape.

Returns
-------
out : ndarray
    An array with elements from `x` where `condition` is True, and elements
    from `y` elsewhere.

See Also
--------
choose
nonzero : The function that is called when x and y are omitted

Notes
-----
If all the arrays are 1-D, `where` is equivalent to::

    [xv if c else yv
     for c, xv, yv in zip(condition, x, y)]

Examples
--------
>>>

In [67]:
np.where(pd.isnull(test_a),test_b,test_a)  #如果符合判断，就输出test_b,如果不符合判断,就输出test_a

array(['a', 's', 'd', 'f', 't'], dtype=object)

**就上面的实例，我们做一些数据的填充，具体做法就是将几个表重叠起来，做相互的映射**

**在DataFrame中combin_first()方法可以做到相同的对应**

In [68]:
frame_a = pd.DataFrame({'a':['a','s','d','f'],'b':[np.nan,'x','c','r'],'c':['i','j','i',np.nan]})

In [69]:
frame_b = pd.DataFrame({'a':[np.nan,'c','d',np.nan],'b':['j','u','j',np.nan]})

In [70]:
frame_a

Unnamed: 0,a,b,c
0,a,,i
1,s,x,j
2,d,c,i
3,f,r,


In [71]:
frame_b

Unnamed: 0,a,b
0,,j
1,c,u
2,d,j
3,,


In [72]:
frame_a.combine_first(frame_b)  #frame_a的空缺有相应位置的frame_b的有意义元素表示

Unnamed: 0,a,b,c
0,a,j,i
1,s,x,j
2,d,c,i
3,f,r,


# 重塑和轴向旋转

## 重塑层次化索引

**我们在上面的操作中介绍了两个方法:stack(),unstack()**

> stack():使DataFrame---->Series

> unstack():使Series---->DataFrame

In [73]:
data = pd.DataFrame({'a':[1,2,3,4],'b':[4,5,6,7]},index=['z','x','c','v'])

In [74]:
data

Unnamed: 0,a,b
z,1,4
x,2,5
c,3,6
v,4,7


In [75]:
a = data.stack()
a

z  a    1
   b    4
x  a    2
   b    5
c  a    3
   b    6
v  a    4
   b    7
dtype: int64

**这个时候我们可以认为，轴反转了。我们看看数据反转的方向和它最后反转的位置。他反转到了index的内部**

In [76]:
a.unstack()

Unnamed: 0,a,b
z,1,4
x,2,5
c,3,6
v,4,7


In [77]:
test = pd.Series(range(5),index=([['a','s','d','f','g'],['a','x','d','f','g']]))

In [78]:
test

a  a    0
s  x    1
d  d    2
f  f    3
g  g    4
dtype: int64

In [79]:
test.unstack()

Unnamed: 0,a,d,f,g,x
a,0.0,,,,
d,,2.0,,,
f,,,3.0,,
g,,,,4.0,
s,,,,,1.0


**如果是轴的旋转，都是从最内部开始的**

**你还记得的我们可以设置数据的轴的名称，所以要想改变每次旋转轴都只能从最里面开始，我们可以在使用unstack和stack方法时在括号里面，写入要旋转的层数（最外层是0层）和层的名称**

In [80]:
frame = pd.DataFrame(np.ceil(np.random.uniform(1,999,(4,4))),index=[['a','a','d','d'],['z','x','c','v']],
                     columns=[['haha','haha','lala','lala'],['q','w','e','r']])

In [81]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,haha,haha,lala,lala
Unnamed: 0_level_1,Unnamed: 1_level_1,q,w,e,r
a,z,592.0,909.0,848.0,155.0
a,x,111.0,702.0,381.0,235.0
d,c,774.0,406.0,427.0,623.0
d,v,373.0,424.0,406.0,19.0


In [82]:
frame.index.names=['key_a','state_a']
frame.columns.names=['key_b','state_b']

In [83]:
frame

Unnamed: 0_level_0,key_b,haha,haha,lala,lala
Unnamed: 0_level_1,state_b,q,w,e,r
key_a,state_a,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
a,z,592.0,909.0,848.0,155.0
a,x,111.0,702.0,381.0,235.0
d,c,774.0,406.0,427.0,623.0
d,v,373.0,424.0,406.0,19.0


In [84]:
frame.unstack()  #index的最里层，变成了最外一层

key_b,haha,haha,haha,haha,haha,haha,haha,haha,lala,lala,lala,lala,lala,lala,lala,lala
state_b,q,q,q,q,w,w,w,w,e,e,e,e,r,r,r,r
state_a,c,v,x,z,c,v,x,z,c,v,x,z,c,v,x,z
key_a,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3
a,,,111.0,592.0,,,702.0,909.0,,,381.0,848.0,,,235.0,155.0
d,774.0,373.0,,,406.0,424.0,,,427.0,406.0,,,623.0,19.0,,


In [85]:
frame.stack(0)  #columns的最外层到了index的最内层

Unnamed: 0_level_0,Unnamed: 1_level_0,state_b,e,q,r,w
key_a,state_a,key_b,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
a,z,haha,,592.0,,909.0
a,z,lala,848.0,,155.0,
a,x,haha,,111.0,,702.0
a,x,lala,381.0,,235.0,
d,c,haha,,774.0,,406.0
d,c,lala,427.0,,623.0,
d,v,haha,,373.0,,424.0
d,v,lala,406.0,,19.0,


In [86]:
frame.unstack(0) #index的最外一层在column的最内层

key_b,haha,haha,haha,haha,lala,lala,lala,lala
state_b,q,q,w,w,e,e,r,r
key_a,a,d,a,d,a,d,a,d
state_a,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
c,,774.0,,406.0,,427.0,,623.0
v,,373.0,,424.0,,406.0,,19.0
x,111.0,,702.0,,381.0,,235.0,
z,592.0,,909.0,,848.0,,155.0,


In [87]:
frame.unstack(1)  #index的最外层去了columns的最内层

key_b,haha,haha,haha,haha,haha,haha,haha,haha,lala,lala,lala,lala,lala,lala,lala,lala
state_b,q,q,q,q,w,w,w,w,e,e,e,e,r,r,r,r
state_a,c,v,x,z,c,v,x,z,c,v,x,z,c,v,x,z
key_a,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3
a,,,111.0,592.0,,,702.0,909.0,,,381.0,848.0,,,235.0,155.0
d,774.0,373.0,,,406.0,424.0,,,427.0,406.0,,,623.0,19.0,,


**我们看了这么多的例子，可以总结一个东西是：无论怎么转换，它的一定转换到所要去的最内层**

## 将‘长格式‘旋转为‘宽格式’

**题目具体是什么意思，我们导入数据来看一下**

In [88]:
data = pd.read_csv('E:/Datawhale数据分析/PythonForDataAnalysis-master/PythonForDataAnalysis-master/ch08/macrodata.csv') 

**这个数据源的下载在：https://github.com/wen-fei/PythonForDataAnalysis**

In [89]:
data.head()

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19


In [90]:
periods=pd.PeriodIndex(year=data.year,quarter=data.quarter,name='data')
periods  #变为period()的list

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', name='data', length=203, freq='Q-DEC')

In [91]:
columns = pd.Index(['realgdp','infl','unemp'],name='item')  
columns

Index(['realgdp', 'infl', 'unemp'], dtype='object', name='item')

In [92]:
data = data.reindex(columns = columns)  #重新定义columns，然后表现出来
data.head()

item,realgdp,infl,unemp
0,2710.349,0.0,5.8
1,2778.801,2.34,5.1
2,2775.488,2.74,5.3
3,2785.204,0.27,5.6
4,2847.699,2.31,5.2


In [93]:
data.index = periods.to_timestamp('D','end')
data.head()

item,realgdp,infl,unemp
data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 23:59:59.999999999,2710.349,0.0,5.8
1959-06-30 23:59:59.999999999,2778.801,2.34,5.1
1959-09-30 23:59:59.999999999,2775.488,2.74,5.3
1959-12-31 23:59:59.999999999,2785.204,0.27,5.6
1960-03-31 23:59:59.999999999,2847.699,2.31,5.2


In [94]:
ldata = data.stack().reset_index()  #注意看这里，这里的数据发生了变化

In [95]:
ldata.head()  #这里的item是之前定义好的columns的名字

Unnamed: 0,data,item,0
0,1959-03-31 23:59:59.999999999,realgdp,2710.349
1,1959-03-31 23:59:59.999999999,infl,0.0
2,1959-03-31 23:59:59.999999999,unemp,5.8
3,1959-06-30 23:59:59.999999999,realgdp,2778.801
4,1959-06-30 23:59:59.999999999,infl,2.34


In [96]:
ldata = data.stack().reset_index().rename(columns={0:'values'})
ldata.head()

Unnamed: 0,data,item,values
0,1959-03-31 23:59:59.999999999,realgdp,2710.349
1,1959-03-31 23:59:59.999999999,infl,0.0
2,1959-03-31 23:59:59.999999999,unemp,5.8
3,1959-06-30 23:59:59.999999999,realgdp,2778.801
4,1959-06-30 23:59:59.999999999,infl,2.34


**我们看出，item将不同的数据名称做了整合后，写在了一列上**

**我们更喜欢DataFrame中，每一个数据名称写在不同的列上面，这时候使用的是pivot()**

In [97]:
piovted = ldata.pivot('data','item','values')     #DataFrame.pivot(index=None, columns=None, values=None)
piovted.head()                                    #pivot()函数可以理解为，数据可以按照index，columns，values的顺序返回新的frame

item,infl,realgdp,unemp
data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2


**我们在这里ldata再加入一列数据value_a**

In [98]:
ldata['value_a'] = np.random.randint(len(ldata))

In [99]:
ldata.head()

Unnamed: 0,data,item,values,value_a
0,1959-03-31 23:59:59.999999999,realgdp,2710.349,223
1,1959-03-31 23:59:59.999999999,infl,0.0,223
2,1959-03-31 23:59:59.999999999,unemp,5.8,223
3,1959-06-30 23:59:59.999999999,realgdp,2778.801,223
4,1959-06-30 23:59:59.999999999,infl,2.34,223


**如果在这个DataFrame中忽略piovt()的最后的参数，那么会形成分层索引***

In [100]:
piovted = ldata.pivot('data','item')

In [101]:
piovted.head()

Unnamed: 0_level_0,values,values,values,value_a,value_a,value_a
item,infl,realgdp,unemp,infl,realgdp,unemp
data,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8,223,223,223
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1,223,223,223
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3,223,223,223
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6,223,223,223
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2,223,223,223


## 将‘宽格式’变为‘长格式’

**所谓变为长格式就是讲一个列的不同表达，扩展开。比如上面的item通过piovt()函数来将不同参数展示在了不同的列**

**那么所谓的长格式就是，将不同的列因为某种区分，而转化在了同一个列上面，使用的方法：pd.melt()**

**pd.melt():*https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html*

*pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)*

In [102]:
frame = pd.DataFrame({'a':[1,2,3,4],'b':[4,5,6,7]},index=['z','x','c','v'])

In [103]:
frame

Unnamed: 0,a,b
z,1,4
x,2,5
c,3,6
v,4,7


In [104]:
frame.melt()

Unnamed: 0,variable,value
0,a,1
1,a,2
2,a,3
3,a,4
4,b,4
5,b,5
6,b,6
7,b,7


In [105]:
test = pd.melt(frame,id_vars='a')

In [106]:
test

Unnamed: 0,a,variable,value
0,1,b,4
1,2,b,5
2,3,b,6
3,4,b,7


In [107]:
#开始将a和b一起放在一列里面
test = pd.melt(frame,value_vars=['a','b'])

In [108]:


test

Unnamed: 0,variable,value
0,a,1
1,a,2
2,a,3
3,a,4
4,b,4
5,b,5
6,b,6
7,b,7


**如果是要转为宽格式呢？**

In [112]:
test.reset_index(drop=True)

Unnamed: 0,variable,value
0,a,1
1,a,2
2,a,3
3,a,4
4,b,4
5,b,5
6,b,6
7,b,7


In [127]:
shape = test.pivot(index=None,columns='variable',values='value')  #DataFrame.pivot(index=None, columns=None, values=None)

In [128]:
shape

variable,a,b
0,1.0,
1,2.0,
2,3.0,
3,4.0,
4,,4.0
5,,5.0
6,,6.0
7,,7.0
