表格的长宽转换
===

表格长宽转换也是一种透视表操作

* df.pivot()
    * 将 一张长表 转为 多张宽表
* pd.melt()
    * 将 多张宽表 转为 一张长表

二者互为逆操作

---

## pivot和pivot_table的区别

* pivot转换后，如果表索引有重复值会直接出错
* pivot_table转换后，如果表索引有重复值会聚合为一个输出，不会出错
    * 如果数据不重复，pivot_table得到的结果和pivot一致
    * pivot_table更常用

In [1]:
import numpy as np
import pandas as pd

In [2]:
df_test = pd.DataFrame({
    'foo': ['one','one','one','two','two','two'],
    'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
    'baz': [1, 2, 3, 4, 5, 6]
})

df_test

Unnamed: 0,foo,bar,baz
0,one,A,1
1,one,B,2
2,one,C,3
3,two,A,4
4,two,B,5
5,two,C,6


In [3]:
df_test.pivot(index='foo', columns='bar', values='baz')  # 不带聚合

bar,A,B,C
foo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,1,2,3
two,4,5,6


In [4]:
df_test.pivot_table(index='foo', columns='bar', values='baz')  # 带聚合，推荐

bar,A,B,C
foo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,1,2,3
two,4,5,6


In [5]:
df_test = pd.DataFrame({
    'foo': ['one','one','one','one','two','two'],
    'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
    'baz': [1, 2, 3, 4, 5, 6]
})

df_test

Unnamed: 0,foo,bar,baz
0,one,A,1
1,one,B,2
2,one,C,3
3,one,A,4
4,two,B,5
5,two,C,6


In [6]:
df_test.pivot_table(index='foo', columns='bar', values='baz')

bar,A,B,C
foo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,2.5,2.0,3.0
two,,5.0,6.0


In [8]:
# df_test.pivot(index='foo', columns='bar', values='baz')  # 不带聚合，重复即出错（长宽表格转换）

---

将“长格式”旋转为“宽格式”：pivot()
---

将一列拆分为多列

* 1列值做行索引
* 1列值做列索引
* 剩下的值做表格值

多个时间序列数据通常是以所谓的“长格式”（long）或“堆叠格式”（stacked）存储在数据库和CSV中的

In [12]:
ldata = pd.read_csv('examples/pivot.csv', parse_dates=['date'])
ldata[:14]  # 1年，4个季度，3个指标，共12条数据

Unnamed: 0,date,item,value,value2
0,1959-03-31,realgdp,2710.349,-0.492362
1,1959-03-31,infl,0.0,0.938709
2,1959-03-31,unemp,5.8,-1.249725
3,1959-06-30,realgdp,2778.801,-1.299523
4,1959-06-30,infl,2.34,-0.60139
5,1959-06-30,unemp,5.1,0.958452
6,1959-09-30,realgdp,2775.488,-1.239327
7,1959-09-30,infl,2.74,0.90891
8,1959-09-30,unemp,5.3,-0.588005
9,1959-12-31,realgdp,2785.204,0.567531


In [13]:
ldata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 609 entries, 0 to 608
Data columns (total 4 columns):
date      609 non-null datetime64[ns]
item      609 non-null object
value     609 non-null float64
value2    609 non-null float64
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 19.1+ KB


pivot()前两个参数是行和列索引，最后一个可选参数是需要拆分的列

In [16]:
ldata.pivot('date', 'item', 'value').head()

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,0.0,2710.349,5.8
1959-06-30,2.34,2778.801,5.1
1959-09-30,2.74,2775.488,5.3
1959-12-31,0.27,2785.204,5.6
1960-03-31,2.31,2847.699,5.2


如果不加第三个参数，就会分拆除行、列索引外的所有列

In [17]:
ldata.pivot('date', 'item').head()

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31,0.0,2710.349,5.8,0.938709,-0.492362,-1.249725
1959-06-30,2.34,2778.801,5.1,-0.60139,-1.299523,0.958452
1959-09-30,2.74,2775.488,5.3,0.90891,-1.239327,-0.588005
1959-12-31,0.27,2785.204,5.6,-1.848965,0.567531,-0.532892
1960-03-31,2.31,2847.699,5.2,-0.899409,-0.876764,-0.214352


In [19]:
ldata.pivot('date', 'item')['value'].head()  # 同上

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,0.0,2710.349,5.8
1959-06-30,2.34,2778.801,5.1
1959-09-30,2.74,2775.488,5.3
1959-12-31,0.27,2785.204,5.6
1960-03-31,2.31,2847.699,5.2


#### 用分组聚合（groupby）和重塑(unstack)一般可以实现所有数据重塑功能

不使用新学的pivot方法，

将1天变成一行，item类别从1列变成3列

In [20]:
ldata.head()

Unnamed: 0,date,item,value,value2
0,1959-03-31,realgdp,2710.349,-0.492362
1,1959-03-31,infl,0.0,0.938709
2,1959-03-31,unemp,5.8,-1.249725
3,1959-06-30,realgdp,2778.801,-1.299523
4,1959-06-30,infl,2.34,-0.60139


In [21]:
# 使用pivot函数转换
ldata.pivot('date', 'item', 'value').head()

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,0.0,2710.349,5.8
1959-06-30,2.34,2778.801,5.1
1959-09-30,2.74,2775.488,5.3
1959-12-31,0.27,2785.204,5.6
1960-03-31,2.31,2847.699,5.2


In [22]:
ldata.pivot_table('value', index='date', columns='item').head()

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,0.0,2710.349,5.8
1959-06-30,2.34,2778.801,5.1
1959-09-30,2.74,2775.488,5.3
1959-12-31,0.27,2785.204,5.6
1960-03-31,2.31,2847.699,5.2


#### 不使用pivot,使用分组和旋转手动实现pivot功能

In [27]:
# 先把要选择的列变为行索引
# 两种方式
# 1 分组聚合重塑
ldata.groupby(['date', 'item'])['value'].min().unstack().head()  # 这里聚合运算返回的直接是原值

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,0.0,2710.349,5.8
1959-06-30,2.34,2778.801,5.1
1959-09-30,2.74,2775.488,5.3
1959-12-31,0.27,2785.204,5.6
1960-03-31,2.31,2847.699,5.2


In [32]:
# 方法2，类似分组聚合
ldata.groupby(['date', 'item']).min().head()  # 分组聚合的方式
ldata.set_index(['date', 'item']).head()  # 将这两列转为行索引，效果同上

Unnamed: 0_level_0,Unnamed: 1_level_0,value,value2
date,item,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,realgdp,2710.349,-0.492362
1959-03-31,infl,0.0,0.938709
1959-03-31,unemp,5.8,-1.249725
1959-06-30,realgdp,2778.801,-1.299523
1959-06-30,infl,2.34,-0.60139


In [35]:
ldata.set_index(['date', 'item']).unstack()['value'].head()

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,0.0,2710.349,5.8
1959-06-30,2.34,2778.801,5.1
1959-09-30,2.74,2775.488,5.3
1959-12-31,0.27,2785.204,5.6
1960-03-31,2.31,2847.699,5.2


---

将“宽格式”旋转为“长格式”
---

将多列合并为1列

pivot的逆运算是pandas.melt

它不是将一列转换到多个新的DataFrame，而是合并多个列成为一个，产生一个比输入长的DataFrame

In [36]:
df2 = pd.DataFrame({'key': ['foo', 'bar', 'baz'], 'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df2

Unnamed: 0,key,A,B,C
0,foo,1,4,7
1,bar,2,5,8
2,baz,3,6,9


In [38]:
dfx = pd.melt(df2)  # 将无论多少列转为2列，1列是原列索引，1列是原值
dfx

Unnamed: 0,variable,value
0,key,foo
1,key,bar
2,key,baz
3,A,1
4,A,2
5,A,3
6,B,4
7,B,5
8,B,6
9,C,7


当使用pandas.melt，最好指明哪些列是分组指标

指定key列是唯一分组指标，其它列是数据值

In [40]:
melted = pd.melt(df2, ['key'])
melted

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6
6,foo,C,7
7,bar,C,8
8,baz,C,9


pivot逆运算，将表格转回原形式

In [41]:
df2

Unnamed: 0,key,A,B,C
0,foo,1,4,7
1,bar,2,5,8
2,baz,3,6,9


In [42]:
melted

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6
6,foo,C,7
7,bar,C,8
8,baz,C,9


In [47]:
xxx = melted.pivot('key', 'variable', 'value').reset_index()  # reset_index()将行索引转为普通列，set_index()普通列转为行索引
xxx.columns.name = ''  # 去除列索引name
xxx

Unnamed: 0,key,A,B,C
0,bar,2,5,8
1,baz,3,6,9
2,foo,1,4,7


还可以指定哪些列参与合并

In [49]:
pd.melt(df2, ['key'])
pd.melt(df2, ['key'], value_vars=['A', 'B'])  # 没有c

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6


pandas.melt可以不用分组指标

In [50]:
pd.melt(df2, value_vars=['A', 'B'])

Unnamed: 0,variable,value
0,A,1
1,A,2
2,A,3
3,B,4
4,B,5
5,B,6


In [51]:
df2

Unnamed: 0,key,A,B,C
0,foo,1,4,7
1,bar,2,5,8
2,baz,3,6,9


综合练习：同时指定多个分组指标，和参与列

In [53]:
pd.melt(df2, ['key','A'])  # 指定多个分组指标
pd.melt(df2, ['key','A'], value_vars=['B'])  # 指定多个分组指标和参与列

Unnamed: 0,key,A,variable,value
0,foo,1,B,4
1,bar,2,B,5
2,baz,3,B,6
