## 目录
* [两个df或者Series的重叠关系](#check-Pandas-Column-contains-a-particular-value)

* [透视表专题](#透视表相关)


**======================================================**

### check Pandas Column contains a particular value

* 很多使用需要查看两个df直接的联系，例如某个列的值是否在另一个df中的某个列中有出现；
* 根据需求或者情形的不同，有多种不同的实现方式

In [1]:
import pandas as pd
import numpy as np

先看看最简单常用的关键字“in”

In [2]:
s = pd.Series(list('abc'))
s

0    a
1    b
2    c
dtype: object

最先想到使用的检查方法是：


In [3]:
'a' in s

False

结果却不是意料中的True，这是因为

in of a Series checks whether the value is in the **index**:

只检查是否在index中出现。

In [4]:
# 所以一下操作为True

1 in s

True

一直比较迂回点的办法是先转变为array再使用in来检查，例如借助unique()方法：

但是这种办法是比较蠢的，因为需要花费一定资源先计算unique()

In [5]:
s.unique()

array(['a', 'b', 'c'], dtype=object)

In [6]:
'a' in s.unique()

True

In [7]:
%timeit 'a' in s.unique()

The slowest run took 15.06 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 18.3 µs per loop


比这个好的办法当然是使用s.values:

In [8]:
%timeit 'a' in s.values

The slowest run took 12.81 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 4.01 µs per loop


以上的做法是判断单个值是否在column中出现，

很多时候在项目中是处理整个列，使用上述方法需要配合apply方法逐row判断

Series中有个很好用的method是 isin ,可以直接判断整个Series并返回相同shape的Series（bool dtype）

In [9]:
s1 = pd.Series(list('abcdads2er'))
s2 = pd.Series(list('dvadtqqtdd'))

In [10]:
s1.isin(s2)

0     True
1    False
2    False
3     True
4     True
5     True
6    False
7    False
8    False
9    False
dtype: bool

isin() 所带的参数不一定是Series，可以是所以list like的object 例如list set 等：

In [19]:
s1.isin(list('asdfasdf'))

0     True
1    False
2    False
3     True
4     True
5     True
6     True
7    False
8    False
9    False
dtype: bool

In [12]:
df = pd.DataFrame({'s1':s1, 's2': s2})

这个bool type的Series可以利用的地方还是很多的：
* 现在新的column
* 基于True Fasle取相关组合

In [13]:
df

Unnamed: 0,s1,s2
0,a,d
1,b,v
2,c,a
3,d,d
4,a,t
5,d,q
6,s,q
7,2,t
8,e,d
9,r,d


In [14]:
df['s1_in_s2'] = df.s1.isin(df.s2)

In [15]:
df

Unnamed: 0,s1,s2,s1_in_s2
0,a,d,True
1,b,v,False
2,c,a,False
3,d,d,True
4,a,t,True
5,d,q,True
6,s,q,False
7,2,t,False
8,e,d,False
9,r,d,False


In [23]:
df[df.s1.isin(list('test'))]

Unnamed: 0,s1,s2,s1_in_s2
6,s,q,False
8,e,d,False


In [24]:
df[~df.s1.isin(list('test'))]

Unnamed: 0,s1,s2,s1_in_s2
0,a,d,True
1,b,v,False
2,c,a,False
3,d,d,True
4,a,t,True
5,d,q,True
7,2,t,False
9,r,d,False


### 透视表相关

示例数据： data/olive.csv


In [25]:
data = pd.read_csv('data/olive.csv')

In [26]:
data.head()

Unnamed: 0.1,Unnamed: 0,region,area,palmitic,palmitoleic,stearic,oleic,linoleic,linolenic,arachidic,eicosenoic
0,1.North-Apulia,1,1,1075,75,226,7823,672,36,60,29
1,2.North-Apulia,1,1,1088,73,224,7709,781,31,61,29
2,3.North-Apulia,1,1,911,54,246,8113,549,31,63,29
3,4.North-Apulia,1,1,966,57,240,7952,619,50,78,35
4,5.North-Apulia,1,1,1051,67,259,7771,672,50,80,46


In [27]:
data.region.unique()

array([1, 2, 3], dtype=int64)

In [28]:
data.area.unique()

array([1, 2, 3, 4, 5, 6, 9, 7, 8], dtype=int64)

In [29]:
pd.crosstab(data.area, data.region)

region,1,2,3
area,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,25,0,0
2,56,0,0
3,206,0,0
4,36,0,0
5,0,65,0
6,0,33,0
7,0,0,50
8,0,0,50
9,0,0,51


根据http://pbpython.com/pandas-pivot-table-explained.html 完整学习一下`pivot_table `

data: data/sales-funnel.xlsx

数据是某个渠道的销售数据

* 透视表能解决什么样的问题
* 如何高效的利用pandas搞定透视表

In [30]:
df = pd.read_excel("data/sales-funnel.xlsx")
df.head()

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won


In [32]:
df.Status.unique()

array([u'presented', u'pending', u'declined', u'won'], dtype=object)

In [33]:
df["Status"] = df["Status"].astype("category")
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)

In [35]:
df.dtypes

Account        int64
Name          object
Rep           object
Manager       object
Product       object
Quantity       int64
Price          int64
Status      category
dtype: object

最简单的透视表必须有一个数据帧和一个索引。在本例中，我们将使用“Name（名字）”列作为我们的索引。

In [36]:
pd.pivot_table(df,index=["Name"])

Unnamed: 0_level_0,Account,Price,Quantity
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Barton LLC,740150,35000,1.0
"Fritsch, Russel and Anderson",737550,35000,1.0
Herman LLC,141962,65000,2.0
Jerde-Hilpert,412290,5000,2.0
"Kassulke, Ondricka and Metz",307599,7000,3.0
Keeling LLC,688981,100000,5.0
Kiehn-Spinka,146832,65000,2.0
Koepp Ltd,729833,35000,2.0
Kulas Inc,218895,25000,1.5
Purdy-Kunde,163416,30000,1.0


透视表所产生的总结，自动把int或者float属性的字段取平均值并返回。例如上面的accout，price和quantity

此外，你也可以有多个索引。实际上，大多数的pivot_table参数可以通过列表获取多个值。

In [37]:
pd.pivot_table(df,index=["Name","Rep","Manager"])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Account,Price,Quantity
Name,Rep,Manager,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Barton LLC,John Smith,Debra Henley,740150,35000,1.0
"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,737550,35000,1.0
Herman LLC,Cedric Moss,Fred Anderson,141962,65000,2.0
Jerde-Hilpert,John Smith,Debra Henley,412290,5000,2.0
"Kassulke, Ondricka and Metz",Wendy Yule,Fred Anderson,307599,7000,3.0
Keeling LLC,Wendy Yule,Fred Anderson,688981,100000,5.0
Kiehn-Spinka,Daniel Hilton,Debra Henley,146832,65000,2.0
Koepp Ltd,Wendy Yule,Fred Anderson,729833,35000,2.0
Kulas Inc,Daniel Hilton,Debra Henley,218895,25000,1.5
Purdy-Kunde,Cedric Moss,Fred Anderson,163416,30000,1.0


这样很有趣但并不是特别有用。我们可能想做的是通过将“Manager”和“Rep”设置为索引来查看结果。要实现它其实很简单，只需要改变索引就可以。

In [39]:
x = pd.pivot_table(df,index=["Manager","Rep"])

In [42]:
x.to_excel('D://pivot.xlsx','Sheet1')

In [44]:
# index 的顺序很重要，这里应该把Manager放在Rep前面
pd.pivot_table(df,index=["Rep","Manager"])

Unnamed: 0_level_0,Unnamed: 1_level_0,Account,Price,Quantity
Rep,Manager,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Cedric Moss,Fred Anderson,196016.5,27500.0,1.25
Craig Booker,Debra Henley,720237.0,20000.0,1.25
Daniel Hilton,Debra Henley,194874.0,38333.333333,1.666667
John Smith,Debra Henley,576220.0,20000.0,1.5
Wendy Yule,Fred Anderson,614061.5,44250.0,3.0


“Account”和“Quantity”列对于我们来说并没什么用。所以，通过利用“values”域显式地定义我们关心的列，就可以实现移除那些不关心的列。

In [45]:
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"])

Unnamed: 0_level_0,Unnamed: 1_level_0,Price
Manager,Rep,Unnamed: 2_level_1
Debra Henley,Craig Booker,20000
Debra Henley,Daniel Hilton,38333
Debra Henley,John Smith,20000
Fred Anderson,Cedric Moss,27500
Fred Anderson,Wendy Yule,44250


“Price”列会自动计算数据的平均值，但是我们也可以对该列元素进行计数或求和。要添加这些功能，使用aggfunc和np.sum就很容易实现。

In [46]:
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=np.sum)

Unnamed: 0_level_0,Unnamed: 1_level_0,Price
Manager,Rep,Unnamed: 2_level_1
Debra Henley,Craig Booker,80000
Debra Henley,Daniel Hilton,115000
Debra Henley,John Smith,40000
Fred Anderson,Cedric Moss,110000
Fred Anderson,Wendy Yule,177000


aggfunc可以包含很多函数，下面就让我们尝试一种方法，即使用numpy中的函数mean和len来进行计数。

In [47]:
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=[np.mean,len])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,len
Unnamed: 0_level_1,Unnamed: 1_level_1,Price,Price
Manager,Rep,Unnamed: 2_level_2,Unnamed: 3_level_2
Debra Henley,Craig Booker,20000,4
Debra Henley,Daniel Hilton,38333,3
Debra Henley,John Smith,20000,2
Fred Anderson,Cedric Moss,27500,4
Fred Anderson,Wendy Yule,44250,4


如果需要看来自两个column的数据呢？例如price* quantity = total

一种方法应该是在生成透视表之前先计算出来作为一个新列。

In [49]:
df['total'] = df['Quantity'] * df['Price']

In [50]:
df.head()

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status,total
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented,30000
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented,10000
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending,10000
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined,35000
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won,130000


In [53]:
def get_total(row):
    print row
    return row.Quantity * row.Price

In [None]:
pd.pivot_table(df,index=["Manager","Rep"],values=["Price", "Quantity"],aggfunc=[get_total])

# 发现并不可行，传入get_total的单column

我认为pivot_table中一个令人困惑的地方是“columns（列）”和“values（值）”的使用。记住，变量“columns（列）”是可选的，它提供一种额外的方法来**分割**你所关心的实际值。然而，聚合函数aggfunc最后是被应用到了变量“values”中你所列举的项目上。

In [57]:
pd.pivot_table(df,index=["Manager","Rep"],values=["total"],
               columns=["Product"],aggfunc=[np.sum])

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,sum,sum,sum
Unnamed: 0_level_1,Unnamed: 1_level_1,total,total,total,total
Unnamed: 0_level_2,Product,CPU,Maintenance,Monitor,Software
Manager,Rep,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3
Debra Henley,Craig Booker,65000,10000.0,,10000.0
Debra Henley,Daniel Hilton,210000,,,10000.0
Debra Henley,John Smith,35000,10000.0,,
Fred Anderson,Cedric Moss,160000,5000.0,,10000.0
Fred Anderson,Wendy Yule,630000,21000.0,10000.0,


然而，非数值（NaN）有点令人分心。如果想移除它们，我们可以使用“fill_value”将其设置为0。

In [59]:
pd.pivot_table(df,index=["Manager","Rep"],values=["total"],
               columns=["Product"],aggfunc=[np.sum],fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,sum,sum,sum
Unnamed: 0_level_1,Unnamed: 1_level_1,total,total,total,total
Unnamed: 0_level_2,Product,CPU,Maintenance,Monitor,Software
Manager,Rep,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3
Debra Henley,Craig Booker,65000,10000,0,10000
Debra Henley,Daniel Hilton,210000,0,0,10000
Debra Henley,John Smith,35000,10000,0,0
Fred Anderson,Cedric Moss,160000,5000,0,10000
Fred Anderson,Wendy Yule,630000,21000,10000,0


有趣的是，你可以将几个项目设置为索引来获得不同的可视化表示。下面的代码中，我们将“Product”从“columns”中移除，并添加到“index”变量中。

In [61]:
pd.pivot_table(df,index=["Manager","Rep", "Product"],values=["total", "Quantity"],
               aggfunc=[np.sum],fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sum,sum
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Quantity,total
Manager,Rep,Product,Unnamed: 3_level_2,Unnamed: 4_level_2
Debra Henley,Craig Booker,CPU,2,65000
Debra Henley,Craig Booker,Maintenance,2,10000
Debra Henley,Craig Booker,Software,1,10000
Debra Henley,Daniel Hilton,CPU,4,210000
Debra Henley,Daniel Hilton,Software,1,10000
Debra Henley,John Smith,CPU,1,35000
Debra Henley,John Smith,Maintenance,2,10000
Fred Anderson,Cedric Moss,CPU,3,160000
Fred Anderson,Cedric Moss,Maintenance,1,5000
Fred Anderson,Cedric Moss,Software,1,10000


如果我想查看一些总和数据呢？“margins=True”就可以为我们实现这种功能。

自动在最底下生成total row

In [62]:
pd.pivot_table(df,index=["Manager","Rep","Product"],
               values=["total","Quantity"],
               aggfunc=[np.sum,np.mean],fill_value=0,margins=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sum,sum,mean,mean
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Quantity,total,Quantity,total
Manager,Rep,Product,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Debra Henley,Craig Booker,CPU,2,65000,1.0,32500.0
Debra Henley,Craig Booker,Maintenance,2,10000,2.0,10000.0
Debra Henley,Craig Booker,Software,1,10000,1.0,10000.0
Debra Henley,Daniel Hilton,CPU,4,210000,2.0,105000.0
Debra Henley,Daniel Hilton,Software,1,10000,1.0,10000.0
Debra Henley,John Smith,CPU,1,35000,1.0,35000.0
Debra Henley,John Smith,Maintenance,2,10000,2.0,10000.0
Fred Anderson,Cedric Moss,CPU,3,160000,1.5,80000.0
Fred Anderson,Cedric Moss,Maintenance,1,5000,1.0,5000.0
Fred Anderson,Cedric Moss,Software,1,10000,1.0,10000.0


查看Manager 及 相关状态的情况：

In [63]:
pd.pivot_table(df,index=["Manager","Status"],values=["total"],
               aggfunc=[np.sum],fill_value=0,margins=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,sum
Unnamed: 0_level_1,Unnamed: 1_level_1,total
Manager,Status,Unnamed: 2_level_2
Debra Henley,won,130000
Debra Henley,pending,100000
Debra Henley,presented,50000
Debra Henley,declined,70000
Fred Anderson,won,651000
Fred Anderson,pending,5000
Fred Anderson,presented,50000
Fred Anderson,declined,130000
All,,1186000


一个很方便的特性是，为了对你选择的**不同值执行不同的函数**，你可以向aggfunc传递一个字典。不过，这样做有一个副作用，那就是必须将标签做的更加简洁才行。

In [64]:
pd.pivot_table(df,index=["Manager","Status"],columns=["Product"],values=["Quantity","total"],
               aggfunc={"Quantity":len,"total":np.sum},fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,total,total,total,total,Quantity,Quantity,Quantity,Quantity
Unnamed: 0_level_1,Product,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software
Manager,Status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Debra Henley,won,130000,0,0,0,1,0,0,0
Debra Henley,pending,80000,20000,0,0,1,2,0,0
Debra Henley,presented,30000,0,0,20000,1,0,0,2
Debra Henley,declined,70000,0,0,0,2,0,0,0
Fred Anderson,won,630000,21000,0,0,2,1,0,0
Fred Anderson,pending,0,5000,0,0,0,1,0,0
Fred Anderson,presented,30000,0,10000,10000,1,0,1,1
Fred Anderson,declined,130000,0,0,0,1,0,0,0


You can provide a list of aggfunctions to apply to each value too:

In [65]:
table = pd.pivot_table(df,index=["Manager","Status"],columns=["Product"],values=["Quantity","total"],
               aggfunc={"Quantity":len,"total":[np.sum,np.mean]},fill_value=0)
table

Unnamed: 0_level_0,Unnamed: 1_level_0,total,total,total,total,total,total,total,total,Quantity,Quantity,Quantity,Quantity
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,mean,mean,mean,sum,sum,sum,sum,len,len,len,len
Unnamed: 0_level_2,Product,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software
Manager,Status,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3
Debra Henley,won,130000,0,0,0,130000,0,0,0,1,0,0,0
Debra Henley,pending,80000,10000,0,0,80000,20000,0,0,1,2,0,0
Debra Henley,presented,30000,0,0,10000,30000,0,0,20000,1,0,0,2
Debra Henley,declined,35000,0,0,0,70000,0,0,0,2,0,0,0
Fred Anderson,won,315000,21000,0,0,630000,21000,0,0,2,1,0,0
Fred Anderson,pending,0,5000,0,0,0,5000,0,0,0,1,0,0
Fred Anderson,presented,30000,0,10000,10000,30000,0,10000,10000,1,0,1,1
Fred Anderson,declined,130000,0,0,0,130000,0,0,0,1,0,0,0


也许，同一时间将这些东西全都放在一起会有点令人望而生畏，但是一旦你开始处理这些数据，并一步一步地添加新项目，你将能够领略到它是如何工作的。我一般的经验法则是，**一旦你使用多个“grouby”，那么你需要评估此时使用透视表是否是一种好的选择。**

Here is a short example. Using the data above, I could use groupby like this:
df.groupby(['Name','Rep','Manager']).mean()

Which gives the same output as:
pd.pivot_table(df,index=["Name","Rep","Manager"])

The pivot table is essentially a **wrapper** around groupby's that also allows you to do more (as I show in subsequent steps).

In this case, I personally think it is easier to do my data manipulations using pivot_table than trying to extend my groupby syntax.

There is nothing inherently wrong with groupby. I just notice that sometimes I start two or three groupby's and suddenly realize I've manually created a pivot table.

In [102]:
df.groupby(['Manager', 'Rep']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Account,Quantity,Price,total
Manager,Rep,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Debra Henley,Craig Booker,720237.0,1.25,20000.0,21250.0
Debra Henley,Daniel Hilton,194874.0,1.666667,38333.333333,73333.333333
Debra Henley,John Smith,576220.0,1.5,20000.0,22500.0
Fred Anderson,Cedric Moss,196016.5,1.25,27500.0,43750.0
Fred Anderson,Wendy Yule,614061.5,3.0,44250.0,165250.0


关于透视表的过滤：

可以直接使用标准的过滤函数对df进行过滤，（因为生成的透视表也是一个标准的df）

In [66]:
type(table)

pandas.core.frame.DataFrame

In [67]:
table.shape

(8, 12)

In [74]:
# 注意上面的shape是不包括各种index和column name的；
# 因此[0,0]的cell对应的是左上角130000这个值
table.iat[0,0]

130000

In [71]:
table.head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,total,total,total,total,total,total,total,total,Quantity,Quantity,Quantity,Quantity
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,mean,mean,mean,sum,sum,sum,sum,len,len,len,len
Unnamed: 0_level_2,Product,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software
Manager,Status,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3
Debra Henley,won,130000,0,0,0,130000,0,0,0,1,0,0,0
Debra Henley,pending,80000,10000,0,0,80000,20000,0,0,1,2,0,0


In [None]:
table.query('Manager == ["Debra Henley"]')

# 当前版本有bug：http://stackoverflow.com/questions/30445044/having-trouble-with-multiple-groupby-with-a-variable-and-a-category-binned-da

In [None]:
table.query('Status == ["pending","won"]')

In [88]:
test = pd.DataFrame(dict(A = np.random.rand(3),
                        B = pd.Series(['a', 'a', 'b'],dtype='category'),
                        C = pd.Series(['x', 'y', 'z'])))

In [89]:
test

Unnamed: 0,A,B,C
0,0.662318,a,x
1,0.513908,a,y
2,0.907551,b,z


In [81]:
test.query('B=="a"')

Unnamed: 0,A,B
0,0.886707,a
1,0.834855,a


In [97]:
test_pt = pd.pivot_table(test,index=["B", "C"], fill_value=0, aggfunc=[np.sum, len])

In [98]:
test_pt

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,len
Unnamed: 0_level_1,Unnamed: 1_level_1,A,A
B,C,Unnamed: 2_level_2,Unnamed: 3_level_2
a,x,0.662318,1
a,y,0.513908,1
a,z,0.0,0
b,x,0.0,0
b,y,0.0,0
b,z,0.907551,1


In [99]:
test_pt.query("B == ['a']")

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,len
Unnamed: 0_level_1,Unnamed: 1_level_1,A,A
B,C,Unnamed: 2_level_2,Unnamed: 3_level_2
a,x,0.662318,1
a,y,0.513908,1
a,z,0.0,0


#### 关于透视表的备忘
<img src='img/pivot-table-datasheet.png'>

#### 透视表实战
需求按银行日期细分交易信息

data: data/pivot_table.csv

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/pivot_table.csv')
df.shape

(17637, 5)

In [3]:
df.dtypes

txn_id               int64
bank_id              int64
currency_amount    float64
txn_status           int64
update_time          int64
dtype: object

In [4]:
df.head()

Unnamed: 0,txn_id,bank_id,currency_amount,txn_status,update_time
0,1,10000,100,0,1444118813
1,2,10000,100,0,1444118843
2,3,10000,100,0,1444118994
3,4,10006,100,0,1444121142
4,5,10001,100,0,1444121176


In [5]:
# timestamp to datetime！
# 使用pd.to_datetime()
# http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.to_datetime.html
# timestamp 是秒为单位
df['update_time'] = pd.to_datetime(df.update_time, unit='s')

In [6]:
df.head()

Unnamed: 0,txn_id,bank_id,currency_amount,txn_status,update_time
0,1,10000,100,0,2015-10-06 08:06:53
1,2,10000,100,0,2015-10-06 08:07:23
2,3,10000,100,0,2015-10-06 08:09:54
3,4,10006,100,0,2015-10-06 08:45:42
4,5,10001,100,0,2015-10-06 08:46:16


In [8]:
# 这里的index需要以天为单位，因此必须使用group
# 官方有记录该用法： http://pandas.pydata.org/pandas-docs/stable/reshaping.html
# 另外一点是我们需要计算多少条记录，而不是默认的平均值
pd.pivot_table(df,index=pd.Grouper(freq='1D', key='update_time'),values=["txn_id"],
               columns=["bank_id"],aggfunc='count')

Unnamed: 0_level_0,update_time,update_time,update_time,txn_id,txn_id,txn_id
bank_id,10000,10001,10006,10000,10001,10006
update_time,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2015-10-06,9.0,11.0,9.0,9.0,11.0,9.0
2015-10-07,3.0,10.0,6.0,3.0,10.0,6.0
2015-10-08,4.0,9.0,3.0,4.0,9.0,3.0
2015-10-09,5.0,6.0,,5.0,6.0,
2015-10-12,3.0,4.0,10.0,3.0,4.0,10.0
2015-10-14,,3.0,,,3.0,
2015-10-15,1.0,1.0,1.0,1.0,1.0,1.0
2015-10-16,3.0,3.0,4.0,3.0,3.0,4.0
2015-10-20,,,15.0,,,15.0
2015-10-21,5.0,7.0,5.0,5.0,7.0,5.0
