## Idioms

These are some neat pandas idioms

if-then/if-then-else on one column, and assignment to another one or more columns:

In [44]:
import pandas as pd
import numpy as np

In [45]:
df = pd.DataFrame({'AAA':[4, 5, 6, 7], 'BBB': [10, 20, 30, 40], 'CCC': [100, 50, -30, -50]}) ; df

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,20,50
2,6,30,-30
3,7,40,-50


### if-then...

An if-then on one column

<font color=Gainsboro> 获取 __0__ 号索引 </font>

In [46]:
df.loc[0]

AAA      4
BBB     10
CCC    100
Name: 0, dtype: int64

<font color=Gainsboro> 获取 __AAA__ 列

In [47]:
df.AAA

0    4
1    5
2    6
3    7
Name: AAA, dtype: int64

<font color=Gainsboro> 生成 __AAA__ 列中大于 __5__ 的 __boolean Series__, 赋予变量 __b__

In [48]:
b = df.AAA > 5 ; b

0    False
1    False
2     True
3     True
Name: AAA, dtype: bool

<font color=Gainsboro> 根据 __AAA__ 列的 __boolean Series__ 过滤 __df__

In [49]:
df.loc[b]

Unnamed: 0,AAA,BBB,CCC
2,6,30,-30
3,7,40,-50


<font color=Gainsboro> 根据 __AAA__ 列的 __boolean Series__ 返回 __BBB__ 列的 __Series__

In [50]:
df.loc[b, 'BBB']

2    30
3    40
Name: BBB, dtype: int64

<font color=Gainsboro> 以 __AAA__ 列的值 __大于5__ 为条件,设置 __BBB__ 列的值为 __-1__

In [51]:
df.loc[b, 'BBB'] = -1 ; df

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,20,50
2,6,-1,-30
3,7,-1,-50


An if-then with assignment to 2 columns:

<font color=Gainsboro> 以 __AAA__ 列的值 __大于等于5__ 为条件,设置 __BBB CCC__ 列的值为 __555__

In [52]:
df.loc[df.AAA >= 5, ('BBB', 'CCC')] = 555 ; df

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,555,555
2,6,555,555
3,7,555,555


Or use pandas where after you’ve set up a mask

<font color=Gainsboro> 设置一个 __AAA__ 列为 __True__ , __BBB__ 列为 __False__ , __CCC__ 列 __True False__ 交替的 __df__ 蒙版 __mask__

In [53]:
mask = pd.DataFrame({'AAA' : [True] * len(df), 'BBB' : [False] * len(df), 'CCC' : [True, False] * 2}) ; mask

Unnamed: 0,AAA,BBB,CCC
0,True,False,True
1,True,False,False
2,True,False,True
3,True,False,False


<font color=Gainsboro> 根据 上面的蒙版过滤 __df__ 并将蒙版之外的值设置为 -1000

In [54]:
df.where(mask, other=-1000)

Unnamed: 0,AAA,BBB,CCC
0,4,-1000,100
1,5,-1000,-1000
2,6,-1000,555
3,7,-1000,-1000


if-then-else using numpy’s where()

<font color=Gainsboro> 在 __df__ 后面追加一列 __logic__ ,如果 __AAA__ 列的值 > 5 那么新列的值为 __high__ ,否则为 __low__

In [55]:
np.where(df.AAA > 5, 'high', 'low')

array(['low', 'low', 'high', 'high'],
      dtype='<U4')

In [56]:
df['logic'] = np.where(df.AAA > 5, 'high', 'low'); df

Unnamed: 0,AAA,BBB,CCC,logic
0,4,10,100,low
1,5,555,555,low
2,6,555,555,high
3,7,555,555,high


### Splitting

Split a frame with a boolean criterion

In [57]:
df = pd.DataFrame({'AAA':[4, 5, 6, 7], 'BBB': [10, 20, 30, 40], 'CCC': [100, 50, -30, -50]}) ; df

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,20,50
2,6,30,-30
3,7,40,-50


<font color=Gainsboro> 根据 __AAA__ 的条件 __<= 5__ 来分割 __df__ ,分割后的结果赋给变量 __dflow__

In [58]:
dflow = df[df.AAA <= 5]; dflow

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,20,50


<font color=Gainsboro> 根据 __AAA__ 的条件 __> 5__ 来分割 __df__ ,分割后的结果赋给变量 __dfhigh__

In [59]:
dfhigh = df.AAA > 5; dfhigh

0    False
1    False
2     True
3     True
Name: AAA, dtype: bool

### Building Criteria

Select with multi-column criteria

...and (without assignment returns a Series)

<font color=Gainsboro> 多个条件过滤__AAA__ : __BBB < 25__ & __CCC >= -40__ ,生成 __newseries__

In [60]:
newseries = df.loc[(df.BBB < 25) & (df.CCC >= -40), 'AAA']; newseries

0    4
1    5
Name: AAA, dtype: int64

...or (without assignment returns a Series)

<font color=Gainsboro> 多个条件过滤__AAA__ : __BBB < 25__ | __CCC >= -40__ ,生成 __newseries__

In [61]:
newseries = df.loc[(df.BBB < 25) | (df.CCC >= -40), 'AAA']; newseries

0    4
1    5
2    6
Name: AAA, dtype: int64

...or (with assignment modifies the DataFrame.)

<font color=Gainsboro> 多个条件过滤__AAA__ : __BBB > 25__ | __CCC >= 75__ ,赋值 __0.1__

In [62]:
df.loc[(df.BBB > 25) | (df.CCC >= 75), 'AAA'] = 0.1; df

Unnamed: 0,AAA,BBB,CCC
0,0.1,10,100
1,5.0,20,50
2,0.1,30,-30
3,0.1,40,-50


Select rows with data closest to certain value using argsort

<font color=Gainsboro> 以 __CCC - 43.0 的绝对值__  作为条件对 __df__ 进行 __排序__

In [63]:
df.loc[(df['CCC'] - 43).abs().argsort()]

Unnamed: 0,AAA,BBB,CCC
1,5.0,20,50
0,0.1,10,100
2,0.1,30,-30
3,0.1,40,-50


Dynamically reduce a list of criteria using a binary operators

<font color=Gainsboro> 创建三个条件: __AAA <= 5.5 , BBB == 10.0 , CCC > -40.0__ 将 __Series__ 分别赋予变量 __crit1 , crit2 , crit3__

In [64]:
crit1 = df.AAA <= 5.5
crit2 = df.BBB == 10.0
crit3 = df.CCC > -40.0

One could hard code:

<font color=Gainsboro> 使用 __hard code__ 按照逻辑 __and__ 组合三个条件,将 __Series__ 赋予变量 __all_crit__

In [65]:
all_crit = crit1 & crit2 & crit3; all_crit

0     True
1    False
2    False
3    False
dtype: bool

...Or it can be done with a list of dynamically built criteria

<font color=Gainsboro> 使用 __list__ 构建上文三个 __Series__ 条件, 将 __list Series__ 赋予变量 __crit_list__

In [66]:
crit_list = [crit1, crit2, crit3]; crit_list

[0    True
 1    True
 2    True
 3    True
 Name: AAA, dtype: bool, 0     True
 1    False
 2    False
 3    False
 Name: BBB, dtype: bool, 0     True
 1     True
 2     True
 3    False
 Name: CCC, dtype: bool]

<font color=Gainsboro> 导入 __reduce__

In [67]:
from functools import reduce

<font color=Gainsboro> 以上文的 __list Series__ 动态构建 __criteria__

In [68]:
criteria = reduce(lambda x, y: x & y, crit_list); criteria

0     True
1    False
2    False
3    False
dtype: bool

<font color=Gainsboro> 以上文的 __criteria__ 处理 __df__

In [69]:
df[criteria]

Unnamed: 0,AAA,BBB,CCC
0,0.1,10,100


### New content

#### argsort()

``argsort`` 函数返回的是数据集合从小到大的索引值

In [70]:
x = np.array([3, 1, 2])
np.argsort(x)

array([1, 2, 0], dtype=int64)

#### reduce()

``reduce`` 函数是一个 __二元操作函数__ ,他用来将一个数据集合(链表,元组等)中的所有数据进行下列操作:

- 传给 ``reduce`` 中的函数 __func__ 必须是一个二元操作函数
- 先对集合中的第 __1,2__ 个数据进行操作,
- 得到的结果再与 __下一个__ 数据用 __func__ 函数运算,最后得到一个结果.
- 在 __python 3__ 以后, ``reduce`` 已经不在 __built-in function__ 里了,要用它就得 ``from functools import reduce``

In [71]:
from functools import reduce

In [72]:
def myadd(x,y):  
    return x+y

# 结果就是输出 1+2+3+4+5+6+7 的结果,即28
sum = reduce(myadd,(1,2,3,4,5,6,7)) ; sum

28

In [73]:
# 当然,也可以用lambda的方法,更为简单:
sum=reduce(lambda x,y:x+y,(1,2,3,4,5,6,7)) ; sum

28

#### np.where()

__np.where()__ 既可以接收三个参数，用于三目运算，也可接收一个参数，用于返回符合条件的下标。

In [86]:
a = np.array([1, 2, 3, 1, 2, 3, 1, 2, 3])
# 返回符合条件的下标
idx = np.where(a > 2) ; idx

(array([2, 5, 8], dtype=int64),)

In [85]:
# 将奇数转换为偶数，偶数转换为奇数
y = np.where(a%2 == 0, a+1, a-1) ; y

array([0, 3, 2, 0, 3, 2, 0, 3, 2])

#### where()

DataFrame.where() differs from numpy.where()  
df.where(df < 0, -df) == np.where(df < 0, df, -df)

In [93]:
df = pd.DataFrame(np.random.randn(8, 4), index=pd.date_range('1/1/2000', periods=8), columns=['A', 'B', 'C', 'D']) ; df.head(3)

Unnamed: 0,A,B,C,D
2000-01-01,-0.783122,-1.070786,0.32501,-0.046439
2000-01-02,-0.930241,-1.655944,0.892656,0.130739
2000-01-03,-0.579276,-0.463769,0.882423,1.385569


In [92]:
df = df.where(df < 0, other=-df) ; df.head(3)

Unnamed: 0,A,B,C,D
2000-01-01,-0.601673,-0.618839,-0.78484,-1.154798
2000-01-02,-0.022689,-2.149203,-0.844671,-1.004174
2000-01-03,-0.364793,-0.470375,-0.429521,-2.397059


当 __series__ 对象使用 __where()__ 时，则返回一个序列

In [96]:
s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
s[s > 0]

3    1
2    2
1    3
0    4
dtype: int64

In [97]:
s.where(s > 0)

4    NaN
3    1.0
2    2.0
1    3.0
0    4.0
dtype: float64