## Idioms

These are some neat pandas idioms

if-then/if-then-else on one column, and assignment to another one or more columns:

<font color=Gainsboro> 导入 __pandas__ 和 __numpy__

In [48]:
import pandas as pd
import numpy as np

<font color=Gainsboro> 准备一个 __DataFrame__ 数据 {'AAA':[4, 5, 6, 7], 'BBB': [10, 20, 30, 40], 'CCC': [100, 50, -30, -50]} 赋给变量 __df__

In [50]:
df = pd.DataFrame({'AAA': [4, 5, 6, 7], 'BBB': [10, 20, 30, 40], 'CCC': [100, 50, -30, -50]}); df

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,20,50
2,6,30,-30
3,7,40,-50


### if-then...

An if-then on one column

<font color=Gainsboro> 获取 __第一行__ 的 __Series__

In [58]:
df.iloc[0]

AAA      4
BBB     10
CCC    100
Name: 0, dtype: int64

<font color=Gainsboro> 获取 __第一列__ 的 __Series__

In [56]:
df['AAA']

0    4
1    5
2    6
3    7
Name: AAA, dtype: int64

<font color=Gainsboro> 生成 __AAA__ 列中大于 __5__ 的 __boolean Series__, 赋予变量 __b__

In [57]:
b = df.AAA > 5; b

0    False
1    False
2     True
3     True
Name: AAA, dtype: bool

<font color=Gainsboro> 根据 __AAA__ 列的 __boolean Series__ 过滤 __df__

In [61]:
df.loc[b]

Unnamed: 0,AAA,BBB,CCC
2,6,30,-30
3,7,40,-50


<font color=Gainsboro> 根据 __AAA__ 列的 __boolean Series__ 返回 __BBB__ 列的 __Series__

In [62]:
df.loc[b, 'BBB']

2    30
3    40
Name: BBB, dtype: int64

<font color=Gainsboro> 以 __AAA__ 列的值 __大于5__ 为条件,设置 __BBB__ 列的值为 __-1__

In [63]:
df.loc[df.AAA > 5, 'BBB'] = -1; df

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,20,50
2,6,-1,-30
3,7,-1,-50


An if-then with assignment to 2 columns:

<font color=Gainsboro> 以 __AAA__ 列的值 __大于等于5__ 为条件,设置 __BBB CCC__ 列的值为 __555__

In [64]:
df.loc[df.AAA > 5, ['BBB', 'CCC']]

Unnamed: 0,BBB,CCC
2,-1,-30
3,-1,-50


Or use pandas where after you’ve set up a mask

<font color=Gainsboro> 设置一个 __AAA__ 列为 __True__ , __BBB__ 列为 __False__ , __CCC__ 列 __True False__ 交替的 __df__ 蒙版 __mask__

In [66]:
mask = pd.DataFrame({'AAA': [True] * len(df), 'BBB': [False] * len(df), 'CCC': [True, False] * 2}); mask

Unnamed: 0,AAA,BBB,CCC
0,True,False,True
1,True,False,False
2,True,False,True
3,True,False,False


<font color=Gainsboro> 根据 上面的蒙版过滤 __df__ 并将蒙版之外的值设置为 -1000

In [68]:
df.where(mask, other=-1000)

Unnamed: 0,AAA,BBB,CCC
0,4,-1000,100
1,5,-1000,-1000
2,6,-1000,-30
3,7,-1000,-1000


if-then-else using numpy’s where()

<font color=Gainsboro> 在 __df__ 后面追加一列 __logic__ ,如果 __AAA__ 列的值 > 5 那么新列的值为 __high__ ,否则为 __low__

In [70]:
df['logic'] = np.where(df.AAA > 5, 'high', 'low'); df

Unnamed: 0,AAA,BBB,CCC,logic
0,4,10,100,low
1,5,20,50,low
2,6,-1,-30,high
3,7,-1,-50,high


### Splitting

Split a frame with a boolean criterion

<font color=Gainsboro> 准备一个 __DataFrame__ 数据 {'AAA':[4, 5, 6, 7], 'BBB': [10, 20, 30, 40], 'CCC': [100, 50, -30, -50]} 赋给变量 __df__

In [71]:
df = pd.DataFrame({'AAA': [4, 5, 6, 7], 'BBB': [10, 20, 30, 40], 'CCC': [100, 50, -30, -50]}); df

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,20,50
2,6,30,-30
3,7,40,-50


<font color=Gainsboro> 根据 __AAA__ 的条件 __<= 5__ 来分割 __df__ ,分割后的结果赋给变量 __dflow__

In [73]:
dflow = df.loc[df.AAA <= 5]; dflow

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,20,50


<font color=Gainsboro> 根据 __AAA__ 的条件 __> 5__ 来分割 __df__ ,分割后的结果赋给变量 __dfhigh__

In [74]:
dfhigh = df.loc[df.AAA > 5]; dfhigh

Unnamed: 0,AAA,BBB,CCC
2,6,30,-30
3,7,40,-50


### Building Criteria

Select with multi-column criteria

...and (without assignment returns a Series)

<font color=Gainsboro> 多个条件过滤__AAA__ : __BBB < 25__ & __CCC >= -40__ ,生成 __newseries__

In [75]:
newseries = df.loc[(df.BBB < 25) & (df.CCC >= -40), 'AAA']; newseries

0    4
1    5
Name: AAA, dtype: int64

...or (without assignment returns a Series)

<font color=Gainsboro> 多个条件过滤__AAA__ : __BBB < 25__ | __CCC >= -40__ ,生成 __newseries__

In [76]:
newseries = df.loc[(df.BBB < 25) | (df.CCC >= -40), 'AAA']; newseries

0    4
1    5
2    6
Name: AAA, dtype: int64

...or (with assignment modifies the DataFrame.)

<font color=Gainsboro> 多个条件过滤__AAA__ : __BBB > 25__ | __CCC >= 75__ ,赋值 __0.1__

In [77]:
df.loc[(df.BBB > 25) | (df.CCC >= 75), 'AAA'] = 0.1; df

Unnamed: 0,AAA,BBB,CCC
0,0.1,10,100
1,5.0,20,50
2,0.1,30,-30
3,0.1,40,-50


Select rows with data closest to certain value using argsort

<font color=Gainsboro> 以 __CCC - 43.0 的绝对值__  作为条件对 __df__ 进行 __排序__

In [78]:
df.loc[(df.CCC - 43.0).abs().argsort()]

Unnamed: 0,AAA,BBB,CCC
1,5.0,20,50
0,0.1,10,100
2,0.1,30,-30
3,0.1,40,-50


Dynamically reduce a list of criteria using a binary operators

<font color=Gainsboro> 创建三个条件: __AAA <= 5.5 , BBB == 10.0 , CCC > -40.0__ 将 __Series__ 分别赋予变量 __crit1 , crit2 , crit3__

In [79]:
crit1 = df.AAA <= 5.5
crit2 = df.BBB == 10.0
crit3 = df.CCC > -40.0

One could hard code:

<font color=Gainsboro> 使用 __hard code__ 按照逻辑 __and__ 组合三个条件,将 __Series__ 赋予变量 __all_crit__

In [80]:
all_crit = crit1 & crit2 & crit3; all_crit

0     True
1    False
2    False
3    False
dtype: bool

...Or it can be done with a list of dynamically built criteria

<font color=Gainsboro> 使用 __list__ 构建上文三个 __Series__ 条件, 将 __list Series__ 赋予变量 __crit_list__

In [83]:
crit_list = [crit1, crit2, crit3]; crit_list

[0    True
 1    True
 2    True
 3    True
 Name: AAA, dtype: bool, 0     True
 1    False
 2    False
 3    False
 Name: BBB, dtype: bool, 0     True
 1     True
 2     True
 3    False
 Name: CCC, dtype: bool]

<font color=Gainsboro> 导入 __reduce__

In [84]:
from functools import reduce

<font color=Gainsboro> 以上文的 __list Series__ 动态构建 __criteria__

In [85]:
allcrit = reduce(lambda x, y: x & y, crit_list); allcrit

0     True
1    False
2    False
3    False
dtype: bool

<font color=Gainsboro> 以上文的 __criteria__ 处理 __df__

In [86]:
df[allcrit]

Unnamed: 0,AAA,BBB,CCC
0,0.1,10,100


---

# .loc[]

Type:        property
String form: <property object at 0x076F2EA0>
Docstring:
Purely label-location based indexer for selection by label.  
类型:属性字符串形式:Docstring:纯粹基于标签位置的索引器，用于根据标签进行选择。

``.loc[]`` is primarily label based, but may also be used with a
boolean array.

Allowed inputs are:

- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
  interpreted as a *label* of the index, and **never** as an
  integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'`` (note that contrary
  to usual python slices, **both** the start and the stop are included!).
- A boolean array.
- A ``callable`` function with one argument (the calling Series, DataFrame
  or Panel) and that returns valid output for indexing (one of the above)

``.loc`` will raise a ``KeyError`` when the items are not found.

See more at :ref:`Selection by Label <indexing.label>`

---

# argsort()

Signature: np.argsort(a, axis=-1, kind='quicksort', order=None)
Docstring:
Returns the indices that would sort an array.

Perform an indirect sort along the given axis using the algorithm specified
by the `kind` keyword. It returns an array of indices of the same shape as
`a` that index data along the given axis in sorted order.

Parameters
----------
a : array_like
    Array to sort.
axis : int or None, optional
    Axis along which to sort.  The default is -1 (the last axis). If None,
    the flattened array is used.
kind : {'quicksort', 'mergesort', 'heapsort'}, optional
    Sorting algorithm.
order : str or list of str, optional
    When `a` is an array with fields defined, this argument specifies
    which fields to compare first, second, etc.  A single field can
    be specified as a string, and not all fields need be specified,
    but unspecified fields will still be used, in the order in which
    they come up in the dtype, to break ties.

Returns
-------
index_array : ndarray, int
    Array of indices that sort `a` along the specified axis.
    If `a` is one-dimensional, ``a[index_array]`` yields a sorted `a`.

See Also
--------
sort : Describes sorting algorithms used.
lexsort : Indirect stable sort with multiple keys.
ndarray.sort : Inplace sort.
argpartition : Indirect partial sort.

Notes
-----
See `sort` for notes on the different sorting algorithms.

As of NumPy 1.4.0 `argsort` works with real/complex arrays containing
nan values. The enhanced sort order is documented in `sort`.

In [30]:
x = np.array([3, 1, 2])
np.argsort(x)

array([1, 2, 0], dtype=int32)

---

# reduce()

``reduce`` 函数是一个 __二元操作函数__ ,他用来将一个数据集合(链表,元组等)中的所有数据进行下列操作:

- 传给 ``reduce`` 中的函数 __func__ 必须是一个二元操作函数
- 先对集合中的第 __1,2__ 个数据进行操作,
- 得到的结果再与 __下一个__ 数据用 __func__ 函数运算,最后得到一个结果.
- 在 __python 3__ 以后, ``reduce`` 已经不在 __built-in function__ 里了,要用它就得 ``from functools import reduce``

In [31]:
from functools import reduce

In [32]:
def myadd(x,y):  
    return x+y

# 结果就是输出 1+2+3+4+5+6+7 的结果,即28
sum = reduce(myadd,(1,2,3,4,5,6,7)) ; sum

28

In [33]:
# 当然,也可以用lambda的方法,更为简单:
sum=reduce(lambda x,y:x+y,(1,2,3,4,5,6,7)) ; sum

28

---

# np.where()

__np.where()__ 既可以接收三个参数，用于三目运算，也可接收一个参数，用于返回符合条件的下标。

In [34]:
a = np.array([1, 2, 3, 1, 2, 3, 1, 2, 3])
# 返回符合条件的下标
idx = np.where(a > 2) ; idx

(array([2, 5, 8], dtype=int32),)

In [35]:
# 将奇数转换为偶数，偶数转换为奇数
y = np.where(a%2 == 0, a+1, a-1) ; y

array([0, 3, 2, 0, 3, 2, 0, 3, 2])

---

# where()

DataFrame.where() differs from numpy.where()  
df.where(df < 0, -df) == np.where(df < 0, df, -df)

In [36]:
df = pd.DataFrame(np.random.randn(8, 4), index=pd.date_range('1/1/2000', periods=8), columns=['A', 'B', 'C', 'D']) ; df.head(3)

Unnamed: 0,A,B,C,D
2000-01-01,1.616178,0.575991,-1.17471,2.292116
2000-01-02,0.3926,-2.356968,0.745577,-1.086068
2000-01-03,1.209799,0.100374,-0.320902,0.578457


In [37]:
df = df.where(df < 0, other=-df) ; df.head(3)

Unnamed: 0,A,B,C,D
2000-01-01,-1.616178,-0.575991,-1.17471,-2.292116
2000-01-02,-0.3926,-2.356968,-0.745577,-1.086068
2000-01-03,-1.209799,-0.100374,-0.320902,-0.578457


当 __series__ 对象使用 __where()__ 时，则返回一个序列

In [38]:
s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
s[s > 0]

3    1
2    2
1    3
0    4
dtype: int64