# Pandas cookbook notes

- toc: true 
- badges: true
- comments: true
- categories: [python, pandas]

The cookbook is [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook)

In [30]:
import pandas as pd
import numpy as np

## Sort rows based on closeness to certain value

In [2]:
df = pd.DataFrame({'AAA': [4, 5, 6, 7],
                   'BBB': [10, 20, 30, 40],
                   'CCC': [100, 50, -30, -50]})
df

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,20,50
2,6,30,-30
3,7,40,-50


To sort the rows in order of closeness to myval, do this:

In [15]:
myval = 35
df.loc[(df.CCC - myval).abs().argsort()]

Unnamed: 0,AAA,BBB,CCC
1,5,20,50
0,4,10,100
2,6,30,-30
3,7,40,-50


A note of caution on argsort:

In [26]:
myval = 35
a = (df.CCC - myval).abs()
b = a.argsort()
a, b

(0    65
 1    15
 2    65
 3    85
 Name: CCC, dtype: int64,
 0    1
 1    0
 2    2
 3    3
 Name: CCC, dtype: int64)

In [27]:
myval = 34
a = (df.CCC - myval).abs()
b = a.argsort()
a, b

(0    66
 1    16
 2    64
 3    84
 Name: CCC, dtype: int64,
 0    1
 1    2
 2    0
 3    3
 Name: CCC, dtype: int64)

I tripped up expecting the result of argsort to look like the first case and got really confused by the result of the second case because I expected argsort to return a series containing the rank of each value in the original series. [This](https://github.com/numpy/numpy/issues/8757#issuecomment-355126992) post provided the solution: argsort doesn't return a series of ranks but a series such that a[b] returns a sorted version a. Hence, the first value in b tells us that in the sorted series a[b], the 0th element will be element 1 in the original series a.

## Compound boolean selection

Careful when using compound boolean conditions; it took me a moment to figure out why the result below (from the cookbook) is correct.

In [37]:
df

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,20,50
2,6,30,-30
3,7,40,-50


In [38]:
df[~((df.AAA <= 6) & (df.index.isin([0, 2, 4])))]

Unnamed: 0,AAA,BBB,CCC
1,5,20,50
3,7,40,-50


What confused me was that 5 is smaller than 6. The key thing to remember is that not (a & b) equals (not a) | (not b).

## Creating new columns based on existing ones using mappings

The below is a straightforward adaptation from the [cookbook](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#new-columns):

In [111]:
df = pd.DataFrame({'AAA': [1, 2, 1, 3],
                   'BBB': [1, 1, 4, 2],
                   'CCC': [2, 1, 3, 1]})

source_cols = ['AAA', 'BBB']
new_cols = [str(c) + '_cat' for c in source_cols]
cats = {1: 'One', 2: 'Two', 3: 'Three'}

dd = df.copy()
dd[new_cols] = df[source_cols].applymap(cats.get)
dd

Unnamed: 0,AAA,BBB,CCC,AAA_cat,BBB_cat
0,1,1,2,One,One
1,2,1,1,Two,One
2,1,4,3,One,
3,3,2,1,Three,Two


But it made me wonder why applymap required the use of the get method while we can map values of a series like so:

In [100]:
s = pd.Series([1, 2, 3, 1])
s.map(cats)

0      One
1      Two
2    Three
3      One
dtype: object

or so

In [101]:
s.map(cats.get)

0      One
1      Two
2    Three
3      One
dtype: object

The answer is simple: applymap requires a function as argument, while map takes functions or mappings. 

One limitation of the cookbook solution above is that is doesn't seem to allow for default values (notice that 4 gets substituted with "None").

One way around this is the following:

In [110]:
df[new_cols] = df[source_cols].applymap(lambda x: cats.get(x, 'Hello'))
df

Unnamed: 0,AAA,BBB,CCC,AAA_cat,BBB_cat
0,1,1,2,One,One
1,2,1,1,Two,One
2,1,4,3,One,Hello
3,3,2,1,Three,Two
