# Fluent Pandas

- hide: false
- toc: true
- comments: true
- categories: [python, pandas]

Various data exercises for regular practice.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

# Explore

## Create a table that shows the number of planets discovered by each method in each decade

In [16]:
df = sns.load_dataset('planets')
print(df.shape)
df.head(3)

(1035, 6)


Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011


In [29]:
decades = df.year // 10 * 10
decades = decades.astype(str) + 's'
decades.name = 'decade'

In [26]:
# using pivot table
df.pivot_table('number', columns=decades, index='method', aggfunc='sum').fillna(0)

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,0.0,0.0,0.0,2.0
Eclipse Timing Variations,0.0,0.0,5.0,10.0
Imaging,0.0,0.0,29.0,21.0
Microlensing,0.0,0.0,12.0,15.0
Orbital Brightness Modulation,0.0,0.0,0.0,5.0
Pulsar Timing,0.0,9.0,1.0,1.0
Pulsation Timing Variations,0.0,0.0,1.0,0.0
Radial Velocity,1.0,52.0,475.0,424.0
Transit,0.0,0.0,64.0,712.0
Transit Timing Variations,0.0,0.0,0.0,9.0


In [28]:
# using groupby
df.groupby(['method', decades]).number.sum().unstack().fillna(0)

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,0.0,0.0,0.0,2.0
Eclipse Timing Variations,0.0,0.0,5.0,10.0
Imaging,0.0,0.0,29.0,21.0
Microlensing,0.0,0.0,12.0,15.0
Orbital Brightness Modulation,0.0,0.0,0.0,5.0
Pulsar Timing,0.0,9.0,1.0,1.0
Pulsation Timing Variations,0.0,0.0,1.0,0.0
Radial Velocity,1.0,52.0,475.0,424.0
Transit,0.0,0.0,64.0,712.0
Transit Timing Variations,0.0,0.0,0.0,9.0


# Sort and filter

## Sort by closeness of column CCC to myval

In [47]:
df = pd.DataFrame({'a': [4, 5, 6, 7],
                   'b': [10, 20, 30, 40],
                   'c': [100, 50, -30, -50]})
df

Unnamed: 0,a,b,c
0,4,10,100
1,5,20,50
2,6,30,-30
3,7,40,-50


In [54]:
myval = 34

df.loc[(df.c - myval).abs().argsort()]

Unnamed: 0,a,b,c
1,5,20,50
2,6,30,-30
0,4,10,100
3,7,40,-50


Reminder of what happens here:

In [59]:
a = (df.c - myval).abs()
b = a.argsort() 
a, b

(0    66
 1    16
 2    64
 3    84
 Name: c, dtype: int64,
 0    1
 1    2
 2    0
 3    3
 Name: c, dtype: int64)

`argsort` returns a series of indexes, so that `df[a]` returns an ordered dataframe. The first elemnt in `b` thus refers to the index of the smallest element in `a`.

## Sources

- [Fluent Python](https://www.oreilly.com/library/view/fluent-python/9781491946237/)
- [Python Cookbook](https://www.oreilly.com/library/view/python-cookbook-3rd/9781449357337/)
- [Learning Python](https://www.oreilly.com/library/view/learning-python-5th/9781449355722/)
- [The Hitchhiker's Guide to Python](https://docs.python-guide.org/writing/structure/)
- [Effective Python](https://effectivepython.com)
- [Python for Data Analysis](https://www.oreilly.com/library/view/python-for-data/9781491957653/)
- [Python Data Science Handbook](https://www.oreilly.com/library/view/python-data-science/9781491912126/)
- [Pandas cookbook](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html)