# Pandas essentials

- hide: false
- toc: true
- comments: true
- categories: [python, pandas]

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

Load a sample dataset

In [2]:
cols = {
    'user_id': 'user',
    'transaction_date': 'date',
    'amount': 'amount',
    'transaction_description': 'desc',
    'merchant_name': 'merchant',
    'gender': 'gender',
    'year_of_birth': 'yob',
    'salary_range': 'salary',
}

def randomise_date(series):
    """Add noise to years for additional anonymisation."""
    series = series[~(series.dt.month.eq(2) & series.dt.day.eq(29))]    
    return pd.to_datetime({
        'year': series.dt.year - np.random.randint(0, 5, size=len(series)),
        'month': series.dt.month,
        'day': series.dt.day
    })

fp = './data/sample.parquet'
df = pd.read_parquet(fp, columns=cols).rename(columns=cols)
df['date'] = randomise_date(df.date)
print(df.shape)
df.head(3)

(157287, 8)


Unnamed: 0,user,date,amount,desc,merchant,gender,yob,salary
0,777,2010-01-03,3.03,aviva pa - d/d,aviva,m,1969.0,20k to 30k
1,777,2010-01-03,6.68,"9572 31dec11 , tesco stores 3345 , warrington ...",tesco,m,1969.0,20k to 30k
2,777,2012-01-03,10.27,"9572 30dec11 , mcdonalds , restaurant , winwic...",mcdonalds,m,1969.0,20k to 30k


# Time series

## `groupby` vs `resample`

Basically, `resample` fills in missing period values while `groupby` doesn't.

In [134]:
idx = pd.date_range('2020', freq='2d', periods=3)
data = pd.DataFrame({'col': range(len(idx))}, index=idx)
data

Unnamed: 0,col
2020-01-01,0
2020-01-03,1
2020-01-05,2


In [135]:
data.resample('d').sum()

Unnamed: 0,col
2020-01-01,0
2020-01-02,0
2020-01-03,1
2020-01-04,0
2020-01-05,2


In [139]:
data.groupby(level=0).sum()

Unnamed: 0,col
2020-01-01,0
2020-01-03,1
2020-01-05,2


In [170]:
data

Unnamed: 0,col
2020-01-01,0
2020-01-03,1
2020-01-05,2


## Timedeltas

In [181]:
df.date.max()

Timestamp('2020-07-31 00:00:00')

In [182]:
d = df.date.max() - df.date.min()
print(d)
d.days

4585 days 00:00:00


4585

## Date offsets

Period differences create [Date offsets](https://pandas.pydata.org/docs/reference/offset_frequency.html).

In [197]:
d = df.date.max().to_period('M') - df.date.min().to_period('M')
print(d)
print(type(d))
d.n

<150 * MonthEnds>
<class 'pandas._libs.tslibs.offsets.MonthEnd'>


150

# Aggregate

## `count` vs `size`

In [56]:
g = df.groupby('user')

# number of rows per group as a series
display(g.size().head(3))

# non-missing observations per group for each variable
g.count().head(3)

user
777      6302
14777    2101
20777    6452
dtype: int64

Unnamed: 0_level_0,date,amount,desc,merchant,gender,yob,salary
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
777,6291,6302,6302,6302,6302,6302,6302
14777,2101,2101,2101,2101,2101,2101,0
20777,6443,6452,6452,6452,6452,6452,0


# Filter

Different approaches to filter data in decreasing order of preference

In [42]:
cutoff = 30_000
a = df.loc[df.amount > cutoff]
b = df.query('amount > @cutoff')
c = df[df.amount > cutoff]
all(a == b) == all(b == c)

True

# Categories

## Manual sort order

In [44]:
df = pd.DataFrame({
    'id':[1, 2, 3, 4, 5],
    'quality': ['good', 'excellent', 'very good', 'excellent', 'good']
})
df.sort_values('quality')

Unnamed: 0,id,quality
1,2,excellent
3,4,excellent
0,1,good
4,5,good
2,3,very good


In [43]:
from pandas.api.types import CategoricalDtype
quality_cat = CategoricalDtype(['good', 'very good', 'excellent'], ordered=True)
df['quality'] = df.quality.astype(quality_cat)
df.sort_values('quality')

Unnamed: 0,id,quality
0,1,good
4,5,good
2,3,very good
1,2,excellent
3,4,excellent


# Dates and times

## Parsing string dates

Using `dateutil`

In [19]:
from dateutil.parser import parse
date = '1 Nov 2020'
print(parse(date))
parse(date).month

2020-11-01 00:00:00


11

Inside `Pandas`

In [21]:
print(pd.Timestamp(date))
pd.Timestamp(date).month

2020-11-01 00:00:00


11

## Date and period ranges

In [61]:
# create quarterly date and change frequency to standard date
idx = pd.period_range('2018-1', '2019-1', freq='Q-DEC')
s = pd.Series(np.random.randn(len(idx)), index=idx)
print(s)
s.asfreq('d', how='start')

2018Q1   -0.210888
2018Q2    0.217048
2018Q3    0.093228
2018Q4   -0.280792
2019Q1   -1.017585
Freq: Q-DEC, dtype: float64


2018-01-01   -0.210888
2018-04-01    0.217048
2018-07-01    0.093228
2018-10-01   -0.280792
2019-01-01   -1.017585
Freq: D, dtype: float64

In [35]:
# create 100-day series and resample to monthly
idx = pd.date_range('2000', periods=100)
s = pd.Series(np.random.randn(len(idx)), index=idx)
s.resample('M', kind='period').mean()

2000-01   -0.129504
2000-02   -0.040099
2000-03    0.210304
2000-04   -0.038681
Freq: M, dtype: float64

In [60]:
# create hourly series, convert to daily open-high-low-close
idx = pd.date_range('2000', freq='H', periods=100)
s = pd.Series(np.random.randn(len(idx)), index=idx)
s.resample('d').ohlc()

Unnamed: 0,open,high,low,close
2000-01-01,-0.14835,2.901749,-2.153478,-0.657941
2000-01-02,-0.964828,1.569833,-1.415382,0.3997
2000-01-03,-0.545781,1.263261,-1.940718,-1.940718
2000-01-04,-0.406149,1.658944,-1.393457,-0.656099
2000-01-05,-1.839502,0.957588,-1.839502,-0.09254


# Grouping

Create a dictionary from groups based on column types:

In [10]:
df = sns.load_dataset('iris')
pieces = dict(list(df.groupby('species')))
pieces['setosa'].head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


# Mappings

## `apply` vs `map` vs `applymap`

- `apply` applies a function along an axis of a dataframe or on series values
- `map` applies a correspondance to each value in a series
- `applymap` applies a function to each element in a dataframe

In [97]:
data = df.loc[:2, ['gender', 'merchant']]
gender = {'m': 'male', 'f': 'female'}
data

Unnamed: 0,gender,merchant
0,m,aviva
1,m,tesco
2,m,mcdonalds


In [99]:
data.apply(lambda x: x.map(gender))

Unnamed: 0,gender,merchant
0,male,
1,male,
2,male,


In [101]:
data.gender.map(gender)

0    male
1    male
2    male
Name: gender, dtype: object

In [106]:
data.applymap(gender.get)

Unnamed: 0,gender,merchant
0,male,
1,male,
2,male,


`get` turns a dictionary into a function that takes a key and returns its corresponding value if the key is in the dictionary and a default value otherwise.

## Creating new columns based on existing ones using mappings

The below is a straightforward adaptation from the [cookbook](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#new-columns):

In [111]:
df = pd.DataFrame({'AAA': [1, 2, 1, 3],
                   'BBB': [1, 1, 4, 2],
                   'CCC': [2, 1, 3, 1]})

source_cols = ['AAA', 'BBB']
new_cols = [str(c) + '_cat' for c in source_cols]
cats = {1: 'One', 2: 'Two', 3: 'Three'}

dd = df.copy()
dd[new_cols] = df[source_cols].applymap(cats.get)
dd

Unnamed: 0,AAA,BBB,CCC,AAA_cat,BBB_cat
0,1,1,2,One,One
1,2,1,1,Two,One
2,1,4,3,One,
3,3,2,1,Three,Two


But it made me wonder why applymap required the use of the get method while we can map values of a series like so:

In [100]:
s = pd.Series([1, 2, 3, 1])
s.map(cats)

0      One
1      Two
2    Three
3      One
dtype: object

or so

In [101]:
s.map(cats.get)

0      One
1      Two
2    Three
3      One
dtype: object

The answer is simple: applymap requires a function as argument, while map takes functions or mappings. 

One limitation of the cookbook solution above is that is doesn't seem to allow for default values (notice that 4 gets substituted with "None").

One way around this is the following:

In [110]:
df[new_cols] = df[source_cols].applymap(lambda x: cats.get(x, 'Hello'))
df

Unnamed: 0,AAA,BBB,CCC,AAA_cat,BBB_cat
0,1,1,2,One,One
1,2,1,1,Two,One
2,1,4,3,One,Hello
3,3,2,1,Three,Two


# Sources
- [Python for Data Analysis](https://www.oreilly.com/library/view/python-for-data/9781491957653/)
- [Python Data Science Handbook](https://www.oreilly.com/library/view/python-data-science/9781491912126/) (PDSH)
- [Pandas cookbook](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html)

<!-- - [Fluent Python](https://www.oreilly.com/library/view/fluent-python/9781491946237/)
- [Python Cookbook](https://www.oreilly.com/library/view/python-cookbook-3rd/9781449357337/)
- [Learning Python](https://www.oreilly.com/library/view/learning-python-5th/9781449355722/)
- [The Hitchhiker's Guide to Python](https://docs.python-guide.org/writing/structure/)
- [Effective Python](https://effectivepython.com)
- [Python for Data Analysis](https://www.oreilly.com/library/view/python-for-data/9781491957653/)
- [Python Data Science Handbook](https://www.oreilly.com/library/view/python-data-science/9781491912126/) -->