# Pandas essentials

- hide: true
- toc: true
- comments: true
- categories: [python, pandas]

In [7]:
import numpy as np
import pandas as pd
import seaborn as sns

Load a sample dataset

In [137]:
cols = {
    'user_id': 'user',
    'transaction_date': 'date',
    'amount': 'amount',
    'transaction_description': 'desc',
    'merchant_name': 'merchant',
    'gender': 'gender',
    'year_of_birth': 'yob',
    'salary_range': 'salary',
}

def randomise_date(series):
    """Add noise to years for additional anonymisation."""
    series = series[~(series.dt.month.eq(2) & series.dt.day.eq(29))]    
    return pd.to_datetime({
        'year': series.dt.year - np.random.randint(0, 5, size=len(series)),
        'month': series.dt.month,
        'day': series.dt.day
    })

fp = './data/sample.parquet'
df = pd.read_parquet(fp, columns=cols).rename(columns=cols)
df['date'] = randomise_date(df.date)
print(df.shape)
df.head(3)

(157287, 8)


Unnamed: 0,user,date,amount,desc,merchant,gender,yob,salary
0,777,2012-01-03,3.03,aviva pa - d/d,aviva,m,1969.0,20k to 30k
1,777,2009-01-03,6.68,"9572 31dec11 , tesco stores 3345 , warrington ...",tesco,m,1969.0,20k to 30k
2,777,2011-01-03,10.27,"9572 30dec11 , mcdonalds , restaurant , winwic...",mcdonalds,m,1969.0,20k to 30k


# Dates and times

## Parsing string dates

Using `dateutil`

In [19]:
from dateutil.parser import parse
date = '1 Nov 2020'
print(parse(date))
parse(date).month

2020-11-01 00:00:00


11

Inside `Pandas`

In [21]:
print(pd.Timestamp(date))
pd.Timestamp(date).month

2020-11-01 00:00:00


11

## Date and period ranges

In [33]:
# create quarterly date and change frequency to standard date
idx = pd.period_range('2018-1', '2019-1', freq='Q-DEC')
s = pd.Series(np.random.randn(len(idx)), index=idx)
print(s)
s.asfreq('d', how='start')

2018Q1   -0.484997
2018Q2   -0.817007
2018Q3    2.018879
2018Q4   -0.176754
2019Q1   -1.085844
Freq: Q-DEC, dtype: float64


2018-01-01   -0.484997
2018-04-01   -0.817007
2018-07-01    2.018879
2018-10-01   -0.176754
2019-01-01   -1.085844
Freq: D, dtype: float64

In [35]:
# create 100-day series and resample to monthly
idx = pd.date_range('2000', periods=100)
s = pd.Series(np.random.randn(len(idx)), index=idx)
s.resample('M', kind='period').mean()

2000-01   -0.129504
2000-02   -0.040099
2000-03    0.210304
2000-04   -0.038681
Freq: M, dtype: float64

In [40]:
# create hourly series, convert to daily open-high-low-close
idx = pd.date_range('2000', freq='H', periods=100)
s = pd.Series(np.random.randn(len(idx)), index=idx)
s.resample('d').ohlc()

Unnamed: 0,open,high,low,close
2000-01-01,1.478093,2.355484,-1.320692,-1.036503
2000-01-02,0.736884,1.764789,-2.652206,-0.965161
2000-01-03,-0.308438,2.355778,-1.502893,1.16236
2000-01-04,1.10993,1.786124,-1.424925,1.269311
2000-01-05,0.474575,0.487767,0.116583,0.369565


## Grouping

Create a dictionary from groups based on column types:

In [10]:
df = sns.load_dataset('iris')
pieces = dict(list(df.groupby('species')))
pieces['setosa'].head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Sources
- [Python for Data Analysis](https://www.oreilly.com/library/view/python-for-data/9781491957653/)
- [Python Data Science Handbook](https://www.oreilly.com/library/view/python-data-science/9781491912126/)

<!-- - [Fluent Python](https://www.oreilly.com/library/view/fluent-python/9781491946237/)
- [Python Cookbook](https://www.oreilly.com/library/view/python-cookbook-3rd/9781449357337/)
- [Learning Python](https://www.oreilly.com/library/view/learning-python-5th/9781449355722/)
- [The Hitchhiker's Guide to Python](https://docs.python-guide.org/writing/structure/)
- [Effective Python](https://effectivepython.com)
- [Python for Data Analysis](https://www.oreilly.com/library/view/python-for-data/9781491957653/)
- [Python Data Science Handbook](https://www.oreilly.com/library/view/python-data-science/9781491912126/) -->