# Non-Mini Intro to Pandas 2
by Dr Liang Jin

Part of AcF701 Python Sessions: [github.com/drliangjin/mini-python-book](https://github.com/drliangjin/mini-python-book)

Official Pandas Doc: [pandas.pydata.org](https://pandas.pydata.org/)

In [2]:
import numpy as np
import pandas as pd

### Pandas -- Continued...
1. Data Transformation
2. Data Grouping & Aggregation
3. Time Series

## 1. Data Transformation

### Map

## 2. Data Grouping & Aggregation

<img src="img/split-apply-combine.svg">

In [35]:
# create the above dataset
data = {'key': ['A', 'B', 'C', 'A', 'B', 'C'], 
        'data': [1, 2, 3, 4, 5, 6]}
df = pd.DataFrame(data, columns=['key', 'data'])

Unnamed: 0,key,data
0,A,1
1,B,2
2,C,3
3,A,4
4,B,5
5,C,6


In [54]:
# Pandas's GroupBy object.
# It has not actually computed anything yet but form the intermediate datasets
grouped = df.groupby(by='key', as_index=False)
grouped

<pandas.core.groupby.DataFrameGroupBy object at 0x112b5c550>

In [55]:
# to see what have been saved in the GroupBy object
# we can use iteration to print out data in each group
for name, group in grouped:
    print("Sub-group: {}".format(name))
    print(group)
    print("\n")

Sub-group: A
  key  data
0   A     1
3   A     4


Sub-group: B
  key  data
1   B     2
4   B     5


Sub-group: C
  key  data
2   C     3
5   C     6




In [56]:
# Now let's apply some functions and/or methods
# Note, the function sum() has been applied to each group
# results are then combined together as a DataFrame object
grouped.sum() # <= If you don't want the groupby key as index, specify as_index=False

Unnamed: 0,key,data
0,A,5
1,B,7
2,C,9


## 3. Time Series

### Datetime Object

In [23]:
# built-in `datetime` module
from datetime import datetime, timedelta
from dateutil.parser import parse

# datetime stores both the date and time down to the microsecond
datetime.now()

datetime.datetime(2018, 5, 10, 21, 21, 38, 44927)

In [9]:
# we can compute the temporal diff between two datetime objective
delta = datetime.now() - datetime(1949, 10, 1)
delta

25058

In [11]:
# use diff
start = datetime.now()
start + timedelta(12)

datetime.datetime(2018, 5, 22, 16, 44, 11, 341453)

### Converting between string and datetime

In [14]:
# convert datetime object to a spefic "human friendly" format
datetime.now().strftime('%d-%m-%Y') # <= str 'f'ormat time

'10-05-2018'

In [16]:
# convert string to datetime object
datetime.strptime('2018-05-09', '%Y-%m-%d') # <= str 'p'arse time

datetime.datetime(2018, 5, 9, 0, 0)

In [19]:
# use dateutil package
parse('May 09, 2018, 23:59')

datetime.datetime(2018, 5, 9, 23, 59)

In [23]:
# pandas's to_datetime function
pd.to_datetime(['May 09, 2018, 23:59', '2018-05-09 23:59', None])

DatetimeIndex(['2018-05-09 23:59:00', '2018-05-09 23:59:00', 'NaT'], dtype='datetime64[ns]', freq=None)

### Use datetime object as Index

In [29]:
# mannually create timestmaps
dates = [datetime(2018, 5, 10), datetime(2018, 5, 11), datetime(2018, 5, 12)]
# the list of dates are passed as index
ts1 = pd.Series(np.random.randn(3), index=dates)
ts1.index

DatetimeIndex(['2018-05-10', '2018-05-11', '2018-05-12'], dtype='datetime64[ns]', freq=None)

In [34]:
# Pandas's date_range, by default, generates daily timestamps
pd.date_range('2018-05-10', '2018, May, 12') # accepts different formats...

DatetimeIndex(['2018-05-10', '2018-05-11', '2018-05-12'], dtype='datetime64[ns]', freq='D')

In [35]:
# specify start (end) date, and periods
pd.date_range(start='2018-05-10', periods=3)

DatetimeIndex(['2018-05-10', '2018-05-11', '2018-05-12'], dtype='datetime64[ns]', freq='D')

In [39]:
# specify frequency
pd.date_range('2018-05-01', '2018-08-30', freq='M') # <= month end, others 'D', 'Q'

DatetimeIndex(['2018-05-31', '2018-06-30', '2018-07-31'], dtype='datetime64[ns]', freq='M')

### Shift method

In [15]:
# create a dataset with month-start as index
index = pd.date_range('1/1/2000', periods=3, freq='MS')
ts2 = pd.Series(np.random.randn(3), index=index)

In [16]:
ts2

2000-01-01    0.551772
2000-02-01    0.415378
2000-03-01   -0.046669
Freq: MS, dtype: float64

What if we want to create `lead` or `lag` data?

In [17]:
# shift() moves the data point forward or backward
# leaves datetime index unmodified
ts2.shift(1) 

2000-01-01         NaN
2000-02-01    0.551772
2000-03-01    0.415378
Freq: MS, dtype: float64

In [20]:
# with passing a freq argument, instead of moving data,
# shift() method move timestamps
ts2.shift(9, freq='MS')

2000-10-01    0.551772
2000-11-01    0.415378
2000-12-01   -0.046669
Freq: MS, dtype: float64

In [26]:
# Another handy function to shift datetime
# especially helpful when merging databases
from pandas.tseries.offsets import MonthEnd

datetime.now() + MonthEnd(0)

Timestamp('2018-05-31 21:46:24.857039')