Categoricals in Series/DataFrame

Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales

The categorical data type is useful in the following cases:

A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory

The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order

As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

In [4]:
import pandas as pd
import numpy as np
import matplotlib as plt
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

In [5]:
df["grade"] = df["raw_grade"].astype("category")

In [6]:
df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

TimedeltaIndex/Scalar

a new scalar type Timedelta, which is a subclass of datetime.timedelta, and behaves in a
similar manner, but allows compatibility with np.timedelta64 types as well as a host of custom representation

In [8]:
tds = pd.Timedelta('31 days 5 min 3 sec')

In [14]:
tds.min


Timedelta('-106752 days +00:12:43.145224')

In [12]:
tds.seconds

303

In [13]:
tds.days

31

In [15]:
tds.microseconds

0

In [28]:
df = pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]});

In [29]:
df

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,20,50
2,6,30,-30
3,7,40,-50


In [32]:
a = df.loc[df.AAA >= 5,'BBB'] = -1;

In [33]:
a

-1

In [34]:
df

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,-1,50
2,6,-1,-30
3,7,-1,-50


In [35]:
df.loc[df.AAA >= 5,['BBB','CCC']] = 555;

In [36]:
df

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,555,555
2,6,555,555
3,7,555,555


In [37]:
df.loc[df.AAA < 5,['BBB','CCC']] = 2000;


In [38]:
df

Unnamed: 0,AAA,BBB,CCC
0,4,2000,2000
1,5,555,555
2,6,555,555
3,7,555,555


In [39]:
df_mask = pd.DataFrame({'AAA' : [True] * 4, 'BBB' : [False] * 4,'CCC' : [True,False] * 2})

In [40]:
df_mask

Unnamed: 0,AAA,BBB,CCC
0,True,False,True
1,True,False,False
2,True,False,True
3,True,False,False


In [41]:
df.where(df_mask,-1000)

Unnamed: 0,AAA,BBB,CCC
0,4,-1000,2000
1,5,-1000,-1000
2,6,-1000,555
3,7,-1000,-1000


In [42]:
df = pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]});



In [43]:
df

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,20,50
2,6,30,-30
3,7,40,-50


In [44]:
dflow = df[df.AAA <= 5]

In [45]:
dflow

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,20,50


In [46]:
dfhigh = df[df.AAA > 5]

In [47]:
dfhigh

Unnamed: 0,AAA,BBB,CCC
2,6,30,-30
3,7,40,-50


In [48]:
df = pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]});

In [49]:
df

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,20,50
2,6,30,-30
3,7,40,-50


In [50]:
newseries = df.loc[(df['BBB'] < 25) & (df['CCC'] >= -40), 'AAA'];

In [51]:
newseries

0    4
1    5
Name: AAA, dtype: int64

In [52]:
df = pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]});

In [53]:
df

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,20,50
2,6,30,-30
3,7,40,-50


In [54]:
df[(df.AAA <= 6) & (df.index.isin([0,2,4]))]

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
2,6,30,-30


In [55]:
data = {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}

In [56]:
data


{'AAA': [4, 5, 6, 7], 'BBB': [10, 20, 30, 40], 'CCC': [100, 50, -30, -50]}

In [57]:
df = pd.DataFrame(data=data,index=['foo','bar','boo','kar']);

In [58]:
df

Unnamed: 0,AAA,BBB,CCC
foo,4,10,100
bar,5,20,50
boo,6,30,-30
kar,7,40,-50


MultiIndexing

In [59]:
df = pd.DataFrame({'row' : [0,1,2],
'One_X' : [1.1,1.1,1.1],
'One_Y' : [1.2,1.2,1.2],
'Two_X' : [1.11,1.11,1.11],
'Two_Y' : [1.22,1.22,1.22]});

In [60]:
df

Unnamed: 0,One_X,One_Y,Two_X,Two_Y,row
0,1.1,1.2,1.11,1.22,0
1,1.1,1.2,1.11,1.22,1
2,1.1,1.2,1.11,1.22,2


In [61]:
df = df.set_index('row');

In [62]:
df

Unnamed: 0_level_0,One_X,One_Y,Two_X,Two_Y
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1.1,1.2,1.11,1.22
1,1.1,1.2,1.11,1.22
2,1.1,1.2,1.11,1.22


In [63]:
df.columns = pd.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns]);

In [64]:
df

Unnamed: 0_level_0,One,One,Two,Two
Unnamed: 0_level_1,X,Y,X,Y
row,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,1.1,1.2,1.11,1.22
1,1.1,1.2,1.11,1.22
2,1.1,1.2,1.11,1.22


In [65]:
df = df.stack(0).reset_index(1);

In [66]:
df

Unnamed: 0_level_0,level_1,X,Y
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,One,1.1,1.2
0,Two,1.11,1.22
1,One,1.1,1.2
1,Two,1.11,1.22
2,One,1.1,1.2
2,Two,1.11,1.22


In [67]:
df.columns = ['Sample','All_X','All_Y'];
df

Unnamed: 0_level_0,Sample,All_X,All_Y
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,One,1.1,1.2
0,Two,1.11,1.22
1,One,1.1,1.2
1,Two,1.11,1.22
2,One,1.1,1.2
2,Two,1.11,1.22


In [70]:
# Performing arithmetic with a multi-index that needs broadcasting

cols = pd.MultiIndex.from_tuples([ (x,y) for x in ['A','B','C'] for y in ['O','I']])
cols

MultiIndex(levels=[['A', 'B', 'C'], ['I', 'O']],
           labels=[[0, 0, 1, 1, 2, 2], [1, 0, 1, 0, 1, 0]])

In [69]:
df = pd.DataFrame(np.random.randn(2,6),index=['n','m'],columns=cols); 
df

Unnamed: 0_level_0,A,A,B,B,C,C
Unnamed: 0_level_1,O,I,O,I,O,I
n,-2.625183,0.099223,-1.389805,1.166931,-0.49398,-0.585721
m,-0.233521,0.677587,-0.393202,0.228883,0.447193,1.075492


In [71]:
df = df.div(df['C'],level=1);
df

Unnamed: 0_level_0,A,A,B,B,C,C
Unnamed: 0_level_1,O,I,O,I,O,I
n,5.314354,-0.169403,2.813486,-1.9923,1.0,1.0
m,-0.522193,0.630025,-0.879267,0.212817,1.0,1.0


In [72]:
# Slicing a multi-index with xs

In [73]:
coords = [('AA','one'),('AA','six'),('BB','one'),('BB','two'),('BB','six')]

In [74]:
coords

[('AA', 'one'), ('AA', 'six'), ('BB', 'one'), ('BB', 'two'), ('BB', 'six')]

In [75]:
index = pd.MultiIndex.from_tuples(coords)

In [76]:
index

MultiIndex(levels=[['AA', 'BB'], ['one', 'six', 'two']],
           labels=[[0, 0, 1, 1, 1], [0, 1, 0, 2, 1]])

In [77]:
df = pd.DataFrame([11,22,33,44,55],index,['MyData']);
df

Unnamed: 0,Unnamed: 1,MyData
AA,one,11
AA,six,22
BB,one,33
BB,two,44
BB,six,55


In [79]:
df.xs('BB',level=0,axis=0)

Unnamed: 0,MyData
one,33
two,44
six,55


In [80]:
df.xs('six',level=1,axis=0)

Unnamed: 0,MyData
AA,22
BB,55


In [83]:
import itertools   
index = list(itertools.product(['Ada','Quinn','Violet'],['Comp','Math','Sci']))
index

[('Ada', 'Comp'),
 ('Ada', 'Math'),
 ('Ada', 'Sci'),
 ('Quinn', 'Comp'),
 ('Quinn', 'Math'),
 ('Quinn', 'Sci'),
 ('Violet', 'Comp'),
 ('Violet', 'Math'),
 ('Violet', 'Sci')]

In [84]:
headr = list(itertools.product(['Exams','Labs'],['I','II']))

In [85]:
headr

[('Exams', 'I'), ('Exams', 'II'), ('Labs', 'I'), ('Labs', 'II')]

In [86]:
indx = pd.MultiIndex.from_tuples(index,names=['Student','Course'])

In [87]:
indx

MultiIndex(levels=[['Ada', 'Quinn', 'Violet'], ['Comp', 'Math', 'Sci']],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=['Student', 'Course'])

In [88]:
cols = pd.MultiIndex.from_tuples(headr) #Notice these are un-named

In [89]:
cols

MultiIndex(levels=[['Exams', 'Labs'], ['I', 'II']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [90]:
data = [[70+x+y+(x*y)%3 for x in range(4)] for y in range(9)]

In [91]:
data

[[70, 71, 72, 73],
 [71, 73, 75, 74],
 [72, 75, 75, 75],
 [73, 74, 75, 76],
 [74, 76, 78, 77],
 [75, 78, 78, 78],
 [76, 77, 78, 79],
 [77, 79, 81, 80],
 [78, 81, 81, 81]]

In [92]:
df = pd.DataFrame(data,indx,cols);
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Exams,Exams,Labs,Labs
Unnamed: 0_level_1,Unnamed: 1_level_1,I,II,I,II
Student,Course,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Ada,Comp,70,71,72,73
Ada,Math,71,73,75,74
Ada,Sci,72,75,75,75
Quinn,Comp,73,74,75,76
Quinn,Math,74,76,78,77
Quinn,Sci,75,78,78,78
Violet,Comp,76,77,78,79
Violet,Math,77,79,81,80
Violet,Sci,78,81,81,81


In [96]:
All = slice(None)
All

slice(None, None, None)

In [95]:
df.loc['Violet']

Unnamed: 0_level_0,Exams,Exams,Labs,Labs
Unnamed: 0_level_1,I,II,I,II
Course,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Comp,76,77,78,79
Math,77,79,81,80
Sci,78,81,81,81


In [97]:
df.loc[(All,'Math'),All]

Unnamed: 0_level_0,Unnamed: 1_level_0,Exams,Exams,Labs,Labs
Unnamed: 0_level_1,Unnamed: 1_level_1,I,II,I,II
Student,Course,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Ada,Math,71,73,75,74
Quinn,Math,74,76,78,77
Violet,Math,77,79,81,80


In [98]:
df.loc[(slice('Ada','Quinn'),'Math'),All]

Unnamed: 0_level_0,Unnamed: 1_level_0,Exams,Exams,Labs,Labs
Unnamed: 0_level_1,Unnamed: 1_level_1,I,II,I,II
Student,Course,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Ada,Math,71,73,75,74
Quinn,Math,74,76,78,77


In [99]:
df.loc[(All,'Math'),('Exams')]

Unnamed: 0_level_0,Unnamed: 1_level_0,I,II
Student,Course,Unnamed: 2_level_1,Unnamed: 3_level_1
Ada,Math,71,73
Quinn,Math,74,76
Violet,Math,77,79


In [100]:
df.loc[(All,'Math'),(All,'II')]

Unnamed: 0_level_0,Unnamed: 1_level_0,Exams,Labs
Unnamed: 0_level_1,Unnamed: 1_level_1,II,II
Student,Course,Unnamed: 2_level_2,Unnamed: 3_level_2
Ada,Math,73,74
Quinn,Math,76,77
Violet,Math,79,80


In [101]:
df.sort_values(by=('Labs', 'II'), ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Exams,Exams,Labs,Labs
Unnamed: 0_level_1,Unnamed: 1_level_1,I,II,I,II
Student,Course,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Violet,Sci,78,81,81,81
Violet,Math,77,79,81,80
Violet,Comp,76,77,78,79
Quinn,Sci,75,78,78,78
Quinn,Math,74,76,78,77
Quinn,Comp,73,74,75,76
Ada,Sci,72,75,75,75
Ada,Math,71,73,75,74
Ada,Comp,70,71,72,73


In [103]:
# Grouping

In [104]:
df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(),'size': list('SSMMMLL'),
'weight': [8, 10, 11, 1, 20, 12, 12],
'adult' : [False] * 5 + [True] * 2}); 
df

Unnamed: 0,adult,animal,size,weight
0,False,cat,S,8
1,False,dog,S,10
2,False,cat,M,11
3,False,fish,M,1
4,False,dog,M,20
5,True,cat,L,12
6,True,cat,L,12


In [105]:
#List the size of the animals with the highest weight.
df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])

animal
cat     L
dog     M
fish    M
dtype: object

In [106]:
gb = df.groupby(['animal'])

In [108]:
gb

<pandas.core.groupby.DataFrameGroupBy object at 0x10e72d4e0>

In [109]:
gb.get_group('cat')

Unnamed: 0,adult,animal,size,weight
0,False,cat,S,8
2,False,cat,M,11
5,True,cat,L,12
6,True,cat,L,12


In [110]:
# Apply to different items in a group
def GrowUp(x):
    avg_weight = sum(x[x['size'] == 'S'].weight * 1.5)
    avg_weight += sum(x[x['size'] == 'M'].weight * 1.25)
    avg_weight += sum(x[x['size'] == 'L'].weight)
    avg_weight /= len(x)
    return pd.Series(['L',avg_weight,True], index=['size', 'weight', 'adult'])

In [111]:
expected_df = gb.apply(GrowUp)

In [112]:
expected_df

Unnamed: 0_level_0,size,weight,adult
animal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
cat,L,12.4375,True
dog,L,20.0,True
fish,L,1.25,True


In [113]:
S = pd.Series([i / 100.0 for i in range(1,11)])

In [114]:
S

0    0.01
1    0.02
2    0.03
3    0.04
4    0.05
5    0.06
6    0.07
7    0.08
8    0.09
9    0.10
dtype: float64

In [115]:
def CumRet(x,y):
    return x * (1 + y)

In [116]:
def Red(x):
    return functools.reduce(CumRet,x,1.0)

In [118]:
import functools
S.expanding().apply(Red)


0    1.010000
1    1.030200
2    1.061106
3    1.103550
4    1.158728
5    1.228251
6    1.314229
7    1.419367
8    1.547110
9    1.701821
dtype: float64

In [119]:
#Replacing some values with mean of the rest of a group

In [120]:
df = pd.DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, -1, 1, 2]})

In [121]:
df

Unnamed: 0,A,B
0,1,1
1,1,-1
2,2,1
3,2,2


In [122]:
gb = df.groupby('A')

In [123]:
def replace(g):
    mask = g < 0
    g.loc[mask] = g[~mask].mean()
    return g

In [125]:
gb.transform(replace)

## https://pythonforbiologists.com/when-to-use-aggregatefiltertransform-in-pandas/

Unnamed: 0,B
0,1.0
1,1.0
2,1.0
3,2.0


In [126]:
df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2,
    'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
     'flag': [False, True] * 3})

In [128]:
code_groups = df.groupby('code')
code_groups

<pandas.core.groupby.DataFrameGroupBy object at 0x10e70f358>

In [129]:
agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')

In [131]:
sorted_df = df.loc[agg_n_sort_order.index]

In [132]:
sorted_df

Unnamed: 0,code,data,flag
1,bar,-0.21,True
4,bar,-0.59,False
0,foo,0.16,False
3,foo,0.45,True
2,baz,0.33,False
5,baz,0.62,True


In [133]:
## Create multiple aggregated columns

In [136]:
rng = pd.date_range(start="2014-10-07",periods=10,freq='2min')
rng

DatetimeIndex(['2014-10-07 00:00:00', '2014-10-07 00:02:00',
               '2014-10-07 00:04:00', '2014-10-07 00:06:00',
               '2014-10-07 00:08:00', '2014-10-07 00:10:00',
               '2014-10-07 00:12:00', '2014-10-07 00:14:00',
               '2014-10-07 00:16:00', '2014-10-07 00:18:00'],
              dtype='datetime64[ns]', freq='2T')

In [137]:
ts = pd.Series(data = list(range(10)), index = rng)
ts

2014-10-07 00:00:00    0
2014-10-07 00:02:00    1
2014-10-07 00:04:00    2
2014-10-07 00:06:00    3
2014-10-07 00:08:00    4
2014-10-07 00:10:00    5
2014-10-07 00:12:00    6
2014-10-07 00:14:00    7
2014-10-07 00:16:00    8
2014-10-07 00:18:00    9
Freq: 2T, dtype: int64

In [138]:
def MyCust(x):
    if len(x) > 2:
        return x[1] * 1.234
    return pd.NaT

In [139]:
pd.NaT?


In [140]:
mhc = {'Mean' : np.mean, 'Max' : np.max, 'Custom' : MyCust}

In [141]:
mhc

{'Custom': <function __main__.MyCust>,
 'Max': <function numpy.core.fromnumeric.amax>,
 'Mean': <function numpy.core.fromnumeric.mean>}

In [142]:
ts.resample("5min").apply(mhc)

Custom  2014-10-07 00:00:00    1.234
        2014-10-07 00:05:00      NaT
        2014-10-07 00:10:00    7.404
        2014-10-07 00:15:00      NaT
Max     2014-10-07 00:00:00        2
        2014-10-07 00:05:00        4
        2014-10-07 00:10:00        7
        2014-10-07 00:15:00        9
Mean    2014-10-07 00:00:00        1
        2014-10-07 00:05:00      3.5
        2014-10-07 00:10:00        6
        2014-10-07 00:15:00      8.5
dtype: object

In [143]:
ts

2014-10-07 00:00:00    0
2014-10-07 00:02:00    1
2014-10-07 00:04:00    2
2014-10-07 00:06:00    3
2014-10-07 00:08:00    4
2014-10-07 00:10:00    5
2014-10-07 00:12:00    6
2014-10-07 00:14:00    7
2014-10-07 00:16:00    8
2014-10-07 00:18:00    9
Freq: 2T, dtype: int64

In [144]:
df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(),
    'Value': [100, 150, 50, 50]}); 
df

Unnamed: 0,Color,Value
0,Red,100
1,Red,150
2,Red,50
3,Blue,50


In [145]:
df['Counts'] = df.groupby(['Color']).transform(len)

In [146]:
df

Unnamed: 0,Color,Value,Counts
0,Red,100,3
1,Red,150,3
2,Red,50,3
3,Blue,50,1


In [147]:
## Shift groups of the values in a column based on the index

In [148]:
df = pd.DataFrame(
{u'line_race': [10, 10, 8, 10, 10, 8],
u'beyer': [99, 102, 103, 103, 88, 100]},
    index=[u'Last Gunfighter', u'Last Gunfighter', u'Last Gunfighter',
u'Paynter', u'Paynter', u'Paynter']); 
df

Unnamed: 0,beyer,line_race
Last Gunfighter,99,10
Last Gunfighter,102,10
Last Gunfighter,103,8
Paynter,103,10
Paynter,88,10
Paynter,100,8


In [149]:
df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1)

In [150]:
df

Unnamed: 0,beyer,line_race,beyer_shifted
Last Gunfighter,99,10,
Last Gunfighter,102,10,99.0
Last Gunfighter,103,8,102.0
Paynter,103,10,
Paynter,88,10,103.0
Paynter,100,8,88.0


In [151]:
#Select row with maximum value from each group

df = pd.DataFrame({'host':['other','other','that','this','this'],
'service':['mail','web','mail','mail','web'],
'no':[1, 2, 1, 2, 1]}).set_index(['host', 'service'])

In [152]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,no
host,service,Unnamed: 2_level_1
other,mail,1
other,web,2
that,mail,1
this,mail,2
this,web,1


In [153]:
mask = df.groupby(level=0).agg('idxmax')

In [154]:
mask

Unnamed: 0_level_0,no
host,Unnamed: 1_level_1
other,"(other, web)"
that,"(that, mail)"
this,"(this, mail)"


In [155]:
df_count = df.loc[mask['no']].reset_index()

In [156]:
df_count

Unnamed: 0,host,service,no
0,other,web,2
1,that,mail,1
2,this,mail,2


In [157]:
df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])

In [158]:
df

Unnamed: 0,A
0,0
1,1
2,0
3,1
4,1
5,1
6,0
7,1
8,1


In [159]:
df.A.groupby((df.A != df.A.shift()).cumsum()).groups

{1: Int64Index([0], dtype='int64'),
 2: Int64Index([1], dtype='int64'),
 3: Int64Index([2], dtype='int64'),
 4: Int64Index([3, 4, 5], dtype='int64'),
 5: Int64Index([6], dtype='int64'),
 6: Int64Index([7, 8], dtype='int64')}

In [160]:
df

Unnamed: 0,A
0,0
1,1
2,0
3,1
4,1
5,1
6,0
7,1
8,1


In [161]:
df.A.groupby((df.A != df.A.shift()).cumsum()).cumsum()

0    0
1    1
2    0
3    1
4    2
5    3
6    0
7    1
8    2
Name: A, dtype: int64

In [163]:
# Pivot

In [165]:
df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'],
'City' : ['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
'Sales' : [13,6,16,8,4,3,1]})

In [166]:
df

Unnamed: 0,City,Province,Sales
0,Toronto,ON,13
1,Montreal,QC,6
2,Vancouver,BC,16
3,Calgary,AL,8
4,Edmonton,AL,4
5,Winnipeg,MN,3
6,Windsor,ON,1


In [167]:
table = pd.pivot_table(df,values=['Sales'],index=['Province'],columns=['City'],aggfunc=np.sum)

In [168]:
table

Unnamed: 0_level_0,Sales,Sales,Sales,Sales,Sales,Sales,Sales
City,Calgary,Edmonton,Montreal,Toronto,Vancouver,Windsor,Winnipeg
Province,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
AL,8.0,4.0,,,,,
BC,,,,,16.0,,
MN,,,,,,,3.0
ON,,,,13.0,,1.0,
QC,,,6.0,,,,


In [169]:
table.stack('City')

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales
Province,City,Unnamed: 2_level_1
AL,Calgary,8.0
AL,Edmonton,4.0
BC,Vancouver,16.0
MN,Winnipeg,3.0
ON,Toronto,13.0
ON,Windsor,1.0
QC,Montreal,6.0


In [170]:
## Timedeltas

In [171]:
s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))

In [172]:
s

0   2012-01-01
1   2012-01-02
2   2012-01-03
dtype: datetime64[ns]

In [173]:
s - s.max()

0   -2 days
1   -1 days
2    0 days
dtype: timedelta64[ns]

In [174]:
s.max() - s

0   2 days
1   1 days
2   0 days
dtype: timedelta64[ns]

In [176]:
import datetime
s - datetime.datetime(2011,1,1,3,5)


0   364 days 20:55:00
1   365 days 20:55:00
2   366 days 20:55:00
dtype: timedelta64[ns]

In [177]:
s + datetime.timedelta(minutes=5)

0   2012-01-01 00:05:00
1   2012-01-02 00:05:00
2   2012-01-03 00:05:00
dtype: datetime64[ns]

In [178]:
deltas = pd.Series([ datetime.timedelta(days=i) for i in range(3) ])

In [179]:
deltas

0   0 days
1   1 days
2   2 days
dtype: timedelta64[ns]

In [181]:
df = pd.DataFrame(dict(A = s, B = deltas)); 
df

Unnamed: 0,A,B
0,2012-01-01,0 days
1,2012-01-02,1 days
2,2012-01-03,2 days


In [182]:
df['New Dates'] = df['A'] + df['B'];

In [183]:
df

Unnamed: 0,A,B,New Dates
0,2012-01-01,0 days,2012-01-01
1,2012-01-02,1 days,2012-01-03
2,2012-01-03,2 days,2012-01-05


In [184]:
df['Delta'] = df['A'] - df['New Dates']; 
df

Unnamed: 0,A,B,New Dates,Delta
0,2012-01-01,0 days,2012-01-01,0 days
1,2012-01-02,1 days,2012-01-03,-1 days
2,2012-01-03,2 days,2012-01-05,-2 days


In [185]:
df.dtypes

A             datetime64[ns]
B            timedelta64[ns]
New Dates     datetime64[ns]
Delta        timedelta64[ns]
dtype: object

In [186]:
y = s - s.shift(); 
y

0      NaT
1   1 days
2   1 days
dtype: timedelta64[ns]

In [187]:
y[1] = np.nan; 
y

0      NaT
1      NaT
2   1 days
dtype: timedelta64[ns]

In [190]:
#Aliasing Axis Names
def set_axis_alias(cls, axis, alias):
    if axis not in cls._AXIS_NUMBERS:
        raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
    cls._AXIS_ALIASES[alias] = axis

In [191]:
def clear_axis_alias(cls, axis, alias):
    if axis not in cls._AXIS_NUMBERS:
        raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
    cls._AXIS_ALIASES.pop(alias,None)

In [192]:
set_axis_alias(pd.DataFrame,'columns', 'myaxis2')

In [193]:
df2 = pd.DataFrame(np.random.randn(3,2),columns=['c1','c2'],index=['i1','i2','i3'])

In [194]:
df2.sum(axis='myaxis2')

i1    1.034794
i2    2.658676
i3   -0.285454
dtype: float64

Creating Example Data

To create a dataframe from every combination of some given values, we can
create a dict where the keys are column names and the values are lists of the data values

In [195]:
def expand_grid(data_dict):
    rows = itertools.product(*data_dict.values())
    return pd.DataFrame.from_records(rows, columns=data_dict.keys())

In [196]:
df = expand_grid(
    {'height': [60, 70],
    'weight': [100, 140, 180],
    'sex': ['Male', 'Female']})

In [197]:
df

Unnamed: 0,height,sex,weight
0,60,Male,100
1,60,Male,140
2,60,Male,180
3,60,Female,100
4,60,Female,140
5,60,Female,180
6,70,Male,100
7,70,Male,140
8,70,Male,180
9,70,Female,100
