In [1]:
%pylab inline
import pandas as pd
from pandas import DataFrame, Series

Populating the interactive namespace from numpy and matplotlib


In [2]:
df1 = DataFrame({'key': list('bbacaab'),
                 'data1': range(7)})
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [3]:
df2 = DataFrame({'key': list('abd'),
                 'data2': range(3)})
df2

Unnamed: 0,data2,key
0,0,a
1,1,b
2,2,d


The following keeps rows where the 'key' values (the only matching columns) have the same values, and combines the other columns - data1 and data2 here. Rows with values for the 'key' column that don't have a matching value in the other DataFrame - here, rows with 'key' values of 'c' and 'd' - are left out of the resulting merged DataFrame. This is an inner join, which is merge's default. See below for other types of join.

In [4]:
pd.merge(df1, df2)

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


The merge method uses overlapping/matching column names as keys if no keys are specified. It's good practice to specify the keys.

In [5]:
pd.merge(df1, df2, on='key')

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


In [6]:
df3 = DataFrame({'lkey': list('bbacaab'),
                 'data1': range(7)})
df3

Unnamed: 0,data1,lkey
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [7]:
df4 = DataFrame({'rkey': list('abd'),
                 'data2': range(3)})
df4

Unnamed: 0,data2,rkey
0,0,a
1,1,b
2,2,d


In [8]:
pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,1,b
1,1,b,1,b
2,6,b,1,b
3,2,a,0,a
4,4,a,0,a
5,5,a,0,a


You can do more than just the default inner join, using the 'how' parameter. In these examples, an outer join this keeps the 'c' and 'd' key values and their associated data. An outer join here is a combination of left and right joins, which can also be specified using 'how'. Considered differently, an inner join is the intersection of the keys, and an outer join is the union of the keys.

In [9]:
pd.merge(df1, df2, how='outer')

Unnamed: 0,data1,key,data2
0,0.0,b,1.0
1,1.0,b,1.0
2,6.0,b,1.0
3,2.0,a,0.0
4,4.0,a,0.0
5,5.0,a,0.0
6,3.0,c,
7,,d,2.0


Up to this point we've been doing many-to-one merges, because the second DataFrame's had only a single row for each key value. You can also do many-to-many merges.

In [10]:
df1 = DataFrame({'key': list('bbacab'),
                 'data1': range(6)})
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,b


In [11]:
df2 = DataFrame({'key': list('ababd'),
                 'data2': range(5)})
df2

Unnamed: 0,data2,key
0,0,a
1,1,b
2,2,a
3,3,b
4,4,d


In [12]:
pd.merge(df1, df2, on='key', how='left')

Unnamed: 0,data1,key,data2
0,0,b,1.0
1,0,b,3.0
2,1,b,1.0
3,1,b,3.0
4,2,a,0.0
5,2,a,2.0
6,3,c,
7,4,a,0.0
8,4,a,2.0
9,5,b,1.0


There's more on page 180 and 181 about merging using more than one key, and to handle overlapping column names by specifying text to append to overlapping names. (You can also rename axis labels, as explained later.)

##Merging on index

By default the merge keys are assumed to be in columns. You can also merge using rows/indexes.

In [13]:
left1 = DataFrame({'key': list('abaabc'),
                   'value': range(6)})
left1

Unnamed: 0,key,value
0,a,0
1,b,1
2,a,2
3,a,3
4,b,4
5,c,5


In [14]:
right1 = DataFrame({'group_val': [3.5, 7]}, index=['a','b'])
right1

Unnamed: 0,group_val
a,3.5
b,7.0


In [15]:
# for the right side, use the row index as the key
pd.merge(left1, right1, left_on='key', right_index=True)

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0


In [16]:
pd.merge(left1, right1, left_on='key', right_index=True, how='outer')

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0
5,c,5,


###Merging hierarchically-indexed data

In [17]:
lefth = DataFrame({'key1': ['Ohio','Ohio','Ohio','Nevada','Nevada'],
                   'key2': [2000, 2001, 2002, 2001, 2002],
                   'data': np.arange(5.)})
lefth

Unnamed: 0,data,key1,key2
0,0,Ohio,2000
1,1,Ohio,2001
2,2,Ohio,2002
3,3,Nevada,2001
4,4,Nevada,2002


In [18]:
righth = DataFrame(np.arange(12).reshape((6,2)),
                   index=[['Nevada','Nevada','Ohio','Ohio','Ohio','Ohio'],
                          [2001, 2000, 2000, 2000, 2001, 2002]],
                   columns=['event1','event2'])
righth

Unnamed: 0,Unnamed: 1,event1,event2
Nevada,2001,0,1
Nevada,2000,2,3
Ohio,2000,4,5
Ohio,2000,6,7
Ohio,2001,8,9
Ohio,2002,10,11


In [19]:
pd.merge(lefth, righth, left_on=['key1','key2'], right_index=True)

Unnamed: 0,data,key1,key2,event1,event2
0,0,Ohio,2000,4,5
0,0,Ohio,2000,6,7
1,1,Ohio,2001,8,9
2,2,Ohio,2002,10,11
3,3,Nevada,2001,0,1


In [20]:
pd.merge(lefth, righth, left_on=['key1','key2'],
         right_index=True, how='outer')

Unnamed: 0,data,key1,key2,event1,event2
0,0.0,Ohio,2000,4.0,5.0
0,0.0,Ohio,2000,6.0,7.0
1,1.0,Ohio,2001,8.0,9.0
2,2.0,Ohio,2002,10.0,11.0
3,3.0,Nevada,2001,0.0,1.0
4,4.0,Nevada,2002,,
4,,Nevada,2000,2.0,3.0


In [21]:
left2 = DataFrame([[1.,2.], [3.,4.], [5.,6.]],
                  index=list('ace'),
                  columns=['Ohio','Nevada'])
left2

Unnamed: 0,Ohio,Nevada
a,1,2
c,3,4
e,5,6


In [22]:
right2 = DataFrame([[7.,8.], [9.,10.], [11.,12.], [13.,14.]],
                   index=list('bcde'),
                   columns=['Missouri','Alabama'])
right2

Unnamed: 0,Missouri,Alabama
b,7,8
c,9,10
d,11,12
e,13,14


In [23]:
pd.merge(left2, right2, how='outer', left_index=True, right_index=True)

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


You can also use the 'more convenient' 'join' instance method to merge by index, and 'also to combine together many DataFrame objects that have the same or similar indices but non-overlapping columns. You could do the previous example as follows:

In [24]:
left2.join(right2, how='outer')

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


You can also join on the index of the passed DataFrame on one of the columns of the calling DataFrame.

In [25]:
left1.join(right1, on='key')

Unnamed: 0,key,value,group_val
0,a,0,3.5
1,b,1,7.0
2,a,2,3.5
3,a,3,3.5
4,b,4,7.0
5,c,5,


Finally, to do a 'simple index-on-index merge', you can pass a list of DataFrame instances to join. This is an alternative to the more general concat function described later.

In [26]:
another = DataFrame([[7.,8.],[9.,10.],[11.,12.],[16.,17.]],
                    index=list('acef'),
                    columns=['New York','Oregon'])
another

Unnamed: 0,New York,Oregon
a,7,8
c,9,10
e,11,12
f,16,17


In [27]:
left2

Unnamed: 0,Ohio,Nevada
a,1,2
c,3,4
e,5,6


In [28]:
right2

Unnamed: 0,Missouri,Alabama
b,7,8
c,9,10
d,11,12
e,13,14


In [29]:
left2.join([right2, another])

Unnamed: 0,Ohio,Nevada,Missouri,Alabama,New York,Oregon
a,1,2,,,7,8
c,3,4,9.0,10.0,9,10
e,5,6,13.0,14.0,11,12


In [30]:
left2.join([right2, another], how='outer')

Unnamed: 0,Ohio,Nevada,Missouri,Alabama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
b,,,7.0,8.0,,
c,3.0,4.0,9.0,10.0,9.0,10.0
d,,,11.0,12.0,,
e,5.0,6.0,13.0,14.0,11.0,12.0
f,,,,,16.0,17.0


##Concatenating along an axis - "concatenating", "binding", "stacking"

NumPy has a concatenate method.

In [31]:
arr = np.arange(12).reshape((3,4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [32]:
np.concatenate([arr, arr], axis=1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In [33]:
np.concatenate([arr, arr])

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

Pandas objects - Series, DataFrame - have labeled axes, which 'further generalize' how you do array concatenation.

In [34]:
s1 = Series([0, 1], index=['a','b'])
s1

a    0
b    1
dtype: int64

In [35]:
s2 = Series([2, 3, 4], index=list('cde'))
s2

c    2
d    3
e    4
dtype: int64

In [36]:
s3 = Series([5, 6], index=['f','g'])
s3

f    5
g    6
dtype: int64

In [37]:
# glue together values and indexes
pd.concat([s1, s2, s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

By default, concat works along axis=0 (rows) and produces another Series. Passing axis=1 gets you a DataFrame, because axis=1 is columns.

In [38]:
pd.concat([s1, s2, s3], axis=1)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


Above there's no overlap in the axis values; the resulting axis is the sorted union (outer join) of the indexes. To intersect them/do an inner join:

In [39]:
s1

a    0
b    1
dtype: int64

In [40]:
s4 = pd.concat([s1 * 5, s3])
s4

a    0
b    5
f    5
g    6
dtype: int64

In [41]:
pd.concat([s1, s4], axis=1)

Unnamed: 0,0,1
a,0.0,0
b,1.0,5
f,,5
g,,6


In [42]:
pd.concat([s1, s4], axis=1, join='inner')

Unnamed: 0,0,1
a,0,0
b,1,5


In the above examples, the concatenated pieces can't be identified in the result. "You might want a hierarchical index on the concatenation axis", which you can do with the 'keys' argument:

In [43]:
result = pd.concat([s1, s1, s3], keys=['one','two','three'])
result

one    a    0
       b    1
two    a    0
       b    1
three  f    5
       g    6
dtype: int64

In [44]:
# then unstack to rotate the inner part of the MultiIndex to columns
result.unstack()

Unnamed: 0,a,b,f,g
one,0.0,1.0,,
two,0.0,1.0,,
three,,,5.0,6.0


And if you combine Series along axis=1 (concatenate by adding columns), the keys become the DataFrame column headers.

In [45]:
pd.concat([s1, s2, s3], axis=1, keys=['one','two','three'])

Unnamed: 0,one,two,three
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


And the same applies when you concatenate DataFrames as below, instead of Series as above.

In [46]:
df1 = DataFrame(np.arange(6).reshape(3,2), index=list('abc'),
                columns=['one','two'])
df1

Unnamed: 0,one,two
a,0,1
b,2,3
c,4,5


In [47]:
df2 = DataFrame(np.arange(4).reshape(2,2), index=['a','c'],
                columns=['three','four'])
df2

Unnamed: 0,three,four
a,0,1
c,2,3


In [48]:
pd.concat([df1, df2], axis=1, keys=['level1','level2'])

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,0.0,1.0
b,2,3,,
c,4,5,2.0,3.0


There's more on p187 and 188, about how to affect how the hierarchical index is created, including use of 'names' and 'ignore_index' when the row index isn't meaningful in context.

##Combining data with overlap

In [49]:
a = Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
           index=list('fedcba'))
a

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [50]:
b = Series(np.arange(len(a), dtype=np.float64),
           index=list('fedcba'))
b

f    0
e    1
d    2
c    3
b    4
a    5
dtype: float64

In [51]:
b[-1]

5.0

In [52]:
b[-1] = np.nan
b

f     0
e     1
d     2
c     3
b     4
a   NaN
dtype: float64

Where the value of a is null, take b's value, otherwise take a.

In [53]:
np.where(pd.isnull(a), b, a)

array([ 0. ,  2.5,  2. ,  3.5,  4.5,  nan])

Or, you can use the Series combine_first method to do the same thing.

In [54]:
Series.combine_first?

In [55]:
a.combine_first(b)

f    0.0
e    2.5
d    2.0
c    3.5
b    4.5
a    NaN
dtype: float64

For DataFrames, combine_first does the same thing column by column - 'you can think of it as patching missing data in the calling object with data from the object' that's passed.

In [56]:
list(range(2, 18, 4))

[2, 6, 10, 14]

In [57]:
df1 = DataFrame({'a': [1., np.nan, 5., np.nan],
                 'b': [np.nan, 2., np.nan, 6.],
                 'c': range(2, 18, 4)})
df1

Unnamed: 0,a,b,c
0,1.0,,2
1,,2.0,6
2,5.0,,10
3,,6.0,14


In [58]:
df2 = DataFrame({'a': [5., 4., np.nan, 3., 7.],
                 'b': [np.nan, 3., 4., 6., 8.]})
df2

Unnamed: 0,a,b
0,5.0,
1,4.0,3.0
2,,4.0
3,3.0,6.0
4,7.0,8.0


As the following shows, it looks like combine_first does both:
- replace NaN in the first DataFrame with values from the second, if the values exist in the second (as expected from the previous example), AND
- add completely new rows if they exist in the second DataFrame but not in the first - in the example below, see the row with index '4'

In [59]:
df1.combine_first(df2)

Unnamed: 0,a,b,c
0,1,,2.0
1,4,2.0,6.0
2,5,4.0,10.0
3,3,6.0,14.0
4,7,8.0,


#Reshaping and pivoting

Pandas uses hierarchical indexing as a key (and 'consistent') way to rearrange data in DataFrames. Generally you stack - rotate/pivot data from the columns to rows, and unstack - rotate/pivot data from rows into columns.

In [60]:
data = DataFrame(np.arange(6).reshape((2,3)),
                 index=pd.Index(['Ohio','Colorado'], name='state'),
                 columns=pd.Index(['one','two','three'], name='number'))
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [61]:
# we stack - columns to rows - to produce a Series
result = data.stack()
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

Note that the index is a MultiIndex
- level 0/outer is the original index values,
- level 1/inner is the new stacked/pivoted index values, which were the column index values

In [62]:
type(result.index)

pandas.core.index.MultiIndex

And you can unstack - pivot from row values into column values. The values that are pivoted out are the inner (most?) index values.

In [63]:
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


As noted in the previous note, by default the innermost level is unstacked (the same behavior, actually, as stack). Another level can be unstacked (or stacked) by passing a level number - starting w/ zero for the outermost level - or a name, if the name is defined. This means, I think, that the default level is the highest number?

In [64]:
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

In [68]:
result.unstack('state')

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


In [65]:
result.unstack(0)

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


In [66]:
result.unstack(1) # same as default here because we have two levels

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [67]:
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


If you have the case where some of the values in the level aren't found/don't exist in each of the subgroups, then unstacking will add NaNs.

In [69]:
s1 = Series([0,1,2,3], index=list('abcd'))
s1

a    0
b    1
c    2
d    3
dtype: int64

In [70]:
s2 = Series([4,5,6], index=list('cde'))
s2

c    4
d    5
e    6
dtype: int64

In [72]:
data2 = pd.concat([s1, s2], keys=['one','two'])
data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [73]:
data2.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2,3,
two,,,4,5,6.0


When you stack, NaNs are filtered out by default, so if you unstack (and add NaNs) and then stack, you get back what you started with... unless you choose to keep NAs when stacking.

In [74]:
data2.unstack().stack() # same as before

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: float64

In [76]:
data2.unstack().stack(dropna=False) # keeps the NAs when stacking

one  a     0
     b     1
     c     2
     d     3
     e   NaN
two  a   NaN
     b   NaN
     c     4
     d     5
     e     6
dtype: float64

Here's an example how, when unstacking in a DataFrame, the level unstacked becomes the lowest level in the result. It also looks like when you stack, the data stacked becomes the lowest/innermost level.

In [77]:
df = DataFrame({'left': result, 'right': result + 5},
                columns=pd.Index(['left','right'], name='side'))
df

Unnamed: 0_level_0,side,left,right
state,number,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,one,0,5
Ohio,two,1,6
Ohio,three,2,7
Colorado,one,3,8
Colorado,two,4,9
Colorado,three,5,10


In [78]:
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

In [79]:
df.unstack('state')

side,left,left,right,right
state,Ohio,Colorado,Ohio,Colorado
number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
one,0,3,5,8
two,1,4,6,9
three,2,5,7,10


In [80]:
df.unstack('state').stack('side')

Unnamed: 0_level_0,state,Ohio,Colorado
number,side,Unnamed: 2_level_1,Unnamed: 3_level_1
one,left,0,3
one,right,5,8
two,left,1,4
two,right,6,9
three,left,2,5
three,right,7,10


###Pivoting "long" to "wide" format

As shown on page 192, you can use the pivot method to pivot out (similar to unstack?) data, which is especially useful when you have data - for ex, from a database - where multiple rows have data for the same 'entity', in different rows... which a DB might have so that the particular types of data can be added to w/o changing the schema.

In [84]:
ldata = DataFrame(
    {'date':['1959-03-31','1959-03-31','1959-03-31','1959-06-30','1959-06-30','1959-06-30','1959-09-30','1959-09-30','1959-09-30','1959-12-31'],
     'item':['realgdp','infl','unemp','realgdp','infl','unemp','realgdp','infl','unemp','realgdp'], 
     'value':[2710.349,0.,5.8,2778.801,2.340,5.1,2775.488,2.74,5.3,2785.204]})
ldata

Unnamed: 0,date,item,value
0,1959-03-31,realgdp,2710.349
1,1959-03-31,infl,0.0
2,1959-03-31,unemp,5.8
3,1959-06-30,realgdp,2778.801
4,1959-06-30,infl,2.34
5,1959-06-30,unemp,5.1
6,1959-09-30,realgdp,2775.488
7,1959-09-30,infl,2.74
8,1959-09-30,unemp,5.3
9,1959-12-31,realgdp,2785.204


In [85]:
pivoted = ldata.pivot('date','item','value')
pivoted

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,0.0,2710.349,5.8
1959-06-30,2.34,2778.801,5.1
1959-09-30,2.74,2775.488,5.3
1959-12-31,,2785.204,


The first param is the row index, the second param is the column index, and then you have an optional value with which to fill the DataFrame. This means that you don't need to worry about specifying index or columns params manually using the DataFrame constructor - instead, just load the data, and use 'pivot' to get a DataFrame in the shape you want.

In [87]:
# add a new column to the source data to demonstrate having two
# value columns to reshape at the same time
ldata['value2'] = np.random.randn(len(ldata))
ldata

Unnamed: 0,date,item,value,value2
0,1959-03-31,realgdp,2710.349,-0.871026
1,1959-03-31,infl,0.0,-1.338485
2,1959-03-31,unemp,5.8,-0.855564
3,1959-06-30,realgdp,2778.801,0.095763
4,1959-06-30,infl,2.34,0.53095
5,1959-06-30,unemp,5.1,1.22194
6,1959-09-30,realgdp,2775.488,-0.282222
7,1959-09-30,infl,2.74,0.032789
8,1959-09-30,unemp,5.3,-2.162751
9,1959-12-31,realgdp,2785.204,-0.401327


In [88]:
# since we don't specify the value argument, we get a DF with 
# hierarchical columns 
pivoted = ldata.pivot('date','item')
pivoted[:5]

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31,0.0,2710.349,5.8,-1.338485,-0.871026,-0.855564
1959-06-30,2.34,2778.801,5.1,0.53095,0.095763,1.22194
1959-09-30,2.74,2775.488,5.3,0.032789,-0.282222,-2.162751
1959-12-31,,2785.204,,,-0.401327,


In [89]:
pivoted['value'][:5]

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,0.0,2710.349,5.8
1959-06-30,2.34,2778.801,5.1
1959-09-30,2.74,2775.488,5.3
1959-12-31,,2785.204,


Interesting to note and use to reinforce the structure of a DataFrame, that the pivot method is a shortcut to create a hierarchical index using set_index and then reshaping with unstack.

In [90]:
unstacked = ldata.set_index(['date','item']).unstack('item')
unstacked[:7]

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31,0.0,2710.349,5.8,-1.338485,-0.871026,-0.855564
1959-06-30,2.34,2778.801,5.1,0.53095,0.095763,1.22194
1959-09-30,2.74,2775.488,5.3,0.032789,-0.282222,-2.162751
1959-12-31,,2785.204,,,-0.401327,


#Data transformation

The book calls everything to here 'rearranging' data. Then it calls 'filtering, cleaning, and other transformations' using the term 'data transformation'.

##Removing duplicates

In [91]:
data = DataFrame({'k1':['one']*3 + ['two']*4,
                  'k2':[1,1,2,3,3,4,4]})
data

Unnamed: 0,k1,k2
0,one,1
1,one,1
2,one,2
3,two,3
4,two,3
5,two,4
6,two,4


In [94]:
data.duplicated() 
# boolean series that says if each row (all values, I think)is duplicated

0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

In [95]:
# and then you can drop duplicate rows (w/o using 'del')
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
2,one,2
3,two,3
5,two,4


The duplicated and drop_duplicates methods consider all of the columns/the whole row, by default. You can pass column names to have them consider only a subset of rows.

In [96]:
data['v1'] = range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,one,1,1
2,one,2,2
3,two,3,3
4,two,3,4
5,two,4,5
6,two,4,6


In [97]:
data.drop_duplicates() # nothing dropped as no duplicate rows

Unnamed: 0,k1,k2,v1
0,one,1,0
1,one,1,1
2,one,2,2
3,two,3,3
4,two,3,4
5,two,4,5
6,two,4,6


In [98]:
data.drop_duplicates('k1')

Unnamed: 0,k1,k2,v1
0,one,1,0
3,two,3,3


By default the first row with a duplicate is kept; pass take_last=False to keep the last.

In [99]:
data.drop_duplicates(['k1','k2'], take_last=True)

Unnamed: 0,k1,k2,v1
1,one,1,1
2,one,2,2
4,two,3,4
6,two,4,6


##Transforming data using a function or mapping

In [100]:
data = DataFrame({'food':['bacon','pulled pork','bacon','Pastrami',
                          'corned beef','Bacon','pastrami',
                          'honey ham','nova lox'],
                  'ounces':[4,3,12,6,7.5,8,3,5,6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [101]:
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}

In [102]:
data['animal'] = data['food'].map(str.lower).map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [103]:
# or...
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

And, my own addition, as I seem to need to do it not infrequently, if you want to use the whole row - instead of just a column/Series as above - then you can do things with multiple fields in each row, by using the DataFrame apply method. By default apply here passes in each column, and so apply is called once for each column. If you want to pass in and process each row, then call with axis=1. 

Also, don't forget that applymap and map are the same, but are called on different objects - applymap is called on DataFrames, and map is called on Series. In both cases, you get each element of the object on which is called. In contrast, applymap only works for DataFrames (as you'd expect, since only DataFrames have multiple rows or columns), and passes in the whole row or column.

In [112]:
data.apply(lambda row: row['food'] + row['animal'], axis=1)

0          baconpig
1    pulled porkpig
2          baconpig
3       Pastramicow
4    corned beefcow
5          Baconpig
6       pastramicow
7      honey hampig
8    nova loxsalmon
dtype: object

##Replacing values

You can use your own function w/ map, apply, or applymap as outlined above, but replace is a simple way to do this.

In [113]:
data = Series([1.,-999.,2.,-999.,-1000.,3.])
data

0       1
1    -999
2       2
3    -999
4   -1000
5       3
dtype: float64

In [115]:
# replace -999 sentinels w/ NAs, which pandas knows about
data.replace(-999, np.nan)

0       1
1     NaN
2       2
3     NaN
4   -1000
5       3
dtype: float64

In [116]:
data.replace([-999,-1000], np.nan)

0     1
1   NaN
2     2
3   NaN
4   NaN
5     3
dtype: float64

In [117]:
data.replace([-999,-1000],[np.nan, 0])

0     1
1   NaN
2     2
3   NaN
4     0
5     3
dtype: float64

In [118]:
data.replace({-999: np.nan, -1000: 0})

0     1
1   NaN
2     2
3   NaN
4     0
5     3
dtype: float64

##Renaming axis indexes

Index values are - in pandas - really part of the data, so it's reasonable to want to map/transform index values like you do w/ actual data values (using map, apply, applymap, replace, etc.). The index object has a map method, and a rename method. (And I wonder if you can do the same with the columns method?)

In [119]:
data = DataFrame(np.arange(12).reshape((3,4)),
                index=['Ohio','Colorado','New York'],
                columns=['one','two','three','four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [120]:
data.index.map(str.upper)

array(['OHIO', 'COLORADO', 'NEW YORK'], dtype=object)

In [121]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [123]:
data.index = data.index.map(str.upper)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


Rename returns a new DataFrame object entirely, saving one from having to copy the DataFrame and then assign new values to the index and columns collections. It looks like the index and columns parameters take a function, which is then applied to each index value and each column value (respectively), when forming the new DataFrame.

In [128]:
data.rename(index=str.title, columns=str.upper)
# str.title returns a 'title-cased' version of the string

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


Or you can map using a dict.

In [129]:
data.rename(index={'OHIO': 'INDIANANANA'},
            columns={'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANANANA,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


You don't _have_ to get a new object, if you use the inplace parameter.

In [130]:
_ = data.rename(index={'OHIO': 'INDIANANA'}, inplace=True)

In [131]:
data

Unnamed: 0,one,two,three,four
INDIANANA,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


In [132]:
_

Unnamed: 0,one,two,three,four
INDIANANA,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


##Discretization and binning

In [133]:
ages = [20,22,25,27,21,23,37,31,61,45,41,32]

In [134]:
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, object): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

cats is a 'Categorical' object - it can be treated as an array of strings that holds the bin name. Here, the bin name is something like (18,25] which means that 18 is included up to just this side of 25, which isn't included. It's implemented as a levels array - actually, levels is deprecated and we should use the 'categories' array - with the category names and a labels array - deprecated too, and we should use 'codes' - that shows how each value in the original array maps to the categories.

In [135]:
cats.labels



array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [136]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [138]:
cats.levels



Index(['(18, 25]', '(25, 35]', '(35, 60]', '(60, 100]'], dtype='object')

In [139]:
cats.categories

Index(['(18, 25]', '(25, 35]', '(35, 60]', '(60, 100]'], dtype='object')

In [140]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

Midway through p199.