In [14]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [1]:
from pandas import DataFrame, Series
import pandas as pd

#Series

Series objects are like one-dimensional arrays, with an associated array that's an index.

In [2]:
o = Series([4,7,-5,3])
o

0    4
1    7
2   -5
3    3
dtype: int64

In [4]:
o.values

array([ 4,  7, -5,  3])

In [5]:
o.index

Int64Index([0, 1, 2, 3], dtype='int64')

In [6]:
o2 = Series([4,7,-5,3], index=['d','b','a','c'])
o2

d    4
b    7
a   -5
c    3
dtype: int64

In [7]:
o2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [8]:
o2[0]

4

In [9]:
o2['d']

4

In [10]:
o2[['c','a','b']]

c    3
a   -5
b    7
dtype: int64

NumPy array operations, like filtering with a boolean array, scalar multiplication, and applying math functions, preserve the link between the index and the resulting value.

In [11]:
o2[o2 > 0]

d    4
b    7
c    3
dtype: int64

In [12]:
o2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

In [15]:
np.exp(o2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

A series can also be thought of as a fixed-length ordered dict, as it maps between index values and data values. You can pass a series into many functions that expect a dict.

In [16]:
'b' in o2

True

In [17]:
'e' in o2

False

In [18]:
data_in_dict = {'Ohio': 35000, 'Texas': 71000, 
                'Oregon': 16000, 'Utah': 5000}
o3 = Series(data_in_dict)
o3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

We pass in here a dict and a set of index values. The constructor takes only the data values from the dict that are specified in the index values. Since 'California' isn't in the index values, no data can be taken from the provided values dict - we still get an entry for the 'California' index value, but it has a NaN value.

In [19]:
states = ['California','Ohio','Oregon','Texas']
o4 = Series(data_in_dict, index=states)
o4

California      NaN
Ohio          35000
Oregon        16000
Texas         71000
dtype: float64

In [20]:
pd.isnull(o4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [21]:
pd.notnull(o4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [22]:
o4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Series 'automatically aligns differently-indexed data in arithmetic operations'. That is, I think, it'll use the index values to determine how to apply the specified arithmetic operation. Here, we have two different series that share some of the same index values. When we add the two Series instances, data values w/ the same index values are added together. 

In [23]:
o3 + o4

California       NaN
Ohio           70000
Oregon         32000
Texas         142000
Utah             NaN
dtype: float64

You can name the Series instance itself, and also name the index.

In [25]:
o4.name = 'population'
o4.index.name = 'state'
o4

state
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
Name: population, dtype: float64

To change the index, change it in place.

In [26]:
o

0    4
1    7
2   -5
3    3
dtype: int64

In [27]:
o.index = ['Bob','Steve','Jeff','Ryan']
o

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

#DataFrames

Some high-level notes on DataFrames:
- Think of it as a dict of Series instances. Each Series instance shares the same index, which is the index of the DataFrame.
- Compared to R's data.frame, row- and column-oriented actions are treated roughly symmetrically.
- While it doesn't matter for work done with DataFrames, the data's stored as one or more 2D blocks and not as a list, dict, or other collection of 1D arrays.
- Even though the data's stored in 2D, it's 'easy' to represent higher-dimensional data using hierarchical indexing.

In [29]:
data = {'state': ['Ohio','Ohio','Ohio','Nevada','Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


In [30]:
DataFrame(data, columns=['year','state','pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


Just like w/ Series, a column that doesn't have any info in the data shows as NaN.

In [31]:
frame2 = DataFrame(data, columns=['year','state','pop','debt'],
                   index=['one','two','three','four','five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [32]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [33]:
frame2.state

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

As shown above, the returned Series is:
- named appropriately depending on the column
- has the same index as the DataFrame

Accessing the columns is intuitive. It uses the bracket and dot notations, so they can't also apply if you want to retrieve rows. To access a row, use the .ix method and pass the index value of the particular row you care about (or use one of a few other approaches that are discussed later).

In [35]:
frame2.ix['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Accessing a column or a row gives you a Series instance. 

In [37]:
type(frame2['state'])

pandas.core.series.Series

In [36]:
type(frame2.ix['three'])

pandas.core.series.Series

There are a variety of different ways to modify columns. For ex, you can provide a single value (that will be broadcast) or an array of different values.

In [38]:
frame2.debt = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [41]:
arange(5.)

array([ 0.,  1.,  2.,  3.,  4.])

In [43]:
frame2.debt = arange(5.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


You can also assign a Series instance to a column. In this case the length of the Series doesn't need to match the length of the DataFrame (like it does if you just assign a bare array). Instead, the Series data will be used and matched up according to the index of the Series and the DataFrame: matching index values are used and any DataFrame index values that don't have a matching index value in the Series will result in NaN (index values in the Series that aren't in the DataFrame are ignored).

In [44]:
val = Series([-1.2, -1.5, -1.7], index=['two','four','five'])
frame2.debt = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


To create a new column, assign to a column that doesn't exist. You have to use the bracket syntax - you can't use the dot syntax to create a new column.

In [46]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


To delete a column, use 'del'. Again, dot syntax doesn't work - use bracket syntax.

In [48]:
del frame2['eastern']
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


The column returned by indexing a DataFrame is a view, not a copy - you can modify the Series and it'll modify the DataFrame.

In [51]:
s = frame2.debt
s['four'] = -1.5555
frame2

A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from IPython.kernel.zmq import kernelapp as app


Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5555
five,2002,Nevada,2.9,-1.7


Instead of using a dict w/ equal-sized arrays, you can also use a nested set of dicts - i.e., a dict of dicts. The keys of the first/outer dict are interpreted as columns, and the keys of the inner dicts are interpreted as row indices.

In [56]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


DataFrames can be transposed, just like NumPy arrays.

In [57]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


When no explicit index is provided, the keys of the inner dicts are unioned and sorted to form the index. Or, you can provide an explicit index.

In [58]:
DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


You can also use a dict w/ Series values.

In [60]:
frame3['Ohio']

2000    1.5
2001    1.7
2002    3.6
Name: Ohio, dtype: float64

In [59]:
frame3['Ohio'][:-1]

2000    1.5
2001    1.7
Name: Ohio, dtype: float64

In [64]:
frame3['Nevada']

2000    NaN
2001    2.4
2002    2.9
Name: Nevada, dtype: float64

In [65]:
frame3['Nevada'][:2]

2000    NaN
2001    2.4
Name: Nevada, dtype: float64

In [62]:
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
pdata

{'Nevada': 2000    NaN
 2001    2.4
 Name: Nevada, dtype: float64, 'Ohio': 2000    1.5
 2001    1.7
 Name: Ohio, dtype: float64}

In [63]:
DataFrame(pdata)

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7


Don't forget DataFrame names.

In [66]:
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [67]:
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


Just like w/ Series, values returns the data as a 2D NumPy ndarray.

In [68]:
frame3.values

array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

##Index objects

In [69]:
o = Series(range(3), index=['a','b','c'])
i = o.index
i

Index(['a', 'b', 'c'], dtype='object')

In [70]:
i[1:]

Index(['b', 'c'], dtype='object')

In [71]:
i[1] = 'd' # they're immutable

TypeError: Indexes does not support mutable operations

As above, indices are immutable - this makes it possible to safely share them between different data structures.

In [72]:
i = pd.Index(np.arange(3))
o2 = Series([1.5, -2.5, 0], index=i)
o2.index is i

True

As page 121 shows, there are multiple Index classes provided by Pandas... not just the most general 'Index' object. Each more specific index object is specialized for particular kinds of index values. For example, there are Int64Index (specialized for integer values), MultiIndex (for hierarchical/multiple levels of indexing on a single axis - like an array of tuples, whatever that means), DateTimeIndex (nanosecond timestamps), and PeriodIndex (period data - timespans).

Both the rows and columns have indices.

In [73]:
type(frame3.columns)

pandas.core.index.Index

In [75]:
type(frame3.index)  # rows

pandas.core.index.Int64Index

##"Essential" functionality, as per page 122 and on

"Reindexing" is "critical" and means to create a new object with the data _conformed_ to a new index.

In [76]:
o = Series([4.5, 7.2, -5.3, 3.6], index=['d','b','a','c'])
o

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Calling reindex with a new index rearranges the data according to the new index, and inserts missing values if any index values aren't already present.

In [79]:
o2 = o.reindex(['a','b','c','d','e'])
o2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [80]:
o.reindex(['a','b','c','d','e'], fill_value=0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

Interpolation works - helpful for ordered data like time series.

In [82]:
o3 = Series(['blue','purple','yellow'], index=[0, 2, 4])
o3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

With DataFrame instances, reindex can alter either the (row) index, columns, or both. 

In [83]:
d = np.arange(9).reshape(3,3)
d

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [86]:
f = DataFrame(d, 
              index=['a','c','d'],
              columns=['Ohio','Texas','California'])
f

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [88]:
f2 = f.reindex(['a','b','c','d'])  # reindex rows/the 'index'
f2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [89]:
states = ['Texas','Utah','California']
f.reindex(columns=states) # reindex columns

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


Both index/rows and columns at the same time.

In [90]:
f.reindex(index=['a','b','c','d'],
          columns=states)

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


"Label-indexing with ix" is a more succinct way to reindex. (Covered more below?)

In [92]:
f.ix[['a','b','c','d'], states]

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


To drop, use the drop method.

In [94]:
o = Series(np.arange(5.), index=['a','b','c','d','e'])
o

a    0
b    1
c    2
d    3
e    4
dtype: float64

In [95]:
new_o = o.drop('c')
new_o

a    0
b    1
d    3
e    4
dtype: float64

In [96]:
o.drop(['d','c'])

a    0
b    1
e    4
dtype: float64

In [97]:
d = DataFrame(np.arange(16).reshape((4,4)),
              index=['Ohio','Colorado','Utah','New York'],
              columns=['one','two','three','four'])
d

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [98]:
d.drop(['Colorado','Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [100]:
d.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [101]:
d.drop(['two','four'], axis=1)

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


##Indexing, selection, filtering

Series indexing works like NumPy array indexing, and you can also use the Series's index values instead of just integers.

In [102]:
o = Series(np.arange(4.), index=['a','b','c','d'])
o

a    0
b    1
c    2
d    3
dtype: float64

In [103]:
o['b']

1.0

In [104]:
o[1]

1.0

In [105]:
o[2:4]

c    2
d    3
dtype: float64

In [106]:
o[['b','a','d']]

b    1
a    0
d    3
dtype: float64

In [107]:
o[[1, 3]]

b    1
d    3
dtype: float64

Boolean indexing - a test is applied to each entry in the series, and a boolean is returned for each entry. (All of the booleans together form yet another Series instance.)

In [111]:
o < 2

a     True
b     True
c    False
d    False
dtype: bool

And you can use this, or any other boolean sequence, to index into the Series and select out values.

In [112]:
o[o < 2]

a    0
b    1
dtype: float64

When you slice with labels (instead of integer values), then the endpoint is inclusive in contrast.

In [114]:
o[1:3] # the endpoint is not inclusive, so we specify the fourth index to get two values

b    1
c    2
dtype: float64

In [115]:
o['b':'c'] # endpoint is inclusive; only specify two values to get two

b    1
c    2
dtype: float64

In [116]:
o['b':'c'] = 5
o

a    0
b    5
c    5
d    3
dtype: float64

In [118]:
d

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [120]:
d['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [121]:
d[['three','one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


And there are some special cases w/ DataFrame indexing.

In [123]:
d[:2] # first two rows

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [125]:
d[d['three'] > 5] # rows where col named 'three' has values > 5 (not sure how this is odd)

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [126]:
d < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [127]:
d[d < 5] = 0

In [128]:
d

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


The ix field (method?) enables one to select a subset of rows and columns from a DataFrame using NumPy-like notation plus axis labels. (This is also, as shown earlier, a less verbose way to reindex.)

In [130]:
d.ix['Colorado']

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

In [129]:
d.ix['Colorado', ['two','three']]

two      5
three    6
Name: Colorado, dtype: int64

In [131]:
d.ix[['Colorado','Utah'], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [132]:
d.ix[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

All rows up to and including the row with index 'Utah' ('including', I think?, because of how label-based indexing is inclusive and not exclusive like integer-based slicing).

In [133]:
d.ix[:'Utah']

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11


In [134]:
d.ix[d.three > 5]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [135]:
d.ix[d.three > 5, :3]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


Page 128 has a table that summarizes the different ways to index into a DataFrame.

Interestingly, page 128 also has a note saying that "having to type frame[:, col] to select a column was too verbose (and error-prone), since column selection is one of the most common operations. Thus I made the design trade-off to push all of the rich label-indexing into ix."

I think he may be meaning that w/ NumPy arrays you'd use the a[:, colname] syntax to select a column, but he didn't want folks to need to do this with DataFrames, since it's so common. ?

##Arithmetic and data alignment

Not sure why it's so important, but the author says that one of the most important Pandas features is 'behavior of arithmetic between objects with different indexes'.

When you add objects, if the indexes don't match exactly, then the index of the resulting object is the union of the indexes of the two things being added (or subtracted, etc.)

In [137]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a','c','d','e'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [138]:
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a','c','e','f','g'])
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [139]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

When you add DataFrames alignment is performed on both rows and columns.

In [141]:
df1 = DataFrame(np.arange(9.).reshape((3,3)),
                columns=list('bcd'),
                index=['Ohio','Texas','Colorado'])
df1

Unnamed: 0,b,c,d
Ohio,0,1,2
Texas,3,4,5
Colorado,6,7,8


In [142]:
df2 = DataFrame(np.arange(12.).reshape((4,3)),
                columns=list('bde'),
                index=['Utah','Ohio','Texas','Oregon'])
df2

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


Adding the two frames gives us a DataFrame whose indexes are the union of the indexes - this applies to both the row and column indexes - of the underlying DataFrames.

NaN values propagate, so the only cells that keep actual numbers are those cells whose row and column index values exist in both of the original DataFrame instances.

In [143]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


You can use the fill_value param to get something besides a NaN in locations that don't overlap.

Cells where there's nothing in either source are still NaN.

In [144]:
df1.add(df2, fill_value=0)

Unnamed: 0,b,c,d,e
Colorado,6,7.0,8,
Ohio,3,1.0,6,5.0
Oregon,9,,10,11.0
Texas,9,4.0,12,8.0
Utah,0,,1,2.0


You can also do arithmetic with DataFrame and Series objects. This works like it does with NumPy arrays.

First, here's how things work with just NumPy arrays.  

In [145]:
a = np.arange(12.).reshape((3,4))
a

array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

In [146]:
a[0]

array([ 0.,  1.,  2.,  3.])

This is 'broadcasting'. It's explained more in chapter 12. Here I think of it as the four values in a[0] being used for each row.

In [148]:
a - a[0]

array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

DataFrame and Series also use broadcasting and work similarly.

In [149]:
f = DataFrame(np.arange(12.).reshape((4,3)),
              columns=list('bde'),
              index=['Utah','Ohio','Texas','Oregon'])
f

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [150]:
s = f.ix[0]
s

b    0
d    1
e    2
Name: Utah, dtype: float64

By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame's columns and broadcasts down the rows.

In [151]:
f - s

Unnamed: 0,b,d,e
Utah,0,0,0
Ohio,3,3,3
Texas,6,6,6
Oregon,9,9,9


If an index value is not found in the DataFrame's columns or the Series's index, Pandas reindexes the objects to form the union.

In [152]:
s2 = Series(range(3), index=['b','e','f'])
f + s2

Unnamed: 0,b,d,e,f
Utah,0,,3,
Ohio,3,,6,
Texas,6,,9,
Oregon,9,,12,


To broadcast over the columns, matching on the rows, use one of the arithmetic methods instead of the operators.

In [153]:
s3 = f['d']
s3

Utah       1
Ohio       4
Texas      7
Oregon    10
Name: d, dtype: float64

In [154]:
f

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


So we take the [1,4,7,10] values and subtract them from each column in turn, matching/determining what to subtract from using the row indexes. So for the first column we subtract 1 from 0, 4 from 3, etc. Then we subtract 1 from 1, 4 from 4, etc. And finally, in the last column, we subtract 1 from 2, 4 from 5, 7 from 8, and 10 from 11.

In [155]:
f.sub(s3, axis=0)

Unnamed: 0,b,d,e
Utah,-1,0,1
Ohio,-1,0,1
Texas,-1,0,1
Oregon,-1,0,1


##Function application and mapping, page 132

In [159]:
f = DataFrame(np.random.randn(4,3), columns=list('bde'),
              index=['Utah','Ohio','Texas','Oregon'])
f

Unnamed: 0,b,d,e
Utah,0.688583,0.069777,0.655359
Ohio,1.049268,-1.604578,-0.530702
Texas,0.240976,0.426752,-0.91142
Oregon,1.86079,-1.475099,-0.169356


In [161]:
np.abs(f) # np.abs is a NumPy ufunc (element-wise array function)

Unnamed: 0,b,d,e
Utah,0.688583,0.069777,0.655359
Ohio,1.049268,1.604578,0.530702
Texas,0.240976,0.426752,0.91142
Oregon,1.86079,1.475099,0.169356


Or, you can apply your own function that takes a 1D array to either all rows or all columns.

In [162]:
# x is the 1D array - this returns the max value minus the min value
fun = lambda x: x.max() - x.min()

In [163]:
f.apply(fun) # by default, pass in each column

b    1.619814
d    2.031330
e    1.566778
dtype: float64

In [165]:
f.apply(fun, axis=1) # or specify axis=1 to pass in each row

Utah      0.618806
Ohio      2.653846
Texas     1.338171
Oregon    3.335889
dtype: float64

Don't forget that many of the most common array statistics - say, sum and mean - are actually also defined as instance methods on the DataFrame class, and so apply isn't necessary.

In [166]:
f.sum()

b    3.839616
d   -2.583149
e   -0.956119
dtype: float64

In [167]:
f.sum(axis=1)

Utah      1.413719
Ohio     -1.086012
Texas    -0.243693
Oregon    0.216334
dtype: float64

The lambda above returns a single scalar value, but it doesn't need to. It can also return a Series with multiple values.

In [168]:
def fun(x):
    return Series([x.min(), x.max()], index=['min','max'])

In [169]:
f.apply(fun)

Unnamed: 0,b,d,e
min,0.240976,-1.604578,-0.91142
max,1.86079,0.426752,0.655359


You can use 'element-wise Python functions' too - I think these must mean functions that apply to each element/cell (not to a Series that's originally a particular row or column).

In [170]:
format = lambda x: '%.2f' % x
f.applymap(format)

Unnamed: 0,b,d,e
Utah,0.69,0.07,0.66
Ohio,1.05,-1.6,-0.53
Texas,0.24,0.43,-0.91
Oregon,1.86,-1.48,-0.17


I guess it's called 'applymap' because of it's similarity to the 'map' function on Series objects, which also applies an element-wise function but to a Series. (?)

In [173]:
f['e'].map(format)

Utah       0.66
Ohio      -0.53
Texas     -0.91
Oregon    -0.17
Name: e, dtype: object

##Sorting, ranking

To sort by row or column index, use sort_index. This sorts using the row index values or column names, using standard lexicographic order.

In [174]:
o = Series(range(4), index=list('dabc'))
o

d    0
a    1
b    2
c    3
dtype: int64

In [175]:
o.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [177]:
f = DataFrame(np.arange(8).reshape((2,4)), index=['three','one'],
              columns=list('dabc'))
f

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [178]:
f.sort_index() # sort by row indices by default

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [179]:
f.sort_index(axis=1) # or, by column names

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [180]:
f.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


And, you can sort by actual data values. With Series, use 'order'.

In [181]:
o = Series([4,7,-3,2])
o

0    4
1    7
2   -3
3    2
dtype: int64

In [182]:
o.order()

2   -3
3    2
0    4
1    7
dtype: int64

In [183]:
o.order(ascending=False)

1    7
0    4
3    2
2   -3
dtype: int64

In [184]:
o = Series([4,np.nan,7,np.nan,-3,2])
o

0     4
1   NaN
2     7
3   NaN
4    -3
5     2
dtype: float64

In [185]:
o.order()

4    -3
5     2
0     4
2     7
1   NaN
3   NaN
dtype: float64

With DataFrames you can sort by the values in one or multiple columns. (I assume you can also sort by values somehow/the same way in one or multiple rows?)

In [186]:
f = DataFrame({'b': [4,7,-3,2],
               'a': [0,1,0,1]})
f

Unnamed: 0,a,b
0,0,4
1,1,7
2,0,-3
3,1,2


It looks like you use sort_index (the same one used previously to actually sort the rows/columns by index values), but use the 'by' parameter to tell it to sort by values (?).

In [187]:
f.sort_index(by='b')

Unnamed: 0,a,b
2,0,-3
3,1,2
0,0,4
1,1,7


In [188]:
f.sort_index(by=['a','b']) # multiple columns

Unnamed: 0,a,b
2,0,-3
0,0,4
3,1,2
1,1,7


Ranking is the act of assigning a number to each item according to its value.

In [189]:
o = Series([7, -5, 7, 4, 2, 0, 4])
o

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [192]:
o.rank() # ties are broken using the mean rank for the group

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

Or, assign ranks not by their value, but by the order in which they're observed in the data.

In [193]:
o.rank(method='first')

0    6
1    1
2    7
3    4
4    3
5    2
6    5
dtype: float64

There are more tie-breaking methods - see the 'rank' docs.

DataFrames can compute ranks over rows or columns - use f.rank() to rank each of the values in each column, and f.rank(axis=1) to rank each of the values in each row.

In [195]:
f = DataFrame({'b': [4.3,7,-3,2],
               'a': [0,1,0,1],
               'c': [-2,5,8,-2.5]})
f

Unnamed: 0,a,b,c
0,0,4.3,-2.0
1,1,7.0,5.0
2,0,-3.0,8.0
3,1,2.0,-2.5


In [196]:
f.rank()

Unnamed: 0,a,b,c
0,1.5,3,2
1,3.5,4,3
2,1.5,1,4
3,3.5,2,1


In [197]:
f.rank(axis=1)

Unnamed: 0,a,b,c
0,2,3,1
1,1,3,2
2,2,1,3
3,2,3,1


##Axis indexes with duplicate values

You can have duplicate index values. Some things work the same,  some things - like reindex - require no duplicates, and some things - data selection work but work differently.

In [198]:
o = Series(range(5), index=list('aabbc'))
o

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [200]:
o.index.is_unique

False

Data selection is different when you have duplicate index values. If you have duplicate index values, you get back a Series; if you instead specify an index value that doesn't have duplicates you get back a scalar. (Seems bad that you get back to different kinds of objects from the same instance - shouldn't you get back a Series object of size 1 in the latter case?)

In [201]:
o['a']

a    0
a    1
dtype: int64

In [202]:
type(o['a'])

pandas.core.series.Series

In [203]:
o['c']

4

In [204]:
type(o['c'])

numpy.int64

The same rules apply when indexing into a DataFrame with duplicate index values. If you use an index value that isn't duplicated, you get back a Series - since you're returning, for example, a single row or single column. If you use an index value that IS duplicated, then you're actually returning multiple rows or columns, so you get back a (smaller) DataFrame.

In [211]:
f = DataFrame(np.random.randn(5,3), index=list('aabbc'))
f

Unnamed: 0,0,1,2
a,-0.02606,-1.803653,1.046564
a,-0.227848,0.274482,-0.010228
b,1.486417,1.639465,-0.164596
b,0.242031,-1.191808,-0.843457
c,0.821127,1.308682,-0.002046


In [212]:
f.columns.is_unique

True

In [213]:
f.index.is_unique

False

In [210]:
f.ix['b']

Unnamed: 0,0,1,2
b,-0.579353,-0.723138,-0.078571
b,0.299266,-0.690381,1.199568


In [215]:
type(f.ix['b'])

pandas.core.frame.DataFrame

In [216]:
f.ix['c']

0    0.821127
1    1.308682
2   -0.002046
Name: c, dtype: float64

In [217]:
type(f.ix['c'])

pandas.core.series.Series

#Descriptive statistics

DataFrames and Series objects have a set of common mathematical and statistical methods. Interesting things about these methods:
- Most are reductions/summary statistics - they extract a single value (like a sum or mean) from a Series, or a Series of single values from the rows or columns of a DataFrame (for example, the mean of each column).
- NumPy also defines similar methods for NumPy arrays, but the Pandas methods are 'built from the ground up' to exclude missing data.

In [218]:
df = DataFrame([[1.4,np.nan], [7.1,-4.5],
                [np.nan,np.nan], [0.75,-1.3]],
               index=list('abcd'),
               columns=['one','two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [219]:
df.sum() # sum each column - add up the rows, by column

one    9.25
two   -5.80
dtype: float64

In [220]:
df.sum(axis=1) # sum each row - add up the columns, by row

a    1.40
b    2.60
c     NaN
d   -0.55
dtype: float64

By default, rows/columns with some NAs still return values - the NAs are ignored and don't affect the result (rows/columns with ALL NAs still return NA). To get an NA as the result, I think, if any of the values are NA, use skipna=False).

In [221]:
df.mean(axis=1) # mean of each row

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

In [222]:
df.mean(axis=1, skipna=False) # mean of each row, NA for any rows with NAs

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

Reduction methods - again, like sum, mean, etc. - also have a 'level' parameter that can be used with hierarchical - MultiIndex indices.

Some methods - idxmin and idxmax, for example - return 'indirect' statistics like the index value where min or max values are attained.

In [223]:
df.idxmax()

one    b
two    d
dtype: object

In [224]:
df.idxmin()

one    d
two    b
dtype: object

Finally, some methods are accumulations - cumsum is an accumulation because it accumulates an end value using all values, sum is a reduction.

In [225]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


Some methods - like 'describe' - are neither reductions, accumulations, or return indirect statistics.

In [226]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


The describe method produces different output with non-numeric data.

In [227]:
o = Series(['a','a','b','c'] * 4)
o

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

In [228]:
o.describe()

count     16
unique     3
top        a
freq       8
dtype: object

Page 139 has a full  list of these methods.

##Correlation and covariance

Some summary stats - like correlation and covariance - are computed from _pairs_ of arguments. 

In [229]:
import pandas.io.data as web

In [232]:
all_data = {}
for ticker in ['AAPL','IBM','MSFT','GOOG']:
    all_data[ticker] = web.get_data_yahoo('MSFT', '1/1/2000', '1/1/2010')
    
price = DataFrame({tic: data['Adj Close']
                   for tic, data in all_data.items()})
volume = DataFrame({tic: data['Volume']
                    for tic, data in all_data.items()})

In [234]:
all_data

{'AAPL':                   Open        High         Low       Close     Volume  \
 Date                                                                    
 2000-01-03  117.375000  118.625000  112.000000  116.562500   53228400   
 2000-01-04  113.562500  117.125000  112.250000  112.625000   54119000   
 2000-01-05  111.125000  116.375000  109.375000  113.812500   64059600   
 2000-01-06  112.187500  113.875000  108.375000  110.000000   54976600   
 2000-01-07  108.625000  112.250000  107.312500  111.437500   62013600   
 2000-01-10  113.437500  113.687500  111.375000  112.250000   44963600   
 2000-01-11  111.500000  114.250000  108.687500  109.375000   46743600   
 2000-01-12  108.500000  108.875000  104.437500  105.812500   66532400   
 2000-01-13  104.375000  108.625000  101.500000  107.812500   83144000   
 2000-01-14  107.187500  113.937500  105.750000  112.250000   73416400   
 2000-01-18  111.812500  116.500000  111.750000  115.312500   81483600   
 2000-01-19  110.500000  111.5

In [235]:
price.head()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-03,41.202972,41.202972,41.202972,41.202972
2000-01-04,39.811129,39.811129,39.811129,39.811129
2000-01-05,40.230892,40.230892,40.230892,40.230892
2000-01-06,38.883234,38.883234,38.883234,38.883234
2000-01-07,39.391367,39.391367,39.391367,39.391367


In [236]:
volume.head()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-03,53228400,53228400,53228400,53228400
2000-01-04,54119000,54119000,54119000,54119000
2000-01-05,64059600,64059600,64059600,64059600
2000-01-06,54976600,54976600,54976600,54976600
2000-01-07,62013600,62013600,62013600,62013600


The following I think computes the percent change between succesive index/row values (for each column) - for ex, between the value for 2000-01-03 and 2000-01-04 for AAPL, etc.

In [240]:
returns = price.pct_change()

In [241]:
returns.head()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-03,,,,
2000-01-04,-0.03378,-0.03378,-0.03378,-0.03378
2000-01-05,0.010544,0.010544,0.010544,0.010544
2000-01-06,-0.033498,-0.033498,-0.033498,-0.033498
2000-01-07,0.013068,0.013068,0.013068,0.013068


Not sure why, but corr here is returning 1.0 regardless of what two 
Series values I use. This is how the book shows using the corr method of the Series object. The cov method returns the same thing too, for some reason. See also the same oddness below w/ the DataFrame methods of the same name. Perhaps I set up things incorrectly above, or the data we got back - which is diff from the data the book used? event though we specify the same data range? - may be breaking this somehow.

In [248]:
returns.GOOG.corr(returns.MSFT)

1.0

In [243]:
returns.MSFT.cov(returns.IBM)

0.00051615254382059823

There's also corr and cov methods on DataFrame. These 'return a full correlation or covariance matrix as a DataFrame'.

In [249]:
returns.corr()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1,1,1,1
GOOG,1,1,1,1
IBM,1,1,1,1
MSFT,1,1,1,1


In [250]:
returns.cov()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,0.000516,0.000516,0.000516,0.000516
GOOG,0.000516,0.000516,0.000516,0.000516
IBM,0.000516,0.000516,0.000516,0.000516
MSFT,0.000516,0.000516,0.000516,0.000516


There's also corrwith, which per p140 lets you do 'pairwise correlations between columns/rows in diff DataFrame and Series instances.

##Unique values, value counts, membership

In [251]:
o = Series(list('cadaabbcc'))
o

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [253]:
uniques = o.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [255]:
uniques.sort()
uniques

array(['a', 'b', 'c', 'd'], dtype=object)

Single method to get a Series with each unique value and the number of time it occurs:

In [257]:
o.value_counts()

a    3
c    3
b    2
d    1
dtype: int64

Pandas also provides value_counts as a top-level class method that can be used with any array or sequence:

In [258]:
pd.value_counts(o.values, sort=False)

c    3
b    2
a    3
d    1
dtype: int64

In [259]:
pd.value_counts(list('adjasldkfjasdlfhasd;fkajsdf;ahsdgkashdfaskdfaksdfasdfasd;f'))

d    11
a    11
s    10
f     9
k     5
h     3
;     3
j     3
l     2
g     1
dtype: int64

Use isin to easily filter a data set down to a particular subset of values in a Series or in a DataFrame column.

In [260]:
mask = o.isin(['b','c']) # boolean array that's True where 'b' or 'c'
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [261]:
o

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [262]:
o[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

One nice use for value_counts with DataFrames is to 'compute a histogram' (but not visually - here the text means just count each value).

In [263]:
d = DataFrame({'Qu1': [1,3,4,3,4],
               'Qu2': [2,3,1,2,3],
               'Qu3': [1,5,2,4,4]})
d

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


Now, 'apply' value_counts to each column, and use 0 where there are NAs.

In [265]:
result = d.apply(pd.value_counts).fillna(0)
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1,1,1
2,0,2,1
3,2,2,0
4,2,0,2
5,0,0,1


#Missing data

Pandas uses NaN - which is a floating point value in implementation - as a sentinel in floating and non-floating point arrays.

In [266]:
string_data = Series(['aardvark','artichoke',np.nan,'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [267]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [268]:
string_data[0] = None
string_data

0         None
1    artichoke
2          NaN
3      avocado
dtype: object

In [269]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

In [270]:
from numpy import nan as NA

In [271]:
data = Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [272]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

You could also do the same as dropna above - with a Series at least, where things are simple - yourself using boolean indexing and notnull (or, likely, isnull).

In [273]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [274]:
data.notnull()

0     True
1    False
2     True
3    False
4     True
dtype: bool

DataFrame's dropna rules are 'a little more complex' - I think at least because you might want to drop rows/columns which are all NA, or those that have any NAs, or somewhere in between.

In [276]:
d = DataFrame([[1.,6.5,3.], [1.,NA,NA],
              [NA,NA,NA], [NA,6.5,3.]])
d

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [277]:
cleaned = d.dropna()
cleaned

Unnamed: 0,0,1,2
0,1,6.5,3


In [278]:
d.dropna(how='all') # only rows that are all NA

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [280]:
d[4] = NA
d

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [281]:
d.dropna(how='all', axis=1) # only cols that are all NA

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [282]:
df = DataFrame(np.random.randn(7,3))
df.ix[:4, 1] = NA
df

Unnamed: 0,0,1,2
0,0.498883,,0.537973
1,0.167015,,-1.332263
2,-0.601805,,0.835373
3,-0.972492,,-0.265646
4,-1.04481,,1.136777
5,0.685937,0.056857,-1.684138
6,-0.574979,-0.265786,-1.495113


In [283]:
df.ix[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,0.498883,,
1,0.167015,,
2,-0.601805,,
3,-0.972492,,-0.265646
4,-1.04481,,1.136777
5,0.685937,0.056857,-1.684138
6,-0.574979,-0.265786,-1.495113


In [286]:
df.dropna(thresh=3) # keep rows with only three+ non-NA values

Unnamed: 0,0,1,2
5,0.685937,0.056857,-1.684138
6,-0.574979,-0.265786,-1.495113


In [287]:
df.dropna(thresh=2) # keep rows with only two+ non-NA values

Unnamed: 0,0,1,2
3,-0.972492,,-0.265646
4,-1.04481,,1.136777
5,0.685937,0.056857,-1.684138
6,-0.574979,-0.265786,-1.495113


Instead dropping NA data, you can replace it.

In [288]:
df

Unnamed: 0,0,1,2
0,0.498883,,
1,0.167015,,
2,-0.601805,,
3,-0.972492,,-0.265646
4,-1.04481,,1.136777
5,0.685937,0.056857,-1.684138
6,-0.574979,-0.265786,-1.495113


In [289]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.498883,0.0,0.0
1,0.167015,0.0,0.0
2,-0.601805,0.0,0.0
3,-0.972492,0.0,-0.265646
4,-1.04481,0.0,1.136777
5,0.685937,0.056857,-1.684138
6,-0.574979,-0.265786,-1.495113


In [290]:
df.fillna({1: 0.5, 3: -1}) # diff fill values for each column

Unnamed: 0,0,1,2
0,0.498883,0.5,
1,0.167015,0.5,
2,-0.601805,0.5,
3,-0.972492,0.5,-0.265646
4,-1.04481,0.5,1.136777
5,0.685937,0.056857,-1.684138
6,-0.574979,-0.265786,-1.495113


In [291]:
_ = df.fillna(0, inplace=True) # returns a ref to the filled obj
df

Unnamed: 0,0,1,2
0,0.498883,0.0,0.0
1,0.167015,0.0,0.0
2,-0.601805,0.0,0.0
3,-0.972492,0.0,-0.265646
4,-1.04481,0.0,1.136777
5,0.685937,0.056857,-1.684138
6,-0.574979,-0.265786,-1.495113


In [292]:
df = DataFrame(np.random.randn(6,3))
df.ix[2:,1]=NA; df.ix[4:,2]=NA
df

Unnamed: 0,0,1,2
0,0.835084,-0.424054,-0.822014
1,1.385518,-0.174821,1.205587
2,0.934476,,-0.34519
3,-0.751307,,0.875368
4,0.296956,,
5,0.296152,,


In [293]:
df.fillna(method='ffill') # interpolation to fill missing values

Unnamed: 0,0,1,2
0,0.835084,-0.424054,-0.822014
1,1.385518,-0.174821,1.205587
2,0.934476,-0.174821,-0.34519
3,-0.751307,-0.174821,0.875368
4,0.296956,-0.174821,0.875368
5,0.296152,-0.174821,0.875368


In [294]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,0.835084,-0.424054,-0.822014
1,1.385518,-0.174821,1.205587
2,0.934476,-0.174821,-0.34519
3,-0.751307,-0.174821,0.875368
4,0.296956,,0.875368
5,0.296152,,0.875368


You can do a lot of other things w/ fillna 'with a little creativity', like pass the mean or median of a Series.

In [295]:
data = Series([1., NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [296]:
data.mean()

3.8333333333333335

In [297]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

#Hierarchical indexing

In [298]:
data = Series(np.random.randn(10),
              index=[list('aaabbbccdd'),
                     [1,2,3,1,2,3,1,2,2,3]])
data

a  1   -1.492356
   2   -0.046519
   3    0.354871
b  1    0.342971
   2   -0.598670
   3   -1.183749
c  1   -0.007644
   2   -0.075437
d  2    1.862357
   3   -0.215455
dtype: float64

In [300]:
type(data.index)

pandas.core.index.MultiIndex

In [301]:
data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

'Partial indexing' to select subsets of data using different parts of the MultiIndex.

In [303]:
data['b']

1    0.342971
2   -0.598670
3   -1.183749
dtype: float64

In [304]:
type(data['b'])

pandas.core.series.Series

In [305]:
type(data['b'].index)

pandas.core.index.Int64Index

In [306]:
data['b':'c']

b  1    0.342971
   2   -0.598670
   3   -1.183749
c  1   -0.007644
   2   -0.075437
dtype: float64

In [307]:
data.ix[['b','d']]

b  1    0.342971
   2   -0.598670
   3   -1.183749
d  2    1.862357
   3   -0.215455
dtype: float64

'In some cases' you can select from an 'inner' level - here, the one that has index values of 1-3.

In [308]:
data[:, 2]

a   -0.046519
b   -0.598670
c   -0.075437
d    1.862357
dtype: float64

As soon as you start reshaping data and using groups (including pivot tables), apparently MultiIndex is commonly used.

In [309]:
# take the inner level and rotate it out so it turns into columns
data.unstack()

Unnamed: 0,1,2,3
a,-1.492356,-0.046519,0.354871
b,0.342971,-0.59867,-1.183749
c,-0.007644,-0.075437,
d,,1.862357,-0.215455


In [310]:
type(data.unstack())

pandas.core.frame.DataFrame

In [311]:
# and you can take columns and make them an (inner) part using a 
# MultiIndex - i.e., take columns and make them into rows - using
# stack()
data.unstack().stack()

a  1   -1.492356
   2   -0.046519
   3    0.354871
b  1    0.342971
   2   -0.598670
   3   -1.183749
c  1   -0.007644
   2   -0.075437
d  2    1.862357
   3   -0.215455
dtype: float64

Both rows and columns can have hierarchical indices.

In [312]:
f = DataFrame(np.arange(12).reshape((4,3)),
              index=[list('aabb'),[1,2,1,2]],
              columns=[['Ohio','Ohio','Colorado'],
                       ['Green','Red','Green']])
f

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


Hierarchical levels can have individual names (as strings or any Python object) - these show up in output. These - index names - _aren't_ the same as axis labels. (My question: what are 'axis labels' in this context? It's not the actual index values, is it? If not, then what is it?)

In [317]:
f.index.names

FrozenList([None, None])

In [318]:
f.columns.names

FrozenList([None, None])

In [319]:
f.index.names = ['key1','key2']
f.columns.names = ['state','color']
f

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


Like as shown with the Series before, you can use partial indexing to select subsets of rows. You can also use partial indexing with columns.

In [321]:
f.ix['b']

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,7,8
2,9,10,11


In [320]:
f['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


You can create a MultiIndex by itself and then reuse it when creating multiple DataFrames.

In [323]:
pd.MultiIndex.from_arrays([['Ohio','Ohio','Colorado'],
                        ['Green','Red','Green']],
                      names=['state','color'])

MultiIndex(levels=[['Colorado', 'Ohio'], ['Green', 'Red']],
           labels=[[1, 1, 0], [0, 1, 0]],
           names=['state', 'color'])

##Reordering and sorting levels

The swaplevel method takes two level numbers or names and returns a new object with the levels exchanged, but the data otherwise unaltered. This isn't crazy at all - while the prettified UI shows the outer level only once, remember that's just a short-hand way of showing that the outer level applies to each item in the inner level. In other words, when you have multiple levels/have a hierarchical index, it's also just multiple columns. However, Pandas cares about what's first/outer and what's second/inner, etc., for at least some operations, so sometimes you want to swap the levels.

In [324]:
f

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [325]:
f.swaplevel('key1','key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


You can also sort by level, starting with index 0.

In [327]:
f.sortlevel(1) # sort by the inner level here

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


The swaplevel method doesn't resort the rows, so, as shown above, key2 doesn't collapse, since the two key2 values of 1 aren't next to each other. To get this result, it's common to swap AND sort at the same time.

In [328]:
f.swaplevel(0, 1).sortlevel(0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


FWIW, 'data selection performance is much better on hierarchically indexed objects if the index is lexicographically sorted starting with the outermost level. That is, as a result of calling sortlevel(0) or sort_index().'

Many descriptive and summary stats on DataFrame and Series have a level option that enables you to specify the level on which you want to apply the statistic, on a particular axis. Under the hood this functionality uses groupby, as discussed later.

In [329]:
f

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [331]:
# groups by key2 and then calls sum
# i think the resulting DataFrame here has a non-hierarchical index
# that includes just the index values that are part of 'key2'
f.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [332]:
# group by the color column, and then apply sum
f.sum(level='color', axis=1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


##Using columns as row indices, or vice-versa

It's common to want to use one or more columns in an existing DataFrame as row indices, or to move a row index into columns.

In [333]:
f = DataFrame({'a': range(7), 'b': range(7, 0, -1),
               'c': ['one','one','one','two','two','two','two'],
               'd': [0,1,2,0,1,2,3]})
f

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


Use DataFrame.set_index to create a new DataFrame using one or more of the existing columns as the index. (Looks like the data stays the same, of course, and the columns that _aren't_ moved to be row indices stay the same, and the previous row indices are done away with.)

In [335]:
f2 = f.set_index(['c','d'])
f2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


You can also leave the columns in.

In [336]:
f.set_index(['c', 'd'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


And, to go the other way, use reset_index - take row index values and move them into columns.

In [337]:
f2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [338]:
f2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


In [348]:
f = DataFrame(np.arange(12).reshape((4,3)),
              index=[list('aabb'),[1,2,1,2]],
              columns=[['Ohio','Ohio','Colorado'],
                       ['Green','Red','Green']])
f.index.names = ['key1','key2']
f.columns.names = ['state','color']
f

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [349]:
fr = f.reset_index()
fr

state,key1,key2,Ohio,Ohio,Colorado
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Green,Red,Green
0,a,1,0,1,2
1,a,2,3,4,5
2,b,1,6,7,8
3,b,2,9,10,11


In [350]:
fr.columns

MultiIndex(levels=[['Colorado', 'Ohio', 'key2', 'key1'], ['Green', 'Red', '']],
           labels=[[3, 2, 1, 1, 0], [2, 2, 0, 1, 0]],
           names=['state', 'color'])

#Other stuff

Pandas objects indexed by integers don't always work the same way as built-in Python objects.

In [351]:
s = Series(np.arange(3.))
s

0    0
1    1
2    2
dtype: float64

In [353]:
# this raises an error because Pandas can't infer whether the user
# wants label-based or position-based indexing
s[-1]

KeyError: -1

In [354]:
# in contrast, there's no potential for ambiguity with non-integer
# index values, like here, so -1 works fine

In [355]:
s2 = Series(np.arange(3.), index=list('abc'))
s2

a    0
b    1
c    2
dtype: float64

In [356]:
s2[-1]

2.0

There's more on page 152 about integer-based indexing and using iget_value, irow, and icol.

There's also coverage of Panel, which is described as a 3D corollary to the DataFrame, and also described as not being covered in this book (much), because MultiIndex with DataFrame is generally sufficient.