In [14]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [1]:
from pandas import DataFrame, Series
import pandas as pd

#Series

Series objects are like one-dimensional arrays, with an associated array that's an index.

In [2]:
o = Series([4,7,-5,3])
o

0    4
1    7
2   -5
3    3
dtype: int64

In [4]:
o.values

array([ 4,  7, -5,  3])

In [5]:
o.index

Int64Index([0, 1, 2, 3], dtype='int64')

In [6]:
o2 = Series([4,7,-5,3], index=['d','b','a','c'])
o2

d    4
b    7
a   -5
c    3
dtype: int64

In [7]:
o2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [8]:
o2[0]

4

In [9]:
o2['d']

4

In [10]:
o2[['c','a','b']]

c    3
a   -5
b    7
dtype: int64

NumPy array operations, like filtering with a boolean array, scalar multiplication, and applying math functions, preserve the link between the index and the resulting value.

In [11]:
o2[o2 > 0]

d    4
b    7
c    3
dtype: int64

In [12]:
o2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

In [15]:
np.exp(o2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

A series can also be thought of as a fixed-length ordered dict, as it maps between index values and data values. You can pass a series into many functions that expect a dict.

In [16]:
'b' in o2

True

In [17]:
'e' in o2

False

In [18]:
data_in_dict = {'Ohio': 35000, 'Texas': 71000, 
                'Oregon': 16000, 'Utah': 5000}
o3 = Series(data_in_dict)
o3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

We pass in here a dict and a set of index values. The constructor takes only the data values from the dict that are specified in the index values. Since 'California' isn't in the index values, no data can be taken from the provided values dict - we still get an entry for the 'California' index value, but it has a NaN value.

In [19]:
states = ['California','Ohio','Oregon','Texas']
o4 = Series(data_in_dict, index=states)
o4

California      NaN
Ohio          35000
Oregon        16000
Texas         71000
dtype: float64

In [20]:
pd.isnull(o4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [21]:
pd.notnull(o4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [22]:
o4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Series 'automatically aligns differently-indexed data in arithmetic operations'. That is, I think, it'll use the index values to determine how to apply the specified arithmetic operation. Here, we have two different series that share some of the same index values. When we add the two Series instances, data values w/ the same index values are added together. 

In [23]:
o3 + o4

California       NaN
Ohio           70000
Oregon         32000
Texas         142000
Utah             NaN
dtype: float64

You can name the Series instance itself, and also name the index.

In [25]:
o4.name = 'population'
o4.index.name = 'state'
o4

state
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
Name: population, dtype: float64

To change the index, change it in place.

In [26]:
o

0    4
1    7
2   -5
3    3
dtype: int64

In [27]:
o.index = ['Bob','Steve','Jeff','Ryan']
o

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

#DataFrames

Some high-level notes on DataFrames:
- Think of it as a dict of Series instances. Each Series instance shares the same index, which is the index of the DataFrame.
- Compared to R's data.frame, row- and column-oriented actions are treated roughly symmetrically.
- While it doesn't matter for work done with DataFrames, the data's stored as one or more 2D blocks and not as a list, dict, or other collection of 1D arrays.
- Even though the data's stored in 2D, it's 'easy' to represent higher-dimensional data using hierarchical indexing.

In [29]:
data = {'state': ['Ohio','Ohio','Ohio','Nevada','Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


In [30]:
DataFrame(data, columns=['year','state','pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


Just like w/ Series, a column that doesn't have any info in the data shows as NaN.

In [31]:
frame2 = DataFrame(data, columns=['year','state','pop','debt'],
                   index=['one','two','three','four','five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [32]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [33]:
frame2.state

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

As shown above, the returned Series is:
- named appropriately depending on the column
- has the same index as the DataFrame

Accessing the columns is intuitive. It uses the bracket and dot notations, so they can't also apply if you want to retrieve rows. To access a row, use the .ix method and pass the index value of the particular row you care about (or use one of a few other approaches that are discussed later).

In [35]:
frame2.ix['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Accessing a column or a row gives you a Series instance. 

In [37]:
type(frame2['state'])

pandas.core.series.Series

In [36]:
type(frame2.ix['three'])

pandas.core.series.Series

There are a variety of different ways to modify columns. For ex, you can provide a single value (that will be broadcast) or an array of different values.

In [38]:
frame2.debt = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [41]:
arange(5.)

array([ 0.,  1.,  2.,  3.,  4.])

In [43]:
frame2.debt = arange(5.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


You can also assign a Series instance to a column. In this case the length of the Series doesn't need to match the length of the DataFrame (like it does if you just assign a bare array). Instead, the Series data will be used and matched up according to the index of the Series and the DataFrame: matching index values are used and any DataFrame index values that don't have a matching index value in the Series will result in NaN (index values in the Series that aren't in the DataFrame are ignored).

In [44]:
val = Series([-1.2, -1.5, -1.7], index=['two','four','five'])
frame2.debt = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


To create a new column, assign to a column that doesn't exist. You have to use the bracket syntax - you can't use the dot syntax to create a new column.

In [46]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


To delete a column, use 'del'. Again, dot syntax doesn't work - use bracket syntax.

In [48]:
del frame2['eastern']
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


The column returned by indexing a DataFrame is a view, not a copy - you can modify the Series and it'll modify the DataFrame.

In [51]:
s = frame2.debt
s['four'] = -1.5555
frame2

A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from IPython.kernel.zmq import kernelapp as app


Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5555
five,2002,Nevada,2.9,-1.7


Instead of using a dict w/ equal-sized arrays, you can also use a nested set of dicts - i.e., a dict of dicts. The keys of the first/outer dict are interpreted as columns, and the keys of the inner dicts are interpreted as row indices.

In [56]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


DataFrames can be transposed, just like NumPy arrays.

In [57]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


When no explicit index is provided, the keys of the inner dicts are unioned and sorted to form the index. Or, you can provide an explicit index.

In [58]:
DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


You can also use a dict w/ Series values.

In [60]:
frame3['Ohio']

2000    1.5
2001    1.7
2002    3.6
Name: Ohio, dtype: float64

In [59]:
frame3['Ohio'][:-1]

2000    1.5
2001    1.7
Name: Ohio, dtype: float64

In [64]:
frame3['Nevada']

2000    NaN
2001    2.4
2002    2.9
Name: Nevada, dtype: float64

In [65]:
frame3['Nevada'][:2]

2000    NaN
2001    2.4
Name: Nevada, dtype: float64

In [62]:
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
pdata

{'Nevada': 2000    NaN
 2001    2.4
 Name: Nevada, dtype: float64, 'Ohio': 2000    1.5
 2001    1.7
 Name: Ohio, dtype: float64}

In [63]:
DataFrame(pdata)

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7


Don't forget DataFrame names.

In [66]:
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [67]:
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


Just like w/ Series, values returns the data as a 2D NumPy ndarray.

In [68]:
frame3.values

array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

##Index objects

In [69]:
o = Series(range(3), index=['a','b','c'])
i = o.index
i

Index(['a', 'b', 'c'], dtype='object')

In [70]:
i[1:]

Index(['b', 'c'], dtype='object')

In [71]:
i[1] = 'd' # they're immutable

TypeError: Indexes does not support mutable operations

As above, indices are immutable - this makes it possible to safely share them between different data structures.

In [72]:
i = pd.Index(np.arange(3))
o2 = Series([1.5, -2.5, 0], index=i)
o2.index is i

True

As page 121 shows, there are multiple Index classes provided by Pandas... not just the most general 'Index' object. Each more specific index object is specialized for particular kinds of index values. For example, there are Int64Index (specialized for integer values), MultiIndex (for hierarchical/multiple levels of indexing on a single axis - like an array of tuples, whatever that means), DateTimeIndex (nanosecond timestamps), and PeriodIndex (period data - timespans).

Both the rows and columns have indices.

In [73]:
type(frame3.columns)

pandas.core.index.Index

In [75]:
type(frame3.index)  # rows

pandas.core.index.Int64Index

##"Essential" functionality, as per page 122 and on

"Reindexing" is "critical" and means to create a new object with the data _conformed_ to a new index.

In [76]:
o = Series([4.5, 7.2, -5.3, 3.6], index=['d','b','a','c'])
o

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Calling reindex with a new index rearranges the data according to the new index, and inserts missing values if any index values aren't already present.

In [79]:
o2 = o.reindex(['a','b','c','d','e'])
o2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [80]:
o.reindex(['a','b','c','d','e'], fill_value=0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

Interpolation works - helpful for ordered data like time series.

In [82]:
o3 = Series(['blue','purple','yellow'], index=[0, 2, 4])
o3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

With DataFrame instances, reindex can alter either the (row) index, columns, or both. 

In [83]:
d = np.arange(9).reshape(3,3)
d

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [86]:
f = DataFrame(d, 
              index=['a','c','d'],
              columns=['Ohio','Texas','California'])
f

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [88]:
f2 = f.reindex(['a','b','c','d'])  # reindex rows/the 'index'
f2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [89]:
states = ['Texas','Utah','California']
f.reindex(columns=states) # reindex columns

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


Both index/rows and columns at the same time.

In [90]:
f.reindex(index=['a','b','c','d'],
          columns=states)

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


"Label-indexing with ix" is a more succinct way to reindex. (Covered more below?)

In [92]:
f.ix[['a','b','c','d'], states]

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


To drop, use the drop method.

In [94]:
o = Series(np.arange(5.), index=['a','b','c','d','e'])
o

a    0
b    1
c    2
d    3
e    4
dtype: float64

In [95]:
new_o = o.drop('c')
new_o

a    0
b    1
d    3
e    4
dtype: float64

In [96]:
o.drop(['d','c'])

a    0
b    1
e    4
dtype: float64

In [97]:
d = DataFrame(np.arange(16).reshape((4,4)),
              index=['Ohio','Colorado','Utah','New York'],
              columns=['one','two','three','four'])
d

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [98]:
d.drop(['Colorado','Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [100]:
d.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [101]:
d.drop(['two','four'], axis=1)

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


##Indexing, selection, filtering

Series indexing works like NumPy array indexing, and you can also use the Series's index values instead of just integers.

In [102]:
o = Series(np.arange(4.), index=['a','b','c','d'])
o

a    0
b    1
c    2
d    3
dtype: float64

In [103]:
o['b']

1.0

In [104]:
o[1]

1.0

In [105]:
o[2:4]

c    2
d    3
dtype: float64

In [106]:
o[['b','a','d']]

b    1
a    0
d    3
dtype: float64

In [107]:
o[[1, 3]]

b    1
d    3
dtype: float64

Boolean indexing - a test is applied to each entry in the series, and a boolean is returned for each entry. (All of the booleans together form yet another Series instance.)

In [111]:
o < 2

a     True
b     True
c    False
d    False
dtype: bool

And you can use this, or any other boolean sequence, to index into the Series and select out values.

In [112]:
o[o < 2]

a    0
b    1
dtype: float64

When you slice with labels (instead of integer values), then the endpoint is inclusive in contrast.

In [114]:
o[1:3] # the endpoint is not inclusive, so we specify the fourth index to get two values

b    1
c    2
dtype: float64

In [115]:
o['b':'c'] # endpoint is inclusive; only specify two values to get two

b    1
c    2
dtype: float64

In [116]:
o['b':'c'] = 5
o

a    0
b    5
c    5
d    3
dtype: float64

In [118]:
d

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [120]:
d['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [121]:
d[['three','one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


And there are some special cases w/ DataFrame indexing.

In [123]:
d[:2] # first two rows

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [125]:
d[d['three'] > 5] # rows where col named 'three' has values > 5 (not sure how this is odd)

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [126]:
d < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [127]:
d[d < 5] = 0

In [128]:
d

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


The ix field (method?) enables one to select a subset of rows and columns from a DataFrame using NumPy-like notation plus axis labels. (This is also, as shown earlier, a less verbose way to reindex.)

In [130]:
d.ix['Colorado']

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

In [129]:
d.ix['Colorado', ['two','three']]

two      5
three    6
Name: Colorado, dtype: int64

In [131]:
d.ix[['Colorado','Utah'], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [132]:
d.ix[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

All rows up to and including the row with index 'Utah' ('including', I think?, because of how label-based indexing is inclusive and not exclusive like integer-based slicing).

In [133]:
d.ix[:'Utah']

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11


In [134]:
d.ix[d.three > 5]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [135]:
d.ix[d.three > 5, :3]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


Page 128 has a table that summarizes the different ways to index into a DataFrame.

Interestingly, page 128 also has a note saying that "having to type frame[:, col] to select a column was too verbose (and error-prone), since column selection is one of the most common operations. Thus I made the design trade-off to push all of the rich label-indexing into ix."

I think he may be meaning that w/ NumPy arrays you'd use the a[:, colname] syntax to select a column, but he didn't want folks to need to do this with DataFrames, since it's so common. ?

##Arithmetic and data alignment

Not sure why it's so important, but the author says that one of the most important Pandas features is 'behavior of arithmetic between objects with different indexes'.

When you add objects, if the indexes don't match exactly, then the index of the resulting object is the union of the indexes of the two things being added (or subtracted, etc.)

In [137]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a','c','d','e'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [138]:
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a','c','e','f','g'])
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [139]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

When you add DataFrames alignment is performed on both rows and columns.

In [141]:
df1 = DataFrame(np.arange(9.).reshape((3,3)),
                columns=list('bcd'),
                index=['Ohio','Texas','Colorado'])
df1

Unnamed: 0,b,c,d
Ohio,0,1,2
Texas,3,4,5
Colorado,6,7,8


In [142]:
df2 = DataFrame(np.arange(12.).reshape((4,3)),
                columns=list('bde'),
                index=['Utah','Ohio','Texas','Oregon'])
df2

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


Adding the two frames gives us a DataFrame whose indexes are the union of the indexes - this applies to both the row and column indexes - of the underlying DataFrames.

NaN values propagate, so the only cells that keep actual numbers are those cells whose row and column index values exist in both of the original DataFrame instances.

In [143]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


You can use the fill_value param to get something besides a NaN in locations that don't overlap.

Cells where there's nothing in either source are still NaN.

In [144]:
df1.add(df2, fill_value=0)

Unnamed: 0,b,c,d,e
Colorado,6,7.0,8,
Ohio,3,1.0,6,5.0
Oregon,9,,10,11.0
Texas,9,4.0,12,8.0
Utah,0,,1,2.0


You can also do arithmetic with DataFrame and Series objects. This works like it does with NumPy arrays.

First, here's how things work with just NumPy arrays.  

In [145]:
a = np.arange(12.).reshape((3,4))
a

array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

In [146]:
a[0]

array([ 0.,  1.,  2.,  3.])

This is 'broadcasting'. It's explained more in chapter 12. Here I think of it as the four values in a[0] being used for each row.

In [148]:
a - a[0]

array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

DataFrame and Series also use broadcasting and work similarly.

In [149]:
f = DataFrame(np.arange(12.).reshape((4,3)),
              columns=list('bde'),
              index=['Utah','Ohio','Texas','Oregon'])
f

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [150]:
s = f.ix[0]
s

b    0
d    1
e    2
Name: Utah, dtype: float64

By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame's columns and broadcasts down the rows.

In [151]:
f - s

Unnamed: 0,b,d,e
Utah,0,0,0
Ohio,3,3,3
Texas,6,6,6
Oregon,9,9,9


If an index value is not found in the DataFrame's columns or the Series's index, Pandas reindexes the objects to form the union.

In [152]:
s2 = Series(range(3), index=['b','e','f'])
f + s2

Unnamed: 0,b,d,e,f
Utah,0,,3,
Ohio,3,,6,
Texas,6,,9,
Oregon,9,,12,


To broadcast over the columns, matching on the rows, use one of the arithmetic methods instead of the operators.

In [153]:
s3 = f['d']
s3

Utah       1
Ohio       4
Texas      7
Oregon    10
Name: d, dtype: float64

In [154]:
f

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


So we take the [1,4,7,10] values and subtract them from each column in turn, matching/determining what to subtract from using the row indexes. So for the first column we subtract 1 from 0, 4 from 3, etc. Then we subtract 1 from 1, 4 from 4, etc. And finally, in the last column, we subtract 1 from 2, 4 from 5, 7 from 8, and 10 from 11.

In [155]:
f.sub(s3, axis=0)

Unnamed: 0,b,d,e
Utah,-1,0,1
Ohio,-1,0,1
Texas,-1,0,1
Oregon,-1,0,1


##Function application and mapping, page 132