Author: Geoff Boeing <br />
Web: http://geoffboeing.com <br /> 
Date: 2015-10-07 <br /> 

# map(), apply(), and applymap() in pandas

These methods are useful for mapping/applying a function across elements, rows, and columns of a pandas DataFrame or Series. But they have some important and often confusing differences.

1. map() applies a function element-wise on a Series
2. apply() works on a row or column basis on a DataFrame (specify the axis!), or on a row basis on a Series
3. applymap() works element-wise on an entire DataFrame

Let's see what that means in practice with some examples.

In [1]:
import numpy as np, pandas as pd

In [2]:
# create a new DataFrame with fake year data
df = pd.DataFrame({'start_year':[2001, 2002, 2005, 2005, 2006], 
                   'end_year':[2002, 2010, 2008, 2006, 2014]})
df

Unnamed: 0,end_year,start_year
0,2002,2001
1,2010,2002
2,2008,2005
3,2006,2005
4,2014,2006


## you can iterate through a DataFrame using the .iterrows() method

In [3]:
# create a new series by adding 4 to each value in the end_year column
years = pd.Series(name='end_year')
for _, row in df.iterrows():
    years.loc[len(years)] = row['end_year'] + 4
years

0    2006
1    2014
2    2012
3    2010
4    2018
Name: end_year, dtype: int64

## alternatively, .map() applies a function element-wise on a Series

In [4]:
# for example: series have a .astype() method, but you can map a type converter as well
df['end_year'].map(str)

0    2002
1    2010
2    2008
3    2006
4    2014
Name: end_year, dtype: object

In [5]:
# here, .map() does the same thing we saw with iterrows
df['end_year'].map(lambda x: x + 4)

0    2006
1    2014
2    2012
3    2010
4    2018
Name: end_year, dtype: int64

Which technique is faster? Use %timeit to find out.

In [6]:
%timeit for _, row in df.iterrows(): row['end_year'] + 4

10000 loops, best of 3: 144 µs per loop


In [7]:
%timeit df['end_year'].map(lambda x: x + 4)

10000 loops, best of 3: 43.6 µs per loop


Mapping a function to the series is much more efficient than iterating through the rows

In [8]:
# you can create a new column to contain the results of the function mapping
df['next_cycle'] = df['end_year'].map(lambda x: x + 4)
df

Unnamed: 0,end_year,start_year,next_cycle
0,2002,2001,2006
1,2010,2002,2014
2,2008,2005,2012
3,2006,2005,2010
4,2014,2006,2018


In [9]:
# for each row, determine if end_year is after 2008
df['ended_after_2008'] = df['end_year'].map(lambda x: x > 2008)
df

Unnamed: 0,end_year,start_year,next_cycle,ended_after_2008
0,2002,2001,2006,False
1,2010,2002,2014,True
2,2008,2005,2012,False
3,2006,2005,2010,False
4,2014,2006,2018,True


In [10]:
# return the df to its original state
df.drop(labels=['next_cycle', 'ended_after_2008'], axis=1, inplace=True)
df

Unnamed: 0,end_year,start_year
0,2002,2001
1,2010,2002
2,2008,2005
3,2006,2005
4,2014,2006


## .apply() works on a row or column basis on an entire DataFrame (specify the axis), or on a row basis on a Series

In [11]:
# dataframes have built-in methods to do common stuff like calculating the mean value
df.mean()

end_year      2008.0
start_year    2003.8
dtype: float64

In [12]:
# and you can do the exact same thing by applying a function that calculates the mean
df.apply(np.mean)

end_year      2008.0
start_year    2003.8
dtype: float64

With all of these techniques, you can apply either: regularly named functions or lambda functions

In [13]:
# here, .apply() applies a function to calculate the difference between the min and max values in each column
def diff(x):
    difference = x.max() - x.min()
    return difference

df.apply(diff, axis=0)

end_year      12
start_year     5
dtype: int64

In [14]:
# exact same thing, only using a lambda function instead of a regularly named one
diff = lambda x: x.max() - x.min()
df.apply(diff, axis=0)

end_year      12
start_year     5
dtype: int64

In [15]:
# same thing again, using a lambda function inline as an argument... you commonly see this in pandas
df.apply(lambda x: x.max() - x.min(), axis=0)

end_year      12
start_year     5
dtype: int64

In [16]:
# here .apply() finds the difference between the min and max values in each row
df.apply(diff, axis=1)

0    1
1    8
2    3
3    1
4    8
dtype: int64

in the result, the first column is just an index; the second column is the difference between min/max

## .applymap() works element-wise on an entire DataFrame

In [17]:
# convert each element in the entire dataframe to a string with 2 decimal places
formatter = lambda x: '{:.2f}'.format(x)
df.applymap(formatter)

Unnamed: 0,end_year,start_year
0,2002.0,2001.0
1,2010.0,2002.0
2,2008.0,2005.0
3,2006.0,2005.0
4,2014.0,2006.0


For a nice guide to modern Python string formatting, check out https://mkaz.github.io/2012/10/10/python-string-format/