Link to Medium blog post:
https://towardsdatascience.com/apply-function-to-pandas-dataframe-rows-76df74165ee4

# 12 Ways to Apply a Function to Each Row in Pandas DataFrame

In [2]:
pip install Faker


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
# Generate a DataFrame using Faker with a million tasks. 
# Each task is a row in the DataFrame. It consists of task_name (str), due_date (datetime.date), and priority (str). 
# Priority can be one of the three values: LOW, MEDIUM, HIGH
import pandas as pd
from faker import Faker
import random
import datetime

fake = Faker()

task_name = [fake.name() for _ in range(1000000)]
due_date = [fake.date() for _ in range(1000000)]
priority = [random.choice(['LOW', 'MEDIUM', 'HIGH']) for _ in range(1000000)]

test_data_set = pd.DataFrame({'task_name': task_name, 'due_date': due_date, 'priority': priority})

test_data_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   task_name  1000000 non-null  object
 1   due_date   1000000 non-null  object
 2   priority   1000000 non-null  object
dtypes: object(3)
memory usage: 22.9+ MB


## Optimize DataFrame Storage

In [5]:
# Instead of str, priority can be stored as Pandas categorical type
priority_dtype = pd.api.types.CategoricalDtype(
  categories=['LOW', 'MEDIUM', 'HIGH'],
  ordered=True
)
test_data_set['priority'] = test_data_set['priority'].astype(priority_dtype)

In [6]:
# Let’s check out the DataFrame size now
test_data_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
 #   Column     Non-Null Count    Dtype   
---  ------     --------------    -----   
 0   task_name  1000000 non-null  object  
 1   due_date   1000000 non-null  object  
 2   priority   1000000 non-null  category
dtypes: category(1), object(2)
memory usage: 16.2+ MB


## Eisenhower Action Function

In [7]:
# Given importance and urgency, eisenhower_action computes an integer value between 0 and 3
def eisenhower_action(is_important: bool, is_urgent: bool) -> int:
  return 2 * is_important + is_urgent

In [11]:
# For this exercise, we will assume that a task with HIGH priority is important
# If the due date is in the next two days, then the task is urgent
# The Eisenhower Action for a task (i.e. a row in the DataFrame) is computed by using the due_date and priority columns
cutoff_date = datetime.date.today() + datetime.timedelta(days=2)

eisenhower_action(
  test_data_set.loc[0].priority == 'HIGH',
  test_data_set.loc[0].due_date <= str(cutoff_date)
)

3

In [None]:
# we will evaluate 12 alternatives for applying eisenhower_action function to DataFrame rows
    # First, we will measure the time for a sample of 100k rows
    # Then, we will measure and plot the time for up to a million rows

## Method 1. Loop Over All Rows of a DataFrame

In [16]:
# The simplest method to process each row in the good old Python loop. This is obviously the worst way, and nobody in the right mind will ever do it
def loop_impl(df):
  cutoff_date = datetime.date.today() + datetime.timedelta(days=2)
  result = []
  for i in range(len(df)):
    row = df.iloc[i]
    result.append(
      eisenhower_action(
        row.priority == 'HIGH', row.due_date <= str(cutoff_date))
    )
  return pd.Series(result)

In [17]:
# time for a sample of 100k rows
data_sample = test_data_set.sample(100000)
%timeit data_sample['action_loop'] = loop_impl(data_sample)

10.4 s ± 455 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [28]:
pip install line_profiler


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [30]:
# Let’s find out what is taking so long using the line_profiler and %lprun, but for a smaller sample of 100 rows
%load_ext line_profiler
%lprun -f loop_impl loop_impl(data_sample[:100])

Timer unit: 1e-09 s

Total time: 0.086536 s
File: /var/folders/cz/dg_zmrb96h979_06qv2zwrf40000gn/T/ipykernel_26907/4164450296.py
Function: loop_impl at line 2

Line #      Hits         Time  Per Hit   % Time  Line Contents
     2                                           def loop_impl(df):
     3         1     678000.0 678000.0      0.8    cutoff_date = datetime.date.today() + datetime.timedelta(days=2)
     4         1          0.0      0.0      0.0    result = []
     5       101     111000.0   1099.0      0.1    for i in range(len(df)):
     6       100   64676000.0 646760.0     74.7      row = df.iloc[i]
     7       200     245000.0   1225.0      0.3      result.append(
     8       200     313000.0   1565.0      0.4        eisenhower_action(
     9       100   12039000.0 120390.0     13.9          row.priority == 'HIGH', row.due_date <= str(cutoff_date))
    10                                               )
    11         1    8474000.0    8e+06      9.8    return pd.Series(resu

## Method 2. Iterate over rows with iterrows Function

In [31]:
# Instead of processing each row in a Python loop, let’s try Pandas iterrows function
def iterrows_impl(df):
  cutoff_date = datetime.date.today() + datetime.timedelta(days=2)
  return pd.Series(
    eisenhower_action(
      row.priority == 'HIGH', row.due_date <= str(cutoff_date))
    for index, row in df.iterrows()
  )

In [32]:
# time for a sample of 100k rows
data_sample = test_data_set.sample(100000)
%timeit data_sample['action_iterrows'] = iterrows_impl(data_sample)

7.36 s ± 242 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Method 3. Iterate over rows with itertuples Function

In [35]:
# Pandas has another method, itertuples, that processes rows as tuples
def itertuples_impl(df):
  cutoff_date = datetime.date.today() + datetime.timedelta(days=2)
  return pd.Series(
    eisenhower_action(
      row.priority == 'HIGH', row.due_date <= str(cutoff_date))
    for row in df.itertuples()
  )

In [36]:
# time for a sample of 100k rows
data_sample = test_data_set.sample(100000)
%timeit data_sample['action_itertuples'] = itertuples_impl(data_sample)

266 ms ± 42.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Method 4. Pandas apply Function to every row

In [37]:
# Pandas DataFrame apply function is quite versatile and is a popular choice. To make it process the rows, you have to pass axis=1 argument
def apply_impl(df):
  cutoff_date = datetime.date.today() + datetime.timedelta(days=2)
  return df.apply(
      lambda row:
        eisenhower_action(
          row.priority == 'HIGH', row.due_date <= str(cutoff_date)),
      axis=1
  )

In [38]:
# time for a sample of 100k rows
data_sample = test_data_set.sample(100000)
%timeit data_sample['action_apply'] = apply_impl(data_sample)

1.84 s ± 103 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Method 5. Python List Comprehension

In [42]:
# A column in DataFrame is a Series that can be used as a list in a list comprehension expression
''' [ foo(x) for x in df['x'] ] '''
# If multiple columns are needed, then zip can be used to make a list of tuples
def list_impl(df):
  cutoff_date = datetime.date.today() + datetime.timedelta(days=2)
  return pd.Series([
    eisenhower_action(priority == 'HIGH', due_date <= str(cutoff_date))
    for (priority, due_date) in zip(df['priority'], df['due_date'])
  ])

In [43]:
# time for a sample of 100k rows
data_sample = test_data_set.sample(100000)
%timeit data_sample['action_list'] = list_impl(data_sample)

144 ms ± 4.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Method 6. Python map Function

In [46]:
# Python’s map function that takes in function and iterables of parameters, and yields results
def map_impl(df):
  cutoff_date = datetime.date.today() + datetime.timedelta(days=2)
  return pd.Series(
    map(eisenhower_action,
      df['priority'] == 'HIGH',
      df['due_date'] <= str(cutoff_date))
  )

In [47]:
# time for a sample of 100k rows
data_sample = test_data_set.sample(100000)
%timeit data_sample['action_map'] = map_impl(data_sample)

78 ms ± 6.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Method 7. Vectorization

In [48]:
# The real power of Pandas shows up in vectorization. But it requires unpacking the function as a vector expression
def vec_impl(df):
  cutoff_date = datetime.date.today() + datetime.timedelta(days=2)
  return (
    2*(df['priority'] == 'HIGH') + (df['due_date'] <= str(cutoff_date)))

In [49]:
# time for a sample of 100k rows
data_sample = test_data_set.sample(100000)
%timeit data_sample['action_vec'] = vec_impl(data_sample)

18.7 ms ± 972 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Method 8. NumPy vectorize Function

In [53]:
pip install numpy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [58]:
# NumPy offers alternatives for migrating from Python to Numpy through vectorization. 
# For example, it has a vectorize() function that vectorzie any scalar function to accept and return NumPy arrays
import numpy as np

def np_vec_impl(df):
  cutoff_date = datetime.date.today() + datetime.timedelta(days=2)
  return np.vectorize(eisenhower_action)(
    df['priority'] == 'HIGH',
    df['due_date'] <= str(cutoff_date)
  )

In [59]:
# time for a sample of 100k rows
data_sample = test_data_set.sample(100000)
%timeit data_sample['action_np_vec'] = np_vec_impl(data_sample)

50.7 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Method 9. Numba Decorators

In [61]:
pip install numba

Collecting numba
  Downloading numba-0.59.0-cp311-cp311-macosx_10_9_x86_64.whl.metadata (2.7 kB)
Collecting llvmlite<0.43,>=0.42.0dev0 (from numba)
  Downloading llvmlite-0.42.0-cp311-cp311-macosx_10_9_x86_64.whl.metadata (4.8 kB)
Downloading numba-0.59.0-cp311-cp311-macosx_10_9_x86_64.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading llvmlite-0.42.0-cp311-cp311-macosx_10_9_x86_64.whl (31.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.1/31.1 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: llvmlite, numba
Successfully installed llvmlite-0.42.0 numba-0.59.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pi

In [64]:
# Numba is commonly used to speed up applying mathematical functions. It has various decorators for JIT compilation and vectorization
import numba

@numba.vectorize
def eisenhower_action(is_important: bool, is_urgent: bool) -> int:
  return 2 * is_important + is_urgent
def numba_impl(df):
  cutoff_date = datetime.date.today() + datetime.timedelta(days=2)
  return eisenhower_action(
    (df['priority'] == 'HIGH').to_numpy(),
    (df['due_date'] <= str(cutoff_date)).to_numpy()
  )

In [65]:
# time for a sample of 100k rows
data_sample = test_data_set.sample(100000)
%timeit data_sample['action_numba'] = numba_impl(data_sample)

20.4 ms ± 9.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Method 10. Multiprocessing with pandarallel

In [66]:
pip install pandarallel

Collecting pandarallel
  Downloading pandarallel-1.6.5.tar.gz (14 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting dill>=0.3.1 (from pandarallel)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hBuilding wheels for collected packages: pandarallel
  Building wheel for pandarallel (setup.py) ... [?25ldone
[?25h  Created wheel for pandarallel: filename=pandarallel-1.6.5-py3-none-any.whl size=16672 sha256=3cad700159e37377a15139d07224223dd2af33657c1dbfd51077b574828dbac9
  Stored in directory: /Users/diegoestuar/Library/Caches/pip/wheels/b9/c6/5a/829298789e94348b81af52ab42c19d49da007306bbcc983827
Successfully built pandarallel
Installing collected packages: dill, pandarallel
Successfully installed dill-0.3.8 pandarallel-1.6.5

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[3

In [67]:
# The pandarallel package utilizes multiple CPUs and split the work into multiple threads
from pandarallel import pandarallel
pandarallel.initialize()
def pandarallel_impl(df):
  cutoff_date = datetime.date.today() + datetime.timedelta(days=2)
  return df.parallel_apply(
    lambda row: eisenhower_action(
      row.priority == 'HIGH', row.due_date <= str(cutoff_date)),
    axis=1
  )

INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


In [68]:
# time for a sample of 100k rows
data_sample = test_data_set.sample(100000)
%timeit data_sample['action_pandarallel'] = pandarallel_impl(data_sample)

1.17 s ± 187 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Method 11. Parallelize with Dask

In [70]:
pip install dask

Collecting dask
  Downloading dask-2024.2.1-py3-none-any.whl.metadata (3.7 kB)
Collecting click>=8.1 (from dask)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting cloudpickle>=1.5.0 (from dask)
  Downloading cloudpickle-3.0.0-py3-none-any.whl.metadata (7.0 kB)
Collecting fsspec>=2021.09.0 (from dask)
  Downloading fsspec-2024.2.0-py3-none-any.whl.metadata (6.8 kB)
Collecting partd>=1.2.0 (from dask)
  Downloading partd-1.4.1-py3-none-any.whl.metadata (4.6 kB)
Collecting pyyaml>=5.3.1 (from dask)
  Downloading PyYAML-6.0.1-cp311-cp311-macosx_10_9_x86_64.whl.metadata (2.1 kB)
Collecting toolz>=0.10.0 (from dask)
  Downloading toolz-0.12.1-py3-none-any.whl.metadata (5.1 kB)
Collecting importlib-metadata>=4.13.0 (from dask)
  Downloading importlib_metadata-7.0.1-py3-none-any.whl.metadata (4.9 kB)
Collecting zipp>=0.5 (from importlib-metadata>=4.13.0->dask)
  Downloading zipp-3.17.0-py3-none-any.whl.metadata (3.7 kB)
Collecting locket (from partd>=1.2.0->dask)
  Downlo

In [77]:
# Dask is a parallel computing library that supports scaling up NumPy, Pandas, Scikit-learn, and many other Python libraries. 
# It offers efficient infra for processing a massive amount of data on multi-node clusters
import dask.dataframe as dd
def dask_impl(df):
  cutoff_date = datetime.date.today() + datetime.timedelta(days=2)
  return dd.from_pandas(df, npartitions=2).apply(
    lambda row: eisenhower_action(
      row.priority == 'HIGH', row.due_date <= str(cutoff_date)),
    axis=1,
    meta=(int)
  ).compute()

In [78]:
# time for a sample of 100k rows
data_sample = test_data_set.sample(100000)
%timeit data_sample['action_dask'] = dask_impl(data_sample)



2.55 s ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Method 12. Opportunistic Parallelization with Swifter

In [82]:
pip install swifter

Collecting swifter
  Downloading swifter-1.4.0.tar.gz (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm>=4.33.0 (from swifter)
  Downloading tqdm-4.66.2-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Downloading tqdm-4.66.2-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: swifter
  Building wheel for swifter (setup.py) ... [?25ldone
[?25h  Created wheel for swifter: filename=swifter-1.4.0-py3-none-any.whl size=16506 sha256=44dc391ce75becbbac5a56ec0cf6b5bde650c473f1389482ebf42839e03c29bf
  Stored in directory: /Users/diegoestuar/Library/Caches/pip/wheels/ef/7f/bd/9bed48f078f3ee1fa

In [85]:
# Swifter automatically decides which is faster: to use Dask parallel processing or a simple Pandas apply. 
# It is very simple to use: just all one word to how one uses Pandas apply function: df.swifter.apply
import swifter

def swifter_impl(df):
  cutoff_date = datetime.date.today() + datetime.timedelta(days=2)
  return df.swifter.apply(
    lambda row: eisenhower_action(
      row.priority == 'HIGH', row.due_date <= str(cutoff_date)),
    axis=1
  )

In [86]:
# time for a sample of 100k rows
data_sample = test_data_set.sample(100000)
%timeit data_sample['action_swifter'] = swifter_impl(data_sample)

47.9 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


# Performance Comparison of Iterating over Pandas DataFrame rows

In [87]:
pip install perfplot

Collecting perfplot
  Downloading perfplot-0.10.2-py3-none-any.whl.metadata (6.0 kB)
Collecting matplotx (from perfplot)
  Downloading matplotx-0.3.10-py3-none-any.whl.metadata (13 kB)
Collecting rich (from perfplot)
  Downloading rich-13.7.1-py3-none-any.whl.metadata (18 kB)
Collecting markdown-it-py>=2.2.0 (from rich->perfplot)
  Downloading markdown_it_py-3.0.0-py3-none-any.whl.metadata (6.9 kB)
Collecting mdurl~=0.1 (from markdown-it-py>=2.2.0->rich->perfplot)
  Downloading mdurl-0.1.2-py3-none-any.whl.metadata (1.6 kB)
Downloading perfplot-0.10.2-py3-none-any.whl (21 kB)
Downloading matplotx-0.3.10-py3-none-any.whl (25 kB)
Downloading rich-13.7.1-py3-none-any.whl (240 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.7/240.7 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hDownloading markdown_it_py-3.0.0-py3-none-any.whl (87 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.5/87.5 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?

In [93]:
pip install ipywidgets

Collecting ipywidgets
  Downloading ipywidgets-8.1.2-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.10 (from ipywidgets)
  Downloading widgetsnbextension-4.0.10-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.10 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.10-py3-none-any.whl.metadata (4.1 kB)
Downloading ipywidgets-8.1.2-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.4/139.4 kB[0m [31m939.3 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading jupyterlab_widgets-3.0.10-py3-none-any.whl (215 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.0/215.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading widgetsnbextension-4.0.10-py3-none-any.whl (2.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: widgets

In [None]:
''' Plotting is helpful in understanding the relative performance of alternatives over input size. 
Perfplot is a handy tool for that. It requires a setup to generate input of a given size and a list of implementations to compare '''
import perfplot

kernels = [
  loop_impl,
  iterrows_impl,
  itertuples_impl,
  apply_impl,
  list_impl,
  vec_impl,
  np_vec_impl,
  numba_impl,
  pandarallel_impl,
  dask_impl,
  swifter_impl
]

perfplot.show(
  setup=lambda n: test_data_set.sample(10000),
  kernels=kernels,
  n_range=[2 ** k for k in range(1, 21)],
  xlabel='n_rows',
  logx=True,
  logy=True,
  equality_check=None
)

Output()

Performing an operation independently to all Pandas rows is a common need. Here are my recommendations:

    Vectorize DataFrame expression: Go for this whenever possible.

    NumPy vectorize: Its API is not very complicated. It does not require additional packages. It offers almost the best performance. Choose this if vectorizing DataFrame isn’t infeasible.

    List Comprehension: Opt for this alternative when needing only 2–3 DataFrame columns, and DataFrame vectorization and NumPy vectorize not infeasible for some reason.

    Pandas itertuples function: Its API is like apply function, but offers 10x better performance than apply. It is the easiest and most readable option. It offers reasonable performance. Do this if the previous three do not work out.

    Numba or Swift: Use this to exploit parallelization without code complexity.