Link to Medium blog post: 
    https://towardsdatascience.com/heres-the-most-efficient-way-to-iterate-through-your-pandas-dataframe-4dad88ac92ee

# Iterrows():

In [21]:
# let's use iterrows on a practice dataframe with 100000 rows and 4 columns
import pandas as pd
import numpy as np
from tqdm import tqdm

df = pd.DataFrame(np.random.randint(0,100,size=(100000, 4)), columns=list('ABCD'))

# Iterrows() is a Pandas inbuilt function to iterate through your data frame
# It should be completely avoided as its performance is very slow compared to other iteration techniques 
# Iterrows() makes multiple function calls while iterating and each row of the iteration has properties of a data frame, which makes it slower

def loop_with_iterrows(df):
    temp = 0
    for index, row in tqdm(df.iterrows()):
        temp = row['A'] + row['B']
        temp = temp **2
    return temp

import timeit
%timeit loop_with_iterrows(df)

100000it [00:04, 22922.38it/s]
100000it [00:04, 23896.90it/s]
100000it [00:04, 23401.12it/s]
100000it [00:04, 23873.47it/s]
100000it [00:04, 23436.29it/s]
100000it [00:04, 23185.26it/s]
100000it [00:04, 23545.78it/s]
100000it [00:04, 23437.56it/s]

4.25 s ± 42.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)





# Itertuples():

In [22]:
# Itertuples() is a Pandas inbuilt function to iterate through your data frame 
# Itertuples() make a comparatively less number of function calls than iterrows() and carry much lesser overhead 
# Itertuples() iterates through the data frame by converting each row of data as a list of tuples

def loop_with_tuples(df):
    temp = 0
    for row in tqdm(df.itertuples()):
        temp = row.A + row.B
        temp = temp **2
    return temp

import timeit
%timeit loop_with_tuples(df)

100000it [00:00, 429120.37it/s]
100000it [00:00, 749393.69it/s]
100000it [00:00, 712560.82it/s]
100000it [00:00, 799577.55it/s]
100000it [00:00, 740774.81it/s]
100000it [00:00, 502444.24it/s]
100000it [00:00, 548718.31it/s]
100000it [00:00, 406476.01it/s]

174 ms ± 42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)





# Numpy Array Iteration:

In [23]:
# Iteration beats the whole purpose of using Pandas. Vectorization is always the best choice 
# Pandas come with df.values() function to convert the data frame to a list of list format

def loop_with_numpy_arrays(df):
    temp = 0
    for row in tqdm(df.values):
        temp = row[0] + row[1]
        temp = temp **2
    return temp

import timeit
%timeit loop_with_numpy_arrays(df)

100%|██████████| 100000/100000 [00:00<00:00, 1009325.84it/s]
100%|██████████| 100000/100000 [00:00<00:00, 1027698.31it/s]
100%|██████████| 100000/100000 [00:00<00:00, 501167.87it/s]
100%|██████████| 100000/100000 [00:00<00:00, 540242.02it/s]
100%|██████████| 100000/100000 [00:00<00:00, 875182.63it/s]
100%|██████████| 100000/100000 [00:00<00:00, 850911.41it/s]
100%|██████████| 100000/100000 [00:00<00:00, 890068.20it/s]
100%|██████████| 100000/100000 [00:00<00:00, 1018764.41it/s]
100%|██████████| 100000/100000 [00:00<00:00, 1122071.70it/s]
100%|██████████| 100000/100000 [00:00<00:00, 1106539.29it/s]
100%|██████████| 100000/100000 [00:00<00:00, 1081745.01it/s]
100%|██████████| 100000/100000 [00:00<00:00, 1110247.47it/s]
100%|██████████| 100000/100000 [00:00<00:00, 1119080.89it/s]
100%|██████████| 100000/100000 [00:00<00:00, 1110988.56it/s]
100%|██████████| 100000/100000 [00:00<00:00, 1094761.24it/s]
100%|██████████| 100000/100000 [00:00<00:00, 1120494.97it/s]
100%|██████████| 100000/10000

105 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)





# Dictionary Iteration:

In [24]:
# Now, let's come to the most efficient way to iterate through the data frame 
# Pandas come with df.to_dict('records') function to convert the data frame to dictionary key-value format

def loop_with_dict(df):
    temp = 0
    for row in tqdm(df.to_dict('records')):
        temp = row['A'] + row['B']
        temp = temp **2
    return temp

import timeit
%timeit loop_with_dict(df)

100%|██████████| 100000/100000 [00:00<00:00, 2239876.96it/s]
100%|██████████| 100000/100000 [00:00<00:00, 1805601.52it/s]
100%|██████████| 100000/100000 [00:00<00:00, 2216710.27it/s]
100%|██████████| 100000/100000 [00:00<00:00, 2071210.09it/s]
100%|██████████| 100000/100000 [00:00<00:00, 2098190.60it/s]
100%|██████████| 100000/100000 [00:00<00:00, 2249161.59it/s]
100%|██████████| 100000/100000 [00:00<00:00, 2204975.29it/s]
100%|██████████| 100000/100000 [00:00<00:00, 2245729.46it/s]

169 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)





# Conclusion:

Usage of itertools() is never recommended to iterate through the data frame, as it carries a lot of overhead and makes a lot of function calls. Itertuples convert the data frame to a list of tuples, then iterates through it, which makes it comparatively faster.

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.