## ** Efficient iterating over rows in a Pandas DataFrame**

Check out my social profiles here to follow me. ⭐ Star This Repo, If you learned something new, a star would be truly appreciated.

[![GitHub Repo](https://img.shields.io/badge/GitHub%20Repo-000000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/Ramakm/ai-hands-on)

[![X](https://img.shields.io/badge/X-000000?style=for-the-badge&logo=x&logoColor=white)](https://x.com/techwith_ram)
[![GitHub](https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/Ramakm)
[![LinkedIn](https://img.shields.io/badge/LinkedIn-0A66C2?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/ramakrushnamohapatra/)
[![Instagram](https://img.shields.io/badge/Instagram-E4405F?style=for-the-badge&logo=instagram&logoColor=white)](https://instagram.com/techwith.ram)


In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/Ramakm/ai-hands-on/refs/heads/main/Data/test.txt')
df

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,attack_type,other
0,0,tcp,private,REJ,0,0,0,0,0,0,...,0.04,0.06,0.00,0.00,0.00,0.0,1.00,1.00,neptune,21
1,0,tcp,private,REJ,0,0,0,0,0,0,...,0.00,0.06,0.00,0.00,0.00,0.0,1.00,1.00,neptune,21
2,2,tcp,ftp_data,SF,12983,0,0,0,0,0,...,0.61,0.04,0.61,0.02,0.00,0.0,0.00,0.00,normal,21
3,0,icmp,eco_i,SF,20,0,0,0,0,0,...,1.00,0.00,1.00,0.28,0.00,0.0,0.00,0.00,saint,15
4,1,tcp,telnet,RSTO,0,15,0,0,0,0,...,0.31,0.17,0.03,0.02,0.00,0.0,0.83,0.71,mscan,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22539,0,tcp,smtp,SF,794,333,0,0,0,0,...,0.72,0.06,0.01,0.01,0.01,0.0,0.00,0.00,normal,21
22540,0,tcp,http,SF,317,938,0,0,0,0,...,1.00,0.00,0.01,0.01,0.01,0.0,0.00,0.00,normal,21
22541,0,tcp,http,SF,54540,8314,0,0,0,2,...,1.00,0.00,0.00,0.00,0.00,0.0,0.07,0.07,back,15
22542,0,udp,domain_u,SF,42,42,0,0,0,0,...,0.99,0.01,0.00,0.00,0.00,0.0,0.00,0.00,normal,21


In [10]:
df.shape

(22544, 43)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22544 entries, 0 to 22543
Data columns (total 43 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   duration                     22544 non-null  int64  
 1   protocol_type                22544 non-null  object 
 2   service                      22544 non-null  object 
 3   flag                         22544 non-null  object 
 4   src_bytes                    22544 non-null  int64  
 5   dst_bytes                    22544 non-null  int64  
 6   land                         22544 non-null  int64  
 7   wrong_fragment               22544 non-null  int64  
 8   urgent                       22544 non-null  int64  
 9   hot                          22544 non-null  int64  
 10  num_failed_logins            22544 non-null  int64  
 11  logged_in                    22544 non-null  int64  
 12  num_compromised              22544 non-null  int64  
 13  root_shell      

So for this dataset, we have around **22544** rows and **43** columns.

## **Iter-rows**

`iterrows()` iterates “over the rows of a Pandas DataFrame as (index, Series) pairs”. It converts each row into a Series object, which causes two problems:

In [4]:
%%timeit -n 10
# Iterrows
total = []
for index, row in df.iterrows():
    total.append(row['src_bytes'] + row['dst_bytes'])

1.29 s ± 38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## **For loop with .loc or .iloc (3× faster)**

This is what I used to do when I started: a basic for loop to select rows by index (with .loc or .iloc).

Why is it bad? Because DataFrames are not designed for this purpose. As with the previous method, rows are converted into Pandas Series objects, which degrades performance.

Interestingly enough, .iloc is faster than .loc. It makes sense since Python doesn’t have to check user-defined labels and directly look at where the row is stored in memory.

In [5]:
%%timeit -n 10
# For loop with .loc
total = []
for index in range(len(df)):
    total.append(df['src_bytes'].loc[index] + df['dst_bytes'].loc[index])

479 ms ± 64.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [12]:
%%timeit -n 10
# For loop with .iloc
total = []
for index in range(len(df)):
    total.append(df['src_bytes'].iloc[index] + df['dst_bytes'].iloc[index])

391 ms ± 62.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## **Apply (4× faster)**

The `apply()` method is another popular choice to iterate over rows. It creates code that is easy to understand but at a cost: performance is nearly as bad as the previous for loop.

This is why I would strongly advise you to avoid this function for this specific purpose (it’s fine for other applications).

Note that I convert the DataFrame into a list using the to_list() method to obtain identical results.



In [7]:
%%timeit -n 10
# Apply
df.apply(lambda row: row['src_bytes'] + row['dst_bytes'], axis=1).to_list()

226 ms ± 42.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## **Itertuples (10× faster)**

If you know about `iterrows()`, you probably know about `itertuples()`. According to the official documentation, it iterates “over the rows of a DataFrame as namedtuples of the values”. In practice, it means that rows are converted into tuples, which are much lighter objects than Pandas Series.

This is why `itertuples()` is a better version of `iterrows()`. This time, we need to access the values with an attribute (or an index). If you want to access them with a string (e.g., if there’s a space in the string), you can use the `getattr()` function instead.

In [8]:
%%timeit -n 10
# Itertuples
total = []
for row in df.itertuples():
    total.append(row.src_bytes + row.dst_bytes)

The slowest run took 4.13 times longer than the fastest. This could mean that an intermediate result is being cached.
191 ms ± 110 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## **List comprehensions (200× faster)**

List comprehensions are a fancy way to iterate over a list as a one-liner.

For instance, [print(i) for i in range(10)] prints numbers from 0 to 9 without any explicit for loop. I say “explicit” because Python actually processes it as a for loop if we look at the bytecode.

So why is it faster? Quite simply because we don’t call the `.append() `method in this version.

In [9]:
%%timeit -n 100
# List comprehension
[src + dst for src, dst in zip(df['src_bytes'], df['dst_bytes'])]

5.08 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## **Pandas vectorization (1500× faster)**

Until now, all the techniques used simply add up single values. Instead of adding single values, why not group them into vectors to sum them up? The difference between adding two numbers or two vectors is not significant for a CPU, which should speed things up.

On top of that, Pandas can process Series objects in parallel, using every CPU core available!

The syntax is also the simplest imaginable: this solution is extremely intuitive. Under the hood, Pandas takes care of vectorizing our data with an optimized C code using contiguous memory blocks.

In [13]:
%%timeit -n 1000
# Vectorization
(df['src_bytes'] + df['dst_bytes']).to_list()

993 µs ± 156 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## **NumPy vectorization (1900× faster)**

NumPy is designed to handle scientific computing. It has less overhead than Pandas methods since rows and dataframes all become np.array. It relies on the same optimizations as Pandas vectorization.

There are two ways of converting a Series into a np.array: using .values or `.to_numpy()`. The former has been deprecated for years, which is why we’re gonna use `.to_numpy()` in this example.

In [14]:
%%timeit -n 1000
# Numpy vectorization
(df['src_bytes'].to_numpy() + df['dst_bytes'].to_numpy()).tolist()

913 µs ± 153 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


We found our winner with a technique that is 1900 times faster than our first competitor.