# Pandas tips: #2 - Split column into new columns


Vectorizing the data is often the fastest way to process the data. There are however situations that you are still need to iterate through each row of a DataFrame. Reasons are often that you solve a problem quick and dirty. This is fine for one-time processing and if it is in the order of minutes.

When iterating over each row, many use the .iterrows() function of the DataFrame. This method returns the index and a Series object. Pandas cannot guarantee that it returns a view or a copy, therefore, do not change the data you are iterating over directly. Changes you make might not be preserved. 

Pandas offers another method to iterate over each row: .itertuples(). The .itertuples() method returns a namedtuple object and depending on an option also returns the index. This method returns the same information as .iterrows() while not giving the illusion that you can change the original DataFrame using the iteration. But the best part is that it is 14 times faster than .iterrows(). Lets demonstrate this:

In [2]:
import numpy as np
import pandas as pd

df = pd.DataFrame(
    np.random.randint(0, 100, size=(100, 4)),
    columns=list('ABCD'),
)

df

Unnamed: 0,A,B,C,D
0,96,55,94,58
1,84,22,48,81
2,56,30,35,49
3,22,33,66,15
4,3,69,40,98
...,...,...,...,...
95,24,79,60,74
96,97,39,41,40
97,81,44,29,20
98,58,79,82,64


Now lets use the `%%timeit` magic command from Jupyter to assess the speed of the iteration:

In [43]:
%%timeit

for ix, row in df.iterrows():
    pass

4.06 ms ± 26.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [45]:
print('row.A = ', row.A)
print('\nFull row Series object:')
print(row)

row.A =  20

Full row Series object:
A    20
B    42
C    53
D    31
Name: 92, dtype: int64


It takes about 3.9ms to iterate through the 100 rows, each having 4 columns.

Doing the same with .itertuples() is much faster, while having the same information:

In [36]:
%%timeit

for nt in df.itertuples():
    pass

276 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


The method is not only faster but also has the exact same information. To show this we need to get a new row first as the for loop has used all instances of the generator. In Python, when creating a generator, the items are cosumed and when the loop is finished, there are no items left.

In [54]:
nt = next(df.itertuples())
print('nt.A = ', nt.A)
print('\nFull nametuple object:')
print(nt)

nt.A =  96

Full nametuple object:
Pandas(Index=0, A=96, B=55, C=94, D=58)


The index gets the name 'index' and is accessible through `nt.index`. There is a tiny speed-up when you tell Pandas to exclude the index. However, for debugging, it is often easier to keep the index. The index can be excluded using:

In [24]:
%%timeit

for row in df.itertuples(index=False):
    pass

269 µs ± 3.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


This is a simple method to increase the speed of your iterations. Of course, when writing code that is used often, invest the time of vectorizing you steps. If it is just short and simple one-time event, use the fastest iteration there is: .itertuples().

If you have any questions, comments, or requests, feel free to [contact me on LinkedIn](https://linkedin.com/in/dennisbakhuis).