# Pandas iterations run-time comparison



According to [this StackOverflow post](https://stackoverflow.com/questions/24870953/does-iterrows-have-performance-issues/24871316#24871316), there is a strict hierarchy of iteration options over ``pandas`` ``DataFrame``:

1. vectorization
2. using a custom cython routine
3. apply
4. itertuples
5. iterrows
6. updating an empty frame (e.g. using loc one-row-at-a-time)

[These comments](https://stackoverflow.com/a/44942739/3861108) sent me to try them all :)

[another useful post](https://stackoverflow.com/a/36911306/2901002)


In [1]:
import pandas as pd
import numpy as np

### Here is a 50,000 records dataframe of numbers

In [2]:
df =pd.DataFrame(np.random.random((50000,2)), columns = ['a','b'])

In [3]:
df.head()

Unnamed: 0,a,b
0,0.515064,0.228086
1,0.030467,0.127614
2,0.705997,0.732451
3,0.331241,0.369426
4,0.707828,0.292173


### Our task will be to string concat the first two digits of each a and be pair with ":" seperator

### First - Lets try the vectorized approach:

In [4]:
%%time

df['vec'] = (df.a.round(1).map(str)+':'+df.b.round(2).map(str))

Wall time: 97 ms


pretty fast :)

In [5]:
df.head()

Unnamed: 0,a,b,vec
0,0.515064,0.228086,0.5:0.23
1,0.030467,0.127614,0.0:0.13
2,0.705997,0.732451,0.7:0.73
3,0.331241,0.369426,0.3:0.37
4,0.707828,0.292173,0.7:0.29


### Second - cython routine
we'll skip it by now 

In [6]:
## %%time
# cython

### Third - apply - my favourite, but not anymore :(

In [7]:
%%time
df['app1'] = df.apply(lambda x: '{0:.1f}:{0:.1f}'.format(x.a,x.b), axis =1)

Wall time: 1.7 s


Let's give it another try:

In [8]:
%%time
df['app2'] = df.apply(lambda x: str(round(x.a,1))+':'+str(round(x.b,1)), axis =1)

Wall time: 2.12 s


** Busted ! **

### Fourth - itertuples

In [9]:
%%time
df['itertuples']=pd.Series('{0:.1f}:{0:.1f}'.format(x.a,x.b) for x in df.itertuples())

Wall time: 184 ms


Much better than ``apply``

### Fifth - try iterrows:

In [10]:
%%time
df['iterrows']=pd.Series('{0:.1f}:{0:.1f}'.format(x[1].a,x[1].b) for x in df.iterrows())

Wall time: 4.98 s


worse than anything before...

### Sixth - updating empty df by loc:

In [None]:
%%time
dfnew = pd.DataFrame()
for i in range(len(df)):
    dfnew.loc[i,'byloc'] = '{0:.1f}:{0:.1f}'.format(df.loc[i].a,df.loc[i].b)

Bad, bad, bad idea...

## Conclusions:

Try to vectorize, if not - ``itertuples`` is even better than ``apply``, at least in this case. If you have to -  go to itterrows, but don't you dare try the 6th option :)