# DataFrame Loop Optimization

Throughout the various examples and exercises in the notebooks during this class, we have been more interested in learning the different Pandas and Numpy methods and less interested in optimization.

Now, let's spend a little time looking at optimization as it is an important part of Data Science, especially as you begin to work with larger data sets. My hope is that this will also give you an idea of different ways to loop through your data when you are working on your own projects.

Note: The initial idea and some of the code for this notebook comes from Tanmay Chinmurkar's article titled 'Hey Pandas, why you no fast loop?!?" on medium.com. Read the full article here: https://medium.com/analytics-vidhya/hey-pandas-why-you-no-fast-loop-e7226ed97322

In [1]:
# common imports
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

Let's start out by creating a DataFrame with 100,000 review scores similar to the review scores we discussed in the AirBnb data.

In [2]:
#set random seed
np.random.seed(42)

# create sample DataFrame of 100,000 review scores
df = pd.DataFrame(np.random.randint(0,100, size = 100000), 
                  columns=['review_score'])

df

Unnamed: 0,review_score
0,51
1,92
2,14
3,71
4,60
...,...
99995,26
99996,35
99997,87
99998,53


Next, let's say that we want to add a new column called `review_stars` that follows the following mapping:

- 0 - 20: 'One Star'

- 21 - 40: 'Two Stars'

- 41 - 60: 'Three Stars'

- 61 - 80: 'Four Stars'

- 81 - 100: 'Five Stars'

*Note: We will not worry about creating categorical data types for this example.

## Basic Loops

Let's first use a basic Python loop to perform this task and calculate the total computation time.

In [3]:
def basic_loop(df):
    # create an empty column to use in the loop
    df['review_stars'] = ''
    
    # loop through review_scores, creating new review_stars column
    for value in range(len(df)):
        if df['review_score'][value] >= 0 and df['review_score'][value] <= 20:
            df['review_stars'].iloc[value] = 'One Star'
        
        elif df['review_score'][value] >= 21 and df['review_score'][value] <= 40:
            df['review_stars'].iloc[value] = 'Two Stars'
        
        elif df['review_score'][value] >= 41 and df['review_score'][value] <= 60:
            df['review_stars'].iloc[value] = 'Three Stars'
        
        elif df['review_score'][value] >= 61 and df['review_score'][value] <= 80:
            df['review_stars'].iloc[value] = 'Four Stars'
        
        else: df['review_stars'].iloc[value] = 'Five Stars'

In [4]:
%%timeit
basic_loop(df)

25.4 s ± 1.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [5]:
# check DataFrame
df

Unnamed: 0,review_score,review_stars
0,51,Three Stars
1,92,Five Stars
2,14,One Star
3,71,Four Stars
4,60,Three Stars
...,...,...
99995,26,Two Stars
99996,35,Two Stars
99997,87,Five Stars
99998,53,Three Stars


## Iterrows

From Pandas documentation:
- The Pandas' method `iterrows` iterates over the DataFrame rows as (Index, Series) pairs. 
- Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames)
- To preserve dtypes while iterating over the rows, it is better to use `itertuples()` which returns named tuples of the values and which is generally faster than iterrows
- You should never modify something you are iterating over

In [6]:
# Documentation example to show dtypes
df2 = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])
df2

Unnamed: 0,int,float
0,1,1.5


In [7]:
# check data types
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   int     1 non-null      int64  
 1   float   1 non-null      float64
dtypes: float64(1), int64(1)
memory usage: 144.0 bytes


In [8]:
row = next(df2.iterrows())[1]
row

int      1.0
float    1.5
Name: 0, dtype: float64

In [9]:
# check data types again
print(row['int'].dtype)
print(df2['int'].dtype)

float64
int64


In [10]:
count = 0
for i, s in df.iterrows():
    print(i)
    print('----')
#     print(s)
    print(s[0])
    print('--------')
    count += 1
    if count == 5:
        break

0
----
51
--------
1
----
92
--------
2
----
14
--------
3
----
71
--------
4
----
60
--------


In [11]:
# create function to work with iterrows()
def iterrows_loop(i,s):
        if s[0] >= 0 and s[0] <= 20:
            return 'One Star'
        elif s[0] >= 21 and s[0] <= 40:
            return 'Two Stars'
        elif s[0] >= 41 and s[0] <= 60:
            return 'Three Stars'
        elif s[0] >= 61 and s[0] <= 80:
            return 'Four Stars'
        else: return 'Five Stars'

In [12]:
%%timeit

# create empty list
output_list = []

# iterate over rows
for i, s in df.iterrows():
    output_list.append(iterrows_loop(i, s))

# add list as new column
df['review_stars2'] = output_list

3.85 s ± 235 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Using Apply()

In [13]:
# create function to work with apply
def buckets(x):
    if x >= 0 and x <= 20:
        return 'One Star'
    elif x >= 21 and x <= 40:
        return  'Two Stars'
    elif x >= 41 and x <= 60:
        return  'Three Stars'
    elif x >= 61 and x <= 80:
        return  'Four Stars'
    else: return  'Five Stars'

In [14]:
%%timeit
df['review_stars3'] = df.apply(lambda row: buckets(row['review_score']), axis=1)

674 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Intertuples()

Iterate over DataFrame rows as named tuples.

In [15]:
count = 0

for row in df.itertuples():
#     print(row)
    print(row[1])
    print('----')
    count += 1
    if count == 5:
        break

51
----
92
----
14
----
71
----
60
----


In [16]:
# create function to work with intertuples()
def itertuples_loop(r):
        if r[1] >= 0 and r[1] <= 20:
            return 'One Star'
        elif r[1] >= 21 and r[1] <= 40:
            return 'Two Stars'
        elif r[1] >= 41 and r[1] <= 60:
            return 'Three Stars'
        elif r[1] >= 61 and r[1] <= 80:
            return 'Four Stars'
        else: return 'Five Stars'

In [17]:
%%timeit

# create empty list
output_list = []

# iterate using itertuples
for row in df.itertuples():
    output_list.append(itertuples_loop(row))

# create new column using list
df['review_stars4'] = output_list

107 ms ± 6.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [18]:
df

Unnamed: 0,review_score,review_stars,review_stars2,review_stars3,review_stars4
0,51,Three Stars,Three Stars,Three Stars,Three Stars
1,92,Five Stars,Five Stars,Five Stars,Five Stars
2,14,One Star,One Star,One Star,One Star
3,71,Four Stars,Four Stars,Four Stars,Four Stars
4,60,Three Stars,Three Stars,Three Stars,Three Stars
...,...,...,...,...,...
99995,26,Two Stars,Two Stars,Two Stars,Two Stars
99996,35,Two Stars,Two Stars,Two Stars,Two Stars
99997,87,Five Stars,Five Stars,Five Stars,Five Stars
99998,53,Three Stars,Three Stars,Three Stars,Three Stars


## np.where()

In [19]:
%%timeit

df['review_stars5'] = np.where((df['review_score'] >= 0) & (df['review_score'] <= 20), 'One Star', 
                         np.where((df['review_score'] >= 21) & (df['review_score'] <= 40), 'Two Stars',
                         np.where((df['review_score'] >= 41) & (df['review_score'] <= 60),'Three Stars', 
                         np.where((df['review_score'] >= 61) & (df['review_score'] <= 80),'Four Stars',
                         'Five Stars'))))

20.6 ms ± 1.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Pandas Vectorization

https://stackoverflow.com/questions/1422149/what-is-vectorization

In [21]:
def pd_vector(score):
    df.loc[(score >= 0) & (score <= 20), 'review_scores6'] = 'One Star'
    df.loc[(score >= 21) & (score <= 40), 'review_scores6'] = 'Two Stars'
    df.loc[(score >= 41) & (score <= 60), 'review_scores6'] = 'Three Stars'
    df.loc[(score >= 61) & (score <= 80), 'review_scores6'] = 'Four Stars'
    df.loc[(score >= 81),'review_scores6'] = 'Five Stars'

In [22]:
%%timeit
pd_vector(df['review_score'])

4.7 ms ± 79.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Numpy Vectorization

In [23]:
def pd_vector_array(score):
    df.loc[(score >= 0) & (score <= 20), 'review_scores7'] = 'One Star'
    df.loc[(score >= 21) & (score <= 40), 'review_scores7'] = 'Two Stars'
    df.loc[(score >= 41) & (score <= 60), 'review_scores7'] = 'Three Stars'
    df.loc[(score >= 61) & (score <= 80), 'review_scores7'] = 'Four Stars'
    df.loc[(score >= 81),'review_scores7'] = 'Five Stars'

In [24]:
%%timeit
pd_vector_array(df['review_score'].values)

3.65 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Binning 

In [25]:
%%timeit
df.loc[:,'review_stars8'] = pd.cut(x=df['review_score'], bins=[0, 20, 40, 60, 80, 100],
                                            labels=['One Star','Two Stars','Three Stars','Four Stars','Five Stars'])

3 ms ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## View Final DataFrame

In [26]:
df

Unnamed: 0,review_score,review_stars,review_stars2,review_stars3,review_stars4,review_stars5,review_scores6,review_scores7,review_stars8
0,51,Three Stars,Three Stars,Three Stars,Three Stars,Three Stars,Three Stars,Three Stars,Three Stars
1,92,Five Stars,Five Stars,Five Stars,Five Stars,Five Stars,Five Stars,Five Stars,Five Stars
2,14,One Star,One Star,One Star,One Star,One Star,One Star,One Star,One Star
3,71,Four Stars,Four Stars,Four Stars,Four Stars,Four Stars,Four Stars,Four Stars,Four Stars
4,60,Three Stars,Three Stars,Three Stars,Three Stars,Three Stars,Three Stars,Three Stars,Three Stars
...,...,...,...,...,...,...,...,...,...
99995,26,Two Stars,Two Stars,Two Stars,Two Stars,Two Stars,Two Stars,Two Stars,Two Stars
99996,35,Two Stars,Two Stars,Two Stars,Two Stars,Two Stars,Two Stars,Two Stars,Two Stars
99997,87,Five Stars,Five Stars,Five Stars,Five Stars,Five Stars,Five Stars,Five Stars,Five Stars
99998,53,Three Stars,Three Stars,Three Stars,Three Stars,Three Stars,Three Stars,Three Stars,Three Stars
