## Better Than Loops: Iterating is slow – Vectorization if fast

###### Imagine you have a homework problem that says: We need to generate a function which provides rewards to a person based on these conditions:

1. If the average sleeping time is ≥ 6 hours, and the annual income is between five and ten thousand dollars, they get a bonus of 15% of their current income.
2. If their age is between 60 and 90, and they are female they get a bonus of 20%. If they are male they get a bonus of 18%.
3. In addition to the above conditions they get a 10% bonus.

Start with a cell to make a dataframe with some made up data on age, contact details, sleep cycle, gender, income, phone number, and favorite food. Make 10 million rows of data.

In [None]:
! python --version

In [None]:
import pandas as pd
import numpy as np

def generate_df(size):
    
    df = pd.DataFrame()
    df['age'] = np.random.randint(1,100,size)
    df['avg_sleeping'] = np.random.randint(1,24, size)
    df['gender'] = np.random.choice(['Male','Female'], size)
    df['annual_income'] = np.random.randint(1000,100000, size)
    df['phone_number'] = np.random.randint(1_111_111_111, 9_999_999_999, size)
    df['favorite_food'] = np.random.choice(['pizza', 'burger', 'chips', 'nachos'], size)
    
    return df

df = generate_df(10_000_000) # 10 million rows.
df.info()

###### We see that the dataframe is about 460 megabytes.   Now make a reward function using a row as input and giving the bonus as an output:


In [None]:
def reward_function(row):
    total_bonus = 0

    if (row['avg_sleeping'] >= 6) and (5000 <= row['annual_income'] <= 10000):
        total_bonus += row['annual_income']*10/100
    
    if (60<=row['age']<=90) and row['gender'] == 'Female':
        total_bonus += row['annual_income'] * 20/100
    
    elif (60<=row['age']<=90) and row['gender'] == 'Male':
        total_bonus += row['annual_income'] * 18/100
    
    total_bonus += row['annual_income'] * 10/100
    
    return total_bonus

# Here's a wrapper function which will make it easier to time the use of the function:

def wrapper(func, *args, **kwargs):
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

##### Now let's compare the speed of applying this function to every row. We can do it three ways.  Start with the simplest idea which is with a for loop and look at the timings up to a million rows:

In [None]:
def loop_function(size):
    df = generate_df(size)
    for idx, row in df.iterrows():
        df.loc[idx, 'bonus'] = reward_function(row)
        
    return df

import timeit

sizes = ['10','50', '100','150','1_000','1_500','10_000','15_000','100_000','150_000','1_000_000']
time_loop = []

for size in sizes:
    
    size = int(size)
    
    wrap = wrapper(loop_function, size)
    n = timeit.timeit(wrap, number = 10)
    
    time_loop.append(n)
    
    
    print(f'Size: {size} | Time: {n}')

#### Note that it takes about 
##### Now try something a little faster than a for loop. Try using the built-in function "apply" which automatically iterates over the rows so it doesn't need a loop.

In [None]:
def apply_function(size):
    df = generate_df(size)
    df['reward'] = df.apply(reward_function, axis=1)
    return df

import timeit

sizes = ['10','50', '100','150','1_000','1_500','10_000','15_000','100_000','150_000','1_000_000']
time_apply = []

for size in sizes:
    
    size = int(size)
    
    wrap = wrapper(apply_function, size)
    n = timeit.timeit(wrap, number = 10)
    
    time_apply.append(n)
    
    
    print(f'Size: {size} | Time: {n}')

##### Notice that the apply function is ~51x faster than a loop.

#### We can try a third method by using vectorization.  Vectorization converts a function into a *vectorized* *function* which provides parallel computation. 

In [None]:
def vectorize_function(size):
    df = generate_df(size)
    return np.vectorize(reward_function_part)(df['avg_sleeping'], df['annual_income'], df['gender'], df['age'])

import timeit

sizes = ['10','50', '100','150','1_000','1_500','10_000','15_000','100_000','150_000','1_000_000']
time_vector = []

for size in sizes:
    
    size = int(size)
    
    wrap = wrapper(vectorize_function, size)
    n = timeit.timeit(wrap, number = 10)
    
    time_vector.append(n)
    
    
    print(f'Size: {size} | Time: {n}')

#### Vectorization turns out to be 120x faster than the iterative method and 24x faster than the apply function!