# Making Pandas faster by vectorization

Based on [Make Your Pandas Code Lightning Fast](https://youtu.be/SAFmrTnEHLg).

## Problem:

Apply a logical condition across every row of a DataFrame.
Assign the result to a new column.

**Solutions:**

*Level 1 **Looping:***

- Define a function with the logic for rewarding for each row (i.e., person).
- Loop over each row of the DataFrame to apply the condition to obtain the value and assign it to a cell.

*Level 2: **vectorization:***
- Apply the logical conditions to the whole DataFrame.
- Assign the default values to the column.
- Assign the Series of calculated values with condition.


## Example

The problem:

Given a population for which each person has the characteristics:
`age`, `time_in_bed`, `percent_sleeping`, `favorite_food`,`hate_food`

create a new column with their `favorite food` or `hate foo`` as a reward.

Reward logic`

    IF  (they were in bed for more than 1 hour
         AND if they slept for more than 10 %)
    OR
        if they are over 90 years old,
    THEN
        give them their favorite food.
    ELSE
        give them their hate food.

In [1]:
import numpy as np
import pandas as pd

size = 10_000   # n° of samples in the DataFrame

## Generate DataFrame

In [2]:
def generate_data(size=10_000):
    df = pd.DataFrame()
    df['age'] = np.random.randint(0, 100, size)
    df['time_in_bed'] = np.random.randint(0, 9, size)
    df['percent_sleeping'] = np.random.rand(size)
    df['favorite_food'] = np.random.choice(
        ['+pizza', '+tacos', '+ice-cream'], size)
    df['hate_food'] = np.random.choice(
        ['-brocolli', '-potato', '-eggs'], size)
    return(df)

## Level 1 Looping

Create a function for the logic to reward each person. The function gets the data for a person and returns `favorite_food` or `hate_food` based on the `condition`:

    IF  (they were in bed for more than 1 hour
         AND if they slept for more than 10 %)
    OR
        if they are over 90 years old,
    THEN
        give them their favorite food.
    ELSE
        give them their hate food

In [3]:
def reward(person):
    condition = ((person['time_in_bed'] > 1
                  ) and (person['percent_sleeping'] > 0.1)
                 ) or person['age'] >= 90
    if condition:
        return person['favorite_food']
    else:
        return person['hate_food']

In [4]:
df = generate_data(size)
df_loop = df

Loop over each row of the df and apply the condition given in the function.
For each row, assign the result to a cell of the DataFrame.

Note the execution time.

In [5]:
%%timeit
for index, person in df_loop.iterrows():
    df_loop.loc[index, 'reward'] = reward(person)

2.98 s ± 183 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Level 2 Vectorization

Instead of looping on each row,
apply the logical conditions to the whole DataFrame.

In [6]:
df_vector = df

In [7]:
%%timeit
condition = ((df_vector['time_in_bed'] > 1
              ) & (df_vector['percent_sleeping'] > 0.1)
             ) | (df_vector['age'] >= 90)

df_vector['reward'] = df_vector['hate_food']
df_vector.loc[condition, 'reward'] = df_vector['favorite_food']

1.31 ms ± 34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Check if the two df are equal

In [8]:
print("Are the two DataFrames equal? ", df_loop.equals(df_vector))

Are the two DataFrames equal?  True
