# Making Pandas faster by vectorization

Based on [Rob Mulla (2022) Make Your Pandas Code Lightning Fast](https://youtu.be/SAFmrTnEHLg).

## Problem:

Apply a logical condition across every row of a DataFrame.
Assign values to the column based on the logical value of the condition.

**Solutions:**

*Solution 1 **Looping:***

- Define a function with the logic for the condition for each row.
- Loop over each row of the DataFrame, apply the condition to obtain the logical value and assign a value to a cell of the DataFraame based on the logical value of the condition.

*Solution 2: **Vectorization:***
- Apply the logical conditions to the whole DataFrame.
- Assign values from a Series to a column if the condition is satisfied.


## Example

The problem:

Given a population for which each person has the characteristics:
`age`, `time_in_bed`, `percent_sleeping`, `favorite_food`,`hate_food`

create a new column, named `reward`, with their `favorite food` or `hate food` in function of the logical value of the `condition`.

Condition for reward:
- If a person is in bed for more than 1 hour and if the person slept more than 10 %, give the favorite food to the person; otherwise, give the hate food to the person.

- If the person is over 90 years old, give the favorite food to the person.

The condition for reward is equivalent to:

    IF  (they were in bed for more than 1 hour
         AND if they slept for more than 10 %)
    OR
        if they are over 90 years old,
    THEN
        give them their favorite food.
    ELSE
        give them their hate food.

In [1]:
import numpy as np
import pandas as pd

size = 10_000   # n° of samples in the DataFrame

## Generate DataFrame

Give random values to the elements of the DataFrame. 

In [2]:
def generate_data(size=10_000):
    df = pd.DataFrame()
    df['age'] = np.random.randint(0, 100, size)
    df['time_in_bed'] = np.random.randint(0, 9, size)
    df['percent_sleeping'] = np.random.rand(size)
    df['favorite_food'] = np.random.choice(
        ['+pizza', '+tacos', '+ice-cream'], size)
    df['hate_food'] = np.random.choice(
        ['-broccoli', '-potato', '-eggs'], size)
    return(df)

In [3]:
df = generate_data(size)

## Solution 1: Looping

Create a function for the condition to reward each person. The function gets the data for a person and returns `favorite_food` or `hate_food` based on the logical value of the `condition`.

In [4]:
def reward(person):
    condition = ((person['time_in_bed'] > 1
                  ) and (person['percent_sleeping'] > 0.1)
                 ) or person['age'] >= 90
    if condition:
        return person['favorite_food']
    else:
        return person['hate_food']

In [5]:
df_loop = df

Loop over each row of the DataFrame and apply the condition given in the function.

For each row, assign the result to a cell of the DataFrame.

Time the duration needed to run the code.

In [6]:
%%timeit
for index, person in df_loop.iterrows():
    df_loop.loc[index, 'reward'] = reward(person)

2.82 s ± 63.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
df_loop.head()

Unnamed: 0,age,time_in_bed,percent_sleeping,favorite_food,hate_food,reward
0,73,2,0.297605,+tacos,-broccoli,+tacos
1,54,5,0.310736,+ice-cream,-potato,+ice-cream
2,61,4,0.700914,+pizza,-eggs,+pizza
3,99,4,0.327881,+tacos,-potato,+tacos
4,52,5,0.641889,+ice-cream,-potato,+ice-cream


## Solution 2: Vectorization

Instead of looping over each row,
apply the logical conditions to the whole DataFrame.

Then, assign to the column `reward` the values from a Series if the `condition` is satisfied.

Time the duration needed to run the code.

In [8]:
df_vector = df

In [9]:
%%timeit
condition = ((df_vector['time_in_bed'] > 1
              ) & (df_vector['percent_sleeping'] > 0.1)
             ) | (df_vector['age'] >= 90)

df_vector['reward'] = df_vector['hate_food']
df_vector.loc[condition, 'reward'] = df_vector['favorite_food']

1.53 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [10]:
df_vector.head()

Unnamed: 0,age,time_in_bed,percent_sleeping,favorite_food,hate_food,reward
0,73,2,0.297605,+tacos,-broccoli,+tacos
1,54,5,0.310736,+ice-cream,-potato,+ice-cream
2,61,4,0.700914,+pizza,-eggs,+pizza
3,99,4,0.327881,+tacos,-potato,+tacos
4,52,5,0.641889,+ice-cream,-potato,+ice-cream


## Compare the time needed for each solution

Depending on the `size` of the DataFrame (of the order of thousands or of tens of thousands), the vectorization is hundreds or thousands times faster than looping.

## Check if the two DataFrames are equal

In [11]:
print("Are the two DataFrames equal? ", df_loop.equals(df_vector))

Are the two DataFrames equal?  True
