In [2]:
import numpy as np 
import pandas as pd
import time 
import random 

Vectorisation is effectively writing more efficient code. 
The aim is replacing explicit python loops with high level operations that acts on either collections `(like arrays or columns)`. 
The operations are usually handled by libraries like NumPy and Pandas. 

For example 

In [3]:
#instead of 
numbers =[1,2,3,4,5]
doubled = [x*2 for x in numbers]
doubled

[2, 4, 6, 8, 10]

In [4]:
# do this - more efficient

numbers = np.array([1,2,3,4,5])
#numbers = np.array(range(6))
doubled = numbers *2
doubled

array([ 2,  4,  6,  8, 10])

**Why is vectorisation important?**
1. Performance: accelerated performance,sometimes up to 100x
2. Clarity: code is more concise and readable 
3. Scalability : Operations run in miliseconds over thousands of rows. - we deal with large datasets at LBG, this is useful 
4. Expressiveness: Write what to do, not how to loop over data

## **Vectorisation in Pandas**
- Pandas is built on to of NumPy. 
- Supports efficient, complied operations over ***entire datasets**  
- vectorised Pandas methods enchance performance, compared to other methods.. 
- e.g. `.apply()` or loops operate row by row, fine for small data but scale poorly 

In [5]:

df = pd.DataFrame({
    'drink': np.random.choice(['Latte','Espresso','Cold Brew','Matcha','Chai'],size =1000000),
    'size':np.random.choice(['Small','Medium','Large'],size = 1000000), 
    'base_price':np.random.uniform(2,6,size=1000000).round(2), 
    'tip':np.random.uniform(0,2,size=1000000).round(2), 
    'peak_hours':np.random.choice([True,False],size=1000000)
})

In [6]:


number = round(random.uniform(1,10),2)
number

6.07

### Value mapping with `.map()`
Its common to use .apply() with conditionals and lookups.  
The more efficient vectorised approach is to use .map() with a dictionary 

In [8]:
# Non vectorised code
start = time.time()

def adjusted_price(row):
    if row['size'] =='Medium':
        return row ['base_price'] + 0.5
    elif row['size'] == 'Large':
        return row['base_price'] + 1.0 
    return row['base_price']

df['price_apply'] = df.apply(adjusted_price,axis =1)

print(f'Non Vectorised time: {time.time()-start} Seconds') # This calculates the elapsed time by subtracting the starting timestamp from the current timestamp

Non Vectorised time: 12.158571481704712 Seconds


In [9]:
#Vectorised code 

start = time.time()

df['price_vec'] = df['base_price'] + df['size'].map({
    'Small':0, 
    'Medium':0.5,
    'Large': 1.0
})

print(f'Vectorised time: {time.time()-start} seconds')

Vectorised time: 0.22067832946777344 seconds


In [10]:
df

Unnamed: 0,drink,size,base_price,tip,peak_hours,price_apply,price_vec
0,Matcha,Small,4.56,1.40,True,4.56,4.56
1,Latte,Small,2.25,1.83,True,2.25,2.25
2,Latte,Large,4.10,1.85,False,5.10,5.10
3,Espresso,Small,4.45,1.58,False,4.45,4.45
4,Matcha,Small,2.30,0.11,True,2.30,2.30
...,...,...,...,...,...,...,...
999995,Chai,Medium,4.26,1.06,True,4.76,4.76
999996,Latte,Small,5.85,1.12,False,5.85,5.85
999997,Espresso,Medium,2.17,0.79,True,2.67,2.67
999998,Cold Brew,Small,5.05,1.59,True,5.05,5.05


### Conditional assignment with `np.where()`

> instead of writing `if` conditions inside `apply()`  
> `np.where()` can be used for fast,element wise conditionals

In [11]:
# non vectorised

start = time.time()
df['tipper_apply'] = df.apply(lambda x: 'regular' if x['tip'] < 1.0 else 'generous',axis =1)

print(f'Non Vectorised time: {time.time()-start} Seconds')


Non Vectorised time: 14.603986740112305 Seconds


In [12]:
# vectorised

df['tipper_vectorised'] = np.where(df['tip'] < 1.0 ,'regular','generous')


**multiple condtions using `np.select()`**

**Multiple Conditions with np.where**
-----

NumPy's `np.where()` can only evaluate a single condition directly, but you can handle multiple conditions in several ways: 

**Method 1: Nested np.where calls**
-------
```python
# Three conditions example
result = np.where(condition1, 
          value1, 
          np.where(condition2, 
                value2,
                np.where(condition3,
                      value3,
                      default_value)))
```

This creates a chain of if-else statements, but can become hard to read with many conditions.


**Method 2: Combining conditions with logical operators**
--------
```python

# Using & (and), | (or), ~ (not)
condition = (df['column1'] > 10) & (df['column2'] == 'value')
result = np.where(condition, True_value, False_value)
```
This works well when you need to apply the same result to combinations of conditions.


**Method 3: Use np.select for multiple conditions**
-------
```python 

conditions = [
    df['size'] == 'Small',
    df['size'] == 'Medium',
    df['size'] == 'Large'
]

choices = [
    df['base_price'] * 0.8,  # Small gets 20% discount
    df['base_price'] * 1.1,  # Medium gets 10% markup
    df['base_price'] * 1.2   # Large gets 20% markup
]

df['adjusted_price'] = np.select(conditions, choices, default=df['base_price'])
```

`np.select()` evaluates conditions in order and returns the value from `choices` for the first True condition.     
The optional `default` parameter specifies what to return when no conditions are met.  
This is the preferred approach for multiple conditions with different outcomes.

In [13]:


start = time.time()
conditions = [
    df['tip'] < 0.2, 
    (df['tip'] >= 0.2) & (df['tip'] <= 1.0),
    df['tip'] > 1.0
]

# Define corresponding choices
choices = [
    'Greedy bastard', 
    'regular', 
    'generous'
]

# Use np.select instead of np.where
df['tipper_enhanced'] = np.select(conditions, choices, default='unknown')

print(f'Vectorised time: {time.time()-start} seconds')

Vectorised time: 0.3285396099090576 seconds


In [14]:
# Non vectorised 
def calculate_revenue(row):
    discount = 1.0
    if row['peak_hours']:
        if row['size'] == 'Small':
            discount = 0.95
        elif row['size'] == 'Medium':
            discount = 0.90
        elif row['size'] =='Large':
            discount = 0.85
    return round(((row['base_price'] + row['tip'])*discount),2)

start= time.time()
df['Revenue_apply'] = df.apply(calculate_revenue,axis =1)
print(f'Non Vectorised time: {time.time()-start} seconds')

Non Vectorised time: 21.610181093215942 seconds


In [15]:
# vectorised version - my attempt

start = time.time()

conditions = [df['size'] == 'Small',
              df['size'] == 'Medium',
              df['size'] =='Large']

results = [round((df['base_price'] + df['tip'])* 0.95 , 2),
           round((df['base_price'] + df['tip'])* 0.90 , 2),
           round((df['base_price'] + df['tip'])* 0.85 , 2)
    
]

df['Revenue_vec'] = np.select(condlist= conditions, choicelist= results ,default=(df['base_price'] + df['tip']))


print(f'Vectorised time: {time.time()-start} seconds')

Vectorised time: 0.37608766555786133 seconds


In [16]:
##non vectorised - my attempt, improved 


# Improved vectorized version based on your attempt

start = time.time()

# Step 1: Calculate the discount factors first
conditions = [
    (df['peak_hours']) & (df['size'] == 'Small'),
    (df['peak_hours']) & (df['size'] == 'Medium'),
    (df['peak_hours']) & (df['size'] == 'Large')
]

discount_factors = [0.95, 0.90, 0.85]

# Apply the discount factor using np.select
discounts = np.select(condlist=conditions, choicelist=discount_factors, default=1.0)

# Step 2: Calculate the revenue with the discounts
base_amount = df['base_price'] + df['tip']
df['Revenue_vec_improved'] = round(base_amount * discounts, 2)

print(f'Vectorised time (improved): {time.time()-start} seconds')

Vectorised time (improved): 0.3451662063598633 seconds


In [17]:
# vectorised version - suggested solution

start = time.time()

discounts = np.select(

    condlist=[(df['peak_hours']) & (df['size'] == 'Small'),
              (df['peak_hours']) &(df['size'] == 'Medium'),
              (df['peak_hours']) & (df['size'] =='Large') ],
    choicelist=[ 0.95,0.90,0.85],
    default=1.0
)

df['revenue_vec'] = round(((df['base_price'] + df['tip'])*discounts),2)

print(f'Vectorised time: {time.time()-start} seconds')

Vectorised time: 0.2875537872314453 seconds


In [18]:
df

Unnamed: 0,drink,size,base_price,tip,peak_hours,price_apply,price_vec,tipper_apply,tipper_vectorised,tipper_enhanced,Revenue_apply,Revenue_vec,Revenue_vec_improved,revenue_vec
0,Matcha,Small,4.56,1.40,True,4.56,4.56,generous,generous,generous,5.66,5.66,5.66,5.66
1,Latte,Small,2.25,1.83,True,2.25,2.25,generous,generous,generous,3.88,3.88,3.88,3.88
2,Latte,Large,4.10,1.85,False,5.10,5.10,generous,generous,generous,5.95,5.06,5.95,5.95
3,Espresso,Small,4.45,1.58,False,4.45,4.45,generous,generous,generous,6.03,5.73,6.03,6.03
4,Matcha,Small,2.30,0.11,True,2.30,2.30,regular,regular,Greedy bastard,2.29,2.29,2.29,2.29
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,Chai,Medium,4.26,1.06,True,4.76,4.76,generous,generous,generous,4.79,4.79,4.79,4.79
999996,Latte,Small,5.85,1.12,False,5.85,5.85,generous,generous,generous,6.97,6.62,6.97,6.97
999997,Espresso,Medium,2.17,0.79,True,2.67,2.67,regular,regular,regular,2.66,2.66,2.66,2.66
999998,Cold Brew,Small,5.05,1.59,True,5.05,5.05,generous,generous,generous,6.31,6.31,6.31,6.31


In [19]:
## The mac is alot quicker than both the windows computers at crunching large datasets