<a href="https://colab.research.google.com/github/cagBRT/PerformanceEnhancement/blob/main/1_a_Alternative_to_for_Loops.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Vectorization is the technique of implementing (NumPy) array operations on a dataset. In the background, it applies the operations to all the elements of an array or series in one go (unlike a ‘for’ loop that manipulates one row at a time).<br>


**Vectorization in Python is super fast and should be preferred over loops, whenever we are working with very large datasets.**





In [8]:
import numpy as np
import pandas as pd

## for loops vs Vectorization<br>

Compare adding 1.5 million numbers by for loop and by using np.sum()

**Using a for loop**

In [2]:
%%time
# iterative sum
total = 0
# iterating through 1.5 Million numbers
for item in range(0, 1500000):
    total = total + item

print('sum is:' + str(total))

sum is:1124999250000
CPU times: user 384 ms, sys: 2.26 ms, total: 386 ms
Wall time: 398 ms


**Using vectorization**

In [6]:
%%time
print('sum is: ',np.sum(np.arange(1500000)))

sum is:  1124999250000
CPU times: user 5.46 ms, sys: 4.09 ms, total: 9.55 ms
Wall time: 21.1 ms


Time for the for loop: 398 ms<br>
Time for the vectorization: 21.1 ms<br>

Vectorization took ~18x less time to execute as compared to the iteration using the range function.

This difference will become more significant while working with Pandas DataFrame.

## Mathematical Operations (on DataFrame)<br>

In Data Science, while working with Pandas DataFrame, the developers use loops to create new derived columns using mathematical operations.<br>

In the following example, we can see how easily the loops can be replaced with Vectorization for such use cases.

**Create a dataframe of 5 million rows**

In [11]:
df = pd.DataFrame(np.random.randint(1, 50, size=(5000000, 4)), columns=('a','b','c','d'))
print('shape=',df.shape)
df.head()

shape= (5000000, 4)


Unnamed: 0,a,b,c,d
0,16,2,1,23
1,16,19,26,9
2,27,37,27,34
3,5,33,12,42
4,6,19,37,12


### creating a new column<br>
Create a new column ‘ratio’ to find the ratio of the column ‘d’ and ‘c’.

**Using a for loop to create a new column**<br>
Be patient, this can take up to 7 minutes

In [12]:
%%time
# Iterating through DataFrame using iterrows
for idx, row in df.iterrows():
    # creating a new column
    df.at[idx,'ratio'] = 100 * (row["d"] / row["c"])



CPU times: user 6min 51s, sys: 747 ms, total: 6min 52s
Wall time: 6min 57s


**Using vectorization to create a new column**

In [13]:
%%time
df["ratio"] = 100 * (df["d"] / df["c"])

CPU times: user 83.3 ms, sys: 30.9 ms, total: 114 ms
Wall time: 169 ms


We can see a significant improvement with DataFrame, the time taken by the Vectorization operation is drastically faster as compared to the loops in Python.

### If-else Statements (on DataFrame)

A lot of operations that require using the ‘If-else’ logic. We can easily replace these logics with Vectorization operations in Python.

Look at the following example to understand it better (we will be using the 5 million row DataFrame that we created above):

Imagine we want to create a new column ‘e’ based on some conditions on the exiting column ‘a’.

**Creating a new column using a for loop and if-else**<br>
Be patient, this will take a while (~10min)

In [14]:
%%time
for idx, row in df.iterrows():
    if row.a == 0:
        df.at[idx,'e'] = row.d
    elif (row.a <= 25) & (row.a > 0):
        df.at[idx,'e'] = (row.b)-(row.c)
    else:
        df.at[idx,'e'] = row.b + row.c

CPU times: user 9min 43s, sys: 1.86 s, total: 9min 45s
Wall time: 10min 1s


**Create a new column using vectorization**

In [17]:
%%time
df['e'] = df['b'] + df['c']
df.loc[df['a'] <= 25, 'e'] = df['b'] -df['c']
df.loc[df['a']==0, 'e'] = df['d']

CPU times: user 385 ms, sys: 169 ms, total: 554 ms
Wall time: 580 ms


### Solving Machine Learning/Deep Learning Networks<br>

Calculate the following equation <br>

 $y=m_1x_1+m_2x_2+m_3x_3+m_4x_4+m_5x_5+c
 $


m is an array of 5 values<br>
x is an array of 5 million values

In [18]:
# setting initial values of m
m = np.random.rand(1,5)

# input values for 5 million rows
x = np.random.rand(5000000,5)

In [20]:
m

array([[0.00296342, 0.79620141, 0.64724538, 0.67998852, 0.90720993]])

In [21]:
x

array([[0.7558599 , 0.58921781, 0.62781798, 0.52153714, 0.94009674],
       [0.76352618, 0.38960832, 0.96926088, 0.79082271, 0.59274708],
       [0.16778951, 0.67113249, 0.04751701, 0.67924639, 0.96667438],
       ...,
       [0.58868249, 0.90064331, 0.97105094, 0.64843349, 0.49664189],
       [0.78117006, 0.96739011, 0.29250178, 0.51529084, 0.72206629],
       [0.62443378, 0.43816913, 0.57197688, 0.37947644, 0.66667625]])

**Solving the equation with for loops**

In [23]:
%%time
total = 0
for i in range(0,5000000):
    total = 0
    for j in range(0,5):
        total = total + x[i][j]*m[0][j]

CPU times: user 24.4 s, sys: 35.9 ms, total: 24.4 s
Wall time: 24.8 s


**Solving the equation with vectorization**

In [25]:
%%time
#dot product
np.dot(x,m.T)

CPU times: user 88.4 ms, sys: 26.9 ms, total: 115 ms
Wall time: 77.5 ms


array([[2.08523264],
       [2.01531537],
       [1.90446538],
       ...,
       [2.23883202],
       [1.96733029],
       [1.58378567]])

For loops: 24.8 seconds<br>
Vectorization: 77.5 ms