<a href="https://colab.research.google.com/github/cagBRT/PerformanceEnhancement/blob/main/1_a_Alternative_to_for_Loops.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Vectorization is the technique of implementing (NumPy) array operations on a dataset.

In the background, it applies the operations to all the elements of an array or series in one go (unlike a ‘for’ loop that manipulates one row at a time).<br>


**Vectorization in Python is super fast and should be preferred over loops, whenever we are working with very large datasets.**





### Benefits of Vectorization<br>


1. **Efficiency:** Vectorized operations are optimized for performance and are much faster than traditional loop-based operations, especially on large datasets.
2. **Clarity:** Vectorized code is often more concise and easier to read compared to code with explicit loops.
3. **Ease of Use:** You can apply operations to entire columns or Series with a single line of code, reducing the complexity of your scripts.
4. **Compatibility**: Pandas integrates seamlessly with other data science libraries like NumPy and scikit-learn, allowing you to work with vectorized data efficiently in your data analysis and machine learning projects.

### How Vectorization Speeds Up Your Code


1. **Reduced Loop Overheads:** In traditional loops, there’s overhead associated with managing the loop index and checking loop conditions. With vectorization, you eliminate these overheads because the operations are applied to entire arrays.
2. **Optimized Low-Level Instructions:** Libraries like NumPy use optimized low-level instructions (e.g., SIMD instructions on modern CPUs) to perform operations on arrays, taking full advantage of hardware capabilities. This can result in significant speed improvements.
3. **Parallelism:** Some vectorized operations can be parallelized, meaning that modern processors can execute multiple operations simultaneously. This parallelism further accelerates computation.
4. **Simplicity:** Vectorized code is often more concise and easier to read than equivalent loop-based code, making it easier to maintain and understand.
5. **Interoperability:** Libraries like NumPy integrate seamlessly with other data science and scientific computing libraries, allowing you to build complex data analysis and numerical computing workflows efficiently.

In [18]:
import numpy as np
import pandas as pd

## for loops vs Vectorization<br>

Compare adding 1.5 million numbers by for loop and by using np.sum()

**Using a for loop**

In [19]:
%%time
# iterative sum
total = 0
# iterating through 1.5 Million numbers
for item in range(0, 1500000):
    total = total + item

print('sum is:' + str(total))

sum is:1124999250000
CPU times: user 214 ms, sys: 988 µs, total: 215 ms
Wall time: 218 ms


**Using vectorization**

In [20]:
%%time
print('sum is: ',np.sum(np.arange(1500000)))

sum is:  1124999250000
CPU times: user 3.91 ms, sys: 988 µs, total: 4.89 ms
Wall time: 6.77 ms


Time for the for loop: 398 ms<br>
Time for the vectorization: 21.1 ms<br>

Vectorization took ~18x less time to execute as compared to the iteration using the range function.

This difference will become more significant while working with Pandas DataFrame.

**Assignment:** <br>
How small must the number of values be so that the difference between the two techniques is negligible?

## Mathematical Operations (on DataFrame)<br>

In Data Science, while working with Pandas DataFrame, the developers use loops to create new derived columns using mathematical operations.<br>

In the following example, we can see how easily the loops can be replaced with Vectorization for such use cases.

**Create a dataframe of 5 million rows**

In [21]:
df = pd.DataFrame(np.random.randint(1, 50, size=(5000000, 4)), columns=('a','b','c','d'))
print('shape=',df.shape)
df.head()

shape= (5000000, 4)


Unnamed: 0,a,b,c,d
0,16,9,48,2
1,3,37,37,18
2,12,26,7,18
3,33,6,46,29
4,1,5,9,27


### creating a new column<br>
Create a new column ‘ratio’ to find the ratio of the column ‘d’ and ‘c’.

**Using a for loop to create a new column**<br>
Be patient, this can take up to 7 minutes

In [22]:
%%time
# Iterating through DataFrame using iterrows
for idx, row in df.iterrows():
    # creating a new column
    df.at[idx,'ratio'] = 100 * (row["d"] / row["c"])

CPU times: user 6min 27s, sys: 462 ms, total: 6min 27s
Wall time: 6min 30s


**Using vectorization to create a new column**

In [23]:
%%time
df["ratio"] = 100 * (df["d"] / df["c"])

CPU times: user 77.9 ms, sys: 26 ms, total: 104 ms
Wall time: 68.3 ms


We can see a significant improvement with DataFrame, the time taken by the Vectorization operation is drastically faster as compared to the loops in Python.

**Assignment:**<br>
How small must the DataFrame be so that the difference between the two techniques is negligible?

### If-else Statements (on DataFrame)

A lot of operations that require using the ‘If-else’ logic. We can easily replace these logics with Vectorization operations in Python.

Look at the following example to understand it better (we will be using the 5 million row DataFrame that we created above):

Imagine we want to create a new column ‘e’ based on some conditions on the exiting column ‘a’.

**Creating a new column using a for loop and if-else**<br>
Be patient, this will take a while (~10min)

In [None]:
%%time
for idx, row in df.iterrows():
    if row.a == 0:
        df.at[idx,'e'] = row.d
    elif (row.a <= 25) & (row.a > 0):
        df.at[idx,'e'] = (row.b)-(row.c)
    else:
        df.at[idx,'e'] = row.b + row.c

**Create a new column using vectorization**

In [8]:
%%time
df['e'] = df['b'] + df['c']
df.loc[df['a'] <= 25, 'e'] = df['b'] -df['c']
df.loc[df['a']==0, 'e'] = df['d']

CPU times: user 259 ms, sys: 113 ms, total: 372 ms
Wall time: 310 ms


**Assignment:**<br>
How small must the DataFrame be so that the difference between the two techniques is negligible?

### Solving Machine Learning/Deep Learning Networks<br>

Calculate the following equation <br>

 $y=m_1x_1+m_2x_2+m_3x_3+m_4x_4+m_5x_5+c
 $


m is an array of 5 values<br>
x is an array of 5 million values

In [9]:
# setting initial values of m
m = np.random.rand(1,5)

# input values for 5 million rows
x = np.random.rand(5000000,5)

In [10]:
m

array([[0.95200946, 0.43928255, 0.93047783, 0.70494671, 0.31859983]])

In [11]:
x

array([[0.37805996, 0.62819801, 0.32690518, 0.61344767, 0.47223032],
       [0.48225794, 0.64857166, 0.55032148, 0.43736745, 0.91035641],
       [0.79857815, 0.4199085 , 0.14098982, 0.42027611, 0.62189277],
       ...,
       [0.90147354, 0.44011535, 0.67637978, 0.66136312, 0.79183972],
       [0.08146755, 0.90763156, 0.00393991, 0.18816702, 0.99221964],
       [0.59738234, 0.91777501, 0.23676719, 0.14013995, 0.24832371]])

**Solving the equation with for loops**

In [12]:
%%time
total = 0
for i in range(0,5000000):
    total = 0
    for j in range(0,5):
        total = total + x[i][j]*m[0][j]

KeyboardInterrupt: 

**Solving the equation with vectorization**

In [13]:
%%time
#dot product
np.dot(x,m.T)

CPU times: user 75.1 ms, sys: 21 ms, total: 96 ms
Wall time: 61.4 ms


array([[1.52295152],
       [1.85444241],
       [1.57030751],
       ...,
       [2.39940848],
       [0.9286993 ],
       [1.37008989]])

For loops: 24.8 seconds<br>
Vectorization: 77.5 ms

**Assignment:**<br>
How small must x be so that the difference between the two techniques is negligible?

## Compare list to arrays

**Create two NumPy arrays and two lists for the comparison**

In [14]:

array1 = np.random.randint(1, 100, size=5000000)
array2 = np.random.randint(1, 100, size=5000000)
list1 = list(array1)
list2 = list(array2)

### Applying Functions
Vectorization also allows you to apply custom functions to columns.

**Define functions for the addition function**

In [15]:
# Vectorized processing with NumPy
def numpy_vectorized():
    result = array1 + array2

# Traditional loop-based processing
def loop_based():
    result = []
    for i in range(len(list1)):
        result.append(list1[i] + list2[i])

**Compare the CPU time between the two data structures**

In [16]:
%%time
numpy_vectorized()

CPU times: user 9.64 ms, sys: 16 ms, total: 25.6 ms
Wall time: 30.2 ms


In [17]:
%%time
loop_based()

CPU times: user 1.05 s, sys: 149 ms, total: 1.2 s
Wall time: 1.21 s


**Assignment:**<br>
How small must the arrays be so that the difference between the two techniques is negligible?