# Vectorization Over Python Loop 

# Table Of Contents
[1. Introduction](#1.Introduction)\
[2. NumPy](#2.NumPy)\
[3. Vectorization](#3.-Vectorization)\
[4. Use Cases](#4.-Use_Cases)\
[5. Conclusion](#5.-Conclusion)


# 1. Introduction
Applications frequently encounter the challenge of managing substantial volumes of data. However, utilizing non-optimized functions can significantly impede the overall efficiency of the algorithm and lead to a prolonged execution time for the model.

To speed up the computational efficiency in our code, we can implement vectorization as a technique. It is imperative to employ standard mathematical functions that facilitate swift operations on extensive arrays of data, eliminating the necessity for explicit loops. 

### Cause behind the slowness of Loop
Loops are often considered a potential cause of slow performance when analyzing code for performance bottlenecks, particularly in Python. The reason loops in Python are slow is due to Python's dynamically typed nature. In Python, the code is processed line by line, compiled into bytecode, and then executed to run the program. However, when there is a loop that iterates over a list, Python faces a challenge because it is dynamically typed. This means that Python does not know the type of objects in the list (whether they are integers, strings, or floats) until it goes through the list.

The information about the type of each object is stored within the object itself, and Python cannot determine it in advance. Consequently, during each iteration of the loop, Python has to perform a series of checks, such as : 
- determining the type of the variable,
- resolving its scope, and
- checking for invalid operations.

This additional overhead can slow down the performance of the loop.

# 2. NumPy - Numerical Python
NumPy is a library that encompasses these functions, enabling the enhancement of algorithm performance through the utilization of vectorization. NumPy serves as a crucial package for efficient scientific computing and data analysis within the Python ecosystem. It acts as a fundamental component for various higher-level tools like Pandas and Scikit-learn. 

The speed advantage of NumPy stems from its utilization of vectorized implementation, with many of its core functions being written in C. Unlike Python lists, NumPy arrays are homogeneous arrays that are densely packed. Python lists, on the other hand, are arrays of pointers to objects, even when they share the same type. This distinction allows NumPy arrays to benefit from the locality of reference. Numerous NumPy operations are implemented in C, thereby eliminating the overhead associated with loops, pointer indirection, and per-element dynamic type checking in Python. The extent of performance improvement varies depending on the specific operations being executed.

NumPy derives its strength in computation from three main concepts: 
- Vectorization,
- Broadcasting, and
- Indexing.


# 3. What is Vectorization?
While using loops is a common approach for repetitive tasks in programming, it can lead to poor performance and lengthy execution times when dealing with a large number of iterations, such as millions or billions of rows. In such scenarios, implementing vectorization in Python can greatly enhance efficiency and eliminate the frustration of waiting for slow processes to finish.

### Why Vectorization?
Vectorization is a method used to enhance the performance of Python code by eliminating the need for loops. This technique can significantly reduce the execution time of the code. 

When iterating over an array or any data structure in Python, there is a significant overhead involved. By utilizing vectorized operations in NumPy, the looping is delegated to highly optimized C and Fortran functions, resulting in faster and more efficient Python code.

To illustrate the benefits of vectorization, let's consider some examples comparing classical methods to the vectorization technique:

- The outer(a, b) function calculates the outer product of two vectors.
- The multiply(a, b) function calculates the matrix product of two arrays.
- The dot(a, b) function calculates the dot product of two arrays.
- The zeros((n, m)) function creates a matrix of specified shape and type, filled with zeros.


# 4. USE CASEs
### 1. Calculating the SUM of Numbers
Initially, let's examine a basic scenario where we calculate the total of numbers by utilizing loops and Vectorization in Python.

In [3]:
# Using Loop

import time 
start = time.time()

# iterative sum
total = 0
# iterating through 1.5 Million numbers
for item in range(0, 1500000):
    total = total + item

print('sum is:' + str(total))
end = time.time()
print(end - start)

sum is:1124999250000
0.249786376953125


In [6]:
# Using Vectorization

import numpy as np

start = time.time()
# vectorized sum - using numpy for vectorization
# np.arange create the sequence of numbers from 0 to 1499999
print(np.sum(np.arange(1500000)))
end = time.time()
print(end - start)

-282181552
0.0029926300048828125


##### Observation: Vectorization was significantly faster, approximately 18 times quicker than iteration with the range function. This performance gap will be even more pronounced when dealing with Pandas DataFrame.


### 2: utilize mathematical operations on DataFrame
The developers in Data Science utilize mathematical operations on DataFrame. They often resort to loops to generate new derived columns.

In the instance below, we demonstrate how Vectorization can efficiently replace loops for these scenarios.

### Generating the DataFrame
- A DataFrame represents tabular data structured in rows and columns.
- We are constructing a pandas DataFrame with  random values ranging from 0 to 50.

In [11]:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 50, size=(100000, 4)), columns=('a','b','c','d'))
df.shape
# (100000, 5)
df.head()

Unnamed: 0,a,b,c,d
0,17,34,26,15
1,4,8,16,0
2,8,44,20,41
3,1,5,49,14
4,8,42,29,49


In [12]:
# Using Loop
import time 
start = time.time()

# Iterating through DataFrame using iterrows
for idx, row in df.iterrows():
    # creating a new column 
    df.at[idx,'ratio'] = 100 * (row['d'] / row['c'])  
end = time.time()
print(end - start)


  df.at[idx,'ratio'] = 100 * (row['d'] / row['c'])
  df.at[idx,'ratio'] = 100 * (row['d'] / row['c'])


9.743719339370728


##### Observation: A notable enhancement is observed when utilizing DataFrame, as the Vectorization operation is nearly 1000 times quicker than loops in Python.

### 3: If-else Statements 
In our data analysis tasks, we often encounter situations where we need to apply 'If-else' logic. However, in Python, we can simplify these logic operations by utilizing Vectorization operations.

To illustrate this concept, let's consider the following example using the DataFrame from use case 2:

Suppose we need to generate a new column 'e' by applying certain conditions to the existing column 'a'.

In [14]:
# Using Loop

import time 
start = time.time()

# Iterating through DataFrame using iterrows
for idx, row in df.iterrows():
    if row.a == 0:
        df.at[idx,'e'] = row.d    
    elif (row.a <= 25) & (row.a > 0):
        df.at[idx,'e'] = (row.b)-(row.c)    
    else:
        df.at[idx,'e'] = row.b + row.c
end = time.time()
print(end - start)

13.707904577255249


In [16]:
# using vectorization 

start = time.time()
df['e'] = df['b'] + df['c']
df.loc[df['a'] <= 25, 'e'] = df['b'] -df['c']
df.loc[df['a']==0, 'e'] = df['d']
end = time.time()
print(end - start)

0.015050649642944336


##### Observation: The Vectorization operation is 600 times faster than the Python loops with if-else statements in terms of time taken.

### 4: Dot Product
The dot product, also referred to as the inner product, is an algebraic operation that involves multiplying two vectors of equal length to obtain a single scalar value.

When applying the dot product to two matrices, a and b, of the same length, the process involves taking the transpose of the first matrix, denoted as a', and then performing matrix multiplication with the second matrix, b.


In [31]:
import time
import numpy
import array

# a and b is an array of array with int of 8 bytes size

a = array.array('q')
for i in range(50000):
    a.append(i);
    b = array.array('q')
for i in range(50000, 100000):
    b.append(i)

# classic dot product of vectors implementation

start = time.process_time()
classic_dot_product = 0.0;

for i in range(len(a)):
    classic_dot_product += a[i] * b[i]

end = time.process_time()
print("classic_dot_product = " + str(classic_dot_product));

print("Computation time using loops = " + str(1000*(end -start)) + "ms")

vectorised_start_time = time.process_time()
vectorised_dot_product = numpy.dot(a, b)
vectorised_end_time = time.process_time()
print("\nvectorised_dot_product = " +str(vectorised_dot_product))
''
print("Computation time using vectorization = "+str(1000*(vectorised_end_time-vectorised_start_time)))

classic_dot_product = 104164166675000.0
Computation time using loops = 31.25ms

vectorised_dot_product = 104164166675000
Computation time using vectorization = 0.0


# 5. Conclusion
Inefficient code is a burden that cannot be overlooked, whether it involves disease detection, electric grid modeling, or any other data processing or analysis tasks performed using Python. Debugging minor changes alone can consume excessive valuable time if it takes 30 minutes or more to execute our code. Slow-running tasks consume valuable time during the development process, hinder user experience, and escalate computational expenses. Enhancing the efficiency of our code to accelerate iterations, ensure user satisfaction, and maintain adherence to our budgetary constraints.

