<a href="https://colab.research.google.com/github/cagBRT/PerformanceEnhancement/blob/main/1b_Pandas_Performance_Enhancement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is Vectorization?<br>

Vectorization is the process of applying operations to entire arrays or Series of data, instead of iterating through each element individually.

In Pandas, this means that you can perform operations on entire columns or Series without writing explicit loops. This highly efficient approach leverages optimized libraries under the hood, making your code faster and more concise.

## When to use for loops<br>

While for-loop syntax in Python is flexible and provides wonderful utility, each iteration over an element is essentially a single step in the route through all elements of the container object. This step-through processing is useful when the order of operation matters (e.g., returning the first item in a list that meets a certain condition).

## Why Vectorization is faster

**Vectorized processing may be applied when the order of processing does not matter.**

The built-in methods in NumPy and Pandas are built with C, which allows for vectorization. Vectorization almost always works faster as execution time is either constant, or grows at a much slower rate with a larger number of elements.

**Parallel Processing**<br>
In NumPy and Pandas, separate segments of arrays are processed amongst all of the processing cores of your computer.

NumPy and Pandas operate on their arrays and series in parallel, with a segment of each array being worked on by a different core of your computer’s processor.

**Like-Datatypes**

NumPy arrays are set to a single datatype. <br>

Likewise with series in Pandas — each column will be of type int, float, str, or datetime.<br>

This allows for optimization of data processing, as the contents of these containers are certain to be able to be manipulated in like-manner.<br>

**This is not the case with Python’s built-in container data-types, such as lists, sets, and dictionaries.** These types allow you to store a variety of types within them at the same time. A list may contain strings, ints, floats, other lists, etc.

**Locality**<br>
NumPy takes your array matrix and stores it in one area of your memory. Contents being local to each other allow them to be operated on faster.<br>

In contrast, Python lists may have its contents stored distant from each other within your memory.

## The Mechanism Behind Vectorization — SISD vs SIMD

Modern computer processors contain components that have particular computer architecture classifications that are relevant to understanding vectorization:

**SISD — Single Instruction, Single Data**<br>
This is the structure for how Python for-loops are processed—
One instruction, per one data element, per one moment in time, in order to produce one result.

The advantage is that it is flexible — you may implement any operation on your data. <br>
The drawback is that it is not optimum for processing large amounts of data.



**SIMD—Single Instruction, Multiple Data**<br>
This is the structure for how NumPy and Pandas vectorizations are processed—One instruction per any number of data elements per one moment in time, in order to produce multiple results.

Contemporary CPUs have a component to process SIMD operations in each of its cores, allowing for parallel processing.

# Vectorized Panda Series and Arrays

In this notebook we compare using apply and series map to datasets over 100 million rows. <br>
We then compare these performance values to using vectorized series and arrays.

In [None]:
import pandas as pd
import numpy as np

**Create a dataset of 100 million rows**

In [None]:
np.random.seed(3)
random_numbers_1=np.random.randint(10e1, size=100000000)
random_numbers_2=np.random.randint(10e2, size=100000000)
random_numbers_3=np.random.randint(10e3, size=100000000)

df = pd.DataFrame({
    'random_numbers_1':random_numbers_1,
    'random_numbers_2':random_numbers_2,
    'random_numbers_3':random_numbers_3,
})
df

DataFrame Iterrows:
**Iterrows() allows you to iterate through a pandas DataFrame row by row and it’s usually an approach to be avoided.** As in this case, we couldn’t even finish the code within the time limit we set.

The following code will take a very long time to run, **don't use iterrows for large datasets**

Create a second dataset of 100 million rows, call it df_small.<br>
In the assignment we will modify the size of this dataset.

In [None]:
np.random.seed(43)
random_numbers_1=np.random.randint(10e1, size=10000000)
random_numbers_2=np.random.randint(10e2, size=10000000)
random_numbers_3=np.random.randint(10e3, size=10000000)

df_small = pd.DataFrame({
    'random_numbers_1':random_numbers_1,
    'random_numbers_2':random_numbers_2,
    'random_numbers_3':random_numbers_3,
})
df_small

Now that we have a large dataset, let's compare using different pandas functions to find the fastest one

### Iterrows

In [None]:
%%time
def standard_scalar_iterrows(pandas_df:pd.DataFrame,
                             mean_ps:pd.Series,
                             std_series:pd.Series,
                             )->pd.DataFrame:
  """Iterate through the rows of the Pandas DataFrame and do the standard scaler calculation row by row"""

  for index, row in pandas_df.iterrows():
    pandas_df['scaled_random_numbers_2']=\
    (row[r'random_numbers_2']-mean_ps)/std_series
  return pandas_df

mean_ps=np.mean(df_small['random_numbers_2'])
std_ps=np.std(df['random_numbers_2'])

df_small=standard_scalar_iterrows(df_small,
                                 mean_ps,
                                 std_ps)

### apply to pd.Series

**Times out, don't use this method on large datasets**

In [None]:
%%time
def standard_scalar_apply(
    pandas_ps:pd.Series,
    mean_ps:float,
    std_ps:float,
    ) -> pd.Series:
    """Use pd.Series.apply() functiom to map through a Panda Series
    """
    scaled_pandas_series=(pandas_ps-mean_ps)/std_pandas_series
    return scaled_pandas_series

mean_ps=np.mean(df_small['random_numbers_2'])
std_pandas_series=np.std(df_small['random_numbers_2'])
df_small['scaled_random_numbers_2']=df_small['random_numbers_2'].apply(standard_scalar_apply,
                                                                     args=(mean_ps,std_pandas_series))

### Series Map<br>

Choose to map the function over each element within the Pandas Series. This is somewhat faster than Series Apply, but still relatively slow.

In [None]:
%%time
def standard_scalar_map(
    pandas_element: int,
    mean_pandas_series: float,
    std_pandas_series: float
    )-> float:
    """Use pd.Series.map() to map through the elements in Pandas Series
    """
    scaled_pandas_element=(pandas_element-mean_pandas_series)/ std_pandas_series
    return scaled_pandas_element

mean_pandas_series=np.mean(df['random_numbers_2'])
std_pandas_series=np.std(df['random_numbers_2'])
df_small['scaled_random_numbers_2'] = df_small['random_numbers_2'].map(lambda x:
                                                           standard_scalar_map(x,
                                                                               mean_pandas_series=mean_pandas_series,
                                                                               std_pandas_series=std_pandas_series
                                                                               ))

**Assignment 1**<br>
1. Change the df_small dataset to fewer rows.
2. What size df_small can be used on the the three previous functions so that it does not time out?



---



---



We have seen that using iterrows, apply, and map on large datasets will lead to the notebook crashing or taking a very long time. <br>

Now let's look at what we can use for large datasets

# **Vectorized** Series<br>

The definition given by the official Numpy documentation, vectorization is defined as being “able to delegate the task of performing mathematical operations on the array’s contents to optimized, compiled C code.” Instead of looping through rows, columns or elements, this allows us to apply one set of instructions on multiple elements at the same time.



the built-in vectorization operation from pandas Series with NumPy. Many data operations can and should be vectorized. Even if you don’t have the built-in vectorization operations from pandas Series as custom functions can get complex, you can probably still find many vectorized operations available in Numpy.

In [None]:
%%time
def standard_scalar_vectorized_series(
    pandas_series: pd.Series
  )->pd.Series:
  """Vectorized operation across Pandas Series
  """
  scaled_series=(pandas_series-np.mean(pandas_series))/np.std(pandas_series)
  return scaled_series

df['scaled_random_numbers_2']=standard_scalar_vectorized_series(df['random_numbers_2'])

# Vectorized array<br>

By using the NumPy array directly (you can convert Pandas Series to NumPy arrays by calling the .values attribute), you can speed up things even further from the vectorized Series.

In [None]:
%%time
def standard_scaler_vectorized_array(
    numpy_array: np.array
    )->np.array:
    """Vectorized operation across numpy arrays
    """
    scaled_array=(numpy_array - np.mean(numpy_array))/np.std(numpy_array)
    return scaled_array

df['scaled_random_numbers_2']=standard_scaler_vectorized_array(df['random_numbers_2'].values)

**Assignment 2:** <br>
How large of a dataset can be used with the vectorized series and arrays before it times out?



---



---



**For 100 million rows of data:**<br>

Iterrows: timed out<br>
Series Apply: timed out<br>
Series Map: timed out<br>
Vectorized Series: 1.39 s CPU time<br>
Vectorized Arrays: 1.07 s CPU time