# Vectorization

Vectorization is an important skill to improve coding efficiency, especially when working with large datasets. The key to vectorization is operating on entire matrices or vectors instead applying the operation sequentially for each element (E.g. through for loops). Numpy is often the foundation for these applications.

## When to Think About Vectorization

When working with small amounts of data, vectorization will not make as big of a difference (see examples below). However, for more complicated tasks with much larger amounts of data (particularly in machine learning or deep learning applications), vectorization becomes an essential tool. Excluding data size, there are two key questions to help you think about when you can apply vectorization:

1. Am I applying the same function uniformly across some data?
2. Can I use matrix operations?

##  **Example: Multiplication**
Below is an introductory example introducing how to vectorize multiplying an entire list of numbers.

In [4]:
## Required Imports
import numpy as np
import time
import timeit
np.random.seed(0) # Set seed to keep uniform values

In [5]:
# Multiplication using an explicit for loop
def explicit_multi(num_list, sample_num):
    
    list_explicit = np.zeros(sample_num) # Storage

    t1_explicit = time.time()

    # Sequentially multiply each value
    for i in range(sample_num):
        list_explicit[i] = num_list[i]*5
    t2_explicit = time.time() # End Timer

    print('Explicit: {}'.format(t2_explicit - t1_explicit))
    return(list_explicit)

# Multiplication using a vectorized implementation
def vectorized_multi(num_list,sample_num):

    list_vectorized = np.zeros(sample_num) # Storage

    t1_vect = time.time()
    
    # Instead of a for loop, use utilize numpy to
    # apply multiplication in parallel
    list_vectorized = num_list*5

    t2_vect = time.time()
    print('Vectorized: {}'.format(t2_vect - t1_vect))
    return(list_vectorized)

With small datasets, vectorization is roughly equal or a little better than the explicit operations.

In [11]:
# 20 Numbers
number_list = np.random.randint(0, 10, size=20)
exp = explicit_multi(number_list, 20)
vect = vectorized_multi(number_list, 20)

# Verify that the two are the same:
print('Arrays Equal: {}'.format(np.array_equal(exp, vect)))

Explicit: 1.621246337890625e-05
Vectorized: 1.2874603271484375e-05
Arrays Equal: True


With larger datasets, the savings from vectorization increases rapidly.

In [12]:
# 10000 Numbers
print('10,000 Numbers')
number_list = np.random.randint(0, 10, size=10000)
exp = explicit_multi(number_list, 10000)
vect = vectorized_multi(number_list, 10000)

# 1000000 Numbers
print('1,000,000 Numbers')
number_list = np.random.randint(0, 10, size=1000000)
exp = explicit_multi(number_list, 1000000)
vect = vectorized_multi(number_list, 1000000)

10,000 Numbers
Explicit: 0.0037522315979003906
Vectorized: 0.00019502639770507812
1,000,000 Numbers
Explicit: 0.3107171058654785
Vectorized: 0.0034859180450439453


## **Example: Row Sums**

Let's look at summing multiple lists of numbers at once, using built in numpy functions

In [16]:
np.random.seed(0)

# 1000 samples with 100 numbers each
samples = np.random.randint(0, 10, (1000,100))

# Looking for an array of 100 sums

def explicit_sums(samples):
    
    sum_explicit = np.zeros(samples.shape[0]) # Storage

    t1_explicit = time.time()

    # Sequentially sum
    for i in range(samples.shape[0]):
        curr_row = samples[i,:]
        temp = 0
        for j in range(len(curr_row)):
            temp += curr_row[j] # Sum the row
        sum_explicit[i] = temp
    t2_explicit = time.time() # End Timer

    print('Explicit: {}'.format(t2_explicit - t1_explicit))
    

# Multiplication using a vectorized implementation
def vectorized_sums(samples):

    t1_vect = time.time()
    # Use built-in numpy functions for vectorized implementation
    sum_vectorized = np.sum(samples, axis=1) # Axis=1 sums the entire row

    t2_vect = time.time()
    print('Vectorized: {}'.format(t2_vect - t1_vect))


explicit_sums(samples)
vectorized_sums(samples)

Explicit: 0.018610000610351562
Vectorized: 0.00030112266540527344
