<a href="https://colab.research.google.com/github/vanderbilt-data-science/p4ai-essentials/blob/main/5_vectorization_solns.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vectorization
> Rounding out Python knowledge with vectorization

As you know from your coursework with datacamp, numpy is optimized for fast vectorized operations for numerical operations including linear algebra and other functions. Let's check this out.

In [19]:
import numpy as np
import pandas as pd

# Introduction to numpy
We've already seen some operations with numpy through datacamp. Let's remind ourselves of the functionality using some of our previous examples.

In [20]:
#@markdown We'll just hide the work that we're going to do to generate the dataset.
#@markdown It's a simple, inelegant way of generating just a bit more data.
#@markdown Make sure to execute this cell to have access to the dog_data dataset.

#original data
weight_kgs = [25.0, 20.22, 17.83, 10.22, 8.05]
height_cm = [68.0, 57.99, 45.21, 36.2, 10.22]
neck_circ_cm = [45.2, 50.35, 55.2, 40.88, 5.06]
back_length_cm = [63.2, 50.25, 43.8, 50.1, 12.5]
chest_circ_cm = [78.2, 86.92, 53.9, 71.2, 25.5]
breed = ['Afghan Hound', 'Airedale Terrier', 'Staffordshire Terrier', 'Australian Shepherd', 'Toy Poodle']

dog_data_sm = {'weight_kgs': weight_kgs,
            'height_cm': height_cm,
            'neck_circ_cm': neck_circ_cm,
            'back_length_cm': back_length_cm,
            'chest_circ_cm' : chest_circ_cm,
            'breed': breed}

#extend to 10 elements
dog_data = {key : value + (np.array(value) + 2*np.random.rand(len(value))).tolist()
            for key, value in dog_data_sm.items() if key!='breed'}

#fix breed to be twice as long
dog_data['breed'] = breed + breed

#for visualization purposes
pd.DataFrame(dog_data)

Unnamed: 0,weight_kgs,height_cm,neck_circ_cm,back_length_cm,chest_circ_cm,breed
0,25.0,68.0,45.2,63.2,78.2,Afghan Hound
1,20.22,57.99,50.35,50.25,86.92,Airedale Terrier
2,17.83,45.21,55.2,43.8,53.9,Staffordshire Terrier
3,10.22,36.2,40.88,50.1,71.2,Australian Shepherd
4,8.05,10.22,5.06,12.5,25.5,Toy Poodle
5,25.584352,69.842679,45.438713,64.725289,79.788767,Afghan Hound
6,22.118486,58.610651,51.782727,51.460148,88.177771,Airedale Terrier
7,19.59765,46.060522,55.627905,45.713229,54.991344,Staffordshire Terrier
8,11.84157,37.6354,41.525984,51.276206,72.949965,Australian Shepherd
9,8.335528,10.734125,5.915546,12.589517,26.155257,Toy Poodle


## Creation of numpy arrays

In [21]:
#Lets practice dictionary comprehensions to remove the breed key
dog_data_num = {key:value for key, value in dog_data.items() if key != 'breed'}

In [22]:
#Let's try an unnecessarily difficult approach to make this dataset
column_names, data = list(zip(*dog_data_num.items()))
column_names

('weight_kgs', 'height_cm', 'neck_circ_cm', 'back_length_cm', 'chest_circ_cm')

In [23]:
#Make target vector
weight_kgs_np = np.array(data[0])
weight_kgs_np

array([25.        , 20.22      , 17.83      , 10.22      ,  8.05      ,
       25.58435213, 22.11848622, 19.59764978, 11.84157034,  8.33552815])

In [24]:
#Create data
np_data = np.array(data).T
np_data

array([[25.        , 68.        , 45.2       , 63.2       , 78.2       ],
       [20.22      , 57.99      , 50.35      , 50.25      , 86.92      ],
       [17.83      , 45.21      , 55.2       , 43.8       , 53.9       ],
       [10.22      , 36.2       , 40.88      , 50.1       , 71.2       ],
       [ 8.05      , 10.22      ,  5.06      , 12.5       , 25.5       ],
       [25.58435213, 69.84267927, 45.43871341, 64.72528904, 79.78876721],
       [22.11848622, 58.61065074, 51.78272726, 51.46014787, 88.17777128],
       [19.59764978, 46.06052162, 55.62790459, 45.71322926, 54.99134387],
       [11.84157034, 37.6354004 , 41.52598441, 51.27620597, 72.94996527],
       [ 8.33552815, 10.73412518,  5.9155463 , 12.58951745, 26.15525682]])

In [25]:
#Remove target column from data
np_data = np_data[:, 1:]
np_data

array([[68.        , 45.2       , 63.2       , 78.2       ],
       [57.99      , 50.35      , 50.25      , 86.92      ],
       [45.21      , 55.2       , 43.8       , 53.9       ],
       [36.2       , 40.88      , 50.1       , 71.2       ],
       [10.22      ,  5.06      , 12.5       , 25.5       ],
       [69.84267927, 45.43871341, 64.72528904, 79.78876721],
       [58.61065074, 51.78272726, 51.46014787, 88.17777128],
       [46.06052162, 55.62790459, 45.71322926, 54.99134387],
       [37.6354004 , 41.52598441, 51.27620597, 72.94996527],
       [10.73412518,  5.9155463 , 12.58951745, 26.15525682]])

## Learn about numpy array

In [26]:
# number of rows and columns
print(np_data.shape)

# number of dimensions of the data
print(np_data.ndim)

# data type of elements
np_data.dtype

(10, 4)
2


dtype('float64')

## Indexing
We can use dimension indices to access data for each dimension

In [27]:
# row 0 and all columns
np_data[0,:]

array([68. , 45.2, 63.2, 78.2])

In [28]:
# row 5, column 2
np_data[5, 2]

64.72528904123999

In [29]:
# boolean indexing: get rows where dog weights are greater than 20 kgs
np_data[np_data[:,0]>20, :]

array([[68.        , 45.2       , 63.2       , 78.2       ],
       [57.99      , 50.35      , 50.25      , 86.92      ],
       [45.21      , 55.2       , 43.8       , 53.9       ],
       [36.2       , 40.88      , 50.1       , 71.2       ],
       [69.84267927, 45.43871341, 64.72528904, 79.78876721],
       [58.61065074, 51.78272726, 51.46014787, 88.17777128],
       [46.06052162, 55.62790459, 45.71322926, 54.99134387],
       [37.6354004 , 41.52598441, 51.27620597, 72.94996527]])

In [30]:
np_data_mdl = np.hstack((np.ones_like(np_data[:,0]).reshape(10,1), np_data))
np_data_mdl

array([[ 1.        , 68.        , 45.2       , 63.2       , 78.2       ],
       [ 1.        , 57.99      , 50.35      , 50.25      , 86.92      ],
       [ 1.        , 45.21      , 55.2       , 43.8       , 53.9       ],
       [ 1.        , 36.2       , 40.88      , 50.1       , 71.2       ],
       [ 1.        , 10.22      ,  5.06      , 12.5       , 25.5       ],
       [ 1.        , 69.84267927, 45.43871341, 64.72528904, 79.78876721],
       [ 1.        , 58.61065074, 51.78272726, 51.46014787, 88.17777128],
       [ 1.        , 46.06052162, 55.62790459, 45.71322926, 54.99134387],
       [ 1.        , 37.6354004 , 41.52598441, 51.27620597, 72.94996527],
       [ 1.        , 10.73412518,  5.9155463 , 12.58951745, 26.15525682]])

### Example 1: Converting to weight_lbs
Let's use a for loop to create a new list. The new list should be the conversion of `weight_kgs_np` to `weight_lbs_np` where 1 kg = 2.2 lbs.

In [31]:
# using braodcasting/vectorization
weight_lbs_np = weight_kgs_np*2.2
weight_lbs_np

array([55.        , 44.484     , 39.226     , 22.484     , 17.71      ,
       56.28557469, 48.66066968, 43.11482951, 26.05145476, 18.33816192])

### Example 2: A basic linear regression
Let's make a REALLY terrible predictor. We'll try to predict the weight based on the height, circumferences, and back lengths. What is this operation?

$$ \hat{dog weight} = w_0 + (w_1*height) + (w_2 * neck\ circumference) + (w_3 * back\ length) + (w_4 * chest\ circumference) $$
<center><br><img src = 'https://algebra1course.files.wordpress.com/2013/02/dot-product-visual.jpg' /></center>

In [32]:
# let's just start by making a list of random weights
w = [1, 0.3, -0.4, 2.2, 0.5]

# Let's turn this into a numpy array
np_w = np.array(w)

# Explore the shape and dimension
print(np_w, 'shape:', np_w.shape, 'dim:', np_w.ndim)

[ 1.   0.3 -0.4  2.2  0.5] shape: (5,) dim: 1


We can think of this problem as a particular linear algebra operation. Let's look at the [numpy linear algebra documentation](https://numpy.org/doc/stable/reference/routines.linalg.html).

In [33]:
# let's do the calculation for a single "row" or "dog"
dogweight_pred = np.dot(w, np_data_mdl[0,:])

# view the result
dogweight_pred

181.46000000000004

How can we do this for all data? Numpy will automatically perform [broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html) depending on the shape of your data.

In [34]:
#use element-wise multiply with broadcasting to obtain dot product applied to each row
dogweight_preds = np.dot(np_data_mdl, np_w)

#view the result
dogweight_preds

array([181.46      , 152.267     , 115.793     , 141.328     ,
        42.292     , 186.06733791, 155.17131527, 120.63177095,
       144.96286214,  42.62858584])

Yay, us! With a few lines (but more knowledge of mathematics), we've created a terrible prediction model!

In [35]:
#@title Try it Yourself: Normalizing data
#@markdown Let's try a vectorized approach to normalizing data. You'll normalize the `weight_kgs_np` column, and use only the first
#@markdown 5 elements (to match the previous example with iteration to check your work).
#@markdown The calculation for normalization that we will use is:
#@markdown $$ /frac{value - np.std(column)}{mean(column)}

#calculate mean
weight_kgs_norm = (weight_kgs_np[:5] - np.mean(weight_kgs_np[:5]))/ np.std(weight_kgs_np[:5])
weight_kgs_norm

array([ 1.38677247,  0.62798442,  0.24859039, -0.95943828, -1.303909  ])

In [36]:
# your answer here

### Example 3: A basic neural network
Let's make a REALLY terrible predictor. We'll try to predict the weight based on the height, circumferences, and back lengths. What is this operation?

$$ \hat{dog weight} = w_0 + (w_1*height) + (w_2 * neck\ circumference) + (w_3 * back\ length) + (w_4 * chest\ circumference) $$

Consider the behavior of a dot product.
<center><br><img src = 'https://algebra1course.files.wordpress.com/2013/02/dot-product-visual.jpg' /></center>

Now, consider that layer of nodes in a neural network is just a bunch of "models" stacked up.
<center><br><img width='40%' src = 'https://media.geeksforgeeks.org/wp-content/uploads/20200702205951/nn.PNG' /></center>

How can we extend what we've done here to this model?

In [39]:
#recall our original data
np_data

array([[68.        , 45.2       , 63.2       , 78.2       ],
       [57.99      , 50.35      , 50.25      , 86.92      ],
       [45.21      , 55.2       , 43.8       , 53.9       ],
       [36.2       , 40.88      , 50.1       , 71.2       ],
       [10.22      ,  5.06      , 12.5       , 25.5       ],
       [69.84267927, 45.43871341, 64.72528904, 79.78876721],
       [58.61065074, 51.78272726, 51.46014787, 88.17777128],
       [46.06052162, 55.62790459, 45.71322926, 54.99134387],
       [37.6354004 , 41.52598441, 51.27620597, 72.94996527],
       [10.73412518,  5.9155463 , 12.58951745, 26.15525682]])

In [48]:
# create nodes in your layer
layer_node_w = np.random.rand(4,3)
layer_node_w

array([[0.06129003, 0.78033796, 0.4179302 ],
       [0.75368389, 0.12791234, 0.60439259],
       [0.31939977, 0.82546567, 0.94938638],
       [0.73851387, 0.12671102, 0.0297295 ]])

In [47]:
# get one single input (row of data)
x = np_data[0,:]
x

array([68. , 45.2, 63.2, 78.2])

In [49]:
# get outputs of all nodes
layer_output = np.dot(x, layer_node_w)
layer_output

array([116.17208414, 120.92285089, 118.06386502])

Now, we'll do essentially the same thing for our output node.

In [52]:
output_w = np.random.rand(3)
output_w

array([0.55909116, 0.08656277, 0.30616329])

In [53]:
nn_pred = np.dot(layer_output, output_w)
nn_pred

111.56502391663022

Yay!! Vectorization has helped us quickly compute the forward pass of a neural network!