Let's first import necessary libraries

In [1]:
# Imported for proper rendering of latex in colab.
from IPython.display import display, Math, Latex
import numpy as np

# Import for generating plots
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


### N2 : Model

##### a. Quick recap
1. Training data contains features and label that is real number.

2. Linear regression model uses linear combination of features to obtain labels. In vectorized form, this can be written as $\textbf y = \textbf {Xw}$

##### b. Objective
The objective of this notebook is to implement model and the inference component from linear regression model.


**Note :**
* Model is parameterized by its weight vector.

* It is described by its mathematical form and weight vector.

#### **Implementation**

The general vectorized form is as follows:
\begin{equation} 
\textbf y_{n\times 1} =\textbf X_{n \times (m+1)} \textbf w_{(m+1)\times 1}
\end{equation}

    where : 
* **n** is number of examples in dataset (train/test/validation).

* **m** is the number of features.

* **X** is a feature matrix contain $(m+1)$ features for $n$ examples along rows. (Notice capital case bold **X** used for matrix)

* **w** is weight vector containg $(m+1)$ weights one for each feature. (notice small case bold **w**)

* **y** is a label vector containing labels for $n$ examples with shape $(n,)$.

In [2]:
def predict(X, w):
    assert X.shape[-1] == w.shape[0], "X and w don't have compatible dimensions"
    # returns the predicted_label vector
    return X @ w

We test this function with the following feature matrix $\textbf X_{2\times (3+1)}:$

\begin{equation} 
\textbf X_{2 \times (3+1)} = \begin{bmatrix} 
1&3&2&5\\ 
1&9&4&7\end{bmatrix} 
\end{equation}
and weight vector $\textbf w$ :
\begin{equation} 
\textbf w_{4\times 1}= \begin{bmatrix}1\\1\\1\\1 \end{bmatrix} 
\end{equation}

Let's perform matrix-vector multiplication between the feature matrix $\textbf X$ and a weight vector $\textbf w$ to obtain labels for all examples:
\begin{align}
\textbf y&=& \textbf {Xw}
\end{align} 

\begin{align}
\\&=& \begin{bmatrix} 1&3&2&5\\ 1&9&4&7 \end{bmatrix} \times \begin{bmatrix} 1\\1\\1\\1 \end{bmatrix}\\
\end{align} 

\begin{align}
&=& \begin{bmatrix} 1\times 1+3 \times1 +2\times1+5\times1 \\ 1\times1 + 3\times1 + 2\times1 +5\times1 \end{bmatrix} 
\end{align}

\begin{align}
\\ &=& \begin{bmatrix} 11\\21 \end{bmatrix} 
\end{align} 

In [3]:
import unittest
class TestPredict(unittest.TestCase):
    '''Test case predict frunciton of linear regression'''
    def test_predict(self):
    #set up 
        train_matrix = np.array([[1,3,2,5],[1,9,4,7]])
        weight_vector =np.array([1,1,1,1])
        expected_label_vector=np.array([11,21])

        # call
        predicted_label_vector = predict(train_matrix,weight_vector)

        # asserts
        # test the shape
        self.assertEqual(predicted_label_vector.shape, (2,))
        
        # and contents
        np.testing.assert_array_equal(expected_label_vector,predicted_label_vector) 

unittest.main(argv=[''],defaultTest='TestPredict',verbosity=2, exit=False)

test_predict (__main__.TestPredict) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.001s

OK


<unittest.main.TestProgram at 0x1e70ba74100>

### Demonstration on synthetic dataset


In [4]:
def add_dummy_feature(X):
    return np.column_stack((np.ones(X.shape[0]) ,X))

In [5]:
n = 100
w = np.array([5,6])
X = 10*np.random.rand(100)
X = add_dummy_feature(X)
noise = np.random.rand(n,)
y = X@w + noise

In [6]:
print(X[:5],y[:5])

[[1.         8.26291599]
 [1.         6.3233168 ]
 [1.         4.24764423]
 [1.         9.56568747]
 [1.         0.70380504]] [54.7493043  43.4791561  31.03635453 62.53363381  9.77963824]


Preprocessing: Dummy feature and train-test split


In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20, random_state=42)

Since we have not yet trained out model, let's use random weight vector to get predictions from out model for the given dataset.

In [8]:
wt_model = np.random.rand(2,)
wt_model

array([0.31607789, 0.52993199])

In [9]:
y_hat = predict(X_train, wt_model)
print('Predicted Labels are:\n', y_hat[:10])
print()
print('Actual Labels are:\n', y_train[:10])

Predicted Labels are:
 [3.01953067 2.0224964  0.33416342 5.00302977 1.92563714 4.86362332
 0.87135491 5.11689399 4.8810301  4.71712963]

Actual Labels are:
 [36.18546102 24.94491657  5.70523776 58.99725074 23.9936738  56.96673461
 11.62803818 60.24328421 57.52479088 55.44323749]


Since we used a random weight vector $\textbf w$ here, most of the predicted labels do not match the actual labels.

#### Comparision of vectorized and non-vectorized version of model inference

In [10]:
def non_vectorized_predict(X, w):
    ''' Prediction of output label for a given input.

        Args:
            X: Feature matrix of shape (n,m+1)
            w: weight vector of shape (m+1,n)
        Returns: 
            y: Predicted label vector of shape(n,)
    '''
    y = []
    for i in range(0, X.shape[0]):
        y_hat_i = 0
        for j in range(0, X.shape[1]):
            y_hat_i += X[i][j]*w[j]
        y.append(y_hat_i)
    return np.array(y)


In [11]:
import unittest


class TestPredictNonVectorized(unittest.TestCase):
    def test_predict_non_vectorized(self):

        #set up
        train_matrix = np.array([[1, 3, 2, 5], [1, 9, 4, 7]])
        weight_vector = np.array([1, 1, 1, 1])
        expected_label_vector = np.array([11, 21])

        #call
        predicted_label_vector = non_vectorized_predict(
            train_matrix, weight_vector)

        #asserts
        #test the shape
        self.assertEqual(predicted_label_vector.shape, (2,))

        #and contents
        np.testing.assert_array_equal(
            expected_label_vector, predicted_label_vector)


unittest.main(
    argv=[''], defaultTest='TestPredictNonVectorized', verbosity=2, exit=False)


test_predict_non_vectorized (__main__.TestPredictNonVectorized) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.001s

OK


<unittest.main.TestProgram at 0x1e70be336a0>

Comparing run time of vectorized and non-vectorized versions on dataset with 100 data points.

In [12]:
import time
start_time = time.time()
y_hat_vectorized = predict(X_train, w)
end_time = time.time()

print("Total time incurred in vectorized inference is: %0.7f s" %
      (end_time-start_time))

start_time = time.time()
y_hat_non_vectorized = non_vectorized_predict(X_train, w)
end_time = time.time()
print("Total time incurred in non-vectorized inference is: %0.7f s" %
      (end_time - start_time))

np.testing.assert_array_equal(y_hat_vectorized, y_hat_non_vectorized)


Total time incurred in vectorized inference is: 0.0000000 s
Total time incurred in non-vectorized inference is: 0.0000000 s


Comparing run time of vectorized and non-vectorized versions on large dataset of 1 million data points.

In [13]:
def generate_data(n=1000_000):
    X = 10*np.random.rand(n)
    X = add_dummy_feature(X)
    noise = np.random.rand(n,)
    y = X@w + noise
    return X,y 

def preprocess(X,y):
    X_train,y_train,X_test,y_test = train_test_split(X,y, test_size=0.2, random_state=42) 
    return X_train, y_train,X_test,y_test

In [14]:
X, y = generate_data(n=1000_000)
X_train, y_train, X_test, y_test = preprocess(X, y)

# Vectorized version
start_time = time.time()
y_hat_vectorized = predict(X_train, w)
end_time = time.time()

print("Total time incurred in vectorized inference is : %0.6f s" %(end_time-start_time))

# Non-vectorized version
start_time = time.time()
y_hat_non_vectorized = non_vectorized_predict(X_train, w)
end_time = time.time()
print('Total time incurred in non-vectorized inference is: %0.6f s' %(end_time-start_time))

np.testing.assert_array_equal(y_hat_vectorized, y_hat_non_vectorized)

Total time incurred in vectorized inference is : 0.002018 s
Total time incurred in non-vectorized inference is: 1.079978 s


Note that the time required for non-vectorized inference in order of magnitude more than the vectorized inference.