## Homework 01 

**David Clapp**

**DSCI 35600 - Machine Learning**

** Due Friday, Feb. 1**

In this assignment, you will create two different implementations for the predict and loss (SSE) functions for a linear regression model. The first implementation will use lists and only tools that are found in base Python. The second implementation will make use of NumPy arrays. This assignment will give you experience writing Python code, working with NumPy, calculating loss for regression problems, and working with Scikit-Learn. 

### Part 1: List Implementation

In the cell below, write a function called `predict1`. It should take two parameters, `beta` and `X`. 
* `beta` is intended to be a list of (not necessarily optimal) model parameters $b_0, b_1, b_2, ..., b_p$. 
* `X` will be a list of lists. Each sublist will represent feature values $x^{(1)}_i, x^{(2)}_i, ..., x^{(p)}_i$ for one of the $n$ observations. 

The function should return a list `y_hat` that contains the predicted $y$ values for each of the $n$ observations. These predicted values are calculated as follows: $\hat{y}_i = b_0 + b_1 x^{(1)}_i + b_2 x^{(2)}_i + ... + b_p x^{(p)}_i$. Pseudocode and some hints will be provided in the HW 01 supplement notebook.

In [1]:
def predict1(beta, X):
        # initialize y_hat to be an empty list.
    y_hat = []
        # get each row from X
    for i in range(0, len(X)):
            # make a copy of the current row.
        row = X[i].copy()
            # insert a 1 at the start of the copied row.
        row.insert(0, 1)
            # initialize the total of the sum of the products to 0.
        total = 0
            # multiply each element in the copied row with the corresponding element in beta.
        for j in range(0, len(row)):
                # add the product to the total.
            total = total + row[j]*beta[j]
            # append the sum to the end of y_hat.
        y_hat.append(total)
    return y_hat

Two lists called `X_list` and `beta_list` are defined in the cell below. Pass these to the function `predict1` and print the results. 

In [2]:
X_list = [[7, 3, 8], [1, 5, 6], [3, 5, 9], [6, 1, 7], [2, 3, 1]]
beta_list = [2.4, 0.1, -0.4, 0.3]

y_hat = predict1(beta_list, X_list)
print(y_hat)

[4.3, 2.3, 3.4, 4.7, 1.7]


In the cell below, create a function called `sse1`. This function should take three paramaters: `beta`, `X`, and `y`. 
* The parameters `beta` and `X` will be defined as they were in the function `predict1`. 
* `y` should be a list with $n$ entries containing the true $y$ values for each of the $n$ observations. 

This function should calculate and return the SSE loss for the model defined by `beta`, as calculated on the dataset given by `X` and `y`. Pseudocode is provided in the supplement.

In [3]:
def sse1(beta, X, y):
        # initialize sse to 0.
    sse = 0
        # calculate y_hat.
    y_hat = predict1(beta, X)
    for i in range(0, len(y)):
        sse = sse + (y[i] - y_hat[i]) ** 2
    return sse

A list called `y_list` is defined in the cell below. Pass `beta_list`, `X_list`, and `y_list` to `sse1` and print the results.

In [4]:
y_list = [5, 2, 5, 4, 3]

sse1_results = sse1(beta_list, X_list, y_list)
print(sse1_results)

5.320000000000001


### Part 2: Array Implementation

In this part, you will be asked to implement the predict and loss functions of a linear regression model using NumPy arrays. Begin by importing `numpy` under the alias `np`.

In [5]:
import numpy as np

In the cell below, use the lists `X_list`, `y_list`, and `beta_list` to create arrays `X_array`, `y_array`, and `beta_array`.  Print the shape of each of these arrays, along with some text explain which shape is associated with which array.

In [6]:
X_array = np.array(X_list)
y_array = np.array(y_list)
beta_array = np.array(beta_list)
print("X_array: ", X_array.shape)
print("y_array: ", y_array.shape)
print("beta_array: ", beta_array.shape)

X_array:  (5, 3)
y_array:  (5,)
beta_array:  (4,)


In the cell below, write a function called `predict2`. It should take two parameters, `beta` and `X`. 
* `beta` is intended to be an array of (not necessarily optimal) model parameters $b_0, b_1, b_2, ..., b_p$. 
* `X` will be a 2D feature array with one row for each of the $n$ observations, and one column for each of the $p$ features. 

The function should return an array `y_hat` that contains the predicted $y$ values for each of the $n$ observations. This array should have the same shape as `y`, which is `(n,)`. **This function should make use of NumPy, and should not use any loops.** Pseudocode and some hints will be provided in the HW 01 supplement notebook.

In [7]:
def predict2(beta, X):
        # make a copy of X
    X_copy = X.copy()
        # create a column of 1's.
    ones = np.ones((X.shape[0], 1))
        # insert the column of 1's to the start of X.
    X_copy = np.hstack((ones, X_copy))
        # get the dot product of the two matricies.
    y_hat = np.dot(X_copy, beta.reshape(-1, 1))
        # reshape y_hat into a row matrix.
    y_hat = y_hat.reshape(-1,)
    return y_hat

Pass `X_array` and `beta_array` to `predict2` and print the results. 

In [8]:
y_hat = predict2(beta_array, X_array)
print(y_hat)

[4.3 2.3 3.4 4.7 1.7]


In the cell below, create a function called `sse2`. This function should take three paramaters: `beta`, `X`, and `y`. 
* The parameters `beta` and `X` will be defined as they were in the function `predict2`. 
* `y` should be an array with $n$ entries containing the true $y$ values for each of the $n$ observations. 

This function should calculate and return the SSE loss for the model defined by `beta`, as calculated on the dataset given by `X` and `y`. **This function should make use of NumPy, and should not use any loops.** Pseudocode is provided in the supplement.

In [9]:
def sse2(beta, X, y):
        # calculate y_hat.
    y_hat = predict2(beta, X)
        # calculate the errors.
    error = (y-y_hat)**2
        # sum up the error array and store it in sse.
    sse = np.sum(error)
    return sse

Pass `beta_array`, `X_array`, and `y_array` to `sse2` and print the results.

In [10]:
sse2_results = sse2(beta_array, X_array, y_array)
print(sse2_results)

5.320000000000001


### Part 3: Minimizing SSE

Create an array called `beta_array2` with shape `(4,)`. The values in this array should be similar to those in `beta_array`, but each entry should be shifted upward or downward by 0.1. Pass this new array, along with `X_array` and `y_array` to the function `sse2`. 

Your goal is find a new coefficicent array (i.e. model) that produces a lower value for the loss. Set the values of `beta_array2` to try to get the loss as small as possible, considering only arrays of the form $\left[ (2.4 \pm 0.1), (0.1 \pm 0.1), (-0.4\pm 0.1), (0.3\pm 0.1) \right]$

In [11]:
beta_array2 = np.array([2.5, 0.2, -0.3, 0.2])
sse3_results = sse2(beta_array2, X_array, y_array)
print(sse3_results)

4.159999999999999


Replace the blanks in the code below to use Scikit-Learn to find the optimal parameter values. Print these. 

In [12]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_array, y_array)
print(model.intercept_)
print(model.coef_)

0.34582132564841395
[0.52305476 0.39193084 0.02161383]


In the cell below, create a single array called `beta_opt` that contains the optimal parameter values found by Scikit-Learn. Pass `beta_opt`, `X_array`, and `y_array` to the function `sse2`. 

In [13]:
beta_opt = 
beta_opt = np.array([model.intercept_, model.coef_[0], model.coef_[1], model.coef_[2]])
sse4_results = sse2(beta_opt, X_array, y_array)
print(sse4_results)

2.080691642651298
