# Gradient Descent Implementation Challenge!!

## Use gradient descent to find the optimal parameters of a **multiple** regression model. (We only showed an implementation for a bivariate model during lecture.)

A note: Implementing gradient descent in any context is not trivial, particularly the step where we calculate the gradient will change based on the number of parameters that we're trying to optimize for. You will need to research what the gradient of a multiple regression model looks like. This challenge is pretty open-ended but I hope it will be thrilling. Please work together, help each other, share resources and generally expand your understanding of gradient descent as you try and achieve this implementation. 

## Suggestions:

Start off with a model that has just two $X$ variables You can use any datasets that have at least two x variables. Potential candidates might be the blood pressure dataset that we used during lecture on Monday: [HERE](https://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/mlr/excel/mlr02.xls) or any of the housing datasets. You would just need to select from them the two varaibles $x$ variables and one y variable that you want to work with that you most want to work with. 

Use Sklearn to find the optimal parameters of your model first. (like we did during the lecture.) So that you can compare the parameter estimates of your gradient-descent linear regression to the estimates of OLS linear regression. If implemented correctly they should be nearly identical.

Becoming a Data Scientist is all about striking out into the unknown, getting stuck and then researching and fighting and learning until you get yourself unstuck. Work together! And fight to take your own learning-rate fueled step towards your own optimal understanding of gradient descent! 


In [9]:
# Incantations
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

# To display all cell outputs instead of just last
# No more needing to always print()
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# OLS Solution

In [5]:
# Importing Monday's blood pressure dataset.
df = pd.read_excel('https://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/mlr/excel/mlr02.xls')
df = df.rename(index=str, columns={"X1": "y", "X2": "age", "X3": "weight"})
print(df.shape)
df.head()

*** No CODEPAGE record, no encoding_override: will use 'ascii'
(11, 3)


Unnamed: 0,y,age,weight
0,132,52,173
1,143,59,184
2,153,67,194
3,162,73,211
4,154,64,196


In [26]:
y = df.loc[:, ['y']].values
X = df.loc[:, ['age','weight']].values

model = LinearRegression()
model.fit(X, y)

beta_0 = model.intercept_
beta_i = model.coef_[0]

print("\nIntercept Value, beta_0: \n", beta_0)
print("\nSlope Coefficients, beta_i: \n", beta_i)
print(f'\nBlood pressure, y: \n{y}')
print(f'\nIndependent variables, X: \n{X}')

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)


Intercept Value, beta_0: 
 [30.99410295]

Slope Coefficients, beta_i: 
 [0.86141469 0.3348592 ]

Blood pressure, y: 
[[132]
 [143]
 [153]
 [162]
 [154]
 [168]
 [137]
 [149]
 [159]
 [128]
 [166]]

Independent variables, X: 
[[ 52 173]
 [ 59 184]
 [ 67 194]
 [ 73 211]
 [ 64 196]
 [ 74 220]
 [ 54 188]
 [ 61 188]
 [ 65 207]
 [ 46 167]
 [ 72 217]]


# Gradient descent

The gradient descent formula is given by:
![Gradient Descent formula](http://www.ryanleeallred.com/wp-content/uploads/2019/01/gradient-descent-formula.png)
Where each new theta_j results from subtracting a term composed of alpha and the slope of J evaluated at the current theta_j. Theta is the vector of coefficients that go into the linear regression. Theta = [beta_0, beta_i[0], ... beta_i[total dimensions - 1]].

The slope is something we'll determine soon, but is based on the error evaluated at theta_j

## Implemented part-by-part
I'll do this by parts first, to make sure that I understand what's going on before putting everything into a singel function.

In [55]:
# theta_null is the starting point for the equation, in this case chosen 
# because it's close to the global maximum given by OLS. 
theta_null = [60, 5, 5]
print(f'theta_null: \n{theta_null}')

theta_null: 
[60, 5, 5]


In [57]:
# I first normalize X

# StandardScaler subtracts mean, normalizes to stdev=1
scaler = StandardScaler() 
# I apply the normalization to X. I changed the type to prevent a 
# dtype change warning from StandardScaler
X_norm = scaler.fit_transform(X.astype('float64'))
print(f'X_norm: \n{X_norm}')

X_norm: 
[[-1.20301838 -1.33261043]
 [-0.39751912 -0.66630522]
 [ 0.52305147 -0.0605732 ]
 [ 1.21347941  0.96917122]
 [ 0.1778375   0.0605732 ]
 [ 1.32855074  1.51433004]
 [-0.97287574 -0.42401241]
 [-0.16737647 -0.42401241]
 [ 0.29290882  0.72687842]
 [-1.89344632 -1.69604964]
 [ 1.09840809  1.33261043]]


In [29]:
# I then add a column of 1s at the beginning, destined to later be 
# multiplied by beta_0. I use the class parameter 'np.c_', which
# concatenates two arrays along the second axis (columns) and creates such
# an axis if it doesn't exist (as happens here with the np.ones 
# vector that has the same length as the rows in X_norm)
X_linalg = np.c_[np.ones(X_norm.shape[0]), X_norm]
print(f'X_linalg: \n{X_linalg}')

X_linalg: 
[[ 1.         -1.20301838 -1.33261043]
 [ 1.         -0.39751912 -0.66630522]
 [ 1.          0.52305147 -0.0605732 ]
 [ 1.          1.21347941  0.96917122]
 [ 1.          0.1778375   0.0605732 ]
 [ 1.          1.32855074  1.51433004]
 [ 1.         -0.97287574 -0.42401241]
 [ 1.         -0.16737647 -0.42401241]
 [ 1.          0.29290882  0.72687842]
 [ 1.         -1.89344632 -1.69604964]
 [ 1.          1.09840809  1.33261043]]


In [50]:
# I then make an array of predictions, y_hat, which for any 
# datapoint (row) in X_linalg is the result of multiplying all the terms
# by the regression equation coefficients in theta and adding up
# these products.  That is, the regression equation is evaluated at each
# row in X, which may contain multiple dimensions. The simple way to 
# calculate that is with the dot product. I'll include these calculations
# in a general function later, but for now I'll use the initialization
# values.

theta = theta_null
y_hat = np.dot(X_linalg, theta).reshape(-1,1)
print(f'y_hat: \n{y_hat}')

y_hat: 
[[47.32185592]
 [54.68087833]
 [62.31239135]
 [70.91325318]
 [61.19205351]
 [74.21440387]
 [53.01555927]
 [57.04305559]
 [65.09893621]
 [42.05252017]
 [72.15509261]]


In [58]:
# The error comes from subtracting the actual y values from our 
# predictions, y_hat. Note that I had to reshape them so that the
# subtraction happened elementwise.
errors = y_hat - y
print(f'errors: \n{errors}')

errors: 
[[-84.67814408]
 [-88.31912167]
 [-90.68760865]
 [-91.08674682]
 [-92.80794649]
 [-93.78559613]
 [-83.98444073]
 [-91.95694441]
 [-93.90106379]
 [-85.94747983]
 [-93.84490739]]


## Stretch Goals

If you happen upon the most useful resources for accomplishing this challenge first, I want you to spend time today studying other variations of Gradient Descent-Based Optimizers.

- Try and write a function that can perform gradient descent for arbitarily large (in dimensionality) multiple regression models. 
- Create a notebook for yourself exploring these topics
- How do they differ from the "vanilla" gradient descent we explored today
- How do these different gradient descent-based optimizers seek to overcome the challenge of finding the global minimum among various local minima?
- Write a blog post that reteaches what you have learned about these other gradient descent-based optimizers.

[Overview of GD-based optimizers](http://ruder.io/optimizing-gradient-descent/)

[Siraj Raval - Evolution of Gradient Descent-Based Optimizers](https://youtu.be/nhqo0u1a6fw)