<a href="https://colab.research.google.com/github/extrajp2014/DS-Unit-2-Sprint-2-Linear-Regression/blob/master/module3-gradient-descent/Gradient_Descent_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gradient Descent Implementation Challenge!!

## Use gradient descent to find the optimal parameters of a **multiple** regression model. (We only showed an implementation for a bivariate model during lecture.)

A note: Implementing gradient descent in any context is not trivial, particularly the step where we calculate the gradient will change based on the number of parameters that we're trying to optimize for. You will need to research what the gradient of a multiple regression model looks like. This challenge is pretty open-ended but I hope it will be thrilling. Please work together, help each other, share resources and generally expand your understanding of gradient descent as you try and achieve this implementation. 

## Suggestions:

Start off with a model that has just two $X$ variables You can use any datasets that have at least two $x$ variables. Potential candidates might be the blood pressure dataset that we used during lecture on Monday: [HERE](https://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/mlr/excel/mlr02.xls) or any of the housing datasets. You would just need to select from them the two $x$ variables and one $y$ variable that you want to work with. 

Use Sklearn to find the optimal parameters of your model first. (like we did during the lecture.) So that you can compare the parameter estimates of your gradient-descent linear regression to the estimates of OLS linear regression. If implemented correctly they should be nearly identical.

Becoming a Data Scientist is all about striking out into the unknown, getting stuck and then researching and fighting and learning until you get yourself unstuck. Work together! And fight to take your own learning-rate fueled step towards your own optimal understanding of gradient descent! 


In [1]:

from scipy import stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import math
# !pip install --upgrade seaborn
import seaborn as sns
sns.__version__
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

pd.set_option('display.height', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 500)
pd.options.display.float_format = '{:,}'.format

df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/Ames%20Housing%20Data/train.csv') 
df.head().T

Unnamed: 0,0,1,2,3,4
Id,1,2,3,4,5
MSSubClass,60,20,60,70,60
MSZoning,RL,RL,RL,RL,RL
LotFrontage,65.0,80.0,68.0,60.0,84.0
LotArea,8450,9600,11250,9550,14260
Street,Pave,Pave,Pave,Pave,Pave
Alley,,,,,
LotShape,Reg,Reg,IR1,IR1,IR1
LandContour,Lvl,Lvl,Lvl,Lvl,Lvl
Utilities,AllPub,AllPub,AllPub,AllPub,AllPub


In [28]:
# "Multiple Variables" Gradient Descent
################################ 
# https://medium.com/we-are-orb/multivariate-linear-regression-in-python-without-scikit-learn-7091b1d45905

#1 normalize the features using mean normalization
temp = df[list(df.select_dtypes(include=[np.number]))]
temp.fillna(0, inplace=True)
temp = (temp - temp.mean())/temp.std()

#2 Create matrices and set hyperparameters 
# Set X, y, theta values for Gradient Descent - Multivariate
X = temp[list(temp.select_dtypes(include=[np.number]))]
ones = np.ones([X.shape[0],1])
X = np.concatenate((ones,X),axis=1)
y = temp.loc[:, ['SalePrice']].values
theta = np.zeros([1,X.shape[1]]) # set 0 for null theta

# set hyper parameters
alpha = 0.02
iters = 1000
# Other Important Variables
n = y.size
np.random.seed(42)

#3 computecost
# gradient descent to minimize this cost
def computeCost(X,y,theta):
    tobesummed = np.power(((X @ theta.T)-y),2)
    return np.sum(tobesummed)/(2 * len(X))

#4 gradient descent
# function from blog
def gradientDescent(X,y,theta,iters,alpha):
    cost = np.zeros(iters)
    for i in range(iters):
        theta = theta - (alpha/len(X)) * np.sum(X * (X @ theta.T - y), axis=0)
        cost[i] = computeCost(X, y, theta)
    return theta,cost
# function from lecture
def gradient_descent(x, y, theta, iterations, alpha):
  past_costs = []
  for i in range(iterations):
    prediction = np.dot(X, theta.T)
    error = prediction - y
    theta = theta - (alpha * (1/n)*np.dot(X.T, error))
    cost = np.dot(error.T, error)
    past_costs.append(cost)
  return theta, past_costs

# Results Comparison
theta1,cost1 = gradient_descent(X,y,theta,iters,alpha)
theta2,cost2 = gradientDescent(X,y,theta,iters,alpha)
print('np.array(cost1).shape:',np.array(cost1).shape)
print('cost2.shape:',cost2.shape)
print('theta1:',theta1)
print('theta2:',theta2)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)


np.array(cost1).shape: (1000,)
cost2.shape: (1000,)
theta1: [[-4.68732255e+08  1.00848822e+22  1.90597787e+23 ...  2.27149267e+23
  -1.78158168e+23  2.53397492e+24]
 [ 1.38479040e+22 -2.65065955e+35 -5.16553238e+36 ... -5.11693205e+36
   4.03532508e+36 -5.46719874e+37]
 [ 9.66804558e+20  5.07212632e+35 -1.90782521e+37 ... -5.09447486e+37
   3.73492160e+37 -6.12850225e+38]
 ...
 [-3.08349848e+23  6.77909688e+36  7.13730127e+37 ...  3.49945038e+36
  -8.68796100e+36 -1.06320417e+38]
 [ 2.16027066e+23 -4.79245051e+36 -4.83230169e+37 ...  2.00075961e+36
   2.82135392e+36  1.28021408e+38]
 [-3.95475743e+24  8.58539476e+37  9.55631300e+38 ...  1.52792906e+38
  -1.90524953e+38 -6.51087205e+37]]
theta2: [[ 3.17234823e-17 -1.48817558e-04 -6.31194202e-03 -6.23396943e-04
   2.56931521e-03  1.78165196e-02  5.40334264e-03  1.13426838e-02
  -3.22625946e-03  1.70721365e-03  1.00569322e-03 -1.98528472e-04
  -1.16362638e-03 -1.99449959e-04  5.99692489e-03  8.78815883e-03
  -3.43565435e-04  1.16805762e-0

In [29]:
model = LinearRegression()
model.fit(X, y)
# Assign coefficient and intercept to variables (beta_1, and beta_0)
beta_1 = model.coef_[0]
beta_0 = model.intercept_
print("beta_1: ", beta_1)
print("beta_0: ", beta_0)

beta_1:  [ 6.63031576e-08  2.77555756e-17  2.01429577e-16 -1.95121220e-17
 -9.40047820e-17 -1.11162570e-16 -1.46915064e-16  4.88343309e-16
 -1.67489583e-17 -8.01472459e-17  4.74027312e-03  1.67660735e-03
  4.59236750e-03 -4.55950836e-03 -7.06114493e-04 -7.97332727e-04
 -8.88115662e-05  9.59806218e-04  1.57887312e-16 -9.06016143e-18
  1.45853565e-17  2.53998453e-17  5.16820917e-17 -4.14748229e-17
 -7.49780566e-17  2.51369782e-16  1.77622515e-17 -6.00433706e-17
  3.58868329e-16 -3.99276737e-17 -5.22734855e-17  2.71053508e-16
 -2.43144207e-16  3.38271078e-17 -5.35595873e-17 -1.36175793e-16
 -2.08166817e-17  1.08420217e-18  1.00000000e+00]
beta_0:  [-6.63031576e-08]


## Stretch Goals

If you happen upon the most useful resources for accomplishing this challenge first, I want you to spend time today studying other variations of Gradient Descent-Based Optimizers. A good list of the most common optimizers can be found in the Keras Documentation: <https://keras.io/optimizers/>

- Try and write a function that can perform gradient descent for arbitarily large (in dimensionality) multiple regression models. 
- Create a notebook for yourself exploring the different gradient descent based optimizers.
- How do the above differ from the "vanilla" gradient descent we explored today?
- How do these different gradient descent-based optimizers seek to overcome the challenge of finding the global minimum among various local minima?
- Write a blog post that reteaches what you have learned about these other gradient descent-based optimizers.

[Overview of GD-based optimizers](http://ruder.io/optimizing-gradient-descent/)

[Siraj Raval - Evolution of Gradient Descent-Based Optimizers](https://youtu.be/nhqo0u1a6fw)