# Gradient Descent Implementation Challenge!!

## Use gradient descent to find the optimal parameters of a **multiple** regression model. (We only showed an implementation for a bivariate model during lecture.)

A note: Implementing gradient descent in any context is not trivial, particularly the step where we calculate the gradient will change based on the number of parameters that we're trying to optimize for. You will need to research what the gradient of a multiple regression model looks like. This challenge is pretty open-ended but I hope it will be thrilling. Please work together, help each other, share resources and generally expand your understanding of gradient descent as you try and achieve this implementation. 

## Suggestions:

Start off with a model that has just two $X$ variables You can use any datasets that have at least two x variables. Potential candidates might be the blood pressure dataset that we used during lecture on Monday: [HERE](https://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/mlr/excel/mlr02.xls) or any of the housing datasets. You would just need to select from them the two varaibles $x$ variables and one y variable that you want to work with that you most want to work with. 

Use Sklearn to find the optimal parameters of your model first. (like we did during the lecture.) So that you can compare the parameter estimates of your gradient-descent linear regression to the estimates of OLS linear regression. If implemented correctly they should be nearly identical.

Becoming a Data Scientist is all about striking out into the unknown, getting stuck and then researching and fighting and learning until you get yourself unstuck. Work together! And fight to take your own learning-rate fueled step towards your own optimal understanding of gradient descent! 


In [0]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import zipfile
from sklearn.linear_model import LinearRegression

import cufflinks
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
plotly.tools.set_credentials_file(username='zangell', api_key='bs2CJxqOA2hlrJXKyeM9')

In [0]:
# gradient descent class
class GradientDescent(object):
    """Gradient Descent regressor.
    
    Parameters
    ----------
    eta : float
        Learning rate (between 0.0 and 1.0)
    n_iter : int
        Passes over the training dataset
    precision : float
        Stop updating weights after getting within a certain precision
    random_state : int
        Random number generator seed for random weight initialization
    
    Attributes
    ----------
    w_ : 1d-array
        Weights after fitting
    errors_ : list
        Sum of squared errors (updates in each epoch)
    
    """
    
    def __init__(self, eta=0.1, n_iter=500, precision=0.00001, random_state=1):
        self.eta=eta
        self.n_iter=n_iter
        self.precision = precision
        self.random_state=random_state
    
    def fit(self, X, y):
        """ Fit training data
        
        Parameters
        ---------
        X : {array-like}, shape = [n_samples, n_features]
            Training vectiors where n_samples is the number of
            samples and n_features is the number of features
        y : array-like, shape = [n_samples]
            Target values
            
        Returns
        ------
        self : object
        
        """
        
        rgen = np.random.RandomState(self.random_state) # initalize random number gen
        
        self.w_ = rgen.normal(loc=0.0, # initalize random weights close to 0
                              scale=0.01,
                              size = 1 + X.shape[1])
        
        self.errors_ = []
        
        for _ in range(self.n_iter): # loop through for specified iterations

            # calculate and append error for this iteraiton
            y_pred = self.predict(X)
            sse = np.sum((y-y_pred)**2) / len(y)
            self.errors_.append(sse)
          
            self.w_[1:] += self.eta * np.dot(X.T, y - y_pred) / len(y) 
            self.w_[0] += self.eta * sum(y - y_pred) / len(y) 
              
              
        return self
    
    def net_input (self, X):
        """Calculate net input"""
        return np.dot(X, self.w_[1:]) + self.w_[0]
    
    def predict(self, X): # can be altered in other implementations
        return self.net_input(X)
        

Let's test this class out on a Bike Sharing Dataset from UCI, found here https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

The data is in a .zip file format, so we'll need a little work here.

In [64]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip

--2019-01-17 00:42:55--  https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 279992 (273K) [application/zip]
Saving to: ‘Bike-Sharing-Dataset.zip.1’


2019-01-17 00:42:55 (870 KB/s) - ‘Bike-Sharing-Dataset.zip.1’ saved [279992/279992]



In [0]:
zip_ref = zipfile.ZipFile('Bike-Sharing-Dataset.zip', 'r')
zip_ref.extractall()
zip_ref.close()

In [66]:
df_bike = pd.read_csv('hour.csv')
df_bike.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


We will use linear regression to predict the number of bikes from the weather (temperature, humidity, and wind speed).

In [0]:
features = ['atemp', 'hum', 'windspeed']
target = 'cnt'

X = df_bike[features]
y = df_bike[target]

First, let's see what coefficients linear regression comes up with.

In [68]:
lr = LinearRegression()
lr.fit(X, y)
print ('Intercept:', lr.intercept_)
print ('Coefficients', lr.coef_)

Intercept: 158.6952193250187
Coefficients [ 409.22508774 -275.86319204   47.86020375]


Now let's see how our GradientDescent class does.

In [71]:
gd = GradientDescent(eta=0.75, n_iter=2000)
gd.fit(X,y)
gd.w_

array([ 158.69521696,  409.2250888 , -275.86319034,   47.8602078 ])

We got pretty close. I have also run this a few times with different learning rates/iterations, and it will converge.

Let's see how our errors look over training epochs.

In [72]:
epochs = np.arange(len(gd.errors_))

trace = go.Scatter(
    x = epochs,
    y = gd.errors_
)

layout = dict(title = 'SSE Across Training Periods',
              xaxis = dict(title = 'Training Period'),
              yaxis = dict(title = 'Sum of Squared Errors'),
              )

data = [trace]

fig = dict(data=data, layout=layout)

py.iplot(fig)

## Stretch Goals

If you happen upon the most useful resources for accomplishing this challenge first, I want you to spend time today studying other variations of Gradient Descent-Based Optimizers.

- Try and write a function that can perform gradient descent for arbitarily large (in dimensionality) multiple regression models. 
- Create a notebook for yourself exploring these topics
- How do they differ from the "vanilla" gradient descent we explored today
- How do these different gradient descent-based optimizers seek to overcome the challenge of finding the global minimum among various local minima?
- Write a blog post that reteaches what you have learned about these other gradient descent-based optimizers.

[Overview of GD-based optimizers](http://ruder.io/optimizing-gradient-descent/)

[Siraj Raval - Evolution of Gradient Descent-Based Optimizers](https://youtu.be/nhqo0u1a6fw)