Logistic regression uses an equation as the representation <br/>
<img src='LR_equation.jpg' align='left' /> <br/>
<img src='http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1534281070/linear_vs_logistic_regression_edxw03.png' align='left' />

<img src="https://miro.medium.com/max/4048/1*9VZGNrRAsU6LcdZvZeKlmg.png" height="700" width="700"/>
<img src="https://miro.medium.com/max/4350/1*dE5mZ46yrWUzcwzz2ZzNqQ.png" height="700" width="700" />
<img src="https://miro.medium.com/max/3492/1*U-Lzf0oDxnaeqmlFhXyzVQ.png" height="700" width="700" />

New slope = old slope of line of best fit — alpha * partial derivative of the cost function at point m

<img src="https://miro.medium.com/max/4692/1*4yCgvwtAAUPADwttYAniRA.png" height="700" width="700"/>

alpha (α) in the gradient descent algorithm. <br/>
If α is too big, then the algorithm will overshoot each iteration (as shown in the left graph), which may inhibit it from reaching the minimum. Conversely, if α is too small, it will take too long to reach the minimum. Thus, α must be in between the two so that neither of these cases occur.

<img src="https://miro.medium.com/max/4610/1*7SO9EC_SqZusmvLyfvbepA.png" width="700" height="700"/>

<img src="https://miro.medium.com/max/3476/1*DDjCOEPSHLsU7tff7LmYUQ.png" />

# Stochastic Gradient Descent

Gradient Descent is the process of minimizing a function following the slope or gradient of
that function. In machine learning, we can use a technique that evaluates and updates the
coefficients every iteration called stochastic gradient descent to minimize the error of a model
on our training data.

The way this optimization algorithm works is that each training instance is shown to the
model one at a time. The model makes a prediction for a training instance, the error is calculated
and the model is updated in order to reduce the error for the next prediction.

This procedure can be used to find the set of coefficients in a model that result in the smallest
error for the model on the training data. Each iteration, the coefficients (b) in machine learning
language are updated using the equation: <br/><br/>
b = b − learning rate × error × x

Logistic Regression uses gradient descent to update the coefficients. Each gradient descent iteration, the coefficients (b) in machine learning language are updated using the equation: <br/>
b = b + learning rate × (y − yhat) × yhat × (1 − yhat) × x 

In [3]:
from math import exp

In [4]:
# Make a prediction with coefficients
def predict(row, coefficients):
    yhat = coefficients[0]
    for i in range(len(row)-1):
        yhat += coefficients[i + 1] * row[i]
    return 1.0 / (1.0 + exp(-yhat))

In [5]:
# test predictions
dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]
coef = [-0.406605464, 0.852573316, -1.104746259]
for row in dataset:
    yhat = predict(row, coef)
    print("Expected=%.3f, Predicted=%.3f [%d]" % (row[-1], yhat, round(yhat)))

Expected=0.000, Predicted=0.299 [0]
Expected=0.000, Predicted=0.146 [0]
Expected=0.000, Predicted=0.085 [0]
Expected=0.000, Predicted=0.220 [0]
Expected=0.000, Predicted=0.247 [0]
Expected=1.000, Predicted=0.955 [1]
Expected=1.000, Predicted=0.862 [1]
Expected=1.000, Predicted=0.972 [1]
Expected=1.000, Predicted=0.999 [1]
Expected=1.000, Predicted=0.905 [1]


<img src='LR_equation.jpg' align='left' /> 

We can estimate the coefficient values for our training data using stochastic gradient descent.
Stochastic gradient descent requires two parameters: <br/>
 Learning Rate: Used to limit the amount each coefficient is corrected each time it is
updated. <br/>
 Epochs: The number of times to run through the training data while updating the
coefficients.

<img src="SGD.png" />

There is one coefficient to weight each input attribute, and these are updated in a
consistent way, for example: <br/>
**b1(t + 1) = b1(t) + learning rate × (y(t) − yhat(t)) × yhat(t) × (1 − yhat(t)) × x1(t)** <br/><br/>
The special coefficient at the beginning of the list, also called the intercept, is updated in a
similar way, except without an input as it is not associated with a specific input value: <br/>
**b0(t + 1) = b0(t) + learning rate × (y(t) − yhat(t)) × yhat(t) × (1 − yhat(t))**

In [6]:
# Estimate logistic regression coefficients using stochastic gradient descent
def coefficients_sgd(train, l_rate, n_epoch):
    coef = [0.0 for i in range(len(train[0]))]
    for epoch in range(n_epoch):
        sum_error = 0
        for row in train:
            yhat = predict(row, coef)
            error = row[-1] - yhat
            sum_error += error**2
            coef[0] = coef[0] + l_rate * error * yhat * (1.0 - yhat)
            for i in range(len(row)-1):
                coef[i + 1] = coef[i + 1] + l_rate * error * yhat * (1.0 - yhat) * row[i]
        print('>epoch=%d, lrate=%.3f, error=%.3f' % (epoch, l_rate, sum_error))
    return coef

In [8]:
# Calculate coefficients
l_rate = 0.3
n_epoch = 100
coef = coefficients_sgd(dataset, l_rate, n_epoch)
print(coef)

>epoch=0, lrate=0.300, error=2.217
>epoch=1, lrate=0.300, error=1.613
>epoch=2, lrate=0.300, error=1.113
>epoch=3, lrate=0.300, error=0.827
>epoch=4, lrate=0.300, error=0.623
>epoch=5, lrate=0.300, error=0.494
>epoch=6, lrate=0.300, error=0.412
>epoch=7, lrate=0.300, error=0.354
>epoch=8, lrate=0.300, error=0.310
>epoch=9, lrate=0.300, error=0.276
>epoch=10, lrate=0.300, error=0.248
>epoch=11, lrate=0.300, error=0.224
>epoch=12, lrate=0.300, error=0.205
>epoch=13, lrate=0.300, error=0.189
>epoch=14, lrate=0.300, error=0.174
>epoch=15, lrate=0.300, error=0.162
>epoch=16, lrate=0.300, error=0.151
>epoch=17, lrate=0.300, error=0.142
>epoch=18, lrate=0.300, error=0.134
>epoch=19, lrate=0.300, error=0.126
>epoch=20, lrate=0.300, error=0.119
>epoch=21, lrate=0.300, error=0.113
>epoch=22, lrate=0.300, error=0.108
>epoch=23, lrate=0.300, error=0.103
>epoch=24, lrate=0.300, error=0.098
>epoch=25, lrate=0.300, error=0.094
>epoch=26, lrate=0.300, error=0.090
>epoch=27, lrate=0.300, error=0.087
>e

You can see how error continues to drop even in the final epoch. We could probably train for
a lot longer (more epochs) or increase the amount we update the coefficients each epoch (higher
learning rate).