This notebook is for [Quora question](https://www.quora.com/unanswered/In-scikit-learn-what-is-the-difference-between-SGDClassifer-with-log-loss-and-logistic-regression) , 

to understand difference between Logisting regression and SGD classifier in sklearn. 

## Implementation of SGD Classifier and Logisting regression. 

In [12]:
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

# Note that the iris dataset is available in sklearn by default.
# This data is also conveniently preprocessed.
iris = datasets.load_iris()
X = iris["data"]
Y = iris["target"]

numFolds = 10
kf = KFold(n_splits = numFolds, shuffle=True)

# These are "Class objects". For each Class, find the AUC through
# 10 fold cross validation.
Models = [LogisticRegression, SGDClassifier]
params = [{}, {"loss": "log", "penalty": "l2"}]
for param, Model in zip(params, Models):
    total = 0
    for train_indices, test_indices in kf.split(X):

        train_X = X[train_indices, :]; train_Y = Y[train_indices]
        test_X = X[test_indices, :]; test_Y = Y[test_indices]

        reg = Model(**param)
        reg.fit(train_X, train_Y)
        predictions = reg.predict(test_X)
        total += accuracy_score(test_Y, predictions)
    accuracy = total / numFolds
    print("Accuracy score of {0}: {1}".format(Model.__name__, accuracy))

Accuracy score of LogisticRegression: 0.9533333333333334
Accuracy score of SGDClassifier: 0.6533333333333333


### Explanation : 
The two algorithms are not equivalent and will not necessarily produce same accuracy given same data. Practically you can try changing the learning rate and epochs of SGD.

** These algorithms are different because logistic regression uses GD where as SGD classifier uses stochastic gradient descent. The convergence of the former will be more efficient and will yield better results. However, as the size of the data set increases, SGDC should approach the accuracy of logistic regression. The parameters for GD mean different things than the parameters for SGD, so you should try adjusting them slightly. **

One way to get similar result in *sklearn* by changing number of iterations, 

The default SGDClassifier n_iter is 5 meaning you do 5 * num_rows steps in weight space. The sklearn rule of thumb is ~ 1 million steps for typical data. For your example, just set it to 1000 and it might reach tolerance first. Your accuracy is lower with SGDClassifier because it's hitting iteration limit before tolerance so you are "early stopping"

## Implementation after changing number of iteration

In [11]:
# Added n_iter here
params = [{}, {"loss": "log", "penalty": "l2", 'n_iter':1000}]

for param, Model in zip(params, Models):
    total = 0
    for train_indices, test_indices in kf.split(X):
        train_X = X[train_indices, :]; train_Y = Y[train_indices]
        test_X = X[test_indices, :]; test_Y = Y[test_indices]
        reg = Model(**param)
        reg.fit(train_X, train_Y)
        predictions = reg.predict(test_X)
        total += accuracy_score(test_Y, predictions)

    accuracy = total / numFolds
    print("Accuracy score of {0}:{1}".format(Model.__name__, accuracy))


Accuracy score of LogisticRegression:0.9533333333333334
Accuracy score of SGDClassifier:0.9666666666666666
