#### Run the below code to import all libraries required to run sample code within this notebook

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import plotly.graph_objs as go
from plotly.offline import  init_notebook_mode, plot, iplot
import plotly.figure_factory as ff

init_notebook_mode(connected=True)


### Solution code

```python
# Just run above code
```


So far we have looked at gradient descent which is a optimization method used to optimize a machine learning algorithm. We used the simple example linear regression. In this notebook we are going to be talking about a variant of gradient descent called stochastic gradient descent. So the topics we will cover are - 

1)  What is stochastic gradient descent? 
2)  A simple comparison using sklearn
3)  Where is stochastic gradient descent used? 

So lets start with the first part <Br>
    
## What is stochastic gradient descent? 

The goal of any optimization method like gradient descent is to find the minimum of the loss function. When we do this, we find the best parameters for our machine learning algorithm. Take for example in case gradient descent when it comes to linear regression. We are trying to find the best values of the slope and the intercept of line that best fits the data. 

\begin{equation} 
\hat y =  \theta_0 + \hat \theta_1 x 
\end{equation}

where  $\theta_1 \text{and } \theta_2 $ are parameters <br>
$x$ is our training data <br>
$\hat y$ is out model <br>

To find the optimum value of $\theta_1 \text{and } \theta_2 $ we use gradient descent where we update the value of both parametres but mimizing the cost function. When we do this, we first- 

1) Cacluate $\hat y$ <br>
2) Find the error using $\hat y$ and $y$  where y is the target in our supervised dataset<br> 
3) Calculate the cost function using the error <br>
4) Minimize the cost function function using gradient descent <br> 
When we carry out step 4) we do this over the whole dataset, so we do - 


\begin{equation} 
\theta_1 =  \theta_1 - \alpha \big(\dfrac{1}{N}\sum_{i=1}^{N} \dfrac {\partial }{\partial \theta_1}C(\hat y_i - y_{i}) \big)
\end{equation}

The above equation can be understood as 
new_parameters =  old_parameters - (learning_rate)* change_in cost function over all examples

So we are calcuating the change in the cost function and using that to update our paramter $\theta_1$

Now when we do this procedure from step 1) to 4) for small datasets this works really well. We can minimize our parameters  over the whole dataset, but when we start to get into large datasets. The kind of datasets this whole approach breaks down. In such situations we can modify the gradient descent procedure such that rather than minimizing the cost function over the whole dataset and then updating our parameters we minimize a single data point and update the parameter.

Mathematically this would look like- 

\begin{equation} 
\theta_1 =  \theta_1 - \alpha \big( \dfrac {\partial }{\partial \theta_1}C(\hat y_i - y_{i}) \big)
\end{equation}

If you compare equation (2) and (3) you will see that when we set $N=1$ in equation (2) which becomes equation (3). <bre>

This approach of using only 1 training example to update the weights is called stochastic gradient descent. 

Now trying provide examples for stochastic gradient descent using traditional algorithms its a bit harder. Thankfully we can furnish an example from Scikit-learn. That is the content of our next section.

## A simple comparison using sklearn

Sklearn provides us with a stochastic gradient descent classifier and regressor. 
In this section we are going to use the iris dataset and train a logistic regression model on it and we will then compare it the the SGDClassifier provided by sklearn. So let us first start by importing the data and creating a training and testing set. 


In [2]:
# import some data to play with
iris = datasets.load_iris()

# we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
X = iris.data[:, :2]
y = iris.target

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=42)

### Solution code

```python
# Just run above code
```

When you are generating training and testing sets, its always a good idea to use the train_test_split function provided by sklearn. It makes sure that the data is shuffled and lets you better control the proportion of test data you are generating. Not to mention, it saves you from writing at least 10+ lines of code! 


Once we have the test train set, we fit the data using the model.  We will predict the output, and get the accuracy score of the model. We want the accuracy so that we can compare this with the accuracy of the SDGClassifier as well. 

In [3]:
# we are fitting the data using a logistic regression model  
log_clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial')
log_clf.fit(x_train, y_train)
log_pred = log_clf.predict(x_test)
accuracy_log = np.round(100*(accuracy_score(y_test, log_pred)),2)
print("accuracy of the logistic regression model is : {}" .format(accuracy_log))
      


accuracy of the logistic regression model is : 83.02


### Solution code

```python
# Just run above code
```

We find the accuracy to be ~83%. So now lets fit the data to the SGDClassifier and see what we get.


In [4]:
sgd_clf = SGDClassifier(loss="log", penalty="l2", max_iter=100)
sgd_clf.fit(x_train,y_train)
sgd_pred = sgd_clf.predict(x_test)
accuracy_sgd = np.round(100*(accuracy_score(y_test, sgd_pred)),2)
print("accuracy of the logistic regression model is : {}" .format(accuracy_sgd))


accuracy of the logistic regression model is : 83.02


### Solution code

```python
# Just run above code
```

You will get a different value of accuracy, every time you run the algorithm. So let is run this a thousand times and see what happens

In [5]:
def sgdclassifier(x_data, y_data, x_test_data, y_test_data):
    sgd_clf = SGDClassifier(loss="log", penalty="l2", max_iter=100, tol =0.19)
    sgd_clf.fit(x_data,y_data)
    sgd_pred = sgd_clf.predict(x_test_data)
    accuracy_sgd = np.round(100*(accuracy_score(y_test_data, sgd_pred)),2)
    return accuracy_sgd

accuracy_values = []
for i in range(0,1000): 
    accuracy_values.append(sgdclassifier(x_train,y_train, x_test, y_test))
    

hist_data = [accuracy_values]
group_labels = ['Distribution of Accuracy']


fig = ff.create_distplot(hist_data, group_labels)
fig.layout.xaxis.range=[40,90]
fig.layout.yaxis.range=[0,0.5]

iplot(fig, filename='Distribution of accuracies')


### Solution code

```python
# Just run above code
```

You can see we have a nicely spread out distribution with a peak at around 67% accuracy. This distribution will change if you run it again. All of this is happening due to the small change that we made in how we update our weights i.e the effect of using equation 3 rather than equation 1. 

This constant change in the accuracy is because we are updating equation the weights constantly. If we were to plot the value of the cost function each time we updated the weights we will find that there will be a lot of variation in the cost over the number of iteration. It will come down to a single value but after a lot of variation. This is where stochastic gradient descent gets its name from. This variation in the cost function is stochastic hence the name. 

Of course one might be wondering why do we even use stochastic gradient descent in the first place? Where is this approach useful? Well thats the topic of our next section.

## Where is stochastic gradient descent useful? 


The real strength in stochastic gradient descent lies in points- 

1) Hyper parameter tuning
2) Dealing with large data and sparse data



below is an example of 1) We have taken the same model SDG classifier as above and changed a bunch of hyper parameters and run the model again. The axis of both plots for above and below are the same. What you can see is that although the most probable accuracy is still around 67% we have access higher accuracy models in when we change the hyper parameters like learning rate, regularization etc. While you can change many parameters when doing gradient descent, you general have more control over how you are minimizing the cost function when you use stochastic gradient descent. 


In [6]:
def sgdclassifier(x_data, y_data, x_test_data, y_test_data):
    sgd_clf = SGDClassifier(loss="log", penalty="l1", max_iter=1000, tol =0.19, alpha = 0.01, epsilon=0.5, shuffle=True)
    sgd_clf.fit(x_data,y_data)
    sgd_pred = sgd_clf.predict(x_test_data)
    accuracy_sgd = np.round(100*(accuracy_score(y_test_data, sgd_pred)),2)
    return accuracy_sgd

accuracy_values = []
for i in range(0,1000): 
    accuracy_values.append(sgdclassifier(x_train,y_train, x_test, y_test))
    

hist_data = [accuracy_values]
group_labels = ['Distribution of Accuracy']


fig = ff.create_distplot(hist_data, group_labels)
fig.layout.xaxis.range=[40,90]
fig.layout.yaxis.range=[0,0.5]

iplot(fig, filename='Distribution of accuracies')

### Solution code

```python
# Just run above code
```


This control of minimization matters a lot when you are dealing with large datasets which can only be fit using complex models like neural networks. In case of neural networks, you have millions of parameters, hence optimization can be really hard. With such large datasets you cannot use gradient descent since you cannot minimize all your data at once. In such situation, you use either stochastic gradient descent or a variation of it called batch gradient descent where you minimize small number of data points, then update the parameter then repeat it until you go over the whole dataset. 

One place where stochastic gradient descent is especially useful when we are working non-neural network models is when we have memory constrains. Suppose you have a dataset that is 5GB and you have only 4GB of RAM, you can use SGDClassifier or Regressor to fit models since it will not load the whole dataset to the memory, the way you would normal do with gradient descent based optimization. That is why SGD provides different types of loss options for different models. For example- we used "log" loss which means that the SGDClassifier is using a logistic regression model that is being optimized using stochastic gradient descent. We can build a support vector machine using "hinge" loss in the SGDClassifier. There are other options as well so check out the documentation for SDGClassifier and Regressor to get a better idea of what algorithms can be used. 

To understand the full applicability of stochastic gradient descent we need to see training examples using neural networks which is out of the scope of this section. Despite this you can see that sgd does give you access to better optimized model. As we close this notebook, we should remember that its not just the accuracy that you get with a model that is important, its also how you got it that matters as well. 




In [7]:
# End of Notebook

### Solution code

```python
# End of Notebook
```