### Stochastic Gradient Descent
### Divya Handa
## Github link - https://github.com/divyahanda219/SVM


## Stochastic Gradient Descent (SGD) is an iterative method for optimizing the loss function. The term gradient descent literally translates to slope descent. So, SGD is basically an algorithm used to travel down a slope (descent) untill it reaches the lowest point of the function. This is achieved by first starting with a random initial value from the dataset and updating the weights at each step. 
### Here, we will first create two arrays X and Y of size [n_samples, n_features]. The array X contains the training samples and array Y contains the class labels. We will use SGD classifier, which is a linear classifier. The loss function will be set as 'hinge' to obtain a linear SVM. There are many more options available for loss e.g. 'log', 'huber' etc. You can explore more on this at  https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html. For now, we will use 'hinge' and set the regularization term aka penalty as 'I2', which is the standard regularizer for linear SVM models. The number of epochs (max_iter) will be set as 5 for this example. In case of a large dataset, the number of epochs can be increased accordingly. We will then fit the model. 

In [67]:
from sklearn.linear_model import SGDClassifier
X = [[0., 0.], [1., 1.]]
y = [0, 1]
clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)
clf.fit(X, y)   




SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=5,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

### Now, we can use the model to predict new values. 

In [63]:
clf.predict([[2., 2.]])

array([1])

### Use coef_ and intercept_ to determine the model parameters.

In [64]:
clf.coef_                                         


array([[9.91080278, 9.91080278]])

In [68]:
clf.intercept_ 

array([-9.97004991])

### Let us now calculate the distance of our sample point[2,2] to the hyperplane. It is important to know this distance so that we know the direction of classification. If the distance is positive, it implies that the sample point lies on the side of hyperplane contatining the clusters.

In [65]:
clf.decision_function([[2., 2.]])

array([29.65318117])

### We created a linear model and calculated the distance of sample point from the hyperplane. Now, we will create a logistic model using loss = log. This function takes into account the uncertainity of our prediction.  It will gives us a more nuanced view of the performance of our model. 

In [66]:
clf = SGDClassifier(loss="log", max_iter=5).fit(X, y)
clf.predict_proba([[1., 1.]])                      




array([[4.97248476e-07, 9.99999503e-01]])

### The probability of our sample point [1,1] being classified as P(y/x) is given by clf.predict_proba function. 