## Stochastic Gradient Descent

We calculate the loss and gradient based on only **one sample**.   (Not recommended due to unstability)

$$\frac{\partial J}{\partial \theta_j} = (h^{(i)}-y^{(i)})x_j$$

## Mini-Batch Gradient Descent

We calculate the loss and gradient based on **subset of samples**.  (Recommended; used as standard in deep learning)

$$\frac{\partial J}{\partial \theta_j} = \sum_{i=start}^{batch}(h^{(i)}-y^{(i)})x_j$$

In [1]:
#experiment tracking
import mlflow
mlflow.set_tracking_uri("http://la.cs.ait.ac.th")
# mlflow.create_experiment(name="chaky-diabetes-example")  #create if you haven't create
mlflow.set_experiment(experiment_name="chaky-diabetes-example")

<Experiment: artifact_location='mlflow-artifacts:/307462513559759924', creation_time=1689306072647, experiment_id='307462513559759924', last_update_time=1689306072647, lifecycle_stage='active', name='chaky-diabetes-example', tags={}>

In [2]:
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
from time import time

diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
m = X.shape[0]  #number of samples
n = X.shape[1]  #number of features

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# actually you can do like this too
# X = np.insert(X, 0, 1, axis=1)
intercept = np.ones((X_train.shape[0], 1))
X_train   = np.concatenate((intercept, X_train), axis=1)
intercept = np.ones((X_test.shape[0], 1))
X_test    = np.concatenate((intercept, X_test), axis=1)

Here, I want to demonstrate some techniques:
- **Class** - it's much better to write any in the class form, as you can modularize it for future use, so please be comfortable with it
- **Early stopping** - it's expensive to run the model until a certain set of iterations; instead, we can stop when the training loss is no longer decreasing
- **Cross validation** - we never really do cross validation from scratch, so here I showed you how to do it

Experiment tracking:
- **MLFlow** - this is a popular experiment tracking tool.  Everyone love it and use it.  Here I just demonstrated a very simple usage - please continue to self-study.

Some terms worth mentioning:
- **Epoch** - it's a popular term in ML/DL - one epoch refers to a training and validation process with all the training data with one model.

Some coding practice worth mentioning:
- Notice how I put `_` in front of some function; it does not really have any functionality aside from telling the coders that it is not meant for outside use (should not be called from `__main__`)

In [3]:
from sklearn.model_selection import KFold

class LinearRegression(object):
    
    #in this class, we add cross validation as well for some spicy code....
    kfold = KFold(n_splits=5)
            
    def __init__(self, alpha=0.001, num_epochs=5, batch_size=50, method='batch', cv=kfold):
        self.alpha      = alpha
        self.num_epochs = num_epochs
        self.batch_size = batch_size
        self.method     = method
        self.cv         = cv
    
    def mse(self, ytrue, ypred):
        return ((ypred - ytrue) ** 2).sum() / ytrue.shape[0]
    
    def fit(self, X_train, y_train):
            
        #create a list of kfold scores
        self.kfold_scores = list()
        
        #reset val loss
        self.val_loss_old = np.infty

        #kfold.split in the sklearn.....
        #5 splits
        for fold, (train_idx, val_idx) in enumerate(self.cv.split(X_train)):
            
            X_cross_train = X_train[train_idx]
            y_cross_train = y_train[train_idx]
            X_cross_val   = X_train[val_idx]
            y_cross_val   = y_train[val_idx]
            
            #create self.theta here
            self.theta = np.zeros(X_cross_train.shape[1])
            
            #define X_cross_train as only a subset of the data
            #how big is this subset?  => mini-batch size ==> 50
            
            #one epoch will exhaust the WHOLE training set
            for epoch in range(self.num_epochs):
            
                #with replacement or no replacement
                #with replacement means just randomize
                #with no replacement means 0:50, 51:100, 101:150, ......300:323
                #shuffle your index
                perm = np.random.permutation(X_cross_train.shape[0])
                        
                X_cross_train = X_cross_train[perm]
                y_cross_train = y_cross_train[perm]
                
                if   self.method == 'sto':
                    for batch_idx in range(X_cross_train.shape[0]):
                        X_method_train = X_cross_train[batch_idx].reshape(1, -1) #(11,) ==> (1, 11) ==> (m, n)
                        y_method_train = y_cross_train[batch_idx]                    
                        self._train(X_method_train, y_method_train)
                elif self.method == 'mini':
                    for batch_idx in range(0, X_cross_train.shape[0], self.batch_size):
                        #batch_idx = 0, 50, 100, 150
                        X_method_train = X_cross_train[batch_idx:batch_idx+self.batch_size, :]
                        y_method_train = y_cross_train[batch_idx:batch_idx+self.batch_size]
                        self._train(X_method_train, y_method_train)
                else:
                    X_method_train = X_cross_train
                    y_method_train = y_cross_train
                    self._train(X_method_train, y_method_train)
                    
            yhat_val = self.predict(X_cross_val)
            
            #early stopping
            val_loss_new = self.mse(y_cross_val, yhat_val)
            if np.allclose(val_loss_new, self.val_loss_old):
                break
            self.val_loss_old = val_loss_new
            
            self.kfold_scores.append(val_loss_new)
            print(f"Fold {fold}: {val_loss_new}")
                    
    def _train(self, X, y):
        yhat = self.predict(X)
        grad = X.T @(yhat - y)
        self.theta = self.theta - self.alpha * grad
    
    def predict(self, X):
        return X @ self.theta  #===>(m, n) @ (n, )
    
    def _coef(self):
        return self.theta[1:]  #remind that theta is (w0, w1, w2, w3, w4.....wn)
                               #w0 is the bias or the intercept
                               #w1....wn are the weights / coefficients / theta
    def _bias(self):
        return self.theta[0]

In [4]:
lr = LinearRegression(method="batch") #<==try put method="batch" or "sto"
lr.fit(X_train, y_train)
yhat = lr.predict(X_test)
mse  = lr.mse(yhat, y_test)

# print the mse
print("Test MSE: ", mse)

Fold 0: 3990.7435685751593
Fold 1: 4924.366277323169
Fold 2: 4329.149051087357
Fold 3: 4052.190647395748
Fold 4: 5648.755870932162
Test MSE:  5509.353824615312


## Experiment

### Batch

In [5]:
mlflow.start_run(run_name="experiment-batch")

#######
params = {"method": "batch", "alpha": 0.01}
lr = LinearRegression(**params) #<==try put method="batch" or "sto"
lr.fit(X_train, y_train)
yhat = lr.predict(X_test)
mse  = lr.mse(yhat, y_test)

mlflow.log_params(params=params)
mlflow.log_metric('mse', mse)

# mlflow.log_figure(fig.figure_, f"XXX.png")  #you can log figure too in case you have one...
#mlflow.sklearn.log_model(sk_model, "sk_models", signature=signature)  #you can also let mlflow save model

#######

mlflow.end_run()


Fold 0: 8820768048950.701
Fold 1: 3580841319844.273
Fold 2: 6380667824783.314
Fold 3: 11227811335787.287
Fold 4: 6417509937430.679


### Mini-batch

In [6]:
mlflow.start_run(run_name="experiment-mini")

#######
params = {"method": "mini", "alpha": 0.01}
lr = LinearRegression(**params)
lr.fit(X_train, y_train)
yhat = lr.predict(X_test)
mse  = lr.mse(yhat, y_test)

mlflow.log_params(params=params)
mlflow.log_metric('mse', mse)

#######

mlflow.end_run()


Fold 0: 78837.08682827152
Fold 1: 2646.300208375495
Fold 2: 2792.062951504067
Fold 3: 6530.007454750576
Fold 4: 9957.29334330124


## Group Workshop - Check your understandings

Answer the following questions:

Instruction:  Gather in your group.  Will randomly pick groups to present.

1.  Explain why Chaky teach stochastic and mini-batch gradient descent.  Why we should care?  Explain in your own words.
2.  What's the shape of `X_train` and `y_train`?
3.  What does `batch_size=50` means?
4.  what is `np.infty`?  Why is set like that?
5.  What is `enumerate(self.cv.split(X_train))`
6.  What's the shape of `X_cross_train` and `y_cross_train`?
7.  What's the shape of `X_method_train` and `y_method_train` in the case of stochastic gradient descent?
8.  What's the shape of `X_method_train` and `y_method_train` in the case of mini-batch gradient descent?
9.  What's the shape of `X_method_train` and `y_method_train` in the case of batch gradient descent?
10. In https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection, there are many ways to split and `KFold` is the simplest.  Try do `enumerate` the `ShuffleSplit` and tells us what index it gave us.   If you are confused, see the examples shown in https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit
11. Change our code so that it uses `ShuffleSplit`.  Report the Test MSE.
12. Please learn about `TimeSeriesSplit` by yourself, as it is used in time series data.
13. In one `epoch`, we train a model using all training data.  Which line(s) of code makes sure we used all the training data?
14. What is `np.random.permutation` and what does it give you?  Demonstrate your answer with trial code.
15. In a line `X_method_train = X_cross_train[batch_idx].reshape(1, -1)`, why do we need to reshape?
16. In a line `yhat_val = self.predict(X_cross_val)`, when should validation happens?  
17. What does this line do: `if np.allclose(val_loss_new, self.val_loss_old): break`?
18. Explain why the coefficients can be achieved via `return self.theta[1:]`
19. Perform an experiment comparing these parameters using cross-validation: `alpha: {0.1, 0.01, 0.001}`, `method:{'batch', 'sto', 'mini'}`.  Write a short paragraph report.