<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/notebooks/M1_sml_byhand.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Digression: SML by hand

Well, we might not do it anymore after this exercise, but still it might be worth to review how to code up a simple SML model by hand.

## Linear regression. by hand
Based on adapted example from [here](https://www.kdnuggets.com/linear-regression-from-scratch-with-numpy).

In [None]:
3 # Generate random dataset
from sklearn import datasets
X, y = datasets.make_regression(
        n_samples=500, n_features=1, noise=15, random_state=4)

Reminder: The general equation for a linear line is:

y = b0 + b1*X

* X is numeric, single-valued. Here b1 and b0 represent the gradient and y-intercept (or bias).
* These are unknowns, and varying values of these can generate different lines. In machine learning, X is dependent on the data, and so are the y values.
We only have control over b0 and b1, that act as our model parameters.
* We aim to find optimal values of these two parameters, that generate a line that minimizes the difference between predicted and actual y values.

But how do we know the optimal values of our bias and weight values? Well, we don’t. But we can iteratively find it out using Gradient Descent. We start with random values and change them slightly for multiple steps until we get close to the optimal values.

First, let us initialize Linear Regression, and we will go over the optimization process in greater detail later.

#### Initializing

Reminder: The general equation for a linear line is:

y = b0 + b1*X

* X is numeric, single-valued. Here b1 and b0 represent the gradient and y-intercept (or bias).
* These are unknowns, and varying values of these can generate different lines. In machine learning, X is dependent on the data, and so are the y values.
We only have control over b0 and b1, that act as our model parameters.
* We aim to find optimal values of these two parameters, that generate a line that minimizes the difference between predicted and actual y values.

But how do we know the optimal values of our bias and weight values? Well, we don’t. But we can iteratively find it out using Gradient Descent. We start with random values and change them slightly for multiple steps until we get close to the optimal values.

First, let us initialize Linear Regression, and we will go over the optimization process in greater detail later.

```python
class LinearRegression:
    def __init__(self, lr: int = 0.01, n_iters: int = 1000) -> None:
        self.lr = lr
        self.n_iters = n_iters
        self.weights = None
        self.bias = None
```

* We use a learning rate and number of iterations hyperparameters, that will be explained later.
* The weights and biases are set to None because the number of weight parameters depends on the input features within the data.
* We do not have access to the data yet, so we initialize them to None for now.

In [None]:
# Initialize the LinearRegression object
model = LinearRegression(lr=0.01, n_iters=1000)

In [None]:
model

#### The Fit Method
In the fit method, we are provided with data and their associated values. We can now use these, to initialize our weights, and then train the model to find optimal weights.

```python
def fit(self, X, y):
        num_samples, num_features = X.shape     # X shape [N, f]
        self.weights = np.random.rand(num_features)  # W shape [f, 1]
        self.bias = 0
```

* Most elements of ML are usually performed using ```sklearn``` which offers a uniform API.
* New algorithms developed outside the ```sklearn``` framework will usually use the same established notation which makes it easy to switch to switch or use them in combination with e.g. tools for performance evaluation.

#### Predicting Y Values

We use the line equation discussed above to calculate predicted y values. However, instead of an iterative approach to sum all values, we can follow a vectorized approach for faster computation. Given that the weights and X values are NumPy arrays, we can use matrix multiplication to get predictions.

X has shape (num_samples, num_features) and weights have shape (num_features, ). We want the predictions to be of shape (num_samples, ) matching the original y values. Therefore we can multiply X with weights, or (num_samples, num_features) x (num_features, ) to obtain predictions of shape (num_samples, ).

The bias value is added at the end of each prediction. This can simply be implemented in a single line.


In [None]:
# y_pred shape should be N, 1
y_pred = np.dot(X, self.weights) + self.bias

* However, are these predictions correct? Obviously not. We are using randomly initialized values for the weights and bias, so the predictions will also be random.
* How do we get the optimal values? **Gradient Descent** (little digression), which will be an optimization (minimize RMSE) exercise.

#### Gradient descent


* For our predictions to be as close to original targets as possible, we now try to minimize this function. The loss function will be minimum, where the gradient is zero. As we can only optimize our weights and bias values, we take the partial derivates of the MSE function with respect to weights and bias values.

![](https://editor.analyticsvidhya.com/uploads/631731_P7z2BKhd0R-9uyn9ThDasA.png)

* We take the gradient (= derivative) with respect to each weight value and then move them to the opposite of the gradient.
* This pushes the the loss towards minimum. If the gradient is positive, so we decrease the weight (and vice versa).
* This pushes the J(W) or loss towards the minimum value.
* The learning rate (or alpha) controls the incremental steps. We only make a small change in the value, for stable movement towards the minimum.


* If we simplify the derivate equation using basic algebraic manipulation, this becomes very simple to implement.
* For the derivate, we implement this using two lines of code:

In [None]:
# X -> [ N, f ]
# y_pred -> [ N ]
# dw -> [ f ]
dw = (1 / num_samples) * np.dot(X.T, y_pred - y)
db = (1 / num_samples) * np.sum(y_pred - y)

* dw is again of shape (num_features, ) So we have a separate derivate value for each weight. We optimize them separately. db has a single value.
* To optimize the values now, we move the values in the opposite direction of the gradient using basic subtraction.

In [None]:
self.weights = self.weights - self.lr * dw
self.bias = self.bias - self.lr * db

* Again, this is only a single step.
* We only make a small change to the randomly initialized values. We now repeatedly perform the same steps, to converge towards a minimum.
* The complete loop is as follows:

In [None]:
for i in range(self.n_iters):

            # y_pred shape should be N, 1
            y_pred = np.dot(X, self.weights) + self.bias

            # X -> [N,f]
            # y_pred -> [N]
            # dw -> [f]
            dw = (1 / num_samples) * np.dot(X.T, y_pred - y)
            db = (1 / num_samples) * np.sum(y_pred - y)

            self.weights = self.weights - self.lr * dw
            self.bias = self.bias - self.lr * db

#### Prediction
* We predict the same way as we did during training.
* However, now we have the optimal set of weights and biases.
* The predicted values should now be close to the original values.



In [None]:
def predict(self, X):
        return np.dot(X, self.weights) + self.bias

#### Investigate

In [None]:
class LinearRegression:
    def __init__(self, lr: int = 0.01, n_iters: int = 1000) -> None:
        self.lr = lr
        self.n_iters = n_iters
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        num_samples, num_features = X.shape     # X shape [N, f]
        self.weights = np.random.rand(num_features)  # W shape [f, 1]
        self.bias = 0

        for i in range(self.n_iters):

            # y_pred shape should be N, 1
            y_pred = np.dot(X, self.weights) + self.bias

            # X -> [N,f]
            # y_pred -> [N]
            # dw -> [f]
            dw = (1 / num_samples) * np.dot(X.T, y_pred - y)
            db = (1 / num_samples) * np.sum(y_pred - y)

            self.weights = self.weights - self.lr * dw
            self.bias = self.bias - self.lr * db

        return self

    def predict(self, X):
        return np.dot(X, self.weights) + self.bias

In [None]:
# Initialize the LinearRegression object
model = LinearRegression(lr=0.01, n_iters=1000)

In [None]:
# Fit the model to the example data
model.fit(X, y)