## Building Linear Neural Networks for Machine Learning Regression

In [2]:
import torch

In [10]:
X = torch.randn(10, 10)
y = torch.randn(10)
X, y

(tensor([[ 0.0105, -0.3524, -0.8405,  0.1248,  1.1378, -0.4270, -1.3653,  1.2809,
          -0.6189,  0.0762],
         [ 0.6146, -0.6661,  0.2844,  2.7747, -2.3469, -0.5257, -0.4112, -0.1374,
          -1.6002, -2.6443],
         [ 0.1022,  0.5095,  0.3900, -1.2284,  2.3054, -0.5433, -0.3673, -1.1440,
          -0.4079,  0.5785],
         [ 0.8475,  1.0511,  0.2029, -0.5822, -1.0615, -1.7253,  0.5514,  1.7176,
           0.7520, -0.6288],
         [-0.2764,  0.6178, -0.4972, -2.1459, -0.7055,  1.5842,  1.6377, -0.6056,
           1.8395, -0.3558],
         [ 0.0044, -0.4361,  0.3105, -0.1385,  0.6621,  1.5718,  0.5231,  0.3330,
           1.4507, -0.4212],
         [ 0.1110,  0.4385,  1.6466, -0.5679, -1.1906, -0.9303, -1.8452,  2.0042,
           0.2904,  0.5847],
         [ 0.5209,  0.0863,  0.9270, -1.7087,  0.2636, -1.4006,  1.3385,  0.0141,
           0.4232,  2.2421],
         [-0.5329, -1.7276,  1.1089, -2.0264,  1.3249,  0.5282,  1.3302,  0.7906,
          -1.2325,  0.1138],
 

In [7]:
import math
import time
import numpy as np
from d2l import torch as d2l

𝑦̂ =𝐰⊤𝐱+𝑏. is the formula for linear regression. Givent that the upper T means a vector multiplication between the weights w and the sample vector x, added to the bias will give the prediction y hat.

𝐲̂ =𝐗𝐰+𝑏, this will give all the predictions of y for all the samples in matrix X. In this case this is a matrix vector multiplication and the value of y is a vector.

The whole point of machine learning is to search for the best parameters w and b, such that given a new unseen sample that has the same distribution as the X matrix the model will be able to accurately predict a value y for the new sample that is not too far from the distribution of y.

𝑙(𝑖)(𝐰,𝑏)=1/2(𝑦̂(𝑖)−𝑦(𝑖))^2 is the loss funtion we use to predict if our model is performing properly in a regression probelm. This is known a squared arror loss

In [None]:
def squared_error(y_pred, y_actual):
    return 0.5 * ((y_pred - y_actual)**2)

In [None]:
def mean_squared_error(y_pred, y_actual):
    if(len(y_actual)!=len(y_pred)):
        raise Exception("Length of y_pred and y_actual must be equal")
    
    squared_errors = []
    for i in range(len(y_actual)): 
        squared_errors.append(squared_error(y_pred[i], y_actual[i]))
        
    return sum(squared_errors)/len(squared_errors)

𝐰∗=(𝐗⊤𝐗)−1𝐗⊤𝐲, we can us the analytic solution to fine the appropriate weigts for a give problem, but the analytic solution is so restrictive that it would not suffice for solving the most exciting problems in deep learning

In [18]:
# finding the appropriate weights using the analytical solution

def analystic_solution(X, y):
    solution = torch.mm(X, X)
    solution = torch.linalg.inv(solution)
    solution = solution * (torch.mv(X, y))
    return solution

In [19]:
result = analystic_solution(X, y)
result

tensor([[ 3.9007e+01, -6.2344e+01,  6.4169e+00,  1.6892e+01,  3.9624e+01,
         -9.8062e+00, -8.0019e-01, -2.4872e+01,  1.2526e+00, -1.4920e+01],
        [ 1.0613e+01, -1.8492e+01,  2.8515e+00,  1.5286e+00,  1.0414e+01,
         -6.5795e-01, -4.3969e-02, -6.7103e+00, -9.4963e-02, -3.8291e+00],
        [-2.6523e+00,  4.0194e+00, -3.0021e-01, -1.9942e+00, -2.4172e+00,
          4.7917e-01,  6.5802e-02,  2.2544e+00,  2.8787e-01,  7.5786e-01],
        [-1.2973e+00,  1.9226e+00, -1.6229e-01, -6.0957e-01, -1.2410e+00,
          3.4665e-01,  1.3808e-02,  9.8791e-01, -1.3649e-01,  5.8528e-01],
        [-1.1359e+00,  1.3649e+00, -1.5007e-01,  9.0920e-02, -1.0524e+00,
          5.6368e-01, -5.5252e-02,  5.3290e-01, -1.6425e-01,  4.3980e-01],
        [ 2.2696e+01, -3.7016e+01,  4.4170e+00,  8.0673e+00,  2.2863e+01,
         -4.2856e+00, -3.3882e-01, -1.4580e+01,  3.2476e-01, -8.3744e+00],
        [-6.0228e-01,  7.5568e-01,  2.2478e-01, -1.4045e+00, -8.5432e-01,
          9.9978e-01,  5.0414e-0

In practice the technique for finding the parameters is by gradually adjusting the weights and reducing the loss function, till we reach a point where further adjustment of the weights will not lead to reduction in the loss function result, this iterative process is known as gradient descent, as we are gradually descending the gradient of the loss function to find the lowest possible value for loss function/objective function.

Given the current nature of evaluating our loss function it would mean that we would need to find the loss of every single sample before a single update to the weights, this would be extremely slow, therefore an alternate technique can be used which is to update the gradients after getting the loss for a single sample. Even at that updating the weights has both statistical and performance drawbacks, therefore the solution is to pick the middle ground by breaking the updates into batches. This will lead to minibatch stochastic gradient descent.

The derivative of each weight is the amount it is supposed to change in order to minimize the loss.

Tunable parameters that are not updated in the training loop are called hyperparameters, examples of such are: Learning rate, Batch size etc. They can be tuned automatically by a number of techniques, such as Bayesian optimization