### Chapter 5 : The Mechanics of Learning

##### The problem :
We just got back from a trip and brought a thermometer from that place. Now the problem is that the units are not given. So what we will so is that we would take readings of the values of the thermometer and the corresponding thermometer and try to reduce the error to zero. Here we will use the same method that kepler use but we will have an additional tool `Pytorch!`.

In [2]:
import torch
import plotly.express as px

In [3]:
t_c = [0.5, 14.0, 15.0, 28.0, 11.0, 8.0, 3.0, -4.0, 6.0, 13.0, 21.0]       # readings in celcius
t_u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]    # readings in unknown units
t_c = torch.tensor(t_c)         # converted to tensor   
t_u = torch.tensor(t_u)         # converted to tensor

In [4]:
# Now first we would plot and visualize the data
px.scatter(x = t_u, y = t_c, labels={'x':'measurement', 'y':'celsius'}, title='Thermometer readings')

##### Choosing a linear model in our first try
In the absence of further knowledge a linear model might be worth a try (just like in the case of kepler), so what we do is that we multiply the the unknown unit data point with a factor(or weight) and add a constant to it (bias) and we may get a temperature in Celsius(upto an error that we omit).
```
                                                    t_c = w * t_u + b
```
So now we need to estimate the weigths and the biases of our model. We will run the temperatures t_u through the model and then we will calculate the temperature and see if it is close to the celsius temperature. We will god through this with pytorch and we will see that training a Neural Network will involve swapping this model with a more complex one with more parameters.\
Now we also need to define a measurement for error. This measurement of error is called as the `Loss Function`. This value should be high if the error is high and should be low if the error is low. So our goal for optimization is to find the values of `w` and `b` such that the loss function's value should be low.
1. We can use `| t_c - t_U | ` as loss or we can use `( t_c - t_u ) ** 2` as the loss. Both the losses are net positive even when pure error is negative. An they are also are minimum at zero and monotnically increase as the predicted value moves away from the true value in either direction. Also both the loss functions are convex and it is relatively easy to minimize a convex function. But the loss functions of Neural Networks are not convex.
2. We will not use |t_c - t_u| as the measurement of error because the value of the dunction is zero when t_c = t_u and the derivative is also undefined at 0 and that is also the minimum when we try to plot the grapg and we are converging to that point when we are trying to minimize the function.
3. The problems stated in the previous point are not applicable to the (t_c - t_u)**2 function because its derivative is not zero at any point. So we will go with the sqared error one.

In [5]:
# Now as we have already created the tensors for input of the data. Now we will use pytorch to build our model
def model(t_u, w, b):
    return w*t_u + b

#  In our model the parameters would be pytorch scalars and it will use broadcasting to compute the results
# Now we define out loss
def loss(t_p, t_c):
    squared_error = (t_p - t_c) ** 2
    return squared_error.mean()

In [6]:
# Now we would initialize the parameters of w and b and invoke the model and calculate the loss

# Initialize the parameter
w = torch.ones(())
b = torch.zeros(())

t_p = model(t_u, w, b)

print(w, b)

print(f"The predicted results are : {t_p}")

# calculate the loss
squared_error = loss(t_p, t_c)
print(f" The loss is : {squared_error} ")

tensor(1.) tensor(0.)
The predicted results are : tensor([35.7000, 55.9000, 58.2000, 81.9000, 56.3000, 48.9000, 33.9000, 21.8000,
        48.4000, 60.4000, 68.4000])
 The loss is : 1763.884765625 


Now we have calculated the loss but we need to change our parameters w and b in such a way that decreases our loss. To do this what we can do is that we can add a small value to w and b and see how the loss behaves and our idea is to decrease the parameters in such a way that the loss decreases.

In [7]:
delta = 0.1

loss_rate_of_change_of_w = loss(model(t_u, w + delta, b), t_c) - loss(model(t_u, w-delta, b), t_c) / (2.0*delta)
print(loss_rate_of_change_of_w)

tensor(-4462.7925)


A unit change in w leads to a change in loss. If the change is negative then we need to increase w to minimize the loss and vice-versa. But by how much the unit change should happend. A change that is proportional to the rate_of_change_of_loss is a good idea especially when the loss has multiple parameters, so we need to change those parameters which have a more significant change in loss and because the rate_of_change_of_loss would  be dramatically different at a distance from the current value of w. Therefore we typically should scale the rate of change by a small factor which in ML language is called as the learning rate.

In [8]:
learning_rate = 1e-2
w = w - learning_rate*loss_rate_of_change_of_w
print(f"New w : {w}")

# We could do the same for b also
loss_rate_of_change_of_b = loss(model(t_u, w, b+delta), t_c) - loss(model(t_u, w, b-delta), t_c) / (2.0*delta)
print(loss_rate_of_change_of_b)

b = b - (learning_rate * loss_rate_of_change_of_b)
print(f"New b : {b}")


New w : 45.62792205810547
tensor(-24238800.)
New b : 242388.0


The above is one iteration of gradient descent. To minimize the loss we chode a small enough learning rate and repeat the above process for weigths and biases both, but the above method of updating the weight and bias is just for two parameters. We will have to do that for millions and in some cases billions of parameters. So we will see a more scalable method of calculating the `rate_of_change_of_parameters w.r.t loss (gradient)`.

##### Calculating the Analytical Gradient
1. The calculation of the rate of change of loss by repeated evaluations in order to probe the behaviour of loss function in the neighbourhood of w and b does  not scale to many parameters. Also is is not always clear how large the neighbourhood should be. We chose delta = 0.1 but it practice it depends on the shape of the loss function w.r.t w and b. If the loss changes too quickly w.r.t delta then we would not have a very good idea in which direction the loss is decreasing the most.
2. But what if we could make the derivative infinitesimally small, this is exactly what happens when we analytically take the derivative of loss w.r.t a parameter. In a model with one or two parameters that we are dealing with we compute the individual derivatives of the loss w.r.t each parameter and add them to a vector called as the `gradient`.
3. In order to compute the derivative of the loss w.r.t a parameter we can apply the chain rule and compute the derivative
```
                                    dloss_fn/dw = dt_p/dw * dloss_fn/dt_p
```
4. Using the method described in the previous point we can calculate the gradient of loss w.r.t w and b and put that in our gradients vector and we do this for all the datapoints and sum them to create the gradient vector for all the training examples.
5. This is one iteration of the process which produces makes the parameters of the model change. When we repeat the process for a number of iteration we can converge to a point where the loss is now close to zero. That's when we know that the model has fit the data correctly and when given a new example input it would calculate the matching output of the data that it has seen.


##### Overtraining
1. Sometimes the loss goes to infinity, which means our params are receiving updates that are too large and their values start oscillating back and forth over the minimum and as one update overshoots the other overcorrects it even more due to which the loss might increase exponentially and becomes infinity.So the loss instead of converging, diverges. Ideally we want to se smaller and smaller updates to the parameters.
2. But we need to figure out how can we limit the magnitude of learning_rate*grad. Simple we can chose a smaller learning rate because the gradient cannot change unless we move away from that point (i.e unless the parameters change). So we need to go with a smaller learning rate to avoid exploding in the other direction.
3. We usually increase or decrease the learning in orders of magnitude of 1.
4. We can also make the learning rate adaptive so that for the first few iteration the learning rate is large so that we make progress towards the minimum faster and then slowly comverge to the minimum, so that we can converge to the minimum in lesser time.

##### Normalizing the Inputs
For different inputs the gradients for the weights and biases would be different because different inputs have different values all of which have different scales. This means that the learning rate which is good for one can be very large for the other. So the scale of the gradients would also be different. In order to combat this problem we could do 2 things:
1. Have seperate learning rates for seperate parameters, which would be too bothersome for a large number of parameters.
2. Or Normalize the inputs so that they fall under the same distribution i.e we scale the inputs so that the inputs range between -1.0 to 1.0. This would also make the gradients more manageable.
3. For small networks we can get away with not normalizing the inputs, but for large NN's it is a crucial tool for speedy convergence of the NN.

#### Pytorch's Autograd : Backpropagating All things

Previously we computed the gradients of the loss w.r.t to the parameters and backprpopagated the gradients to update the parameters by the learning rate using the chain rul of calculus. The only requirement here is that all the functions that we are evluating here can be differentiated analytically. If that is the case then we can compute the gradients of the functions w.r.t to the parameters in one swoop. Even if the model is coposed of a million paramters the process of wrting the analytical gradient and evaluating it is the same.

##### Computing Gradients automatically
Pytorch has another component called as the `autograd`. The advantage of this component is that a pytorch tensor can remember it's `parents (which created the tensor)` and the `operation (by which the tensor is created)` and they can provide a chain of derivatives of such operations with respect to their inputs. Given a forward operation no matter how nested pytorch can automatically provide the gradient expression of that expression w.r.t to the inputs.

##### Applying Autograd
Now we would write the thermometer code by using the autograd function. But lets first recall our model and loss_fn.

In [9]:
def model(input, w, b):
    return w * input + b

def loss_func(predicted, actual):
    squared_diff = (predicted - actual) ** 2
    return squared_diff.mean()

In [10]:
# lets initizlize a parameters trensor
params  = torch.tensor([1.0,0.0], requires_grad= True)

The `requires_grad = True` argument to the tensor is telling pytorch to track down the family tree of the tensor `params`. In other words any tensor that will have params as an ancestor will have a chain of all the functions that were called to get from params to that tensor.  In case these functions are differentiable then pytorch will automatically compute the gradients and populate the grad attribute of the params tensor.

In [11]:
# Lets see if params has the grad attribute. This shows that the grad attribute of params is assigned as None.
params.grad is None

True

All we have to do is to initialize all the tensors with requires_grad = True and then calculate the loss tensor by calling the model and then call the loss_tensor.backward() function to populate the grad attributes of all the participating tensors.

In [12]:
# Calculate the loss tensor
loss = loss_func(model(t_u, *params), t_c)
# Perform the backward pass
loss.backward()
print(params.grad)

tensor([4517.2969,   82.6000])


At this point the grad attribute of the params tensor contains the values of the derivative of loss w.r.t each of the elements of params tensor.
When we compute our loss, in addition to computations what pytorch does is that it also prepares a graph of the various tensors and their computations as nodes. When the .backward() function is called then pytorch traverses this graph in the reverse order and computes the gradients.

##### Accumulating Gradients
When the .backwards() is called on the loss tensor then the gradients are computed for all the tensors in the graph and `the gradients are accumulated(summed on top of each other)` on the leaf nodes i.e the inputs of the graph which are at the leaf nodes.\
\
`Warning : As the gradients are accumulated on the leaf nodes , we need to zero out the gradients explicitly after using it for parameter updates.`

In [13]:
# We can do zeroing by
params.grad.zero_()

tensor([0., 0.])

`You might be curious why we need to zero the gradients manually, when ideally calling .backward() should automatically zero out the gradients. Doing it this way provides more flexibility and control while working with more complex models.`

In [14]:
def model(input, w, b):
    return w * input + b

def loss_func(predicted, actual):
    squared_diff = (predicted - actual) ** 2
    return squared_diff.mean()
    
# Now with this thing lets build our training loop from start to finish
def training_loop(n_epochs, learning_rate, params, t_u, t_c):
    for epoch in range(1, n_epochs+1):
        if params.grad is not None:
            # Zero out the accumulated gradients at the start of each epoch
            params.grad.zero_()

        # carry the forward pass
        loss_tensor = loss_func(model(t_u, *params), t_c)

        # carry out the backward pass
        loss_tensor.backward()

        # Now do the parameter update. This step we need more explanation
        with torch.no_grad():
            params -= learning_rate*params.grad
        
        # Print the loss after the epochs
        if epoch % 50 == 0:
            print(f" Epoch: {epoch} | Loss: {float(loss_tensor)} ")
    return params


##### There are few caveats in the above code:
1. First we are encapsulating the no_grad() context in the with statement. This means that in the with statement Autograd should not add edges to the nodes here. `In this bit of code the forward graph is consumed in the .backward() call leaving us with the params leaf node but now what we are doing in params -= larning_rate*params.grad is that we are changing the leaf node before we start building a fresh forward graph on top of it.`
2. In the params tensor we are subtracting the update from it. When using autograd we usually avoid inplace updates of tensors because the autograd engine might need to keep track of the values we would be modifying in the forward pass. Here however we are operating without the autograd and it is useful to keep the params tensor.

In [15]:
# Now lets see that in practive
t_u_mean = torch.mean(t_u)
t_u_std = torch.std(t_u)
t_un = (t_u - t_u_mean)/t_u_std
params_vect = torch.tensor([1.0,0.0], requires_grad = True)
training_loop(n_epochs = 1000, learning_rate = 0.01, params = params_vect, t_u = t_un, t_c = t_c)

 Epoch: 50 | Loss: 27.870668411254883 
 Epoch: 100 | Loss: 6.498063564300537 
 Epoch: 150 | Loss: 3.443054437637329 
 Epoch: 200 | Loss: 3.0026862621307373 
 Epoch: 250 | Loss: 2.938664436340332 
 Epoch: 300 | Loss: 2.9292776584625244 
 Epoch: 350 | Loss: 2.9278886318206787 
 Epoch: 400 | Loss: 2.9276816844940186 
 Epoch: 450 | Loss: 2.9276506900787354 
 Epoch: 500 | Loss: 2.9276463985443115 
 Epoch: 550 | Loss: 2.927645683288574 
 Epoch: 600 | Loss: 2.927644968032837 
 Epoch: 650 | Loss: 2.9276459217071533 
 Epoch: 700 | Loss: 2.927645206451416 
 Epoch: 750 | Loss: 2.927645206451416 
 Epoch: 800 | Loss: 2.927645206451416 
 Epoch: 850 | Loss: 2.927645206451416 
 Epoch: 900 | Loss: 2.927645206451416 
 Epoch: 950 | Loss: 2.927645206451416 
 Epoch: 1000 | Loss: 2.927645206451416 


tensor([ 9.0349, 10.5000], requires_grad=True)

##### This is the result that we got i.e, this means that though we are capable of computing the derivatives by hand we do not need to because autograd takes care of that stuff.

#### Optimizers
Till now we have impelemented Vanilla Gradient Descent which was fine for our simple case, but now there are different optimization strategies which could help in better and faster optimization. These optimization strategies are somewhat variants of Gradient descent but with some more caveats. The `torch.optim` submodule provides us with other optimizers (pre-implemented boilerplate algorithms code), so we do not have to go through the hassle of updating the parameters by oureselves.

In [16]:
import torch.optim as optim
dir(optim)

['ASGD',
 'Adadelta',
 'Adagrad',
 'Adam',
 'AdamW',
 'Adamax',
 'LBFGS',
 'NAdam',
 'Optimizer',
 'RAdam',
 'RMSprop',
 'Rprop',
 'SGD',
 'SparseAdam',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_functional',
 '_multi_tensor',
 'lr_scheduler',
 'swa_utils']

The above is list of optimization algorithms implemented in pytorch.

Every optimizer constructor takes in a list of parameters(aka pytorch tensors with requires_grad = True). All parameters passed inside the optimizers are retained inside the optimizer so the optimizer can update the parameters and their grads values also.\
Each optimizer has two methods:
1. zero_grad : zeros the value of the grad attribute of the each of the parameters.
2. step : updates the values of the parameters according to the logic of that optimizer.

In [17]:
# lets create params_tensor and instantiate a SGD optimizer
params = torch.tensor([1.0,0.0], requires_grad = True)
learning_ratre = 1e-5
optimizer = optim.SGD([params], lr = learning_rate)  # Stochastic Gradient Descent

The above code instantiates an optimizer with a learning rate. This is a vanilla Gradient descent(unless the momentum parameter is set to 0.0, which is the default). It is called as stochastic Gradient descent because the loss is calculated by averaging over a random subset of all the samples called as `minibatch`.\
Lets use our optimizer

In [18]:
print(f"params started with : {params}")
t_p = model(t_un, *params)       # calculate the predictions
loss_t = loss_func(t_p, t_c)      # Calculate the loss
loss_t.backward()                 # Backpropagate to get the gradients
optimizer.step()                # Update the parameters
print(f"params after one step of the optimizer: {params}")

params started with : tensor([1., 0.], requires_grad=True)
params after one step of the optimizer: tensor([1.1461, 0.2100], requires_grad=True)


So what is happening here is that we are updating our parameter without explicitly performing the calculation by oureselves. So what pytorch does is that it looks for params.grad and updates them using the updation logic of stochastic gradient descent.\
But if we want to stick this code in a training loop and perform the calculations we should not do so because `we are not making the gradients zero` and because of this the gradients will accumulate and we our gradient descent would not perform the correct update. 

In [19]:
# Now we would add our loop ready code to zero out the gradients.
print(f"params started with : {params}")
t_p = model(t_un, *params)       # calculate the predictions
loss_t = loss_func(t_p, t_c)      # Calculate the loss
optimizer.zero_grad()              # Zero out the gradients before every epoch
loss_t.backward()                 # Backpropagate to get the gradients
optimizer.step()                # Update the parameters
print(f"params after one step of the optimizer: {params}")

params started with : tensor([1.1461, 0.2100], requires_grad=True)
params after one step of the optimizer: tensor([1.2895, 0.4158], requires_grad=True)


In [20]:
# Now lets update our training loop accordingly
def training_loop(n_epochs, optimizer, params, t_u, t_c): # The new training loop would contain the optimizer instead of learning rate
    for epoch in range(1, n_epochs+1):
        # carry the forward pass
        loss_tensor = loss_func(model(t_u, *params), t_c)

        # zero out the gradients
        optimizer.zero_grad()

        # carry out the backward pass
        loss_tensor.backward()

        # Now do the parameter update.
        optimizer.step()
    
        
        # Print the loss after the epochs
        if epoch % 50 == 0:
            print(f" Epoch: {epoch} | Loss: {float(loss_tensor)} ")
    return params

In [21]:
params = torch.tensor([1.0,0.0], requires_grad= True)
learning_rate = 1e-2
optimizer_sgd = optim.SGD(lr = learning_rate, params = [params]) # do not forget to pass params as a dictionary
training_loop(500, optimizer_sgd, params, t_un, t_c)

 Epoch: 50 | Loss: 27.870668411254883 
 Epoch: 100 | Loss: 6.498063564300537 
 Epoch: 150 | Loss: 3.443054437637329 
 Epoch: 200 | Loss: 3.0026862621307373 
 Epoch: 250 | Loss: 2.938664436340332 
 Epoch: 300 | Loss: 2.9292776584625244 
 Epoch: 350 | Loss: 2.9278886318206787 
 Epoch: 400 | Loss: 2.9276816844940186 
 Epoch: 450 | Loss: 2.9276506900787354 
 Epoch: 500 | Loss: 2.9276463985443115 


tensor([ 9.0341, 10.4996], requires_grad=True)

##### Testing other optimizers
1. To test other optimizers, we can just instantiate other optimizer classes like ADAM optimizer which gives us an adaptive learning rate and in addition it is a lot less sensitive to the scaling of the parameters, so insensitive that we can pass on an unscaled version of the inputs and it would still give us the correct results.

In [22]:
params = torch.tensor([1.0,0.0], requires_grad= True)
learning_rate = 1e-1
optimizer_sgd = optim.Adam(lr = learning_rate, params = [params]) # do not forget to pass params as a dictionary
training_loop(5000, optimizer_sgd, params, t_u, t_c)  # unnormalized inputs and adam optimizer with a bigger learning rate.

 Epoch: 50 | Loss: 32.484798431396484 
 Epoch: 100 | Loss: 23.898942947387695 
 Epoch: 150 | Loss: 21.53290557861328 
 Epoch: 200 | Loss: 19.082260131835938 
 Epoch: 250 | Loss: 16.669700622558594 
 Epoch: 300 | Loss: 14.398392677307129 
 Epoch: 350 | Loss: 12.332925796508789 
 Epoch: 400 | Loss: 10.508145332336426 
 Epoch: 450 | Loss: 8.936264991760254 
 Epoch: 500 | Loss: 7.612900257110596 
 Epoch: 550 | Loss: 6.522254467010498 
 Epoch: 600 | Loss: 5.641384601593018 
 Epoch: 650 | Loss: 4.943643093109131 
 Epoch: 700 | Loss: 4.401312351226807 
 Epoch: 750 | Loss: 3.9875364303588867 
 Epoch: 800 | Loss: 3.6775741577148438 
 Epoch: 850 | Loss: 3.449573516845703 
 Epoch: 900 | Loss: 3.284883975982666 
 Epoch: 950 | Loss: 3.168062448501587 
 Epoch: 1000 | Loss: 3.086700439453125 
 Epoch: 1050 | Loss: 3.0310614109039307 
 Epoch: 1100 | Loss: 2.9937093257904053 
 Epoch: 1150 | Loss: 2.969097852706909 
 Epoch: 1200 | Loss: 2.9531853199005127 
 Epoch: 1250 | Loss: 2.9430925846099854 
 Epoch:

tensor([  0.5368, -17.3048], requires_grad=True)

#### Shuffling a Dataset : Dividing it into training and validation sets
In order to check if the model that we are using fits the data or overfits the data and understanding the generalizing capabilities of the model(making correct predictions on the data it has not seen before). We can divide the data into training and validation sets and see how the loss behaves on the training and validation sets.
1. If the loss goes down in the training set, but not in the validation set, then it means that the model has started to overfit the data and is not generalizing to the validation set. We need a less complex model, or we need more data.
2. If the loss does not go down in the training set then the model is not able  to fit the data (underfitting) and we need a more complex model which could approximate the data points more correctly.
To divde the dataset into training and validation sets we would need to shuffle the data and then split the data into training and validation sets.

In [23]:
# Shuffling the indices of the tensor is like finding a permutation of its indices. The `randperm` fucnction is used for that.
n_samples = t_u.shape[0]
n_vals = int(0.2*n_samples)

shuffled_indices = torch.randperm(n_samples)
train_indices = shuffled_indices[: -n_vals]
val_indices = shuffled_indices[-n_vals:]

train_indices, val_indices

(tensor([ 7, 10,  5,  3,  9,  0,  2,  6,  4]), tensor([1, 8]))

We now got index tensors we can use to build training and validation sets, starting from data tensor

In [24]:
train_t_u = t_u[train_indices]
val_t_u = t_u[val_indices]

train_t_c = t_c[train_indices]
val_t_c = t_c[val_indices]

train_t_un = 0.1 * train_t_u
val_t_un = 0.1 * val_t_u

Our training loop does not change since we only want to evaluate the validation  loss in addition to the training loss. We only need to add statements to for the validation loss in the training loop

In [28]:
def training_loop(n_epochs, optimizer, params, t_u, t_c, val_t_u, val_t_c):
    for epoch in range(n_epochs):
        # carry on the forward pass
        pred = model(t_u, *params)

        # perform the predictions on the validation set
        val_pred = model(val_t_u, *params)

        # calculate the loss
        loss = loss_func(pred, t_c)

        # calculate the validation set loss
        val_loss = loss_func(val_pred, val_t_c)

        # zero out the accumulated graidents
        optimizer.zero_grad()

        # perform the backward pass
        loss.backward()

        # perform the parameter update
        optimizer.step()

        # Print out the training and the validation losses
        if epoch %500 == 0:
            print(f"Epochs: {epoch} | Training Loss: {loss} | Validation Loss: {val_loss}")

params = torch.tensor([1.0,0.0], requires_grad = True)
learning_rate = 1e-2
optimizer_adam = optim.Adam(lr = learning_rate, params=[params])
training_loop(10000, optimizer_adam, params, train_t_un, train_t_c, val_t_un, val_t_c)



Epochs: 0 | Training Loss: 90.21489715576172 | Validation Loss: 36.03684616088867
Epochs: 500 | Training Loss: 26.91730499267578 | Validation Loss: 13.232043266296387
Epochs: 1000 | Training Loss: 16.059350967407227 | Validation Loss: 10.746343612670898
Epochs: 1500 | Training Loss: 9.059348106384277 | Validation Loss: 8.640144348144531
Epochs: 2000 | Training Loss: 5.197316646575928 | Validation Loss: 7.065887451171875
Epochs: 2500 | Training Loss: 3.412196397781372 | Validation Loss: 5.985672950744629
Epochs: 3000 | Training Loss: 2.7639241218566895 | Validation Loss: 5.3152642250061035
Epochs: 3500 | Training Loss: 2.5950632095336914 | Validation Loss: 4.953847408294678
Epochs: 4000 | Training Loss: 2.5668890476226807 | Validation Loss: 4.7947587966918945
Epochs: 4500 | Training Loss: 2.564258337020874 | Validation Loss: 4.741969108581543
Epochs: 5000 | Training Loss: 2.5641422271728516 | Validation Loss: 4.7299675941467285
Epochs: 5500 | Training Loss: 2.564138650894165 | Validatio

Above we can see and infer a few things from our training:
1. The validation loss will decrease along with our training loss for only some amount of steps. That is okay because our validation sets data is very small.
2. `The good things happening here is that our validation loss is decreasing along with our training loss. This means that the model is learning general features of the data and is generalizing well.`
3. We should see to it that the validation loss and training loss are as close to each other as possible ideally, then only the model will generalize best to the new data.

#### Autograd: Finer points and Switching it off
1. If we calculating both the training loss and the validation loss then two seperate computation graphs will be generated and evaluated independently. Even if they use the same models and the same loss_functions, seperate tensors are passed in the models and loss_fn to generate seperate graphs.
2. The only tensors common are the params tensor in both the computation graph, but as long as we do not call the .backwards() function on the val_loss tensor it would not accumulate gradients of the leaf nodes of the training computation graph. If by mistake we did call the backwards() function then the gradients of val_loss w.r.t  the inputs will be accumulated(added) in the params tensor along with the gradients from training_loss.backward() gradients. `In that case we would effectively train our model on the whole dataset, instead of just the training dataset because the gradients are coming from both the training and validation sets i.e the whole dataset.
3. There is one more point of discussion here is that if we are not calling  val_loss.backwards() they why are we creating the computation graph. We could just use them as normal function calculations without tracking the computation because it has an overhead which could increase if the model has millions of parameters which could be avoided.
4. The solve the problem defined in point 3, pytorch allows us to switch off autograd when we do not need it using the `torch.no_grad` context manager.This would not provide a significant boost on our small model but it would be very useful for large models.
5. We could check if point 4 is implemented correctly by checking the `requires_grad` attribute on the `val_loss tensor`.

In [29]:
def training_loop(n_epochs, optimizer, params, t_u, t_c, val_t_u, val_t_c):
    for epoch in range(n_epochs):
        # carry on the forward pass
        pred = model(t_u, *params)

        

        # calculate the loss
        loss = loss_func(pred, t_c)

        

        with torch.no_grad():
            # perform the predictions on the validation set
            val_pred = model(val_t_u, *params)

            # calculate the validation set loss
            val_loss = loss_func(val_pred, val_t_c)

            assert val_loss.requires_grad == False


        # zero out the accumulated graidents
        optimizer.zero_grad()

        # perform the backward pass
        loss.backward()

        # perform the parameter update
        optimizer.step()

        # Print out the training and the validation losses
        if epoch %500 == 0:
            print(f"Epochs: {epoch} | Training Loss: {loss} | Validation Loss: {val_loss}")

params = torch.tensor([1.0,0.0], requires_grad = True)
learning_rate = 1e-2
optimizer_sgd = optim.SGD(lr = learning_rate, params=[params])
training_loop(10000, optimizer_sgd, params, train_t_un, train_t_c, val_t_un, val_t_c)

Epochs: 0 | Training Loss: 90.21489715576172 | Validation Loss: 36.03684616088867
Epochs: 500 | Training Loss: 6.590072154998779 | Validation Loss: 7.833559036254883
Epochs: 1000 | Training Loss: 3.094032049179077 | Validation Loss: 5.738351821899414
Epochs: 1500 | Training Loss: 2.633883237838745 | Validation Loss: 5.079338073730469
Epochs: 2000 | Training Loss: 2.573317289352417 | Validation Loss: 4.853560924530029
Epochs: 2500 | Training Loss: 2.565347194671631 | Validation Loss: 4.773406028747559
Epochs: 3000 | Training Loss: 2.564296007156372 | Validation Loss: 4.74454927444458
Epochs: 3500 | Training Loss: 2.5641589164733887 | Validation Loss: 4.734123229980469
Epochs: 4000 | Training Loss: 2.564141273498535 | Validation Loss: 4.730332374572754
Epochs: 4500 | Training Loss: 2.5641393661499023 | Validation Loss: 4.728978633880615
Epochs: 5000 | Training Loss: 2.5641379356384277 | Validation Loss: 4.728473663330078
Epochs: 5500 | Training Loss: 2.564138650894165 | Validation Loss: 

Beacuse the above code with the torch.no_grad context and the assert statement ran without any errors we can safely assume that the computation graph for validation part of the code was not created.

` Note : We should not think that using the torch.no_grad() nescessarily implies that the output tensor does not require gradients. There are particular instances where tensor.requires_grad is not set to False even in no_grad() context. If you want to be completely sure use detach() to detach the tensor entirely from the computation graph.  `

Using the related set_grad_enabled context we can also condition the code to run with autograd enabled or disabled according to a Boolean expression indicating that we are running it in training or inference mode. For e.g we could define a function where we pass in a is_train boolean argument which takes care of running it with or without the autograd argument.

In [30]:
def cal_forward(t_u, t_c, params, is_train):
    with torch.set_grad_enabled(is_train):
        loss = loss_func(model(t_u, *params), t_c)
    return loss

In the next chapter we would use a better model from `torch.nn` module to train out neural network.