# a leaf Variable that requires grad is being used in an in-place operation

## Leaf Tensor

All Tensors that have `requires_grad` set to False will be leaf Tensors by convention. For Tensors that have `requires_grad` which is True, they will be leaf Tensors if they were created by the user(Eg. weights of your neural network). This means that they are not the result of an operation and so `grad_fn` is None.

Basically, if `require_grad` is False then it will be a leaf tensor. Moreover, if `requires_grad` is True and it is created by user, it is also a leaf tensor.

## In-Place Operation

It is an operation which updates the value of the same tensor object on which it is called upon.

In [1]:
import torch
import math
torch.manual_seed(0)

a = torch.randn( (), requires_grad=False)
initial_address = a.data_ptr()
a += 5  #in-place operation
print(initial_address == a.data_ptr())
a = a + 5 #out-of-place operation
print(initial_address == a.data_ptr())

True
False


So, `variable += any_thing` is inplace but `variable = variable + any_thing` is NOT inplace

Now, let’s see what happens when we change `requires_grad` to True when we initialize a.

In [2]:
import torch
import math
torch.manual_seed(0)

a = torch.randn( (), requires_grad=True)
initial_address = a.data_ptr()
a += 5  #in-place operation
print(initial_address == a.data_ptr())
a = a + 5 #out-of-place operation
print(initial_address == a.data_ptr())

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

## How to solve this issue?

You can wrap the update operations under `torch.no_grad():` which will tell the PyTorch to not track and validate the operations being performed under it’s hood.

## Examples

In [3]:
import torch
import math
torch.manual_seed(0)
dtype = torch.float
# device = torch.device("cpu")
device = torch.device("cuda:0")  # Uncomment this to run on GPU

# Create Tensors to hold input and outputs.
# By default, requires_grad=False, which indicates that we do not need to
# compute gradients with respect to these Tensors during the backward pass.
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Create random Tensors for weights. For a third order polynomial, we need
# 4 weights: y = a + b x + c x^2 + d x^3
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
a = torch.randn((), device=device, dtype=dtype, requires_grad=True)
b = torch.randn((), device=device, dtype=dtype, requires_grad=True)
c = torch.randn((), device=device, dtype=dtype, requires_grad=True)
d = torch.randn((), device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y using operations on Tensors.
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(id(a))
        print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call a.grad, b.grad. c.grad and d.grad will be Tensors holding
    # the gradient of the loss with respect to a, b, c, d respectively.
    loss.backward()
    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.

    a = a - learning_rate * a.grad
    b = b - learning_rate * b.grad
    c = c - learning_rate * c.grad
    d = d - learning_rate * d.grad
    # Manually zero the gradients after updating weights
    a.grad = None
    b.grad = None
    c.grad = None
    d.grad = None

print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')

  a = a - learning_rate * a.grad


TypeError: unsupported operand type(s) for *: 'float' and 'NoneType'

When you are updating your weights at line 46, you are making this NEW tensor object a result of some mathematical operation on your original tensor which is making it(the new a) an intermediate tensor.

What PyTorch does in case of `intermediate tensor` is, it doesn’t accumulate the gradient in the `.grad` attribute of the tensor which would have been the case if it was a `leaf` tensor.

So, since the weight/parameter that you are updating is no longer a leaf tensor, it’s .grad will be None as the gradient is not being accumulated.

### retain_grad

retain_grad() will let you save the gradients in the `.grad` attribute and won’t make it None. 

In [4]:
import torch
import math
torch.manual_seed(0)
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0")  # Uncomment this to run on GPU

# Create Tensors to hold input and outputs.
# By default, requires_grad=False, which indicates that we do not need to
# compute gradients with respect to these Tensors during the backward pass.
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Create random Tensors for weights. For a third order polynomial, we need
# 4 weights: y = a + b x + c x^2 + d x^3
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
a = torch.randn((), device=device, dtype=dtype, requires_grad=True)
b = torch.randn((), device=device, dtype=dtype, requires_grad=True)
c = torch.randn((), device=device, dtype=dtype, requires_grad=True)
d = torch.randn((), device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y using operations on Tensors.
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(id(a))
        print(t, loss.item())

    a.retain_grad()
    b.retain_grad()
    c.retain_grad()
    d.retain_grad()
    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call a.grad, b.grad. c.grad and d.grad will be Tensors holding
    # the gradient of the loss with respect to a, b, c, d respectively.
    loss.backward()
    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    
    a = a - learning_rate * a.grad
    b = b - learning_rate * b.grad
    c = c - learning_rate * c.grad
    d = d - learning_rate * d.grad
    # Manually zero the gradients after updating weights
    a.grad = None
    b.grad = None
    c.grad = None
    d.grad = None

print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')

140347014812352
99 3238.300537109375
140347015472064
199 2245.867431640625
140347015930816
299 1559.936279296875
140347018197504
399 1085.354248046875
140347014811264
499 756.6680908203125
140347014145024
599 528.8026733398438
140346854566720
699 370.6806335449219
140346853855488
799 260.8537292480469
140346853619136
899 184.50198364257812
140346853387200
999 131.3756866455078
140347015638656
1099 94.37876892089844
140347015706752
1199 68.59319305419922
140347015075456
1299 50.60741424560547
140347015536320
1399 38.052650451660156
140347014838720
1499 29.282575607299805
140347015457344
1599 23.151979446411133
140346851508096
1699 18.863677978515625
140346851321152
1799 15.862088203430176
140346851097088
1899 13.759873390197754
140346850856896
1999 12.286673545837402
Result: y = 0.058623336255550385 + 0.8372474908828735 x + -0.01011350192129612 x^2 + -0.09055762737989426 x^3


<b>That will be really slow compared to what you were achieving by using `in-place` operation under the hood of `torch.no_grad()`</b>.

We are creating new objects with the updated value(out-of-place operation).

### torch.no_grad()

In [5]:
import torch
import math
torch.manual_seed(0)
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0")  # Uncomment this to run on GPU

# Create Tensors to hold input and outputs.
# By default, requires_grad=False, which indicates that we do not need to
# compute gradients with respect to these Tensors during the backward pass.
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Create random Tensors for weights. For a third order polynomial, we need
# 4 weights: y = a + b x + c x^2 + d x^3
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
a = torch.randn((), device=device, dtype=dtype, requires_grad=True)
b = torch.randn((), device=device, dtype=dtype, requires_grad=True)
c = torch.randn((), device=device, dtype=dtype, requires_grad=True)
d = torch.randn((), device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y using operations on Tensors.
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call a.grad, b.grad. c.grad and d.grad will be Tensors holding
    # the gradient of the loss with respect to a, b, c, d respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    
    # a -= learning_rate * a.grad
    # b -= learning_rate * b.grad
    # c -= learning_rate * c.grad
    # d -= learning_rate * d.grad
    with torch.no_grad():
        a -= learning_rate * a.grad
        b -= learning_rate * b.grad
        c -= learning_rate * c.grad
        d -= learning_rate * d.grad

        # Manually zero the gradients after updating weights
        a.grad = None
        b.grad = None
        c.grad = None
        d.grad = None

print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')

99 3238.300537109375
199 2245.867431640625
299 1559.936279296875
399 1085.354248046875
499 756.6680908203125
599 528.8026733398438
699 370.6806335449219
799 260.8537292480469
899 184.50198364257812
999 131.3756866455078
1099 94.37876892089844
1199 68.59319305419922
1299 50.60741424560547
1399 38.052650451660156
1499 29.282575607299805
1599 23.151979446411133
1699 18.863677978515625
1799 15.862088203430176
1899 13.759873390197754
1999 12.286673545837402
Result: y = 0.058623336255550385 + 0.8372474908828735 x + -0.01011350192129612 x^2 + -0.09055762737989426 x^3


<b>This is much faster!</b>

The resultant tensor weight objects that you were getting by using `torch.no_grad()` were the same as you initialized as you only did `in-place` operations on them. 

## References

[Understanding the Error:- A leaf Variable that requires grad is being used in an in-place operation.](https://medium.com/@mrityu.jha/understanding-the-grad-of-autograd-fc8d266fd6cf)