Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug report] SRL training mechanism is wrong: validation set is used to update weights during training. #48

Open
ncble opened this issue May 28, 2019 · 2 comments

Comments

@ncble
Copy link
Collaborator

ncble commented May 28, 2019

Describe the bug
I hope I'm wrong but it seems that the current training mechanism uses the validation set to update the weights during training which is unwanted. The problem comes from the fact that we calculate the gradient by loss.backward() in both training and validation.

The curring training mechanism can be reduce to the following pseudo code (in srl-zoo/models/learner.py)

        for sample in dataloader:
            if sample in train_set: ## training mode
                model.train()
            else: ## validation mode
                model.eval()
            optimizer.zero_grad()
            Y_pred = model(X)
            loss = compute_loss(Y_pred, Y)
            loss.backward() # <-- [Wrong] We backpropagate the gradient in both train/valid mode
            if sample in train_set:
                optimizer.step()
                loss = loss.item()
            else:
                # We don't update the weights at this iteration, but the gradients of loss on validation
                # samples are calculated and stored and will be used the next time we call optimizer.step()
                loss = loss.item()

The common way to validate a model in Pytorch should look like the following:

        for epoch in range(epochs):
            model.train()
            for sample in dataloader_train:
                optimizer.zero_grad()
                loss = compute_loss(...)
                loss.backward()
                loss = loss.item() ## release tensor
            model.eval()
            with torch.no_grad(): ## It mandatory to call both model.eval() and torch.no_grad()
                for sample in dataloader_valid:
                    loss = compute_loss(...)
                    loss = loss.item() ## release tensor

But in the toolbox, with torch.no_grad() is not called during validation ! Besides, there are the other downsides when calling loss.backward() in the validation mode. Not only it's wrong but also it's a waste of time to do backpropagation when we don't need the gradient.

Code example

I write a code to mimic the training of srl-zoo, and demonstrate that the current training mechanism will use the gradient of loss on the validation data to update the model weights.

import numpy as np 
from time import time
import torch
try:
    from torchsummary import summary ## pip install torchsummary
except:
    pass
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        # self.layer = nn.Linear(10, 1)
        self.model = nn.Sequential(
            nn.Linear(10, 1)
        )
    def forward(self, x):
        # x = self.layer(x)
        x = self.model(x)
        return x


if __name__=="__main__":
    print("Start")
    model = MyModel()
    try:
        summary(model, (10,))
    except:
        pass
    CASE = 0
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    X = np.random.rand(100, 10).astype(np.float32)
    X = torch.tensor(X).to(device)
    # Y = np.random.rand(100, 1).astype(np.float32)  # torch.ones((100, 1))
    # Y = torch.tensor(Y).to(device)
    Y = torch.ones((100, 1)).to(device)
    Z = torch.zeros((100, 1)).to(device)
    
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.5, 0.9))
    model.to(device)
    loss = 0
    val_loss = 0
    for ind in range(10000):
        
        model.train()
        optimizer.zero_grad()
        X_pred = model(X)
        loss = nn.MSELoss()(X_pred, Y)
        loss.backward()
        optimizer.step()
        loss = loss.item()

        if CASE == 0:
            ## No validation
            pass
        else:
            model.eval()
            optimizer.zero_grad()
            Z_pred = model(X)
            loss = nn.MSELoss()(Z_pred, Z)
            loss.backward() # <--  It's wrong !
            # no optimizer.step() here ! 
            val_loss = loss.item()
        print("\r Iter {} Train Loss: {:.10f} | Val Loss: {:.10f}".format(ind+1, loss, val_loss), end='')
    print()

By switching between CASE = 0 (no validation) or CASE = 1 (with wrong validation mechanism), you will see one linear layer is sufficient to learn the task (CASE = 0 e.g. Train Loss: 0.0000001010). However, in the CASE = 1, the gradient of validation set will affect the training and the model will not converge (e.g. Train Loss: 1.0033100843 | Val Loss: 1.0033100843).

SOLUTION
It's easy, just remove loss.backward() in the validation mode and add torch.set_grad_enabled(False/True) at the beginning/end of the validation. This will also provide around 5-10% speed-up depending on the model.

@araffin
Copy link
Owner

araffin commented May 28, 2019

Hello,
The gradients are computed but no gradient step is taken: optimizer.zero_grad() is called before it happens.
However, I agreee that having two dataloaders would be cleaner.

@ncble
Copy link
Collaborator Author

ncble commented May 31, 2019

I'm working on a pull request for all the issues I posted. This problem (two dataloaders) has been addressed, although I need more time to make sure everything is fine before I pull request. For the impatient people, you can take a look at my fork: https://github.com/ncble/srl-zoo/tree/adv_srl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants