Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completely different execution results with CPU or GPU in Google Colab #345

Closed
SuperbTUM opened this issue Dec 14, 2021 · 2 comments
Closed
Assignees
Labels
help wanted Extra attention is needed

Comments

@SuperbTUM
Copy link

Describe the bug

I am using the TabNetClassifier to do classification tasks in Google Colab. Something is wrong when I switch from CPU to GPU. The codes are bug-free but when executing with GPU, the program says cuda is detected and I assume the codes are run in GPU. However, the intermediate info is completely different from CPU version. I did not modify anything, just switch the device from CPU to GPU and execute another time ... I am attaching the code here but I can't paste everything here since I am working on a project.

def train(five_fold_data, X_pretrain):
    unsupervised_model = TabNetPretrainer(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    mask_type='entmax' # "sparsemax"
    )
    boundary = int(0.8 * len(X_pretrain))
    unsupervised_model.fit(
        X_train=X_pretrain[:boundary],
        eval_set=[X_pretrain[boundary:]],
        pretraining_ratio=0.8,
    )
    clf = TabNetClassifier(
        optimizer_fn=torch.optim.Adam,
        optimizer_params=dict(lr=2e-2),
        scheduler_params={"step_size":10, # how to use learning rate scheduler
                        "gamma":0.9},
        scheduler_fn=torch.optim.lr_scheduler.StepLR,
        mask_type='entmax' # This will be overwritten if using pretrain model
    )

    model_sets = list()
    for i, [(x_train, y_train), (x_eval, y_eval)] in enumerate(five_fold_data):
        clf.fit(
            X_train=x_train, y_train=y_train, 
            eval_set=[(x_eval, y_eval)],
            eval_name=['valid'],
            eval_metric=['auc', 'accuracy'],
            max_epochs=120,
            num_workers=2,
            loss_fn=None,
            from_unsupervised=unsupervised_model
    )
        model_sets.append(clf)
    return model_sets

What is the current behavior?
The CPU version returns satisfactory results where GPU version does not.

If the current behavior is a bug, please provide the steps to reproduce.

I am not sure if it is a bug. But I am willing to share with you the entire code privately if you want.

Expected behavior

I assume both versions return exactly identical results.

Screenshots

Other relevant information:
poetry version: NA
python version: 3.7
Operating System: Linux
Additional tools: NA

Additional context

@SuperbTUM SuperbTUM added the bug Something isn't working label Dec 14, 2021
@Optimox Optimox added help wanted Extra attention is needed and removed bug Something isn't working labels Dec 15, 2021
@Optimox
Copy link
Collaborator

Optimox commented Dec 15, 2021

Hello @SuperbTUM,

  • about CPU/GPU differences : pytorch-tabnet performs reproducible training, i.e with the same configuration and random seed you should en up with the same results. However, switching from CPU to GPU changes the configuration, so you can't reproduce the exact same seed both with CPU and GPU. This is not a limitation of pytorch-tabnet but more generally of pytorch, and more generally of GPU computing. Nothing can be done about this, but your results should be SIMILAR. If you see a huge difference this means your problem is very sensitive to random seeds, try to change the seed on CPU and see if you see the same behavior.
  • the same CPU/GPU seed exists for the pretrainer. A poor pretrained model can cause large difference in the end, for this reason, you should also try your pipeline without pretraining with CPU and GPU and see if they are very different.
  • currently your for loop is leaky, indeed the current latest version of pytorch-tabnet does not follow the warm-start policy of scikit-learn (fix is here : feat: add warm_start matching scikit-learn #340 and coming soon in the next release). Currently since the definition of the clf is outside your for loop, you are basically finetuning the same model on different splits, which means your model has been training on the validation fold previously. This leads to overconfidence on the CV result of your model, please consider defining the model inside the for loop to avoid this.

Let me know if this make things clearer.

Best

@Optimox
Copy link
Collaborator

Optimox commented Dec 20, 2021

@SuperbTUM,

I'm temporarily closing this as there's no way to reproduce your reported problem and we did not hear back from you.
Please feel free to reopen once you have more information to share.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants