Completely different execution results with CPU or GPU in Google Colab #345

SuperbTUM · 2021-12-14T22:13:16Z

Describe the bug

I am using the TabNetClassifier to do classification tasks in Google Colab. Something is wrong when I switch from CPU to GPU. The codes are bug-free but when executing with GPU, the program says cuda is detected and I assume the codes are run in GPU. However, the intermediate info is completely different from CPU version. I did not modify anything, just switch the device from CPU to GPU and execute another time ... I am attaching the code here but I can't paste everything here since I am working on a project.

def train(five_fold_data, X_pretrain):
    unsupervised_model = TabNetPretrainer(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    mask_type='entmax' # "sparsemax"
    )
    boundary = int(0.8 * len(X_pretrain))
    unsupervised_model.fit(
        X_train=X_pretrain[:boundary],
        eval_set=[X_pretrain[boundary:]],
        pretraining_ratio=0.8,
    )
    clf = TabNetClassifier(
        optimizer_fn=torch.optim.Adam,
        optimizer_params=dict(lr=2e-2),
        scheduler_params={"step_size":10, # how to use learning rate scheduler
                        "gamma":0.9},
        scheduler_fn=torch.optim.lr_scheduler.StepLR,
        mask_type='entmax' # This will be overwritten if using pretrain model
    )

    model_sets = list()
    for i, [(x_train, y_train), (x_eval, y_eval)] in enumerate(five_fold_data):
        clf.fit(
            X_train=x_train, y_train=y_train, 
            eval_set=[(x_eval, y_eval)],
            eval_name=['valid'],
            eval_metric=['auc', 'accuracy'],
            max_epochs=120,
            num_workers=2,
            loss_fn=None,
            from_unsupervised=unsupervised_model
    )
        model_sets.append(clf)
    return model_sets

What is the current behavior?
The CPU version returns satisfactory results where GPU version does not.

If the current behavior is a bug, please provide the steps to reproduce.

I am not sure if it is a bug. But I am willing to share with you the entire code privately if you want.

Expected behavior

I assume both versions return exactly identical results.

Screenshots

Other relevant information:
poetry version: NA
python version: 3.7
Operating System: Linux
Additional tools: NA

Additional context

The text was updated successfully, but these errors were encountered:

Optimox · 2021-12-15T08:52:47Z

Hello @SuperbTUM,

about CPU/GPU differences : pytorch-tabnet performs reproducible training, i.e with the same configuration and random seed you should en up with the same results. However, switching from CPU to GPU changes the configuration, so you can't reproduce the exact same seed both with CPU and GPU. This is not a limitation of pytorch-tabnet but more generally of pytorch, and more generally of GPU computing. Nothing can be done about this, but your results should be SIMILAR. If you see a huge difference this means your problem is very sensitive to random seeds, try to change the seed on CPU and see if you see the same behavior.
the same CPU/GPU seed exists for the pretrainer. A poor pretrained model can cause large difference in the end, for this reason, you should also try your pipeline without pretraining with CPU and GPU and see if they are very different.
currently your for loop is leaky, indeed the current latest version of pytorch-tabnet does not follow the warm-start policy of scikit-learn (fix is here : feat: add warm_start matching scikit-learn #340 and coming soon in the next release). Currently since the definition of the clf is outside your for loop, you are basically finetuning the same model on different splits, which means your model has been training on the validation fold previously. This leads to overconfidence on the CV result of your model, please consider defining the model inside the for loop to avoid this.

Let me know if this make things clearer.

Best

Optimox · 2021-12-20T09:24:48Z

@SuperbTUM,

I'm temporarily closing this as there's no way to reproduce your reported problem and we did not hear back from you.
Please feel free to reopen once you have more information to share.

SuperbTUM added the bug Something isn't working label Dec 14, 2021

SuperbTUM assigned eduardocarvp and Optimox Dec 14, 2021

Optimox added help wanted Extra attention is needed and removed bug Something isn't working labels Dec 15, 2021

Optimox closed this as completed Dec 20, 2021

eduardocarvp mentioned this issue Jun 24, 2022

I found that "gpu" and "cpu" had a big impact on the results. #413

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Completely different execution results with CPU or GPU in Google Colab #345

Completely different execution results with CPU or GPU in Google Colab #345

SuperbTUM commented Dec 14, 2021

Optimox commented Dec 15, 2021

Optimox commented Dec 20, 2021

Completely different execution results with CPU or GPU in Google Colab #345

Completely different execution results with CPU or GPU in Google Colab #345

Comments

SuperbTUM commented Dec 14, 2021

Optimox commented Dec 15, 2021

Optimox commented Dec 20, 2021