Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running out of memory during training #431

Closed
Kayne88 opened this issue Sep 6, 2022 · 13 comments
Closed

Running out of memory during training #431

Kayne88 opened this issue Sep 6, 2022 · 13 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@Kayne88
Copy link

Kayne88 commented Sep 6, 2022

When training with custom eval metric (pearson corr), after first evaluation my colab session runs out of memory.

What is the current behavior?
Training of TabNetRegressor starts fine and after first evaluation round, I run out of memory. I am training the model on GPU 16GB and free RAM is approx 40 GB. The RAM consumption during training steadily increases.
I am training on a pretty large dataset (11 GB)

Expected behavior

I would expect that the RAM consumption is more or less constant during training, once the model is initialized.

Screenshots

max_epochs = 2
batch_size = 1028
model = TabNetRegressor(
                       optimizer_fn=torch.optim.Adam,
                       optimizer_params=dict(lr=1e-2)
                      )

model.fit(
    X_train=factors_train[features].to_numpy(), y_train=factors_train.target.to_numpy().reshape((-1,1)),
    eval_set=[(factors_test[features].to_numpy(), factors_test.target.to_numpy().reshape((-1,1)))],
    eval_name=['test'],
    eval_metric=[PearsonCorrMetric],
    max_epochs=max_epochs , patience=5,
    batch_size=batch_size,
    virtual_batch_size=128,
    num_workers=0,
    drop_last=False
)

class PearsonCorrMetric(Metric):
  def __init__(self):
    self._name = "pearson_corr"
    self._maximize = True
  
def __call__(self, y_true, y_score):
    return corr_score(y_true, y_score)[1]

def corr_score(y_true, y_pred):
    return "score", np.corrcoef(y_true, y_pred)[0,1], True

Other relevant information:
poetry version: ?
python version: 3.8
Operating System: Ubuntu
Additional tools:

Additional context

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    24W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
@Kayne88 Kayne88 added the bug Something isn't working label Sep 6, 2022
@Optimox Optimox added help wanted Extra attention is needed and removed bug Something isn't working labels Sep 6, 2022
@Optimox
Copy link
Collaborator

Optimox commented Sep 6, 2022

The consumption should attain a peak at the end of an epoch.

Do you manage to get the pearson correlation score for the first epoch?

Does it work if you reduce the batch size ?

@Kayne88
Copy link
Author

Kayne88 commented Sep 6, 2022

Do you manage to get the pearson correlation score for the first epoch?
No, I can't see the evaluation of first epoch

Does it work if you reduce the batch size ?
I tried initially with 256 batch_size, also out of RAM

@Optimox
Copy link
Collaborator

Optimox commented Sep 6, 2022

Is it GPU OOM or RAM OOM ?

@Kayne88
Copy link
Author

Kayne88 commented Sep 6, 2022

RAM OOM. It basically jumps for 30GB consumption over 52GB consumption

Could this be related to the custom metric? Might it help if I implement the metric with torch rather than np, so it can use the GPU?

@Optimox
Copy link
Collaborator

Optimox commented Sep 6, 2022

Could this be related to the custom metric?

I think it's unlikely, but you can try rmse and see if it solves the problem.

What is the size of your train/test ? in number of rows and colums?

@Kayne88
Copy link
Author

Kayne88 commented Sep 6, 2022

TRAIN (1914562, 1214) - TEST (476390, 1214)
RMSE actually works :)

@Optimox
Copy link
Collaborator

Optimox commented Sep 6, 2022

I'd be happy to know if you get competitive results on your dataset with tabnet. Please leave a comment if you can :)

@Kayne88
Copy link
Author

Kayne88 commented Sep 6, 2022

With pleasure. However I first need to make the corr metric work. RMSE is not appropriate for my problem. Also, I would really like to use a custom loss eventually.

But with pleasure, I can give a comparison with my current catboost benchmark scores to tabnet.

@Optimox
Copy link
Collaborator

Optimox commented Sep 6, 2022

Can't you use a simple pearson correlation ?

https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html

@Kayne88
Copy link
Author

Kayne88 commented Sep 6, 2022

I tried to use the sklearn r2_score, also OOM. I suspect the problem is that during eval metric calculation the model, tensors and data are moved to cpu. One way could be to explicitly transfer them to cuda in the metric calculation.

What is working for me now is the implementation here:
https://torchmetrics.readthedocs.io/en/stable/regression/pearson_corr_coef.html

First runs look very promising, after only 15 epochs I come close to catboost performance (which is hyperparam optimized). Real comparison will come on full validation (separate from train and test) which is almost as large as whole train.

One drawback of tabnet is that hyperparam optimization (with optuna) will take a very long time already for 100 trials. I need to see how to best approach that topic.

Keep you updated

PS: What I observe during training with fixed LR that for 1-2 epochs the eval metric is "oscilating" and then does a significant improvement. I am not very experienced with LR schedulers but decided to give OneCycleLR a try. Maybe it smoothes the training.

@Optimox
Copy link
Collaborator

Optimox commented Sep 6, 2022

yes I would advise to decay with OneCycleLR. This will make the model converge in fewer epochs.

Thanks for the updates!

@Kayne88
Copy link
Author

Kayne88 commented Sep 11, 2022

Here are some intermediate results and comparison with catboost benchmark. Ive applied shallow hyperparam optimization to tabnet. Things to note, the data set has very low signal to noise ratio, it's from the financial context, where the target is some performance measure of an asset to be predicted. Adequate basic metrics for such a problem are different kinds of correlations.
The comparison are done on a large validation set, which has almost the size of the train set. The task is regression

CATBOOST

PREDS - pearson correlation 0.031141676801666244 - feature neutral corrleation 0.02642358893294882
PREDS NEUTRALIZED - pearson correlation 0.028844560064221897 - feature neutral correlation 0.026562891162170366

TABNET

PREDS - pearson correlation 0.02533170902252626 - spearman corr 0.02516739791397788 - fnc 0.021378226358012287
PREDS NEUTRALIZED - pearson correlation 0.021450115592071817 - spearman corr 0.020884041941114838 - fnc 0.020596744887857364

We can see that the metrics fall of by quite some margin, however tabnet achieves the best performance among other deep learning architectures (tabtransformer, resnet). Another thing to note is that the pearson correlation between the catboost predictions and tabnet predictions is roughly 0.66, which is not tremendously high. So it seems that tabnet learns a different signal than catboost.

Current flaws:

  • I don't understand in detail how tabnet learns the feature masking yet, however it seems quite general. A characteristc of the data set is that the correlations between the features changes by quite some degree from batch to batch. What I would like to achieve is that the masking is learned based on these correlations. Roughly speaking, features with high avg corr to the other features are masked out, whereas features with low avg corr to the other features are used for predictions. This would be a major advantage over catboost and might improve the performance drastically.

Current hyperparam grid:

param_grid = {
      "optimizer_fn": torch.optim.AdamW,
      "optimizer_params": dict(lr=0.017),
      "scheduler_fn": torch.optim.lr_scheduler.CosineAnnealingWarmRestarts,
      "scheduler_params": dict(T_0=200, T_mult=1, eta_min=1e-4, last_epoch=-1, verbose=False),
      "n_d": 8,
      "n_a": 8,
      "n_steps": 7,
      "gamma": 2.0,
      "n_independent": 4,
      "n_shared": 3,
      "momentum": 0.17,
      "lambda_sparse": 0,
      "verbose": 1,
      "mask_type": "entmax"
  }

@Optimox

@Optimox
Copy link
Collaborator

Optimox commented Sep 12, 2022

@Kayne88 thank you very much for sharing your results.

The model learns to pay attention to specific features in order to minimize the loss function. Some features might end up masked out if they correlate too much with a better feature, however you'll have no guarantee that this is the case. You could simply remove those feature before training.

However you can play with hyperparameters to get closer to what you want:

  • lambda_sparse : the bigger this is the sparsier your mask will be. So setting this to a score > 0 might ensure that the model won't look at two correlated features.
  • gamma : a large gamma (gamma values should stay between 1 and 5 max I'd recommend) will forbid the model to reuse the same features at different steps. So if you don't want weak correlated features to be used by the model you can set a high gamma
  • n_steps : the more steps the more features your model will be able to pick at some point.

All these recommendations have no guarantee of working. This is just my general understanding but you should experiment on them and see how it goes.

Good luck!

@Optimox Optimox closed this as completed Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants