Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyperparameter Tunning #382

Closed
balazsgonczy opened this issue Apr 7, 2022 · 14 comments
Closed

Hyperparameter Tunning #382

balazsgonczy opened this issue Apr 7, 2022 · 14 comments
Assignees
Labels
enhancement New feature or request

Comments

@balazsgonczy
Copy link

Hi,

I would like to know whether it worths fine-tunning the hyperparameters of TABNET for a binary classification task.
Also if it is, then which approach would you suggest taking?

Best,

Balázs

@balazsgonczy balazsgonczy added the enhancement New feature or request label Apr 7, 2022
@Optimox
Copy link
Collaborator

Optimox commented Apr 8, 2022

Hello @balazsgonczy,

Sure it's worth to try to tune the parameters.

There is plenty of example out there on how to tune TabNet.

Here is just one example with optuna:
https://www.kaggle.com/code/neilgibbons/tuning-tabnet-with-optuna/notebook

@balazsgonczy
Copy link
Author

There are these 2 parameters called: "cat_idxs" and "cat_dims". What are they representing? The documentations are not so understandable to me? I have categorical variables, but I haven't specified these parameters and the model still performs nicely. Is this an issue? (I am doing binary classification.)

@Optimox
Copy link
Collaborator

Optimox commented Apr 8, 2022

Those parameters are useful for categorical embedding, however you can't tune them (they are fixed parameters depending on your dataset). If you don't specify them the model won't create embeddings and will treat categorical variables as numerical.

If you specify them then emb_dim is a tunable parameter (1 should be fine if you don't have a huge number of categories)

You can have a look at how those parameters are used in the example notebooks of the repo.

@balazsgonczy
Copy link
Author

So let me ask it like this:

I have a table with categorical features in the 1st and 3rd columns. The 1st feature has 3- (0,1,2), and the 3rd one has 2 (0,1) levels.

-cat_idxs -> [1,3]

-cat_dims -> [3,2]

-cat_emb_dim -> [3,2] #Here I am just guessing. I need further explanation on how to choose the cat_emb_dim list items.

Am I comprehending the parameters right?

@Optimox
Copy link
Collaborator

Optimox commented Apr 9, 2022

Yes you are right except than indexes in python start at 0 so it should be [0,2] not [1,3].

For the embedding dimensions, with low modalities like this you can set them all to 1.
Embeddings can follow rule of thumbs of n_emb=log(n_mod). You can search more the internet about what’s the best dimensions but in me experience bellow 10 or 50 modalities you can just set the embeddings to 1 or 2 and you’ll get full power.

@balazsgonczy
Copy link
Author

Last question: What do you mean by modality here? Do you mean the type of datasource in the table, like images, text etc.? Sorry this way it tells me nothing. :D

@balazsgonczy
Copy link
Author

I think I'll drop these, because I have already done encoding and also had to logscale a few of them due to not normal distribution of the variables. But thank you very much! Please if you have time then answer my last question and you can close this thread!
("What do you mean by modality here?")

@Optimox
Copy link
Collaborator

Optimox commented Apr 10, 2022

What I mean by modalities is the number of unique values in one categorical column.

@balazsgonczy
Copy link
Author

Oh so you meant the set. Then thx! :)

Sry again for my question:

I have run my fine-tuned optuna algorithm and it returned something like this:

Best parameters: {'mask_type': 'entmax', 'n_d': 60, 'n_steps': 7, 'gamma': 1.0, 'n_independent': 2, 'n_shared': 5, 'momentum': 0.35000000000000003, 'lambda_sparse': 3.907960748444502e-06, 'optimizer_fn': , 'patienceScheduler': 9, 'patience': 25}

My question is: Where shall I put the patienceScheduler and the patience parameters during the model initialization? I think the latter goes somewhere here:

TabNetClassifier(...
scheduler_params = dict(mode="min",patience=25),
...)

But where does the patienceScheduler parameter go?

@Optimox
Copy link
Collaborator

Optimox commented Apr 11, 2022

It does not make sense to me to try to optimize patience.

Patience is here to save you some time : if an experiment does not seem to get any better after 5 or 10 epochs (on a total of 50 or 100), just move on to another experiment and don't waste your time on this one.

Trying to optimize patience defies completely this purpose, so just have a look at a few training logs and decide if after 5 (or 10 or 50 whatever) epochs of no improvement it's not worth training longer.

Moreover there is the main patience which performs early stopping (saving you time), and there is a patience parameter in some learning rate scheduler (which lower the learning rate when things are not improving). You need the scheduler's patience to be lower than the early stopping patience, otherwise you'll never decay your learning rate at all.

Please also note that you can perfectly train a TabNet model without any of those 2 patience parameters : use a OneCycleLearningRate scheduler (like in this notebook : https://www.kaggle.com/code/optimo/the-beauty-of-tabnet-a-simple-baseline) and disable patience by setting it to -1. The only important parameters now become the number of epochs of training and the learning rate: try to minimize the number of epochs to make as many experiment as you can as quickly as possible.

In the end, I think you won't gain much by doing a blind hyper parameter optimization, you need to understand what each parameter does and get a feeling of what's happening before doing any random grid search properly.

For example, without knowing your dataset or you problem I'm almost certain that(I might be wrong on your specific problem):

  • n_shared is too big in your experiment
  • n_steps is too big
  • you do not need to change the momentum
  • switch to one CycleLearningRate and you'll improve your scores

Good luck!

@balazsgonczy
Copy link
Author

Could you please explain why do you think the 4 recommendation should work (with citation if possible)?

"

  • n_shared is too big in your experiment
  • n_steps is too big
  • you do not need to change the momentum
  • switch to one CycleLearningRate and you'll improve your scores

"

@Optimox
Copy link
Collaborator

Optimox commented Apr 14, 2022

Not really, those are just my personal feelings, I may be wrong. Nothing scientific here.

Please share the results of your experiments if you try those suggestion so that we can benefit from your results.

@Optimox Optimox closed this as completed Apr 14, 2022
@balazsgonczy
Copy link
Author

Hi Optimox,

I would like to create a hyperparameter range for my thesis, and I wonder what the value range of "lambda_sparse is". It starts from 0.01, and the step size is 0.000001 for me. So I suppose the minimum value is around 0, and the max would be 1?

I look forward to your feedback!

Best,

Balázs

@Optimox
Copy link
Collaborator

Optimox commented Jun 20, 2022

@balazsgonczy lambda_sparse is a multiplicative weight for the sparsity loss.
0 means that you don't add any constraint on sparsity. The maximum acceptable value for lambda_sparse depends on the loss function you are using on your problem. If you have an RMSE arround 10K and the sparsity loss has scores arround 0.1, then you can set a high weight if you want, but if your average loss is around 1e-5, then a weight of 0.1 might be too high already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants