Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Models don't accept model_name, saving_path #136

Closed
rmitsch opened this issue Jun 8, 2020 · 23 comments
Closed

Models don't accept model_name, saving_path #136

rmitsch opened this issue Jun 8, 2020 · 23 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@rmitsch
Copy link

rmitsch commented Jun 8, 2020

Describe the bug

Models don't accept model_name, saving_path as initialization arguments.

What is the current behavior?

See above.

If the current behavior is a bug, please provide the steps to reproduce.

clf: TabNetClassifier = TabNetClassifier(saving_path="/home/user123/dev/", device_name="cpu")

Expected behavior

Models should accept model_name, saving_path as initialization arguments as specified in the documentation.

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

On a related note: How can models be persisted? The mentioned init parameters strongly suggest that it is possible, but I couldn't find any information on this - either in the documentation nor in the code.

@rmitsch rmitsch added the bug Something isn't working label Jun 8, 2020
@Optimox
Copy link
Collaborator

Optimox commented Jun 8, 2020

hey @rmitsch,

Thanks for creating this issue, model_name and saving_pat are actually deprecated, we should remove them and update the README.

Saving a tabnet model follows the same rule as saving a pytorch model or XGBoost model.
Either you save it as pickle and it will be the same as an XGBoost model or you can use pytorch specific save methods https://pytorch.org/tutorials/beginner/saving_loading_models.html

The best way I would recommend :

  • torch.save(clf_tabnet.network.state_dict(), PATH) to save your model clf_tabnet
  • when you want to use it later: you'll need to redifine your tabnet_model with the same params clf_tabnet = TabnetClassfier(**your_params) and then clf_tabnet.network.load_state_dict(torch.load(PATH))

@Optimox Optimox added documentation Improvements or additions to documentation and removed bug Something isn't working labels Jun 8, 2020
@rmitsch
Copy link
Author

rmitsch commented Jun 9, 2020

Thanks for the quick response!
Your recommended approach to save a model approach unfortunately yields AttributeError: 'TabNetClassifier' object has no attribute 'network'.

@Optimox
Copy link
Collaborator

Optimox commented Jun 9, 2020

hello @rmitsch,

Actually you are right, what I said does not work because the network is instantiated only after a fit, which is not very useful in that case (we might change that behaviour in the future).

Try this instead, it should work :

import pickle

# Save the model wherever you want
with open("./AMODEL.pkl", 'wb') as model_file:
    pickle.dump(clf, model_file)

# Load the model later to make prediction
with open("./AMODEL.pkl", 'rb') as model_file:
    new_clf = pickle.load(model_file)

@eduardocarvp
Copy link
Collaborator

Indeed, @Optimox , I have noticed that and I even probably have the change locally where I instantiate the network on class __init__(). I think it is better that way.
I'm willing to work on this and I can also fix the model_name/saving_path on the way, It should be simple.

@Optimox
Copy link
Collaborator

Optimox commented Jun 9, 2020

@eduardocarvp I think the problem with this is that before the fit we do not know either input_dim or output_dim, it's nice to have this computed automatically so I'm not sure how to bypass this.

I think the best way would probably to have a method set_network (or something better) which would be called automatically during the fit but that could be called manually in order to instantiate everything.

About the save I don't know if we wan't to package something or just give a few methods on how to save and reuse a tabnet model.

@rmitsch
Copy link
Author

rmitsch commented Jun 9, 2020

@Optimox Plain old pickling worked, thanks!
My two cents on whether to offer functionality to save and load load models: IMO that would be reasonable, even if it's just a very simple wrapper - that way I as a user don't have to worry about whether using Pytorch's save, pickle etc.

@xywust2014
Copy link

xywust2014 commented Jun 10, 2020

hello @Optimox

Actually you are right, what I said does not work because the network is instantiated only after a fit, which is not very useful in that case (we might change that behaviour in the future).

Try this instead, it should work :

import pickle

# Save the model wherever you want
with open("./AMODEL.pkl", 'wb') as model_file:
    pickle.dump(clf, model_file)

# Load the model later to make prediction
with open("./AMODEL.pkl", 'rb') as model_file:
    new_clf = pickle.load(model_file)

I tried this method. But it gives me an error: PicklingError: Can't pickle <class 'pytorch_tabnet.tab_model.TabNetClassifier'>: it's not the same object as pytorch_tabnet.tab_model.TabNetClassifier . Could you take a look at what might be reason? Thanks!

It looks like that I can use this to save the model. However, after I used .fit method using the training dataset on clf, then abovementioned error will occur.

@Optimox
Copy link
Collaborator

Optimox commented Jun 11, 2020

helo @xywust2014 could you please clarify a bit when the error occurs?
Sharing some code can help us too, but from your error message it looks like you might be missing parenthesis, what you want to save is your clf defined like this clf = pytorch_tabnet.tab_model.TabNetClassifier().

@xywust2014
Copy link

helo @xywust2014 could you please clarify a bit when the error occurs?
Sharing some code can help us too, but from your error message it looks like you might be missing parenthesis, what you want to save is your clf defined like this clf = pytorch_tabnet.tab_model.TabNetClassifier().

Thanks a lot for the help. Here are the code.

clf = TabNetClassifier(
    **best_hyperparams, 
    optimizer_fn=torch.optim.Adam,
    scheduler_params = {"gamma": 0.95,
                     "step_size": 20},
    scheduler_fn=torch.optim.lr_scheduler.StepLR, epsilon=1e-15,
    device_name = 'auto'
)

max_epochs = 100
clf.fit(X_train = train_x, y_train = train_y , 
        X_valid = test_x, y_valid = test_y, 
        max_epochs = max_epochs, patience = 1, 
        batch_size = 512, virtual_batch_size = 256
        )

import pickle 
with open("./AMODEL.pkl","wb") as model_file:
    pickle.dump(clf, model_file)

with open("./AMODEL.pkl","wb") as model_file:
    new_clf = pickle.load(model_file)

@Optimox
Copy link
Collaborator

Optimox commented Jun 11, 2020

hmm could you try reading the file instead of writting in the second with open statement :

instead of this

with open("./AMODEL.pkl","wb") as model_file:
    new_clf = pickle.load(model_file)

try this

with open("./AMODEL.pkl","rb") as model_file:
    new_clf = pickle.load(model_file)

@xywust2014
Copy link

xywust2014 commented Jun 11, 2020

hmm could you try reading the file instead of writting in the second with open statement :

instead of this

with open("./AMODEL.pkl","wb") as model_file:
    new_clf = pickle.load(model_file)

try this

with open("./AMODEL.pkl","rb") as model_file:
    new_clf = pickle.load(model_file)

Thanks a lot!:) I will try that.
Well. Unfortunately, even I changed for that, it still gives me the same error.

@athewsey
Copy link
Contributor

Just chipping in to say - pickle dump/load is working for me... But I've noticed that it means it's not possible to e.g. train the model on a CUDA-enabled machine but then deploy for inference on a CPU-only environment. I suspect there might also be some constraints about porting the model between Python versions or other environment changes?

I'd advocate for either of these, if possible:

  • Adding dedicated load/save methods to this library's API and working on their flexibility, or
  • Adding a (documented) way to use PyTorch's load/save methods

...Not sure if it belongs in a separate enhancement Issue or is OK to tackle here though!

@Optimox
Copy link
Collaborator

Optimox commented Jun 14, 2020

@athewsey
I guess improving things so that the pickle option is not the only one would allow easier porting on different environments, pickle has this inherent flaw.

However have you tried to explicitely switch to cpu for inference after loading your model by doing clf.device_name = "cpu" ?

@athewsey
Copy link
Contributor

Thanks for the quick response @Optimox! ...But afraid I don't think it'll work 😔

The error below is thrown on pickle.load("classifier.pkl"), so not possible to unpickle first then check and override:

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

I also tried unpickling the trained file on a GPU-enabled instance (which showed device_name = "auto" by the way); setting the prop to cpu; then re-pickling to classifier.cpu.pkl and seeing if that would load in the CPU-only environment - It raised same error even with trying a range of different updates e.g:

model.device_name = "cpu"
model.device = torch.device(model.device_name)
model.network.device = model.device
model.network.to(model.network.device)
model.network.cpu()
# Still won't unpickle on a non-CUDA env

@Optimox
Copy link
Collaborator

Optimox commented Jun 15, 2020

yep it was worth a try, I think @eduardocarvp will come with a better solution pretty soon.

@xywust2014
Copy link

Just chipping in to say - pickle dump/load is working for me... But I've noticed that it means it's not possible to e.g. train the model on a CUDA-enabled machine but then deploy for inference on a CPU-only environment. I suspect there might also be some constraints about porting the model between Python versions or other environment changes?

I'd advocate for either of these, if possible:

  • Adding dedicated load/save methods to this library's API and working on their flexibility, or
  • Adding a (documented) way to use PyTorch's load/save methods

...Not sure if it belongs in a separate enhancement Issue or is OK to tackle here though!

Thanks for the response. It looks like that I can use the pickle method to save&load the model clf without fitting on the training datasets. However, after I used .fit method using the training datasets on clf, then the following error will occur.

PicklingError: Can't pickle <class 'pytorch_tabnet.tab_network.TabNet'>: it's not the same object as pytorch_tabnet.tab_network.TabNet

@Optimox
Copy link
Collaborator

Optimox commented Jun 15, 2020

@xywust2014 could you share a minimal code sample to reproduce your error?

Because pickling option seems to be working as long as you stay in the same environment. Without a reproducible bug we can’t help you.

@xywust2014
Copy link

xywust2014 commented Jun 15, 2020

@xywust2014 could you share a minimal code sample to reproduce your error?

Because pickling option seems to be working as long as you stay in the same environment. Without a reproducible bug we can’t help you.

Thanks. Optimox. I am using python 3.7 on Spyder. This machine has Cuda environment. But for the code below, I chose device_name = 'cpu'. (I set this as 'auto' before, which generates same error.)

If I escaped the clf.fit() code, then there won't be any errors.
However, if I run that code, then the error will occur.

best_hyperparams = {"clip_value": 4.0, "gamma": 0.6666666666666666, "lr": 0.23, 
                    "momentum": 0.45, "n_a": 8, "n_d": 48, "n_independent": 6, "n_shared": 2}

clf = TabNetClassifier(
    **best_hyperparams, 
    optimizer_fn=torch.optim.Adam,
    scheduler_params = {"gamma": 0.95,
                     "step_size": 20},
    scheduler_fn=torch.optim.lr_scheduler.StepLR, epsilon=1e-15,
    device_name = 'cpu')

max_epochs = 100
clf.fit(X_train = train_x, y_train = train_y , 
        X_valid = test_x, y_valid = test_y, 
        max_epochs = max_epochs, patience = 2, 
        batch_size = 1024, virtual_batch_size = 256
        )

###################### Save the trained model ####################
#joblib.dump(clf, filename = "Model"+"\\" + model_n +'.plk')
#torch.save(clf.network.state_dict(),"C:\DataScientist\HC\"  + "Model"+"\\" + model_n)
import pickle 
with open("./AMODEL.pkl","wb") as model_file:
    pickle.dump(clf, model_file)

with open("./AMODEL.pkl","rb") as model_file:
    new_clf = pickle.load(model_file)

Thanks guys for the help.

@Optimox
Copy link
Collaborator

Optimox commented Jun 15, 2020

well I just ran this code in my local machine on census income and it runs without problem...

a few notes though, if you are performing hyper parameter tuning you could refer to the README available in the front page of the repo to see "typical" values from the research paper.

  • momentum seems a bit high
  • clip value is for gradient clipping not sure it's worth searching on this but why not
  • typically n_d = n_a works well so this could lower your search space
  • gamma is mathematically supposed to be greater than 1, values bellow one would make the mask behave a bit strangely I guess maybe searching between 1 and 2 would yield better results

I know this does not solve your problem but everything is running just fine on my computer... not sure where this comes from.

@xywust2014
Copy link

well I just ran this code in my local machine on census income and it runs without problem...

a few notes though, if you are performing hyper parameter tuning you could refer to the README available in the front page of the repo to see "typical" values from the research paper.

  • momentum seems a bit high
  • clip value is for gradient clipping not sure it's worth searching on this but why not
  • typically n_d = n_a works well so this could lower your search space
  • gamma is mathematically supposed to be greater than 1, values bellow one would make the mask behave a bit strangely I guess maybe searching between 1 and 2 would yield better results

I know this does not solve your problem but everything is running just fine on my computer... not sure where this comes from.

Thanks a lot, Optimox, for the advice.

@DoDzilla-ai
Copy link

DoDzilla-ai commented Jun 22, 2020

hey @rmitsch,

Thanks for creating this issue, model_name and saving_pat are actually deprecated, we should remove them and update the README.

Saving a tabnet model follows the same rule as saving a pytorch model or XGBoost model.
Either you save it as pickle and it will be the same as an XGBoost model or you can use pytorch specific save methods https://pytorch.org/tutorials/beginner/saving_loading_models.html

The best way I would recommend :

  • torch.save(clf_tabnet.network.state_dict(), PATH) to save your model clf_tabnet
  • when you want to use it later: you'll need to redifine your tabnet_model with the same params clf_tabnet = TabnetClassfier(**your_params) and then clf_tabnet.network.load_state_dict(torch.load(PATH))

Models also don't accept mask_type. Is this deprecated as well?

PS: Btw. lr (learning rate) is not documented in the readme. Just saying...

@Optimox
Copy link
Collaborator

Optimox commented Jun 22, 2020

@DoDzilla-ai

Hello, well you are actually looking at the develop branch README (maybe we should find a way of defaulting the master branch) so mask_type is actually a new feature and not a deprecated one, but you if you installed the code from pip the you are using the master branch code which does not accept mask_type.

The same thing is happening with lr, we changed this recently in order to give more flexibility to final users.

The development branch always have some advanced features that the master branch does not get, they will match at the next release in the coming weeks. In the meantime please refer to the the master branch readme in order to get the current documentation.

@Optimox
Copy link
Collaborator

Optimox commented Jul 3, 2020

The release 1.2 should solve all these problems, feel free to open an new issue if you encounter an other problem.

@Optimox Optimox closed this as completed Jul 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

8 participants