Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save Japanese NER model by using nlp.to_disk #1557

Closed
buivietan opened this issue Nov 13, 2017 · 9 comments
Closed

Save Japanese NER model by using nlp.to_disk #1557

buivietan opened this issue Nov 13, 2017 · 9 comments
Labels
lang / ja Japanese language data and models

Comments

@buivietan
Copy link

buivietan commented Nov 13, 2017

I got this error "AttributeError: 'JapaneseTokenizer' object has no attribute 'to_disk'" when trying to save Japanese NER model in spaCy 2.0.2. Can you guys help me to fix this error? Thanks so much

Environment

  • Operating System: Win 10 Pro
  • Python Version Used: 3.6.3
  • spaCy Version Used: 2.0.2
@ines ines added the lang / ja Japanese language data and models label Nov 13, 2017
@ines
Copy link
Member

ines commented Nov 13, 2017

Thanks for the report! The reason this happens is that the Japanese tokenizer is a custom implementation via the Janome library and not using spaCy's serializable Tokenizerclass. So when you call nlp.to_disk(), spaCy will call the to_disk() methods of all pipeline components and the tokenizer – which fails in this case.

Possible solutions for now:

  • Use Pickle. This is probably the easiest and safest way – but if you're looking to create a spaCy model package, you'd have to modify the package's __init__.py to load your pickle file instead of initialising the model the standard way (see here for details).
  • Set nlp.tokenizer = None before saving out the model. This is not so nice – but in this case, it shouldn't really matter, since spaCy currently doesn't ship with any Japanese language data that you'd want to serialize with the model anyways. (Haven't tested this approach, but pretty sure it works!)

We should probably allow disabling the tokenizer via the disable keyword argument on Language.to_disk, too (which is currently only possible for pipeline components). Will think about the best way to solve this.

Btw, curious to hear about your results on training Japanese NER – sounds very exciting!

@buivietan
Copy link
Author

@ines Thanks so much for your quick reply. I'll try your solution and give you my feedback on training Japanese NER :)

@buivietan
Copy link
Author

Hi @ines,
Sorry for bothering you. I tried your solution but got this error using Pickle “AttributeError: Can't pickle local object 'FeatureExtracter..feature_extracter_fwd' “. Below is my sample code
# save model to output directory
print("Saving model...")
nlp.tokenizer = None
ner_model = pickle.dumps(nlp)
Moreover, I tried to build simple Chinese and Thai NER models, and I could save these models successfully using nlp.to_disk method. I wonder if there is something wrong with Japanese in spaCy ver 2.0.2 that we got this error “AttributeError: 'JapaneseTokenizer' object has no attribute 'to_disk' ”
Can you help me out this trouble. Thanks so much.

@ines
Copy link
Member

ines commented Nov 15, 2017

Hmm, this is strange! I think the difference between Japanese and Thai/Chinese is that it provides a create_tokenizer method (see here), while the others only overwrite the make_doc (see here).

What happens if you don't use Pickle and the regular nlp.to_disk() method, but tell it to disable the tokenizer, for example:

nlp.to_disk('/path/to/model', disable=['tokenizer'])

If this works, the only problem here is that you'll also need to set disable=['tokenizer'] when you load the model back in using nlp.from_disk(). So packaging a model and loading it via spacy.load() won't work out-of-the-box.

We'll think about a good way to solve this in the future. When saving out a model, spaCy should probably check if the tokenizer is serializable and if not, show a warning, but serialize anyway.

Nice to hear that Chinese and Thai worked well – this is really cool!

@buivietan
Copy link
Author

buivietan commented Nov 15, 2017

@ines
Oh yeah, It worked. I could save the Japanese NER model successfully. Thanks so much :)

@ines ines closed this as completed in c9d72de Nov 15, 2017
@ines
Copy link
Member

ines commented Nov 15, 2017

Just pushed a fix to Japanese that implements "dummy" serialization methods on the tokenizer to prevent the error. I also found another small bug that caused the Japanese vocab to not set the lang correctly (meaning that the saved out model's meta.json had "lang": "" set, which causes an error when loading the model back in).

Just tested it locally and both to/from disk and to/from bytes now works correctly. This means you should also be able to package your Japanese model as a Python package using the spacy package command.

@yarongon
Copy link

yarongon commented Jan 30, 2018

I have a similar problem that I could not fix: I've trained a custom NER model that I'd like to save to the disk, and since I'm using a custom tokenizer I don't want to save the tokenizer. Here's what I did:

import spacy

nlp = spacy.load("en")
nlp.tokenizer = some_custom_tokenizer
# Train the NER model...
nlp.tokenizer = None
nlp.to_disk('/tmp/my_model', disable=['tokenizer'])

(Due to this thread I did not packaged the model)
When I try to load it, the pipeline is empty, and surprisingly, is has the default spaCy tokenizer.

nlp = spacy.blank('en').from_disk('/tmp/model', disable=['tokenizer']) 

I need to load the model without the tokenizer but with the full pipeline. Any ideas? thanks.

@yarongon
Copy link

More about this issue: when I tried to load the model like this:

loaded_nlp = spacy.load('/model/directory', disable=['tokenizer'])

I got an error:

FileNotFoundError: [Errno 2] No such file or directory: '/model/directory/tokenizer'

I looked at the code of util.load_model_from_path and I think I found a bug there. Line 158 is:

return nlp.from_disk(model_path)

If the disable parameter will be added to the call, it will be possible to directly use spacy.load for loading models without specific parts:

return nlp.from_disk(model_path, disable)

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lang / ja Japanese language data and models
Projects
None yet
Development

No branches or pull requests

3 participants