New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add custom tokenizer #34
Comments
Yes, you are correct that the current method to load a text file ( from spacy.tokenizer import Tokenizer
# set up the tokenizer...
words = tokenizer("l'attaque de non-concurrence: un rendez-vous pour cette nouvelle rentrée")
spell.word_frequency.load_words(words) But you are probably correct that we should have a method to pass in a function to tokenize; it would have to just take a string as input. Thoughts? |
I put a branch that accepts a tokenizer. See if that does what you are looking for. |
Yes, I did the same locally and it does the tricks perfectly! Thanks |
Done! This will be in the next release. |
Hello,
Thanks for your great project!
I would like to know if it is possible to change the tokenizer used for computing frequencies.
For example I use the spaCy tokenizer for french.
for the following french sentence:
"l'attaque de non-concurrence: un rendez-vous pour cette nouvelle rentrée"
the spaCy tokenizer will output the following word sequence
As a consequence pyspellchecker will always try to correct words having a dash like
rendez-vous
since it was differently tokenized (i.e. ["rendez", "vous"] which have a completely different meaning)would it be possible to include a custom tokenizer when loading a custom frequency file (e.g. )
spell.word_frequency.load_text_file("./my_txt_file", tokenizer=spacy_tokenizer)
I did not try but maybe simply using a custom word_frequency dict will do the job?
spell.word_frequency.load_dictionary('./path-to-my-word-frequency.json')
thanks for your help
Armand
The text was updated successfully, but these errors were encountered: