Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add custom tokenizer #34

Closed
ArmandGiraud opened this issue Mar 5, 2019 · 4 comments
Closed

add custom tokenizer #34

ArmandGiraud opened this issue Mar 5, 2019 · 4 comments

Comments

@ArmandGiraud
Copy link

Hello,
Thanks for your great project!

I would like to know if it is possible to change the tokenizer used for computing frequencies.

For example I use the spaCy tokenizer for french.

for the following french sentence:

"l'attaque de non-concurrence: un rendez-vous pour cette nouvelle rentrée"
the spaCy tokenizer will output the following word sequence

["l'",
 'attaque ',
 'de ',
 'non-concurrence',
 ': ',
 'un ',
 'rendez-vous ',
 'pour ',
 'cette ',
 'nouvelle ',
 'démarches',
 '/',
 'plan']

As a consequence pyspellchecker will always try to correct words having a dash like rendez-vous since it was differently tokenized (i.e. ["rendez", "vous"] which have a completely different meaning)
would it be possible to include a custom tokenizer when loading a custom frequency file (e.g. )

spell.word_frequency.load_text_file("./my_txt_file", tokenizer=spacy_tokenizer)

I did not try but maybe simply using a custom word_frequency dict will do the job?
spell.word_frequency.load_dictionary('./path-to-my-word-frequency.json')

thanks for your help
Armand

@barrust
Copy link
Owner

barrust commented Mar 6, 2019

Yes, you are correct that the current method to load a text file (load_text_file) uses white space to split the words. You can always tokenize and use the load_words function to load the terms into the word_frequency:

from spacy.tokenizer import Tokenizer
# set up the tokenizer...
words = tokenizer("l'attaque de non-concurrence: un rendez-vous pour cette nouvelle rentrée")
spell.word_frequency.load_words(words)

But you are probably correct that we should have a method to pass in a function to tokenize; it would have to just take a string as input. Thoughts?

@barrust
Copy link
Owner

barrust commented Mar 6, 2019

I put a branch that accepts a tokenizer. See if that does what you are looking for.

feature/tokenizer

@ArmandGiraud
Copy link
Author

Yes, I did the same locally and it does the tricks perfectly! Thanks

@barrust
Copy link
Owner

barrust commented Mar 8, 2019

Done! This will be in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants