add custom tokenizer #34

ArmandGiraud · 2019-03-05T21:32:59Z

Hello,
Thanks for your great project!

I would like to know if it is possible to change the tokenizer used for computing frequencies.

For example I use the spaCy tokenizer for french.

for the following french sentence:

"l'attaque de non-concurrence: un rendez-vous pour cette nouvelle rentrée"
the spaCy tokenizer will output the following word sequence

["l'",
 'attaque ',
 'de ',
 'non-concurrence',
 ': ',
 'un ',
 'rendez-vous ',
 'pour ',
 'cette ',
 'nouvelle ',
 'démarches',
 '/',
 'plan']

As a consequence pyspellchecker will always try to correct words having a dash like rendez-vous since it was differently tokenized (i.e. ["rendez", "vous"] which have a completely different meaning)
would it be possible to include a custom tokenizer when loading a custom frequency file (e.g. )

spell.word_frequency.load_text_file("./my_txt_file", tokenizer=spacy_tokenizer)

I did not try but maybe simply using a custom word_frequency dict will do the job?
spell.word_frequency.load_dictionary('./path-to-my-word-frequency.json')

thanks for your help
Armand

The text was updated successfully, but these errors were encountered:

barrust · 2019-03-06T01:27:21Z

Yes, you are correct that the current method to load a text file (load_text_file) uses white space to split the words. You can always tokenize and use the load_words function to load the terms into the word_frequency:

from spacy.tokenizer import Tokenizer
# set up the tokenizer...
words = tokenizer("l'attaque de non-concurrence: un rendez-vous pour cette nouvelle rentrée")
spell.word_frequency.load_words(words)

But you are probably correct that we should have a method to pass in a function to tokenize; it would have to just take a string as input. Thoughts?

barrust · 2019-03-06T01:36:55Z

I put a branch that accepts a tokenizer. See if that does what you are looking for.

feature/tokenizer

ArmandGiraud · 2019-03-06T01:45:11Z

Yes, I did the same locally and it does the tricks perfectly! Thanks

barrust · 2019-03-08T01:26:17Z

Done! This will be in the next release.

barrust mentioned this issue Mar 7, 2019

Feature/tokenizer #35

Merged

barrust closed this as completed Mar 8, 2019

barrust mentioned this issue Mar 8, 2019

Ensure tokenizer uses lowercase #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add custom tokenizer #34

add custom tokenizer #34

ArmandGiraud commented Mar 5, 2019

barrust commented Mar 6, 2019

barrust commented Mar 6, 2019

ArmandGiraud commented Mar 6, 2019

barrust commented Mar 8, 2019

add custom tokenizer #34

add custom tokenizer #34

Comments

ArmandGiraud commented Mar 5, 2019

barrust commented Mar 6, 2019

barrust commented Mar 6, 2019

ArmandGiraud commented Mar 6, 2019

barrust commented Mar 8, 2019