How to add new language #26

MukhtarShaima · 2018-11-17T17:25:39Z

Will you please give me clear instructions or steps ,so that I can add Urdu language,as I'm not able to download the Urdu file from that link which you mentioned.

barrust · 2018-11-17T19:50:59Z

Sure, the steps to generating a new language are fairly straight forward:

Download the set of words that should be added to the dictionary
- I have used the WordFrequency Project but not sure all languages are present there
Load that file into a dictionary; this will depend on your source. The key, is to turn this into a dictionary in the form: key=word, val=frequency as an int.

If your data in in a dictionary form, you can load it like so:

from spellchecker import SpellChecker
spell = SpellChecker(language=None)
spell.word_frequency.load_dictionary(file_to_dictionary)
spell.export(location_for_export)

If you only have txt files with words, etc, you can just load those words directly and have spellchecker build the word frequency for you:

from spellchecker import SpellChecker
spell = SpellChecker(language=None)
spell.word_frequency.load_text_file(path_to_text_file)
spell.export(location_for_export)

Once you have exported the dictionary (really a word frequency list), you can then load that dictionary when you wish to use spellchecker:

from spellchecker import SpellChecker
spell = SpellChecker(language=None, local_dictionary=location_from_export)

MukhtarShaima · 2018-11-19T20:42:07Z

Thanx for the clear instructions,I had successfully loaded my text file.
Now the problem is it does not give me correct answers
eg:
for word in misspelled:
# Get the one most likely answer
print(spell.correction(word))
it should return the correct or most likely word,but sometimes it gives me wrong word in the misspelled,
or it returns the whole misspelled string.
Thank you.

barrust · 2018-11-20T00:56:07Z

That is likely due to a few different possible issues.

If you do not have frequency, i.e., everything is set to 1 (or the same thing). Try something like:

 # return those that are within the specified distance
print(spell.candidates(word))

If the distance between the word you are trying to correct is greater than 2, then it will not work and it will return the word, as is.

Honestly, I have never tried this with non-latin character languages so I am unsure how it will perform.

barrust · 2019-01-06T01:15:32Z

@MukhtarShaima Let me know if you are still having issues, otherwise, I am going to close this one!

Thanks!

ryuzakinho · 2019-03-28T10:43:08Z

Hi,

From my understanding, we can load JSON formatted dictionaries or text documents that will be used for building the frequency list.

I would like to directly use the word frequency lists available here (Word Frequency): https://github.com/hermitdave/FrequencyWords/tree/master/content/2018/fi

These are txt files containing frequencies. Is there a way to directly load such files or do I need to convert them to JSON first?

Thanks for your help!

barrust added the question Further information is requested label Nov 19, 2018

barrust closed this as completed Jan 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to add new language #26

How to add new language #26

MukhtarShaima commented Nov 17, 2018

barrust commented Nov 17, 2018

MukhtarShaima commented Nov 19, 2018

barrust commented Nov 20, 2018

barrust commented Jan 6, 2019

ryuzakinho commented Mar 28, 2019

How to add new language #26

How to add new language #26

Comments

MukhtarShaima commented Nov 17, 2018

barrust commented Nov 17, 2018

MukhtarShaima commented Nov 19, 2018

barrust commented Nov 20, 2018

barrust commented Jan 6, 2019

ryuzakinho commented Mar 28, 2019