Ignore word function #53

David-Baron · 2022-12-28T13:23:41Z

Hello.
I can't find a function to ignore a word.
In some cases we need it.
An example in the English dictionary:
srai -> sry
but I have to ignore it.

David-Baron · 2022-12-29T18:17:58Z

Something like:

class Speller:
    def __init__(
        self, lang="en", threshold=0, nlp_data=None, fast=False, only_replacements=False, ignore=[]
    ):
        self.lang = lang
        self.threshold = threshold
        self.nlp_data = load_from_tar(lang) if nlp_data is None else nlp_data
        self.fast = fast
        self.only_replacements = only_replacements
        self.ignore = ignore

        if threshold > 0:
            # print(f'Original number of words: {len(self.nlp_data)}')
            self.nlp_data = {k: v for k, v in self.nlp_data.items() if v >= threshold}
            # print(f'After applying threshold: {len(self.nlp_data)}')

    def existing(self, words):
        """{'the', 'teh'} => {'the'}"""
        return {word for word in words if word in self.nlp_data}

    def get_candidates(self, word):
        w = Word(word, self.lang, self.only_replacements)
        if self.fast:
            candidates = self.existing([word]) or self.existing(w.typos()) or [word]
        else:
            candidates = (
                self.existing([word])
                or self.existing(w.typos())
                or self.existing(w.double_typos())
                or [word]
            )
        return [(self.nlp_data.get(c, 0), c) for c in candidates]

    def autocorrect_word(self, word):
        """most likely correction for everything up to a double typo"""
        if word == "":
            return ""

        # ignore
        if word in self.ignore:
            return word

        candidates = self.get_candidates(word)

        # in case the word is capitalized
        if word[0].isupper():
            decapitalized = word[0].lower() + word[1:]
            candidates += self.get_candidates(decapitalized)

        best_word = max(candidates)[1]

        if word[0].isupper():
            best_word = best_word[0].upper() + best_word[1:]
        return best_word

    def autocorrect_sentence(self, sentence):
        return re.sub(
            word_regexes[self.lang],
            lambda match: self.autocorrect_word(match.group(0)),
            sentence,
        )

    __call__ = autocorrect_sentence

It's running but no test availlable (I don't know how to write it and no time actualy).

If a developer more qualified than me can do the unit test and do a PR. Thank you.

filyp · 2022-12-31T11:06:19Z

Yeah, the code looks legit, but I also don't have the time to do the test and all.

Thanks for figuring this out. I'm leaving the issue open in case someone else has a similar use case.

filyp · 2022-12-31T11:27:06Z

Ah I remembered there actually is a way to ignore words already (although a bit roundabout).

The nlp_data parameter lets you pass your own word frequency dictionary. If you want to use the default dictionary, but just ignore a few words, you can modify that dictionary. You must set some non-zero frequency to the words you wish to ignore. (I know, this is quite a hacky way to do it, your implementation is cleaner.)

David-Baron · 2023-01-01T11:09:28Z

@filyp Yes indeed, but I find the way a little too boring because the goal and for only a few words. In addition, it modifies the base file, which is wrong when using autocorrect in several projects.

filyp · 2023-01-03T10:22:38Z

I didn't mean modifying files but something like:

spell = Speller()
spell.nlp_data.update(words_to_ignore_dict)

David-Baron · 2023-01-04T09:17:25Z

It's a possibility, the difference and that you have to subtract the ignored ones from the nlp array, longer processing time I think. (not tested)

charlietiwari · 2023-01-06T13:25:03Z

hi, can u please help me in adding some words ,as below words are not spell checked correctly,also i have tried add those updated words as per ur above code ,it is not working ,so can u add some words in your vocabulary .Or can u suggest some tested method to add these words.

metaverse
kiyaverse
metachamber
metaroom

Thanks

David-Baron · 2023-01-07T10:23:25Z

@charlietiwari
I don't think it's about this issue.

In addition, kiyaverse metachamber metaroom are names specific to companies, so it seems to me that they will never be added to a dictionary. Metavarse when it will certainly soon be seen as it enters more and more into the common language of certain languages.
Look in the readme https://github.com/filyp/autocorrect#custom-word-sets for the correction of your particular words.

Please create a new issue if the question does not match the current issue.

filyp · 2023-01-09T17:35:25Z

@David-Baron ah, no, you don't subtract but rather add them to this nlp_data. This way, they are treated as real words and not corrected. Re performance, modifying a dictionary is pretty efficient - it has complexity O(n) where n is the number of new entries (here, words). And you just do it once, during initialization of Speller. In later usage, there should be no increase in processing time, because dictionary lookup time doesn't depend on the number of items in the dictionary.

David-Baron · 2023-01-10T07:40:14Z

@filyp You are right, this works like a charm!

David-Baron closed this as completed Jan 10, 2023

David-Baron mentioned this issue Nov 11, 2023

is it possible to not autocorrect few words? #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore word function #53

Ignore word function #53

David-Baron commented Dec 28, 2022

David-Baron commented Dec 29, 2022 •

edited

Loading

filyp commented Dec 31, 2022

filyp commented Dec 31, 2022

David-Baron commented Jan 1, 2023

filyp commented Jan 3, 2023 •

edited

Loading

David-Baron commented Jan 4, 2023

charlietiwari commented Jan 6, 2023

David-Baron commented Jan 7, 2023

filyp commented Jan 9, 2023

David-Baron commented Jan 10, 2023

Ignore word function #53

Ignore word function #53

Comments

David-Baron commented Dec 28, 2022

David-Baron commented Dec 29, 2022 • edited Loading

filyp commented Dec 31, 2022

filyp commented Dec 31, 2022

David-Baron commented Jan 1, 2023

filyp commented Jan 3, 2023 • edited Loading

David-Baron commented Jan 4, 2023

charlietiwari commented Jan 6, 2023

David-Baron commented Jan 7, 2023

filyp commented Jan 9, 2023

David-Baron commented Jan 10, 2023

David-Baron commented Dec 29, 2022 •

edited

Loading

filyp commented Jan 3, 2023 •

edited

Loading