Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore word function #53

Closed
David-Baron opened this issue Dec 28, 2022 · 10 comments
Closed

Ignore word function #53

David-Baron opened this issue Dec 28, 2022 · 10 comments

Comments

@David-Baron
Copy link

Hello.
I can't find a function to ignore a word.
In some cases we need it.
An example in the English dictionary:
srai -> sry
but I have to ignore it.

@David-Baron
Copy link
Author

David-Baron commented Dec 29, 2022

Something like:

class Speller:
    def __init__(
        self, lang="en", threshold=0, nlp_data=None, fast=False, only_replacements=False, ignore=[]
    ):
        self.lang = lang
        self.threshold = threshold
        self.nlp_data = load_from_tar(lang) if nlp_data is None else nlp_data
        self.fast = fast
        self.only_replacements = only_replacements
        self.ignore = ignore

        if threshold > 0:
            # print(f'Original number of words: {len(self.nlp_data)}')
            self.nlp_data = {k: v for k, v in self.nlp_data.items() if v >= threshold}
            # print(f'After applying threshold: {len(self.nlp_data)}')

    def existing(self, words):
        """{'the', 'teh'} => {'the'}"""
        return {word for word in words if word in self.nlp_data}

    def get_candidates(self, word):
        w = Word(word, self.lang, self.only_replacements)
        if self.fast:
            candidates = self.existing([word]) or self.existing(w.typos()) or [word]
        else:
            candidates = (
                self.existing([word])
                or self.existing(w.typos())
                or self.existing(w.double_typos())
                or [word]
            )
        return [(self.nlp_data.get(c, 0), c) for c in candidates]

    def autocorrect_word(self, word):
        """most likely correction for everything up to a double typo"""
        if word == "":
            return ""

        # ignore
        if word in self.ignore:
            return word

        candidates = self.get_candidates(word)

        # in case the word is capitalized
        if word[0].isupper():
            decapitalized = word[0].lower() + word[1:]
            candidates += self.get_candidates(decapitalized)

        best_word = max(candidates)[1]

        if word[0].isupper():
            best_word = best_word[0].upper() + best_word[1:]
        return best_word

    def autocorrect_sentence(self, sentence):
        return re.sub(
            word_regexes[self.lang],
            lambda match: self.autocorrect_word(match.group(0)),
            sentence,
        )

    __call__ = autocorrect_sentence

It's running but no test availlable (I don't know how to write it and no time actualy).

If a developer more qualified than me can do the unit test and do a PR. Thank you.

@filyp
Copy link
Owner

filyp commented Dec 31, 2022

Yeah, the code looks legit, but I also don't have the time to do the test and all.

Thanks for figuring this out. I'm leaving the issue open in case someone else has a similar use case.

@filyp
Copy link
Owner

filyp commented Dec 31, 2022

Ah I remembered there actually is a way to ignore words already (although a bit roundabout).

The nlp_data parameter lets you pass your own word frequency dictionary. If you want to use the default dictionary, but just ignore a few words, you can modify that dictionary. You must set some non-zero frequency to the words you wish to ignore. (I know, this is quite a hacky way to do it, your implementation is cleaner.)

@David-Baron
Copy link
Author

@filyp Yes indeed, but I find the way a little too boring because the goal and for only a few words. In addition, it modifies the base file, which is wrong when using autocorrect in several projects.

@filyp
Copy link
Owner

filyp commented Jan 3, 2023

I didn't mean modifying files but something like:

spell = Speller()
spell.nlp_data.update(words_to_ignore_dict)

@David-Baron
Copy link
Author

It's a possibility, the difference and that you have to subtract the ignored ones from the nlp array, longer processing time I think. (not tested)

@charlietiwari
Copy link

hi, can u please help me in adding some words ,as below words are not spell checked correctly,also i have tried add those updated words as per ur above code ,it is not working ,so can u add some words in your vocabulary .Or can u suggest some tested method to add these words.

metaverse
kiyaverse
metachamber
metaroom

Thanks

@David-Baron
Copy link
Author

@charlietiwari
I don't think it's about this issue.

In addition, kiyaverse metachamber metaroom are names specific to companies, so it seems to me that they will never be added to a dictionary. Metavarse when it will certainly soon be seen as it enters more and more into the common language of certain languages.
Look in the readme https://github.com/filyp/autocorrect#custom-word-sets for the correction of your particular words.

Please create a new issue if the question does not match the current issue.

@filyp
Copy link
Owner

filyp commented Jan 9, 2023

@David-Baron ah, no, you don't subtract but rather add them to this nlp_data. This way, they are treated as real words and not corrected. Re performance, modifying a dictionary is pretty efficient - it has complexity O(n) where n is the number of new entries (here, words). And you just do it once, during initialization of Speller. In later usage, there should be no increase in processing time, because dictionary lookup time doesn't depend on the number of items in the dictionary.

@David-Baron
Copy link
Author

@filyp You are right, this works like a charm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants