Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"to and fro" is correct #410

Open
EdwardBetts opened this issue Mar 22, 2018 · 19 comments
Open

"to and fro" is correct #410

EdwardBetts opened this issue Mar 22, 2018 · 19 comments
Assignees
Labels
dictionary Changes to the dictionary

Comments

@EdwardBetts
Copy link
Contributor

codespell suggests replacing "fro" with "for". Can we have an exception for the phrase "to and fro"?

https://en.wiktionary.org/wiki/to_and_fro

@larsoner
Copy link
Member

Sure, feel free to add it

@EdwardBetts
Copy link
Contributor Author

Thanks @larsoner. I couldn't find the an existing mechanism to specify multi word exceptions like this. Does it exist and I've missed it, or would it need to be added to handle this case?

@larsoner
Copy link
Member

Usually people just add the word to the list like:

fro->for, fro

Right @luzpaz ?

@larsoner
Copy link
Member

(I don't think there is a multi-word way in particular, I think it's okay to just use this single-word one in the meantime)

@luzpaz
Copy link
Collaborator

luzpaz commented Mar 23, 2018

Actually, you're missing a comma at the end:
fro->for, fro,

Sorry, did you mean something like:
https://github.com/codespell-project/codespell/blob/master/codespell_lib/data/dictionary.txt#L655
fro->for, fro is correct if it's in the context of 'to and fro'
something like that ?

@peternewman
Copy link
Collaborator

So I've just sort of hit another of these, preform->perform. Well actually in this case it was a typo, but preform is a valid word too. But there are others like Cristal.

cristal->crystal, cristal,

Essentially we currently have some entries where the misspelling is also listed as a valid correction. There are two trains of thought here I guess, firstly it's nonsense and we should remove the entry, or more considered, that cristal is most likely a misspelling, but in some (rare) circumstances you may really mean it (see also fro, until we support multi-word corrections #255).

What's the general feeling? I've not used interactive mode, but there a "did you really mean?", and in manual/automatic the option to skip/ignore all potential ones seems like it might be sensible. Essentially treat likely typos differently from definite typos.

@larsoner
Copy link
Member

I've not used interactive mode, but there a "did you really mean?",

That is more or less what these entries seem to do. I agree we could add a parameter to be more suggestive (default) or less suggestive (if the "error" is in the list of corrections, don't prompt or report)

@peternewman
Copy link
Collaborator

I guess one of the things I've always liked about Codespell is the fact the dictionary is curated, rather than a list of valid words, and hence doesn't normally trip up on valid but obscure/technical words. It sort of feels to me that it goes against that ethos when words are added to the dictionary which are valid (although admittedly mostly rarely used). Even if the other variant (i.e. the "typo") is listed, but even more so when it's not.

@larsoner
Copy link
Member

We probably need a new argument for this, that (I agree) should disable these by default.

--strict LEVEL, where for now 0 means include all and 1 (default) means exclude such self corrections? That leaves the option open for other sorts of strictness types later.

@larsoner
Copy link
Member

If so, any volunteer to implement this?

@peternewman
Copy link
Collaborator

Would a bitmask make more sense, in case someone doesn't want one of the future strict levels? I'll have to pass on implementing it for now.

It probably also needs a dictionary test that we don't just have wit->wit, i.e. there is at least one alternative replacement.

@larsoner
Copy link
Member

Yes the idea is to use binary values so it would be a bitmask (it's just a trivial, future compatible one for now)

@peternewman
Copy link
Collaborator

Not that this covers multi-word examples, but the other rare stuff can go in https://github.com/codespell-project/codespell/blob/master/codespell_lib/data/dictionary_rare.txt

@bl-ue bl-ue self-assigned this Jun 13, 2021
@yarikoptic
Copy link
Contributor

I am to overcome this limitation as observed in https://framagit.org/medoc92/recoll/-/merge_requests/23#note_1999939 via

ignore-regex = \bto and fro\b

I think it would be valuable to collect/support presence of such phrases (I can't recall ATM any other but remember hitting them) which should be whitelisted although individual words (fro) should be considered a typo.

@DimitriPapadopoulos
Copy link
Collaborator

DimitriPapadopoulos commented Aug 16, 2023

A good idea indeed, however:

  1. Currently codespell splits text into words before processing them. It does not support n-grams.
  2. The dictionary of typos does not support spaces in possible typos.

Item 1 looks like the most complex to address – but then I haven't done my homework. Nowadays, I am not certain it is useful to start supporting n-grams without using deep learning to process them.

@yarikoptic
Copy link
Contributor

couldn't it be just pretty much "pre-feed ignore-regex with all the phrases surrounded with \b"?

@DimitriPapadopoulos
Copy link
Collaborator

Do you mean you would add an invisible backspace to your text, just to please codespell?

@yarikoptic
Copy link
Contributor

no, I mean that codespell could just pre-craft regex for all the phrases. \b in Python re module is a word boundary:

\b
Matches the empty string, but only at the beginning or end of a word.

@DimitriPapadopoulos
Copy link
Collaborator

So codespell would apply a limited set of regexes for very common such expressions, prior to splitting the text into words, removing the matched words from further checks. I suspect this would have a perceptible impact on performance, but I can't tell how maintainers would react it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dictionary Changes to the dictionary
Projects
Development

No branches or pull requests

7 participants