Allow for pluggable spell-checking modelled after Linter #74

dmoonfire · 2015-07-20T18:13:16Z

I wasn't entirely sure where the best place to put this, but this seems like the most likely the best one. I'm curious about thoughts and designs on making spell-check pluggable much like how linter provides a framework for linting and different lint packages provide the actual processing. This is related to using other dictionaries but more than just choosing a different file.

This would allow supporting Emacs-style "LocalWords" in a file or directory-specific word lists (both via Atom packages). The latter is important to me because I write novels and short stories. In most of my novels, there are hundreds of project-specific words that I don't want in my system dictionary but I do want checked into Git (saves me adding the words back in every time my machine explodes). I also have genre and world-specific lists ("mage" for example for fantasy genre, "Fedran" for my world).

For Emacs, I wrote caspell which got me the bulk of this functionality but I'd like to switch over to Atom. I already miss it.

After working with the linter framework, it seems like the same could be done for spell-check. Create a framework that gathers up all the packages that provide the spell-check service, and then query them to see if a given word is correct among any of them. If it isn't, then gather up suggestions with a given weight and display the top X unique values. Most of the spell-check already provides that, I'd just like a framework to write my per-directory word lists that coordinates with the system dictionary.

Linter could also be used for spell-checking in general and I considered writing the framework in that to just call spell-check, but I wasn't sure how closely you'd want those packages tied together. Linter doesn't have the ability to display fixes (Correct Spelling..., context menu), but it is possible to add that if that would be a better framework to consider.

These are other things I'm aiming for in the future, which may or may not be applicable for the framework (but probably not spell-check directly):

It could also provide dictionary and thesaurus features. With word lists, this wouldn't be available (maybe via a providesSpelling or providesDictionary function), but that would allow for things like dictionary.com or Google lookup of definitions.
Also, a "describe this" function that brings up information about the word (that would probably be a different framework though). I'd love to have a Control-F1 (Visual Studio's Describe This) that brought up information about the item under the cursor.

This is something I want, so I'm willing to do coding toward it. But, better to get feedback before blindly writing something. :)

The text was updated successfully, but these errors were encountered:

dmoonfire · 2016-01-18T20:00:23Z

I started messing with this, mainly because I really want to use it for some of my upcoming projects that I'm stalling on because I don't want to use Emacs (it just isn't pretty enough anymore).

Looking at some of the other items (#11 and #21 in specific), I can see two ways of doing this but I'm not entirely sure which one would make sense for the long run. In both cases, I'm suggesting making spell-check provider-based so other packages provide the actual spell checking (so I can also have a local words, a project-specific, and package-specific ones).

The reason I'm having the system dictionary as a separate package is because there are places where I think it is good not to have the system dictionary involved (some publisher/customers dictate which dictionary is allowed and you don't want pollution; translators might want to only see the language they are translating too).

One Package Per Language

The first approach is to create one apm package per language and then let the user decide which ones to use. So they would install spell-check-en-us and spell-check-de and it would do checking against both. That way, users who have multiple dictionaries can just install the ones they care about. In the code, it would just be using the spellchecker module with different dictionary paths, which is a tad inefficient, but it would let us language-specific dictionaries that don't use Hunspell (I believe there are Gaelic and Polish ones that don't use that library).

Also, I'm proposing multiple packages to handle the other word sources (spell-check-config (this is where I'd put the "github" and "Github" spellings along with my last name), spell-check-project, spell-check-file, spell-check-fantasy) and this would keep the code simple by only having to iterate through the plugins to get providers.

The drawback of this approach is that it would be harder to use system-specific dictionaries unless you also added a spell-check-system which guessed at the user's dictionary.

This also would let us allow adding words easier. With separate packages, I think it would let us have some dictionaries that add words and others that's don't (package-based ones) and have the provider give flags for those (canAddWords).

One Package for Spellchecker

The second approach is to have a single package spell-check-system that has a configuration that includes all the languages that should be checked and instantiate a dictionary instance for each one. This would be easier to implement the system user's language.

It would just be more complicated code to maintain and may add complexity to everything else.

Considerations

One of the goals is to let someone turn off a dictionary for a given project. So, if I have the German, English, and fantasy dictionaries involved, I want to be able to turn off any or all of them depending on the needs of the project. I figured the default would be "use all of them" unless there is something to turn it off. (Or have a config for each package in the first option that determines if it is automatic or not).

Not for this, I was thinking a separate APM project that provides feedback (spell-check-project).

etiktin · 2016-01-18T20:17:31Z

The first approach sounds better to me.

dmoonfire · 2016-01-21T18:08:31Z

I figured it was a good time for an update with my work over at dmoonfire/spell-check. I will be squashing the commit before I submit the PR but I'm a very noisy/frequent committer.

The system now can identify incorrect words from multiple system dictionaries. I implemented the spell-check-en-us and spell-check-de-de but then realized it was creating a lot of noise without a lot of gain. This version uses a configuration setting to get a list of languages and then creates the plugin objects and adds it. If someone wants to have multiple dictionaries (related to issue #11) or different default (#21), they just add it as a comma-separated list in the config (e.g., "en-US, de-DE"). If we decide to come up with a different, specialized dictionary, we can remove it from that list and add it as a separate package and it will also Just Work™ since they all are funneled through the same logic.

Dictionaries can now be positive matches (ignore words) or negative matches (incorrect words).

The drawback of using a lot of dictionaries is the ~500 ms/dictionary loading time. I haven't solved that yet, but I have some ideas. It does use the listener, so it reloads the dictionaries when those configuration values change. It doesn't recheck the document yet.

I also have it so the user can list dictionary path. I don't have the Windows one in yet and I haven't tested the Windows 8 logic, but Linux work pretty well with /usr/share/hunspell and /usr/share/myspell/dicts.

I'm also using navigator.language as a default which may pick up the user's native language (related to issue #44). I'll have to test it better, but I'm hoping it works. Unfortunately, I use the default of en-US, so this will be harder.

I have a second plugin for ignoring known words. We had "GitHub" and "github" both listed in the single dictionary. That is now a configuration option, so anyone can add other entries, such as their name (most dictionaries don't like my last name of "Moonfire"). The ignore is based on regexp, so if you put "GitHub" it converts it into /GitHub/, if you put "/GitHub/i" it will use that. All lowercase entries are automatically case insensitive. I haven't tested the performance with large files yet.

Current plans until this weekend when I have to actually use it:

The suggestions aren't working yet, but those shouldn't be too hard.
Doing project-level settings (spell-check-project) isn't working, this is what I need for this weekend.
Add to dictionary is a per-plugin implementation, but not finished.
Lots of refactoring, cleaning, and code documentation.
Unit tests, mainly because I haven't figured out the Atom/Coffeescript way of doing them.
Documentation on how to use all of this.

dmoonfire · 2016-01-22T20:55:48Z

Finished my development for the week, so here is the status until I can wander back (hopefully in a week or so). I'm behind with deadlines, but I got the two packages up to the point I can write a bunch of words and see where it's painful to use.

Projects

Project plugin (dmoonfire/spell-check-project) is now functional. This uses a language.json file in the project root.

{
  "localWords": [
    "word",
    "/wordmustbelowercase/",
    "/wordCanBeLowercase/i"
  ]
}

The project files are aware of the multiple project paths, so a given file will use it's own project language.json instead of another loaded project path. If there is no language.json, the plugin will automatically disable itself.

If language.json changes, it will be reloaded on the next spell check.

Rewriting the language.json uses tabs, I need to figure out a generic way to figure out writer's preferred settings.

Suggestions

The biggest improvement is that suggestions now work across all dictionaries. They are gathered from every dictionary that provides suggestions and then interspersed together based on the plugin getPriority(): number results. The order is priority + index of result, so if you have ignore words (priority 10), project (priority 25), en-US (priority 100) and de-DE (priority 100), the ignore word suggestions will show up, then projects (assuming less than 75 project suggestions), and then en-US and de-DE will alternate their suggestions. This prevents having the less desired option of en-US come before the preferred de-DE one.

Suggestions for regex-based items (ignoreWords and project) will fake what will be replaced with the suggestion. Eventually, it should do the Emacs thing (if the compared word starts with a capital, make the suggestion start). Right now, whatever you put in the regex is given as a replacement.

Both ignoreWords and the project dictionary use natural to calculate the Jaro–Winkler string distance so only "similar" words are suggested. I picked an arbitrary distance of 0.90 or higher (1.00 is exact match, 0 is non-match).

Adding

While I wasn't planning on adding to the system dictionary, both the ignoreWords and project allow adding to their dictionaries. In both cases, the option to do so shows up as the last few items in the suggestion list (such as "Add to Project (case-sensitive)") in italic. If that is selected, it will either add it to the spell-check config for ignoreWords, or update the language.json file which is reloaded on the next check.

In both cases, the file is not rechecked for spelling because I haven't figured out how to do it yet.

Performance

It still adds a reasonable amount of time to the startup, about 250 ms + 400 ms/dictionary.

With 89 project words and editing a 6k word file on my laptop, performance was pretty reasonable (no really obvious delays or slowdowns) at about 80 wpm. It also handled accented characters fairly well, which is good because I use them heavily in my novels.

* Changed the package to allow for external packages to provide additional checking. (Closes atom#74) - Diabled the task-based handling because of passing plugins. - Two default plugins are included: system-based dictionaries and "known words". - Suggestions and "add to dictionary" are also provided via interfaces. (Closes atom#11) - Modified various calls so they are aware of the where the buffer is located. * Modified system to allow for multiple plugins/checkers to identify correctness. - Incorrect words must be incorrect for all checkers. - Any checker that treats a word as valid is considered valid for the buffer. * Extracted system-based dictionary support into separate checker. - System dictionaries can now check across multiple system locales. - Locale selection can be changed via package settings. (Closes atom#21) - External search paths can be used for Linux and OS X. - Default language is based on Chromium settings. * Extracted hard-coded approved list into a separate checker. - User can add additional "known words" via settings. - Added an option to add more known words via the suggestion dialog. * Updated ignore files and added EditorConfig settings for development. * Various coffee-centric formatting.

* Changed the package to allow for external packages to provide additional checking. (Closes atom#74) - Disabled the task-based handling because of passing plugins. - Two default plugins are included: system-based dictionaries and "known words". - Suggestions and "add to dictionary" are also provided via interfaces. (Closes atom#11) - Modified various calls so they are aware of the where the buffer is located. * Modified system to allow for multiple plugins/checkers to identify correctness. - Incorrect words must be incorrect for all checkers. - Any checker that treats a word as valid is considered valid for the buffer. * Extracted system-based dictionary support into separate checker. - System dictionaries can now check across multiple system locales. - Locale selection can be changed via package settings. (Closes atom#21) - External search paths can be used for Linux and OS X. - Default language is based on Chromium settings. * Extracted hard-coded approved list into a separate checker. - User can add additional "known words" via settings. - Added an option to add more known words via the suggestion dialog. * Updated ignore files and added EditorConfig settings for development. * Various coffee-centric formatting.

* Changed the package to allow for external packages to provide additional checking. (Closes atom#74) - Disabled the task-based handling because of passing plugins. - Two default plugins are included: system-based dictionaries and "known words". - Suggestions and "add to dictionary" are also provided via interfaces. (Closes atom#10) - Modified various calls so they are aware of the where the buffer is located. * Modified system to allow for multiple plugins/checkers to identify correctness. - Incorrect words must be incorrect for all checkers. - Any checker that treats a word as valid is considered valid for the buffer. * Extracted system-based dictionary support into separate checker. - System dictionaries can now check across multiple system locales. - Locale selection can be changed via package settings. (Closes atom#21) - Multiple locales can be selected. (Closes atom#11) - External search paths can be used for Linux and OS X. - Default language is based on the process environment, with a fallback to the browser, before finally using `en-US` as a fallback. * Extracted hard-coded approved list into a separate checker. - User can add additional "known words" via settings. - Added an option to add more known words via the suggestion dialog. * Updated ignore files and added EditorConfig settings for development. * Various coffee-centric formatting.

rugk mentioned this issue Sep 6, 2015

Use linter? #80

Open

izuzak added the enhancement label Sep 13, 2015

dmoonfire mentioned this issue Mar 12, 2016

Changed spell-checking to be plugin-based. #120

Merged

as-cii closed this as completed in 8b7ab9c Aug 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for pluggable spell-checking modelled after Linter #74

Allow for pluggable spell-checking modelled after Linter #74

dmoonfire commented Jul 20, 2015

dmoonfire commented Jan 18, 2016

etiktin commented Jan 18, 2016

dmoonfire commented Jan 21, 2016

dmoonfire commented Jan 22, 2016

Allow for pluggable spell-checking modelled after Linter #74

Allow for pluggable spell-checking modelled after Linter #74

Comments

dmoonfire commented Jul 20, 2015

dmoonfire commented Jan 18, 2016

etiktin commented Jan 18, 2016

dmoonfire commented Jan 21, 2016

dmoonfire commented Jan 22, 2016

Projects

Suggestions

Adding

Performance