Jupyter notebook filter to only spell check inside cell inputs #2138

psychemedia · 2021-11-06T11:37:08Z

The Jupyter notebook .ipynb document format combines markdown text, code, and code outputs (embedded images, data tables) inside a JSON document structure.

Spelling checking notebooks with cleared output cells generally works fine, but in some workflows it may be more convenient to spell check run notebooks. However, testing cell outputs may generate a large number of essentially false positive spelling errors.

It would be useful to be able to invoke a codespell switch (for example, --ipynb) that would ensure that only cell inputs are spell checked. (The nbformat package provides a parser that can be used to parse notebooks.)

Notebook cells may also include cell tag metadata. It might also be useful to be able to specify cell tags to control the spell checking at both cell input and output level. For example:

codespell-ignore: ignore this cell;
codespell-check-output: also check the output of this (code) cell.

The text was updated successfully, but these errors were encountered:

matkoniecz · 2021-11-11T07:39:46Z

However, testing cell outputs may generate a large number of essentially false positive spelling errors

For example? In which situation there will be something detectable as misspelling? Is it about binary data?

peternewman · 2021-11-11T12:05:22Z

For example? In which situation there will be something detectable as misspelling? Is it about binary data?

Not from the looks of things, it seems to be a way to document and run example code, so presumably the documentation, the code and the inputs to the code could all have typos.

I wonder if this is better written into the nbformat package as an extension, with some sort of API provided by codespell, rather than expecting codespell to import and handle the limitless number of document formats.

matkoniecz · 2021-11-11T12:17:51Z

presumably the documentation, the code and the inputs to the code could all have typos

But in such case it would be beneficial to catch them with codespell, right?

If I would have example in documentation then except rare cases I would find typos also there.

peternewman · 2021-11-11T12:20:33Z

Yes, but I think the point was there might be stuff you don't want to check (e.g. actual hex). I don't really know anything about Jupyter notebook other than 2 minutes on their website (and that they can't spell Jupiter... 😆 ).

psychemedia · 2021-11-11T16:34:15Z

The Jupyter notebook .ipynb file type is JSON document type with a cell based structure (https://nbformat.readthedocs.io/en/latest/format_description.html#cell-types ).

Mardown cells contain markdown source that is rendered as HTML by the notebook UI. Code cells have a cell input that contains source code, and cell outputs that contain content relating to a rendering of an object returned from the last line of the executed code, including data tables, object descriptions, images, embedded audio or video files, HTML files etc.

Notebooks can be saved in a state where all the cell outputs are cleared, so you don't get any false positives from codespell finding hashes in megabytes of raw image output. But sometimes it's more convenient to be able to run a spell checker over a notebook that does include cell outputs. In which case, it would just be much cleaner to run codespell over content you know is code or markdown (the code and markdown cell inputs) and not any other cruft that happens to be in the .ipynb file.

psychemedia · 2021-11-11T16:38:31Z

Re: "can't spell Jupiter" - it's a portmanteau: Ju-Py-te-R, representing the three language kernels supported by the original IPython notebooks when the project was renamed. (There are now kernels for many languages folk might want to use.)

peternewman · 2021-11-12T02:25:07Z

Thanks for the background @psychemedia . Although personally I don't think my opinion of how best to implement this has changed:

I wonder if this is better written into the nbformat package as an extension, with some sort of API provided by codespell, rather than expecting codespell to import and handle the limitless number of document formats.

There is a near-infinite number of formats which may want special treatment (such as #2135). Pushing them all into the core codespell means more stuff to support when people might not have the experience of, let alone use, those tools, as well as providing lots of bloat or dependencies when people just want to spellcheck a plain text file.

psychemedia · 2021-11-12T12:03:36Z

@peternewman Understood. Is there an example anywhere of writing an extension for codespell?

peternewman · 2021-11-13T16:14:02Z

@peternewman Understood. Is there an example anywhere of writing an extension for codespell?

Not currently, as such a thing doesn't currently exist.

I pondered a bit more in here:
#2135 (comment)

Perhaps start with a little tool that can iterate all the cells in a notebook via nbformat and output their text? As a hacky version you could just pass that text into codespell via STDIN and go from there (depending on how much you want a solution to the problem versus a full tool...

Also what would you want to do about config in a standalone tool?

The code which actually does the checking is here, I guess we need to turn that into a standalone function.

codespell/codespell_lib/_codespell.py

Lines 640 to 712 in 99f39bd

    
           for i, line in enumerate(lines): 
        
               if line in exclude_lines: 
        
                   continue 
        
               fixed_words = set() 
        
               asked_for = set() 
        
               # If all URI spelling errors will be ignored, erase any URI before 
        
               # extracting words. Otherwise, apply ignores after extracting words. 
        
               # This ensures that if a URI ignore word occurs both inside a URI and 
        
               # outside, it will still be a spelling error. 
        
               if "*" in uri_ignore_words: 
        
                   line = uri_regex.sub(' ', line) 
        
               check_words = extract_words(line, word_regex, ignore_word_regex) 
        
               if "*" not in uri_ignore_words: 
        
                   apply_uri_ignore_words(check_words, line, word_regex, 
        
                                          ignore_word_regex, uri_regex, 
        
                                          uri_ignore_words) 
        
               for word in check_words: 
        
                   lword = word.lower() 
        
                   if lword in misspellings: 
        
                       context_shown = False 
        
                       fix = misspellings[lword].fix 
        
                       fixword = fix_case(word, misspellings[lword].data) 
        
                       if options.interactive and lword not in asked_for: 
        
                           if context is not None: 
        
                               context_shown = True 
        
                               print_context(lines, i, context) 
        
                           fix, fixword = ask_for_word_fix( 
        
                               lines[i], word, misspellings[lword], 
        
                               options.interactive) 
        
                           asked_for.add(lword) 
        
                       if summary and fix: 
        
                           summary.update(lword) 
        
                       if word in fixed_words:  # can skip because of re.sub below 
        
                           continue 
        
                       if options.write_changes and fix: 
        
                           changed = True 
        
                           lines[i] = re.sub(r'\b%s\b' % word, fixword, lines[i]) 
        
                           fixed_words.add(word) 
        
                           continue 
        
                       # otherwise warning was explicitly set by interactive mode 
        
                       if (options.interactive & 2 and not fix and not 
        
                               misspellings[lword].reason): 
        
                           continue 
        
                       cfilename = "%s%s%s" % (colors.FILE, filename, colors.DISABLE) 
        
                       cline = "%s%d%s" % (colors.FILE, i + 1, colors.DISABLE) 
        
                       cwrongword = "%s%s%s" % (colors.WWORD, word, colors.DISABLE) 
        
                       crightword = "%s%s%s" % (colors.FWORD, fixword, colors.DISABLE) 
        
                       if misspellings[lword].reason: 
        
                           if options.quiet_level & QuietLevels.DISABLED_FIXES: 
        
                               continue 
        
                           creason = "  | %s%s%s" % (colors.FILE, 
        
                                                     misspellings[lword].reason, 
        
                                                     colors.DISABLE) 
        
                       else: 
        
                           if options.quiet_level & QuietLevels.NON_AUTOMATIC_FIXES: 
        
                               continue 
        
                           creason = '' 
        
                       # If we get to this point (uncorrected error) we should change 
        
                       # our bad_count and thus return value 
        
                       bad_count += 1

peternewman · 2021-12-02T01:19:29Z

@psychemedia , I wonder if I've found an even easier way to integrate the two while allowing cross-language working...

See:
#1147 (comment)

psychemedia · 2021-12-06T11:12:25Z

Ah, insteresting.. so just extract comments and then spell check those. I think my original motivation was to be able to make use of the Jupyter notebook structure when trying to spell check .ipynb docs.

The approach I use now is to just to clear cell outputs and then spell check the notebook as is. An alternative approach is to convert .ipynb to a text format (eg markdown or .py) using jupytext and then spell check that. (That gets rid of all the cell outputs as well as the JSON structure in the .ipynb.) Your new suggested approach could improve that text pipeline route: .ipynb -> .py -> extract comments -> codespell

peternewman · 2021-12-06T12:21:17Z

Ah, insteresting.. so just extract comments and then spell check those.

I'd think of it more as remove stuff that isn't comments rather than extract comments. If you're able to just strip the stuff that will trip up codespell, but keep the comments in exactly the same places, then with some fairly minor fudging of codespell or it's output, my main draw was you could use the annotation info coming out as-is, i.e. if codespell said it was on line 5, it would be in exactly that place in the original notebook.

I think my original motivation was to be able to make use of the Jupyter notebook structure when trying to spell check .ipynb docs.

I don't really follow this, would it provide something more human friendly to edit within the notebook itself, like example one question section (or whatever is relevant to these)?

The approach I use now is to just to clear cell outputs and then spell check the notebook as is. An alternative approach is to convert .ipynb to a text format (eg markdown or .py) using jupytext and then spell check that. (That gets rid of all the cell outputs as well as the JSON structure in the .ipynb.) Your new suggested approach could improve that text pipeline route: .ipynb -> .py -> extract comments -> codespell

In the short term, if you write something that takes a .ipynb file as input on stdin and returns the cleaned version on stdout, then for now you can do something like:
cat foo.ipynb | python3 filter.py | codespell -

It will say:
speling ==> spelling

And you'll get some useful output, and then with fairly minimal changes to codespell we could let you do something like:
codespell --filters="ipynb=python3 filter.py" foo.ipynb

And it would instead say:
foo.ipynb:42: speling ==> spelling

Which obviously doesn't make much difference for a single file, but would be more important when you're checking a whole folder of them!

psychemedia · 2021-12-06T13:24:19Z

Re: "more human friendly", this is the view I tend to edit files in eg in VS Code:

or in Jupyter notebook:

Using the jupytext extension, I can edit light or percent format .pyfiles in a notebook UI - so for example, comments can be styled in markdown cells, and code split out into code cells.

psychemedia · 2021-12-06T13:32:47Z

Re: the pipeline - yep, one approach would be to do s/thing like:

cat $MY_NB_PATH | jupytext --from ipynb --to py:percent | codespell -

Though the line numbers aren't totally meanigful without access to the (dynamically created) .py file.

EwoutH · 2022-10-18T15:09:22Z

I also have an issue where Jupyter Notebook images become corrupted due to codespell. In this commit serveral images are modified and breaking when using Codespell v2.2.1.

A solution where codespell ignores content in "data": {"image/*"} fields by default would be ideal.

peternewman added the enhancement label Nov 11, 2021

peternewman mentioned this issue Dec 2, 2021

Option to check spelling only in the code comments #1147

Open

atong01 mentioned this issue Sep 5, 2022

Improved pre-commit config: Jupyter Notebook linting, black incompatibility ashleve/lightning-hydra-template#434

Closed

EwoutH mentioned this issue Oct 18, 2022

Fix spelling mistakes quaquel/EMAworkbench#195

Merged

peternewman mentioned this issue Nov 18, 2022

Using codespell as a library #2607

Closed

TomAugspurger mentioned this issue Dec 12, 2022

Fix typos microsoft/PlanetaryComputerExamples#236

Merged

jackiekazil mentioned this issue May 8, 2023

Add regex option to run in codespell yml #2851

Closed

edublancas mentioned this issue May 27, 2023

testing codespell ploomber/contributing#62

Closed

frank1010111 mentioned this issue Aug 11, 2023

[JOSS Review] Fix typos in docs frank1010111/bluebonnet#30

Merged

SoyGema mentioned this issue Sep 2, 2023

Update example iterative/dvclive#697

Merged

2 tasks

brynpickering mentioned this issue Dec 28, 2023

Add spell checking; fix spelling mistakes calliope-project/calliope#530

Merged

1 task

g4brielvs mentioned this issue Jun 27, 2024

Jupyter notebook filter to only spell check inside cell inputs worldbank/template#54

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jupyter notebook filter to only spell check inside cell inputs #2138

Jupyter notebook filter to only spell check inside cell inputs #2138

psychemedia commented Nov 6, 2021

matkoniecz commented Nov 11, 2021

peternewman commented Nov 11, 2021

matkoniecz commented Nov 11, 2021

peternewman commented Nov 11, 2021

psychemedia commented Nov 11, 2021

psychemedia commented Nov 11, 2021

peternewman commented Nov 12, 2021

psychemedia commented Nov 12, 2021

peternewman commented Nov 13, 2021

peternewman commented Dec 2, 2021

psychemedia commented Dec 6, 2021

peternewman commented Dec 6, 2021

psychemedia commented Dec 6, 2021 •

edited

Loading

psychemedia commented Dec 6, 2021 •

edited

Loading

EwoutH commented Oct 18, 2022

Jupyter notebook filter to only spell check inside cell inputs #2138

Jupyter notebook filter to only spell check inside cell inputs #2138

Comments

psychemedia commented Nov 6, 2021

matkoniecz commented Nov 11, 2021

peternewman commented Nov 11, 2021

matkoniecz commented Nov 11, 2021

peternewman commented Nov 11, 2021

psychemedia commented Nov 11, 2021

psychemedia commented Nov 11, 2021

peternewman commented Nov 12, 2021

psychemedia commented Nov 12, 2021

peternewman commented Nov 13, 2021

peternewman commented Dec 2, 2021

psychemedia commented Dec 6, 2021

peternewman commented Dec 6, 2021

psychemedia commented Dec 6, 2021 • edited Loading

psychemedia commented Dec 6, 2021 • edited Loading

EwoutH commented Oct 18, 2022

psychemedia commented Dec 6, 2021 •

edited

Loading

psychemedia commented Dec 6, 2021 •

edited

Loading