Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jupyter notebook filter to only spell check inside cell inputs #2138

Open
psychemedia opened this issue Nov 6, 2021 · 15 comments
Open

Jupyter notebook filter to only spell check inside cell inputs #2138

psychemedia opened this issue Nov 6, 2021 · 15 comments

Comments

@psychemedia
Copy link

The Jupyter notebook .ipynb document format combines markdown text, code, and code outputs (embedded images, data tables) inside a JSON document structure.

Spelling checking notebooks with cleared output cells generally works fine, but in some workflows it may be more convenient to spell check run notebooks. However, testing cell outputs may generate a large number of essentially false positive spelling errors.

It would be useful to be able to invoke a codespell switch (for example, --ipynb) that would ensure that only cell inputs are spell checked. (The nbformat package provides a parser that can be used to parse notebooks.)

Notebook cells may also include cell tag metadata. It might also be useful to be able to specify cell tags to control the spell checking at both cell input and output level. For example:

  • codespell-ignore: ignore this cell;
  • codespell-check-output: also check the output of this (code) cell.
@matkoniecz
Copy link
Contributor

However, testing cell outputs may generate a large number of essentially false positive spelling errors

For example? In which situation there will be something detectable as misspelling? Is it about binary data?

@peternewman
Copy link
Collaborator

For example? In which situation there will be something detectable as misspelling? Is it about binary data?

Not from the looks of things, it seems to be a way to document and run example code, so presumably the documentation, the code and the inputs to the code could all have typos.

I wonder if this is better written into the nbformat package as an extension, with some sort of API provided by codespell, rather than expecting codespell to import and handle the limitless number of document formats.

@matkoniecz
Copy link
Contributor

presumably the documentation, the code and the inputs to the code could all have typos

But in such case it would be beneficial to catch them with codespell, right?

If I would have example in documentation then except rare cases I would find typos also there.

@peternewman
Copy link
Collaborator

Yes, but I think the point was there might be stuff you don't want to check (e.g. actual hex). I don't really know anything about Jupyter notebook other than 2 minutes on their website (and that they can't spell Jupiter... 😆 ).

@psychemedia
Copy link
Author

The Jupyter notebook .ipynb file type is JSON document type with a cell based structure (https://nbformat.readthedocs.io/en/latest/format_description.html#cell-types ).

Mardown cells contain markdown source that is rendered as HTML by the notebook UI. Code cells have a cell input that contains source code, and cell outputs that contain content relating to a rendering of an object returned from the last line of the executed code, including data tables, object descriptions, images, embedded audio or video files, HTML files etc.

Notebooks can be saved in a state where all the cell outputs are cleared, so you don't get any false positives from codespell finding hashes in megabytes of raw image output. But sometimes it's more convenient to be able to run a spell checker over a notebook that does include cell outputs. In which case, it would just be much cleaner to run codespell over content you know is code or markdown (the code and markdown cell inputs) and not any other cruft that happens to be in the .ipynb file.

@psychemedia
Copy link
Author

Re: "can't spell Jupiter" - it's a portmanteau: Ju-Py-te-R, representing the three language kernels supported by the original IPython notebooks when the project was renamed. (There are now kernels for many languages folk might want to use.)

@peternewman
Copy link
Collaborator

Thanks for the background @psychemedia . Although personally I don't think my opinion of how best to implement this has changed:

I wonder if this is better written into the nbformat package as an extension, with some sort of API provided by codespell, rather than expecting codespell to import and handle the limitless number of document formats.

There is a near-infinite number of formats which may want special treatment (such as #2135). Pushing them all into the core codespell means more stuff to support when people might not have the experience of, let alone use, those tools, as well as providing lots of bloat or dependencies when people just want to spellcheck a plain text file.

@psychemedia
Copy link
Author

@peternewman Understood. Is there an example anywhere of writing an extension for codespell?

@peternewman
Copy link
Collaborator

@peternewman Understood. Is there an example anywhere of writing an extension for codespell?

Not currently, as such a thing doesn't currently exist.

I pondered a bit more in here:
#2135 (comment)

Perhaps start with a little tool that can iterate all the cells in a notebook via nbformat and output their text? As a hacky version you could just pass that text into codespell via STDIN and go from there (depending on how much you want a solution to the problem versus a full tool...

Also what would you want to do about config in a standalone tool?

The code which actually does the checking is here, I guess we need to turn that into a standalone function.

for i, line in enumerate(lines):
if line in exclude_lines:
continue
fixed_words = set()
asked_for = set()
# If all URI spelling errors will be ignored, erase any URI before
# extracting words. Otherwise, apply ignores after extracting words.
# This ensures that if a URI ignore word occurs both inside a URI and
# outside, it will still be a spelling error.
if "*" in uri_ignore_words:
line = uri_regex.sub(' ', line)
check_words = extract_words(line, word_regex, ignore_word_regex)
if "*" not in uri_ignore_words:
apply_uri_ignore_words(check_words, line, word_regex,
ignore_word_regex, uri_regex,
uri_ignore_words)
for word in check_words:
lword = word.lower()
if lword in misspellings:
context_shown = False
fix = misspellings[lword].fix
fixword = fix_case(word, misspellings[lword].data)
if options.interactive and lword not in asked_for:
if context is not None:
context_shown = True
print_context(lines, i, context)
fix, fixword = ask_for_word_fix(
lines[i], word, misspellings[lword],
options.interactive)
asked_for.add(lword)
if summary and fix:
summary.update(lword)
if word in fixed_words: # can skip because of re.sub below
continue
if options.write_changes and fix:
changed = True
lines[i] = re.sub(r'\b%s\b' % word, fixword, lines[i])
fixed_words.add(word)
continue
# otherwise warning was explicitly set by interactive mode
if (options.interactive & 2 and not fix and not
misspellings[lword].reason):
continue
cfilename = "%s%s%s" % (colors.FILE, filename, colors.DISABLE)
cline = "%s%d%s" % (colors.FILE, i + 1, colors.DISABLE)
cwrongword = "%s%s%s" % (colors.WWORD, word, colors.DISABLE)
crightword = "%s%s%s" % (colors.FWORD, fixword, colors.DISABLE)
if misspellings[lword].reason:
if options.quiet_level & QuietLevels.DISABLED_FIXES:
continue
creason = " | %s%s%s" % (colors.FILE,
misspellings[lword].reason,
colors.DISABLE)
else:
if options.quiet_level & QuietLevels.NON_AUTOMATIC_FIXES:
continue
creason = ''
# If we get to this point (uncorrected error) we should change
# our bad_count and thus return value
bad_count += 1

@peternewman
Copy link
Collaborator

@psychemedia , I wonder if I've found an even easier way to integrate the two while allowing cross-language working...

See:
#1147 (comment)

@psychemedia
Copy link
Author

Ah, insteresting.. so just extract comments and then spell check those. I think my original motivation was to be able to make use of the Jupyter notebook structure when trying to spell check .ipynb docs.

The approach I use now is to just to clear cell outputs and then spell check the notebook as is. An alternative approach is to convert .ipynb to a text format (eg markdown or .py) using jupytext and then spell check that. (That gets rid of all the cell outputs as well as the JSON structure in the .ipynb.) Your new suggested approach could improve that text pipeline route: .ipynb -> .py -> extract comments -> codespell

@peternewman
Copy link
Collaborator

Ah, insteresting.. so just extract comments and then spell check those.

I'd think of it more as remove stuff that isn't comments rather than extract comments. If you're able to just strip the stuff that will trip up codespell, but keep the comments in exactly the same places, then with some fairly minor fudging of codespell or it's output, my main draw was you could use the annotation info coming out as-is, i.e. if codespell said it was on line 5, it would be in exactly that place in the original notebook.

I think my original motivation was to be able to make use of the Jupyter notebook structure when trying to spell check .ipynb docs.

I don't really follow this, would it provide something more human friendly to edit within the notebook itself, like example one question section (or whatever is relevant to these)?

The approach I use now is to just to clear cell outputs and then spell check the notebook as is. An alternative approach is to convert .ipynb to a text format (eg markdown or .py) using jupytext and then spell check that. (That gets rid of all the cell outputs as well as the JSON structure in the .ipynb.) Your new suggested approach could improve that text pipeline route: .ipynb -> .py -> extract comments -> codespell

In the short term, if you write something that takes a .ipynb file as input on stdin and returns the cleaned version on stdout, then for now you can do something like:
cat foo.ipynb | python3 filter.py | codespell -

It will say:
speling ==> spelling

And you'll get some useful output, and then with fairly minimal changes to codespell we could let you do something like:
codespell --filters="ipynb=python3 filter.py" foo.ipynb

And it would instead say:
foo.ipynb:42: speling ==> spelling

Which obviously doesn't make much difference for a single file, but would be more important when you're checking a whole folder of them!

@psychemedia
Copy link
Author

psychemedia commented Dec 6, 2021

Re: "more human friendly", this is the view I tend to edit files in eg in VS Code:

or in Jupyter notebook:

image

Using the jupytext extension, I can edit light or percent format .pyfiles in a notebook UI - so for example, comments can be styled in markdown cells, and code split out into code cells.

@psychemedia
Copy link
Author

psychemedia commented Dec 6, 2021

Re: the pipeline - yep, one approach would be to do s/thing like:

cat $MY_NB_PATH | jupytext --from ipynb --to py:percent | codespell -

Though the line numbers aren't totally meanigful without access to the (dynamically created) .py file.

@EwoutH
Copy link

EwoutH commented Oct 18, 2022

I also have an issue where Jupyter Notebook images become corrupted due to codespell. In this commit serveral images are modified and breaking when using Codespell v2.2.1.

A solution where codespell ignores content in "data": {"image/*"} fields by default would be ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants