-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jupyter notebook filter to only spell check inside cell inputs #2138
Comments
For example? In which situation there will be something detectable as misspelling? Is it about binary data? |
Not from the looks of things, it seems to be a way to document and run example code, so presumably the documentation, the code and the inputs to the code could all have typos. I wonder if this is better written into the nbformat package as an extension, with some sort of API provided by codespell, rather than expecting codespell to import and handle the limitless number of document formats. |
But in such case it would be beneficial to catch them with codespell, right? If I would have example in documentation then except rare cases I would find typos also there. |
Yes, but I think the point was there might be stuff you don't want to check (e.g. actual hex). I don't really know anything about Jupyter notebook other than 2 minutes on their website (and that they can't spell Jupiter... 😆 ). |
The Jupyter notebook Mardown cells contain markdown source that is rendered as HTML by the notebook UI. Code cells have a cell input that contains source code, and cell outputs that contain content relating to a rendering of an object returned from the last line of the executed code, including data tables, object descriptions, images, embedded audio or video files, HTML files etc. Notebooks can be saved in a state where all the cell outputs are cleared, so you don't get any false positives from codespell finding hashes in megabytes of raw image output. But sometimes it's more convenient to be able to run a spell checker over a notebook that does include cell outputs. In which case, it would just be much cleaner to run |
Re: "can't spell Jupiter" - it's a portmanteau: |
Thanks for the background @psychemedia . Although personally I don't think my opinion of how best to implement this has changed:
There is a near-infinite number of formats which may want special treatment (such as #2135). Pushing them all into the core codespell means more stuff to support when people might not have the experience of, let alone use, those tools, as well as providing lots of bloat or dependencies when people just want to spellcheck a plain text file. |
@peternewman Understood. Is there an example anywhere of writing an extension for |
Not currently, as such a thing doesn't currently exist. I pondered a bit more in here: Perhaps start with a little tool that can iterate all the cells in a notebook via nbformat and output their text? As a hacky version you could just pass that text into codespell via STDIN and go from there (depending on how much you want a solution to the problem versus a full tool... Also what would you want to do about config in a standalone tool? The code which actually does the checking is here, I guess we need to turn that into a standalone function. codespell/codespell_lib/_codespell.py Lines 640 to 712 in 99f39bd
|
@psychemedia , I wonder if I've found an even easier way to integrate the two while allowing cross-language working... See: |
Ah, insteresting.. so just extract comments and then spell check those. I think my original motivation was to be able to make use of the Jupyter notebook structure when trying to spell check The approach I use now is to just to clear cell outputs and then spell check the notebook as is. An alternative approach is to convert |
I'd think of it more as remove stuff that isn't comments rather than extract comments. If you're able to just strip the stuff that will trip up codespell, but keep the comments in exactly the same places, then with some fairly minor fudging of codespell or it's output, my main draw was you could use the annotation info coming out as-is, i.e. if codespell said it was on line 5, it would be in exactly that place in the original notebook.
I don't really follow this, would it provide something more human friendly to edit within the notebook itself, like example one question section (or whatever is relevant to these)?
In the short term, if you write something that takes a .ipynb file as input on stdin and returns the cleaned version on stdout, then for now you can do something like: It will say: And you'll get some useful output, and then with fairly minimal changes to codespell we could let you do something like: And it would instead say: Which obviously doesn't make much difference for a single file, but would be more important when you're checking a whole folder of them! |
Re: the pipeline - yep, one approach would be to do s/thing like:
Though the line numbers aren't totally meanigful without access to the (dynamically created) |
I also have an issue where Jupyter Notebook images become corrupted due to codespell. In this commit serveral images are modified and breaking when using Codespell v2.2.1. A solution where codespell ignores content in |
The Jupyter notebook
.ipynb
document format combines markdown text, code, and code outputs (embedded images, data tables) inside a JSON document structure.Spelling checking notebooks with cleared output cells generally works fine, but in some workflows it may be more convenient to spell check run notebooks. However, testing cell outputs may generate a large number of essentially false positive spelling errors.
It would be useful to be able to invoke a
codespell
switch (for example,--ipynb
) that would ensure that only cell inputs are spell checked. (Thenbformat
package provides a parser that can be used to parse notebooks.)Notebook cells may also include cell tag metadata. It might also be useful to be able to specify cell tags to control the spell checking at both cell input and output level. For example:
codespell-ignore
: ignore this cell;codespell-check-output
: also check the output of this (code) cell.The text was updated successfully, but these errors were encountered: