-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor SourceKind
to store file content
#6640
Conversation
Current dependencies on/for this PR:
This comment was auto-generated by Graphite. |
PR Check ResultsEcosystem✅ ecosystem check detected no changes. BenchmarkLinux
Windows
|
use crate::jupyter::Notebook; | ||
|
||
#[derive(Clone, Debug, PartialEq, is_macro::Is)] | ||
pub enum SourceKind { | ||
Python, | ||
Python(String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to store the PySourceType
on the SourceKind
as well (making it a Program
or SourceFile
entity). However, it is a bit awkward because PySourceType
has a Jupyter
variant, but that variant should only be used to Jupyter
...
@@ -63,7 +63,7 @@ pub struct FixerResult<'a> { | |||
/// The result returned by the linter, after applying any fixes. | |||
pub result: LinterResult<(Vec<Message>, Option<ImportMap>)>, | |||
/// The resulting source code, after applying any fixes. | |||
pub transformed: Cow<'a, str>, | |||
pub transformed: Cow<'a, SourceKind>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change here is that the linter now returns the transformed SourceKind
which aligns the handling of Notebooks with regular python files in the sense that it doesn't mutate the source file in place.
@@ -374,7 +374,7 @@ pub fn lint_only( | |||
&directives, | |||
settings, | |||
noqa, | |||
source_kind, | |||
Some(source_kind), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this is optional, but wanted to avoid increasing the scope of this even more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason this is optional is because there's no --add-noqa
support for Jupyter Notebooks yet. The logic for add_noqa_to_path
basically calls into check_path
and as far as I recall there were some circular dependency problems. Maybe it could be solved now with the recent changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add_noqa_to_path
detect the PySourceType
and exit early (with a warning) if it is a Jupyter notebook to avoid that all downstream code has to handle optional source kinds?
pub(crate) fn updated(&self, new_source: String, source_map: &SourceMap) -> Self { | ||
match self { | ||
SourceKind::Jupyter(notebook) => { | ||
let mut cloned = notebook.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This for sure is more expensive than the old "update in place".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my main reason to update in place was to avoid cloning the entire notebook. Actually, the only thing we need to update with each linting cycle are the cell markers and the transformed source code can be looped over to the next iteration cycle. The cell content can be updated at the end before we actually update the Notebook that too only when --fix
is passed.
Using the above idea, we could create a NotebookView
which would only contain the necessary fields (for now I think it's cell markers only and the concatenated source code?) and should be much cheaper to clone. We would not even clone it and use an updated
method on the view with the SourceMap
and transformed source code and create a new view with the updated source code and cell markers. Once the linter iteration is complete, we can apply the view onto the Notebook, update the cell content then and write the Notebook back to the disk.
I think the abstraction would be to have a view into the source code be it Python or Jupyter and the view would have updated
method to update the view. At the end the view will be applied to the concrete type and saved onto the disk.
This is just a rough idea based on my morning walk :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I'm able to follow. Do you want to take a stab at this together with performing the renaming that you suggested? Or is this something that needs changing before landing this PR?
I tried to change raw
to an Rc
to get cheap cloning. However, it turns out that update_cell_content
mutates the raw cells. The only larger structure that doesn't seem to get updated is valid_code_cells
.
I think we could also move out index
from the Notebook
because it is only used in Diagnostics
and instead have a JupyterIndex
that is either Lazy(Notebook)
or Built(Index)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to take a stab at this together with performing the renaming that you suggested?
Yeah, I can give it a go next week.
Or is this something that needs changing before landing this PR?
No
The only larger structure that doesn't seem to get updated is
valid_code_cells
.
What do you mean here by "larger structure"? As valid_code_cells
is just a vector of u32
.
However, it turns out that
update_cell_content
mutates the raw cells.
Yeah, this update isn't really required for every linter loop. We can avoid doing that and only update it at the end. This basically uses the concatenated source code and cell markers to update the cell content.
I think we could also move out
index
from theNotebook
because it is only used inDiagnostics
and instead have aJupyterIndex
that is eitherLazy(Notebook)
orBuilt(Index)
.
Yeah, this is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean here by "larger structure"? As valid_code_cells is just a vector of u32.
With large, I mean any non-fixed size heap allocated data structure that requires a deep clone.
@@ -64,7 +64,7 @@ pub(crate) struct Diagnostics { | |||
pub(crate) messages: Vec<Message>, | |||
pub(crate) fixed: FxHashMap<String, FixTable>, | |||
pub(crate) imports: ImportMap, | |||
pub(crate) source_kind: FxHashMap<String, SourceKind>, | |||
pub(crate) notebooks: FxHashMap<String, Notebook>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable. So this allows us to keep the string content on SourceKind::Python
, since we no longer store the Python source kind on here anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm supportive of this. It's really nice to avoid the &mut
on the source kind.
@dhruvmanila any opinion on this? Any recommendations on how to test the changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making these changes :)
In the near future we should move towards finalizing an abstraction layer between different sources and the engines (linter, formatter, etc.).
pub(crate) fn updated(&self, new_source: String, source_map: &SourceMap) -> Self { | ||
match self { | ||
SourceKind::Jupyter(notebook) => { | ||
let mut cloned = notebook.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my main reason to update in place was to avoid cloning the entire notebook. Actually, the only thing we need to update with each linting cycle are the cell markers and the transformed source code can be looped over to the next iteration cycle. The cell content can be updated at the end before we actually update the Notebook that too only when --fix
is passed.
Using the above idea, we could create a NotebookView
which would only contain the necessary fields (for now I think it's cell markers only and the concatenated source code?) and should be much cheaper to clone. We would not even clone it and use an updated
method on the view with the SourceMap
and transformed source code and create a new view with the updated source code and cell markers. Once the linter iteration is complete, we can apply the view onto the Notebook, update the cell content then and write the Notebook back to the disk.
I think the abstraction would be to have a view into the source code be it Python or Jupyter and the view would have updated
method to update the view. At the end the view will be applied to the concrete type and saved onto the disk.
This is just a rough idea based on my morning walk :)
Ok(test_contents(&SourceKind::Python(contents), &path, settings).0) | ||
} | ||
|
||
pub(crate) struct TestedNotebook { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good. Thanks for doing this :)
My main concern would be to test the linter loop (> 2 iterations) with Cellsimport random
import math try:
print()
except ValueError:
pass import random
import pprint
random.randint(10, 20) foo = 1
if foo is 2:
raise ValueError(f"Invalid foo: {foo is 1}") Command:
|
Thanks @dhruvmanila for the extensive test plan. I played through the commands (showing diagnostics, diff, and fixing) and it works as expected |
77ca065
to
83eab8e
Compare
SourceKind
to store file content
fbdec00
to
a928045
Compare
a928045
to
49904dc
Compare
Summary
This PR changes two things. You may want to take a look at the individual commits. I'm not entirely convinced whether we should land this change and it certainly needs more testing:
Diagnostics
to store the notebooks rather than theSourceKind
. The sole reason is that we don't need the kind. We only need the notebook index to resolve cells. Ideally, this would be handled by a custom diagnostic kind, but we aren't there yetSourceKind
to store the content for bothNotebooks
and regular Python files. This allows us to alignlint_fix
to never return the input in place, and instead return the updated result as aCow<'a, SourceKind>
. I find this easier to reason about than having to remember that notebooks are updated in place, but regular source code is not. However, this has the downside that we now need to clone the Notebooks, which may be rather expensive (multiple large vecs).I'm looking for general feedback. Also happy if someone with more understanding on jupyter notebooks wants to take this over.
Test Plan