-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid cloning source code multiple times #6629
Conversation
PR Check ResultsEcosystem✅ ecosystem check detected no changes. BenchmarkLinux
Windows
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this should reduce memory usage quiet a bit and may even improve the cpython benchmark.
The new solution still feels a bit awkward to me. Maybe it is because I don't understand the separation between Sources
, SourceType
, and SourceKind
well enough. Maybe we don't have these abstractions right yet which makes this more complicated than it needs to be? Could it also help to avoid storing the Notebook
contents twice?
crates/ruff_cli/src/diagnostics.rs
Outdated
@@ -549,6 +521,102 @@ pub(crate) fn lint_stdin( | |||
}) | |||
} | |||
|
|||
#[derive(Debug)] | |||
struct Sources<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find Sources
a confusing name, because it only contains a single source file.
Could we maybe marry this with our SourceFile
implementation that also gives us cheap cloning?
crates/ruff_cli/src/diagnostics.rs
Outdated
} else { | ||
Ok(Sources { | ||
source_type, | ||
source_kind: SourceKind::Python, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we instead change SourceKind
to:
enum SourceKind<'a> {
Python(&'a str),
Notebook(Notebook)
}
impl SourceKind<'_> {
fn source_code(&self) -> &str {
match self {
Self::Python(source) => source,
Self::Notebook(notebook) => notebook.contents()
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using source_code
here (or source
) because I find contents
a too generic term (I can't even tell from its name if it is a string)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my instinct too. I tried it but it didn’t work, because the Diagnostics struct includes the SourceKind so that it can reconstruct the cell ranges when it reports diagnostics at the end. So we’d have to add lifetimes to Diagnostics and then keep all the source code in memory for the life of the program.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I see. I have to take a closer look at this tomorrow. I wonder if SourceKind
(or something similar) should be a trait and / or if we can structure these types differently to also avoid the overlap between SourceKind
and SourceType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diagnostics only needs access to the index
though, not the file contents, which may help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's another source mapping (the reverse after applying fixes). It would be nice to have a more formal concept for that rather than special casing jupyter notebooks in the Emitter
. Because we'll have the same problem when linting markdown files: We need to map back the relative code block indices to the absolute file indices, potentially customizing the message.
It feels awkward to me too, but I didn’t want to do anything more invasive. I want to see if we can remove the contents from Notebook, perhaps, which would make the responsibilities clearer. |
ee92fff
to
7f35448
Compare
11dfb35
to
1554c36
Compare
1554c36
to
ccef889
Compare
@MichaReiser - I made some improvements based on your suggestions but LMK how you want to proceed (e.g., whether you want to spend time on this, or want me to do so, or want to merge and revisit later). |
Current dependencies on/for this PR:
This comment was auto-generated by Graphite. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I like the improvements and the improved naming.
We should follow up on notebooks. I think it would be nice if SourceKind
could store the source code as well, to avoid passing two arguments everywhere. However, I didn't manage to do that because of some lifetime issues when fixing notebooks (and updating notebook indices)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for doing this :)
Summary
In working on #6628, I noticed that we clone the source code contents, potentially multiple times, prior to linting. The issue is that
SourceKind::Python
takes aString
, so we first have to provide it with aString
. In the stdin case, that means cloning. However, on top of this, we then have to clonesource_kind.contents()
becauseSourceKind
gets mutated. So for stdin, we end up cloning twice. For non-stdin, we end up cloning once, but unnecessarily (since the contents don't get mutated, only the kind).This PR removes the
String
fromsource_kind
, instead requiring that we parse it out elsewhere. It reduces the number of clones down to 1 for Jupyter Notebooks, and zero otherwise.