Avoid cloning source code multiple times #6629

charliermarsh · 2023-08-16T19:25:52Z

Summary

In working on #6628, I noticed that we clone the source code contents, potentially multiple times, prior to linting. The issue is that SourceKind::Python takes a String, so we first have to provide it with a String. In the stdin case, that means cloning. However, on top of this, we then have to clone source_kind.contents() because SourceKind gets mutated. So for stdin, we end up cloning twice. For non-stdin, we end up cloning once, but unnecessarily (since the contents don't get mutated, only the kind).

This PR removes the String from source_kind, instead requiring that we parse it out elsewhere. It reduces the number of clones down to 1 for Jupyter Notebooks, and zero otherwise.

github-actions · 2023-08-16T19:59:55Z

PR Check Results

Ecosystem

✅ ecosystem check detected no changes.

Benchmark

Linux

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.00      4.0±0.16ms    10.3 MB/sec    1.04      4.1±0.20ms     9.8 MB/sec
formatter/numpy/ctypeslib.py               1.00   818.3±39.71µs    20.3 MB/sec    1.00   814.7±38.52µs    20.4 MB/sec
formatter/numpy/globals.py                 1.01     85.1±5.47µs    34.7 MB/sec    1.00     84.4±4.46µs    35.0 MB/sec
formatter/pydantic/types.py                1.00  1632.6±80.37µs    15.6 MB/sec    1.05  1716.3±137.12µs    14.9 MB/sec
linter/all-rules/large/dataset.py          1.00     12.9±0.52ms     3.1 MB/sec    1.02     13.2±0.53ms     3.1 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.01      3.4±0.11ms     4.8 MB/sec    1.00      3.4±0.12ms     4.9 MB/sec
linter/all-rules/numpy/globals.py          1.00   497.4±21.70µs     5.9 MB/sec    1.03   514.3±20.54µs     5.7 MB/sec
linter/all-rules/pydantic/types.py         1.00      6.8±0.31ms     3.7 MB/sec    1.00      6.8±0.26ms     3.7 MB/sec
linter/default-rules/large/dataset.py      1.00      6.8±0.22ms     6.0 MB/sec    1.01      6.9±0.26ms     5.9 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.01  1481.6±59.80µs    11.2 MB/sec    1.00  1461.6±55.76µs    11.4 MB/sec
linter/default-rules/numpy/globals.py      1.00   185.1±10.97µs    15.9 MB/sec    1.02    188.6±8.34µs    15.6 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.1±0.14ms     8.4 MB/sec    1.00      3.1±0.11ms     8.4 MB/sec

Windows

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.00      4.1±0.05ms     9.9 MB/sec    1.01      4.2±0.10ms     9.7 MB/sec
formatter/numpy/ctypeslib.py               1.00   820.6±17.11µs    20.3 MB/sec    1.00   819.4±11.63µs    20.3 MB/sec
formatter/numpy/globals.py                 1.00     85.0±4.03µs    34.7 MB/sec    1.00     84.7±2.46µs    34.8 MB/sec
formatter/pydantic/types.py                1.00  1658.8±26.36µs    15.4 MB/sec    1.01  1680.5±28.82µs    15.2 MB/sec
linter/all-rules/large/dataset.py          1.01     12.8±0.23ms     3.2 MB/sec    1.00     12.7±0.11ms     3.2 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.01      3.5±0.07ms     4.8 MB/sec    1.00      3.5±0.03ms     4.8 MB/sec
linter/all-rules/numpy/globals.py          1.00   443.4±10.99µs     6.7 MB/sec    1.00    441.6±7.35µs     6.7 MB/sec
linter/all-rules/pydantic/types.py         1.00      6.5±0.09ms     3.9 MB/sec    1.00      6.6±0.10ms     3.9 MB/sec
linter/default-rules/large/dataset.py      1.00      7.0±0.06ms     5.8 MB/sec    1.00      7.0±0.06ms     5.9 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.01  1491.0±18.54µs    11.2 MB/sec    1.00  1481.3±25.24µs    11.2 MB/sec
linter/default-rules/numpy/globals.py      1.01    177.3±4.66µs    16.6 MB/sec    1.00    175.9±6.45µs    16.8 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.1±0.03ms     8.2 MB/sec    1.00      3.1±0.05ms     8.2 MB/sec

MichaReiser

Nice, this should reduce memory usage quiet a bit and may even improve the cpython benchmark.

The new solution still feels a bit awkward to me. Maybe it is because I don't understand the separation between Sources, SourceType, and SourceKind well enough. Maybe we don't have these abstractions right yet which makes this more complicated than it needs to be? Could it also help to avoid storing the Notebook contents twice?

MichaReiser · 2023-08-16T19:49:48Z

crates/ruff_cli/src/diagnostics.rs

@@ -549,6 +521,102 @@ pub(crate) fn lint_stdin(
    })
 }

+#[derive(Debug)]
+struct Sources<'a> {


I find Sources a confusing name, because it only contains a single source file.

Could we maybe marry this with our SourceFile implementation that also gives us cheap cloning?

crates/ruff_cli/src/diagnostics.rs

MichaReiser · 2023-08-16T20:00:40Z

crates/ruff_cli/src/diagnostics.rs

+    } else {
+        Ok(Sources {
+            source_type,
+            source_kind: SourceKind::Python,


Could we instead change SourceKind to:

enum SourceKind<'a> { Python(&'a str), Notebook(Notebook) } impl SourceKind<'_> { fn source_code(&self) -> &str { match self { Self::Python(source) => source, Self::Notebook(notebook) => notebook.contents() } }

I'm using source_codehere (or source) because I find contents a too generic term (I can't even tell from its name if it is a string)

This was my instinct too. I tried it but it didn’t work, because the Diagnostics struct includes the SourceKind so that it can reconstruct the cell ranges when it reports diagnostics at the end. So we’d have to add lifetimes to Diagnostics and then keep all the source code in memory for the life of the program.

Hmm I see. I have to take a closer look at this tomorrow. I wonder if SourceKind (or something similar) should be a trait and / or if we can structure these types differently to also avoid the overlap between SourceKind and SourceType

Diagnostics only needs access to the index though, not the file contents, which may help.

Yeah, it's another source mapping (the reverse after applying fixes). It would be nice to have a more formal concept for that rather than special casing jupyter notebooks in the Emitter. Because we'll have the same problem when linting markdown files: We need to map back the relative code block indices to the absolute file indices, potentially customizing the message.

charliermarsh · 2023-08-16T20:10:13Z

It feels awkward to me too, but I didn’t want to do anything more invasive. I want to see if we can remove the contents from Notebook, perhaps, which would make the responsibilities clearer.

charliermarsh · 2023-08-16T20:40:45Z

@MichaReiser - I made some improvements based on your suggestions but LMK how you want to proceed (e.g., whether you want to spend time on this, or want me to do so, or want to merge and revisit later).

MichaReiser · 2023-08-17T07:51:59Z

Current dependencies on/for this PR:

main
- PR Avoid cloning source code multiple times #6629 👈
  - PR Refactor SourceKind to store file content #6640

This comment was auto-generated by Graphite.

MichaReiser

Thanks. I like the improvements and the improved naming.

We should follow up on notebooks. I think it would be nice if SourceKind could store the source code as well, to avoid passing two arguments everywhere. However, I didn't manage to do that because of some lifetime issues when fixing notebooks (and updating notebook indices)

dhruvmanila

Looks good! Thanks for doing this :)

charliermarsh added internal An internal refactor or improvement performance Potential performance improvement labels Aug 16, 2023

charliermarsh marked this pull request as ready for review August 16, 2023 19:26

charliermarsh requested a review from dhruvmanila as a code owner August 16, 2023 19:26

zanieb approved these changes Aug 16, 2023

View reviewed changes

MichaReiser reviewed Aug 16, 2023

View reviewed changes

charliermarsh force-pushed the charlie/jupyter branch from ee92fff to 7f35448 Compare August 16, 2023 20:26

Base automatically changed from charlie/jupyter to main August 16, 2023 20:34

charliermarsh force-pushed the charlie/cow branch 3 times, most recently from 11dfb35 to 1554c36 Compare August 16, 2023 20:40

Avoid cloning source code multiple times

ccef889

charliermarsh force-pushed the charlie/cow branch from 1554c36 to ccef889 Compare August 16, 2023 20:40

MichaReiser mentioned this pull request Aug 17, 2023

Refactor SourceKind to store file content #6640

Merged

MichaReiser approved these changes Aug 17, 2023

View reviewed changes

dhruvmanila approved these changes Aug 17, 2023

View reviewed changes

charliermarsh merged commit 2aeb273 into main Aug 18, 2023
17 checks passed

charliermarsh deleted the charlie/cow branch August 18, 2023 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid cloning source code multiple times #6629

Avoid cloning source code multiple times #6629

charliermarsh commented Aug 16, 2023

github-actions bot commented Aug 16, 2023 •

edited

MichaReiser left a comment

MichaReiser Aug 16, 2023

MichaReiser Aug 16, 2023 •

edited

MichaReiser Aug 16, 2023

charliermarsh Aug 16, 2023

MichaReiser Aug 16, 2023

charliermarsh Aug 16, 2023

MichaReiser Aug 16, 2023

charliermarsh commented Aug 16, 2023

charliermarsh commented Aug 16, 2023

MichaReiser commented Aug 17, 2023

MichaReiser left a comment

dhruvmanila left a comment

Avoid cloning source code multiple times #6629

Avoid cloning source code multiple times #6629

Conversation

charliermarsh commented Aug 16, 2023

Summary

github-actions bot commented Aug 16, 2023 • edited

PR Check Results

Ecosystem

Benchmark

Linux

Windows

MichaReiser left a comment

Choose a reason for hiding this comment

MichaReiser Aug 16, 2023

Choose a reason for hiding this comment

MichaReiser Aug 16, 2023 • edited

Choose a reason for hiding this comment

MichaReiser Aug 16, 2023

Choose a reason for hiding this comment

charliermarsh Aug 16, 2023

Choose a reason for hiding this comment

MichaReiser Aug 16, 2023

Choose a reason for hiding this comment

charliermarsh Aug 16, 2023

Choose a reason for hiding this comment

MichaReiser Aug 16, 2023

Choose a reason for hiding this comment

charliermarsh commented Aug 16, 2023

charliermarsh commented Aug 16, 2023

MichaReiser commented Aug 17, 2023

MichaReiser left a comment

Choose a reason for hiding this comment

dhruvmanila left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 16, 2023 •

edited

MichaReiser Aug 16, 2023 •

edited