Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move file-level rule exemption to lexer-based approach #5567

Merged
merged 3 commits into from
Jul 7, 2023

Conversation

charliermarsh
Copy link
Member

Summary

In addition to # noqa codes, we also support file-level exemptions, which look like:

  • # flake8: noqa (ignore all rules in the file, for compatibility)
  • # ruff: noqa (all rules in the file)
  • # ruff: noqa: F401 (ignore F401 in the file, Flake8 doesn't support this)

This PR moves that logic to something that looks a lot more like our # noqa parser. Performance is actually quite a bit worse than the previous approach (lexing # flake8: noqa goes from 2ns to 11ns; lexing # ruff: noqa: F401, F841 is about the same; lexing # type: ignore # noqa: E501` fgoes from 4ns to 6ns), but the numbers are very small so it's... maybe worth it?

The primary benefit here is that we now properly support flexible whitespace, like: #flake8:noqa. Previously, we required exact string matching, and we also didn't support all case-insensitive variants of noqa.

@charliermarsh
Copy link
Member Author

I thought this might end up being faster, but given that it's slower, IDK, open to not merging. I do think handling #flake8: noqa (for example) is nice though.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2023

PR Check Results

Ecosystem

✅ ecosystem check detected no changes.

Benchmark

Linux

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.00      7.9±0.03ms     5.2 MB/sec    1.00      7.9±0.03ms     5.2 MB/sec
formatter/numpy/ctypeslib.py               1.00   1753.3±2.03µs     9.5 MB/sec    1.00   1756.2±3.49µs     9.5 MB/sec
formatter/numpy/globals.py                 1.00    197.9±0.85µs    14.9 MB/sec    1.01    199.6±0.52µs    14.8 MB/sec
formatter/pydantic/types.py                1.01      3.8±0.01ms     6.7 MB/sec    1.00      3.8±0.00ms     6.8 MB/sec
linter/all-rules/large/dataset.py          1.06     14.5±1.27ms     2.8 MB/sec    1.00     13.6±0.06ms     3.0 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.02      3.5±0.10ms     4.8 MB/sec    1.00      3.4±0.01ms     4.9 MB/sec
linter/all-rules/numpy/globals.py          1.00    437.9±0.47µs     6.7 MB/sec    1.00    437.4±1.19µs     6.7 MB/sec
linter/all-rules/pydantic/types.py         1.01      6.0±0.03ms     4.2 MB/sec    1.00      6.0±0.02ms     4.3 MB/sec
linter/default-rules/large/dataset.py      1.01      6.8±0.02ms     6.0 MB/sec    1.00      6.7±0.02ms     6.0 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.01   1481.7±3.54µs    11.2 MB/sec    1.00   1468.8±3.68µs    11.3 MB/sec
linter/default-rules/numpy/globals.py      1.00    169.7±0.21µs    17.4 MB/sec    1.00    170.5±1.10µs    17.3 MB/sec
linter/default-rules/pydantic/types.py     1.01      3.1±0.02ms     8.3 MB/sec    1.00      3.0±0.01ms     8.4 MB/sec

Windows

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.00      7.6±0.03ms     5.3 MB/sec    1.00      7.6±0.03ms     5.3 MB/sec
formatter/numpy/ctypeslib.py               1.00  1592.3±13.92µs    10.5 MB/sec    1.01  1601.8±11.61µs    10.4 MB/sec
formatter/numpy/globals.py                 1.00    170.8±1.46µs    17.3 MB/sec    1.01    172.0±3.35µs    17.2 MB/sec
formatter/pydantic/types.py                1.00      3.5±0.01ms     7.2 MB/sec    1.00      3.6±0.02ms     7.2 MB/sec
linter/all-rules/large/dataset.py          1.04     13.0±0.15ms     3.1 MB/sec    1.00     12.6±0.04ms     3.2 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.01      3.3±0.02ms     5.1 MB/sec    1.00      3.2±0.01ms     5.2 MB/sec
linter/all-rules/numpy/globals.py          1.00    347.7±3.18µs     8.5 MB/sec    1.00    346.2±6.27µs     8.5 MB/sec
linter/all-rules/pydantic/types.py         1.00      5.5±0.04ms     4.7 MB/sec    1.00      5.5±0.02ms     4.6 MB/sec
linter/default-rules/large/dataset.py      1.02      6.7±0.10ms     6.0 MB/sec    1.00      6.6±0.02ms     6.2 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.01  1339.7±13.14µs    12.4 MB/sec    1.00  1330.5±20.62µs    12.5 MB/sec
linter/default-rules/numpy/globals.py      1.01    147.8±2.21µs    20.0 MB/sec    1.00    145.7±1.00µs    20.2 MB/sec
linter/default-rules/pydantic/types.py     1.00      2.9±0.01ms     8.8 MB/sec    1.00      2.9±0.01ms     8.8 MB/sec

@charliermarsh charliermarsh force-pushed the charlie/exemption-parser branch 2 times, most recently from 3c26030 to 92f34e5 Compare July 6, 2023 20:16
Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat. As mentioned in the other PR. An alternative would have been to write a more "traditional" lexer that returns a sequence of Tokens instead (with their ranges). See SimpleTokenizer for an example. But this works too.

@@ -256,38 +257,148 @@ enum ParsedFileExemption<'a> {
impl<'a> ParsedFileExemption<'a> {
/// Return a [`ParsedFileExemption`] for a given comment line.
fn try_extract(line: &'a str) -> Option<Self> {
let line = line.trim_whitespace_start();
let line = ParsedFileExemption::lex_whitespace(line);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ParsedFileExemption -> Self

let mut chars = line.chars();
if chars
.next()
.map_or(false, |c| c.to_ascii_lowercase() == 'n')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised there isn't a better way to write this, but at least i found the issue for the missing methos rust-lang/rfcs#2566

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could match on the byte slice because we only compare with byte characters. But you then still need to do the invariant which is annoying?

match line.as_str().bytes() {
	['n' | 'N', 'o' | 'O', 'q' | 'Q',  'a' | 'A'] => tada,
	_ => nope
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is that we want to select the string after "noqa" and converting back from bytes isn't safe.

Which reminds me, would line.strip_prefix("noqa") work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think line.strip_prefix("noqa") matches case-insensitively, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is that we want to select the string after "noqa" and converting back from bytes isn't safe.

I think it would be safe here right because we know that all variations of noqa have an exact length of 4 bytes

Anyway. I think the current code is fine. it's verbose but does the job

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, I changed it.

@konstin
Copy link
Member

konstin commented Jul 7, 2023

How did you compute those ns timings?

@charliermarsh
Copy link
Member Author

charliermarsh commented Jul 7, 2023

How did you compute those ns timings?

I use cargo bench in the ruff crate. I add something like:

use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion};

use ruff::noqa::ParsedFileExemption;

pub fn directive_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("Directive");
    for i in [
        "# noqa: F401",
        "# noqa: F401, F841",
        "# flake8: noqa: F401, F841",
        "# ruff: noqa: F401, F841",
        "# flake8: noqa",
        "# ruff: noqa",
        "# noqa",
        "# type: ignore # noqa: E501",
        "# type: ignore # nosec",
        "# some very long comment that # is interspersed with characters but # no directive",
    ]
    .iter()
    {
        group.bench_with_input(BenchmarkId::new("Regex", i), i, |b, _i| {
            b.iter(|| ParsedFileExemption::try_regex(black_box(i)))
        });
        group.bench_with_input(BenchmarkId::new("Lexer", i), i, |b, _i| {
            b.iter(|| ParsedFileExemption::try_extract(black_box(i)))
        });
    }
    group.finish();
}

criterion_group!(benches, directive_benchmark);
criterion_main!(benches);

In crates/ruff/benches/benchmark.rs. Then add:

[[bench]]
name = "benchmark"
harness = false

To crates/ruff/Cargo.toml, along with `criterion = "0.5.1".

Then cargo bench in crates/ruff gives you benchmarks!

@charliermarsh charliermarsh enabled auto-merge (squash) July 7, 2023 15:34
@charliermarsh charliermarsh merged commit 5640c31 into main Jul 7, 2023
14 checks passed
@charliermarsh charliermarsh deleted the charlie/exemption-parser branch July 7, 2023 15:41
charliermarsh added a commit that referenced this pull request Jul 11, 2023
## Summary

Similar to #5567, we can remove the use of regex, plus simplify the
representation (use `Option`), add snapshot tests, etc.

This is about 100x faster than using a regex for cases that match (2.5ns
vs. 250ns). It's obviously not a hot path, but I prefer the consistency
with other similar comment-parsing. I may DRY these up into some common
functionality later on.
konstin pushed a commit that referenced this pull request Jul 19, 2023
## Summary

Similar to #5567, we can remove the use of regex, plus simplify the
representation (use `Option`), add snapshot tests, etc.

This is about 100x faster than using a regex for cases that match (2.5ns
vs. 250ns). It's obviously not a hot path, but I prefer the consistency
with other similar comment-parsing. I may DRY these up into some common
functionality later on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants