Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace row/column based Location with byte-offsets. #3931

Merged
merged 21 commits into from
Apr 26, 2023

Conversation

MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Apr 11, 2023

Summary

This PR changes ruff to use our own fork of RustPython that replaces Location { row: u32, column: u32 } with TextSize astral-sh/RustPython#4. The main motivation for this change is to ship the logical line rules. Enabling the logical line rules regresses performance by as much as 50% because the rules need to slice into the source string, which requires building and querying the LineIndex. Using byte offsets everywhere trades the need from having to build the LineIndex to inspect the source text in lint rules with re-computing the row and column information when rendering diagnostics. This is a favourable trade because most projects using ruff only have very few diagnostics.

Notable Changes

SourceCodeFile

It is now necessary to always include the source code when passing Messages because the source text is necessary to re-compute the row and column positions for byte offsets (TextSize). Previously, the source text was only included when using --show-source. This results in a noticeable slowdown in projects with many (ten thousand) diagnostics.

Locator

The Locator now exposes methods to:

  • For a byte offset, compute its line's start and end positions. Useful to compute the indent or when replacing a full line.
  • For a text range, compute the offset of the first line and the last line in that range. Useful when replacing all lines in a range.

The computations are performed on demand without querying the LineIndex.

The Locator still has a lazy computed LineIndex because we have a few diagnostics that use a line number as part of their message.

SourceCode

SourceCode now provides methods to compute the SourceLocation (row column information) given an offset.

UniversalNewline

The UniversalNewline iterator now returns Line items instead of &str. This is necessary because many lints need to know the offset of the nth line and summing the text.text_len() doesn't give you the right result because the text does not include the trailing newline character:

def f:
    pass

The text len of the first line is 6 bytes because the line does not include the trailing newline character.

The Line struct provides methods o get a line's start offset, end offset, range, and text. It also provides methods to get the text, end offset, and range, including the trailing newline character.

Use TextRange for ranges

Consistently uses TextRange in favor of: (Location, Location) and start: Location, end: Location because TextRange better communicates that the two offsets are related.

Replaces all references of Range with TextRange and deletes Range.

Use TextSize instead of Location

Replaces all references to Location with TextSize.

Stylist

This PR removes the lazy computations for indention and quote because slicing into the source string is now cheap.

Indexer

The Indexer used to store the line numbers of commented lines, lines with continuations, and lines with multiline strings. This is no longer feasible because it would require computing the line numbers. The new implementation stores the line-start offset for continuous lines and the TextRange for comments and multiline strings.

Storing the TextRange instead of line numbers helped to fix a false-negative where a mixed spaces-tab indent at the start of a multiline string was not reported because the analysis incorrectly assumed that it is part of the multiline string.

Noqa

This PR now stores the TextRange of the line for each noqa comment sorted in ascending order by the start position. Testing whether a diagnostic is suppressed requires a binary search on the ranges to test if any range contains the diagnostics start location.

This PR further replaces the mapping to suppress some syntaxes on other lines by a TextRange vector where every entry means that a position falling into that range should be remapped to the end of the range.

isort directives

Similar to noqa. It now stores the TextRanges instead of the line numbers for the areas where sorting is disabled. This PR now only stores the TextSize for split positions as this proves to be sufficient.

Benchmark

TLDR: 10% performance improvement for projects with few diagnostics. Identical performance or small regression for projects with many diagnostics. The new implementation with logical-lines enabled outperforms main with logical-lines disabled.

Micro Benchmarks

This PR improves the default-rules benchmark by 6-15% and the all-rules benchmark by 4-8%. More importantly, ruff with logical-lines enabled is as fast or even faster than main. This should allow us to ship logical lines without causing a runtime regression.

group                                      bytes                                  bytes-logical                          main                                   main-logical
-----                                      -----                                  -------------                          ----                                   ------------
linter/all-rules/large/dataset.py          1.00      8.6±0.08ms     4.7 MB/sec    1.03      8.9±0.17ms     4.6 MB/sec    1.04      8.9±0.12ms     4.6 MB/sec    1.10      9.4±0.01ms     4.3 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      2.0±0.07ms     8.3 MB/sec    1.05      2.1±0.03ms     7.9 MB/sec    1.05      2.1±0.03ms     7.9 MB/sec    1.13      2.3±0.00ms     7.4 MB/sec
linter/all-rules/numpy/globals.py          1.00    220.7±4.63µs    13.4 MB/sec    1.04    228.8±2.74µs    12.9 MB/sec    1.08    238.8±1.17µs    12.4 MB/sec    1.16    256.8±1.25µs    11.5 MB/sec
linter/all-rules/pydantic/types.py         1.00      3.5±0.04ms     7.3 MB/sec    1.05      3.7±0.10ms     6.9 MB/sec    1.07      3.7±0.02ms     6.8 MB/sec    1.14      4.0±0.02ms     6.4 MB/sec
linter/default-rules/large/dataset.py      1.00      4.3±0.07ms     9.5 MB/sec    1.07      4.6±0.03ms     8.8 MB/sec    1.08      4.6±0.05ms     8.8 MB/sec    1.18      5.1±0.08ms     8.1 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00    870.1±5.69µs    19.1 MB/sec    1.12    971.9±3.08µs    17.1 MB/sec    1.15   1002.2±4.42µs    16.6 MB/sec    1.27   1108.6±3.25µs    15.0 MB/sec
linter/default-rules/numpy/globals.py      1.00     95.9±0.72µs    30.8 MB/sec    1.08    103.2±1.69µs    28.6 MB/sec    1.06    102.0±1.01µs    28.9 MB/sec    1.19    114.5±0.47µs    25.8 MB/sec
linter/default-rules/pydantic/types.py     1.00  1883.0±10.58µs    13.5 MB/sec    1.12      2.1±0.01ms    12.1 MB/sec    1.10      2.1±0.00ms    12.3 MB/sec    1.23      2.3±0.01ms    11.0 MB/sec

It's worth pointing out that the relative slowdown introduced by enabling the logical lines lint rules remains unchanged. I'm surprised by this because it doesn't show the improvement I expected from removing the LineIndex computation from the linting path.

group                                      main                                   main-logical
-----                                      ----                                   ------------
linter/all-rules/large/dataset.py          1.00      8.9±0.12ms     4.6 MB/sec    1.06      9.4±0.01ms     4.3 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      2.1±0.03ms     7.9 MB/sec    1.07      2.3±0.00ms     7.4 MB/sec
linter/all-rules/numpy/globals.py          1.00    238.8±1.17µs    12.4 MB/sec    1.08    256.8±1.25µs    11.5 MB/sec
linter/all-rules/pydantic/types.py         1.00      3.7±0.02ms     6.8 MB/sec    1.07      4.0±0.02ms     6.4 MB/sec
linter/default-rules/large/dataset.py      1.00      4.6±0.05ms     8.8 MB/sec    1.09      5.1±0.08ms     8.1 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00   1002.2±4.42µs    16.6 MB/sec    1.11   1108.6±3.25µs    15.0 MB/sec
linter/default-rules/numpy/globals.py      1.00    102.0±1.01µs    28.9 MB/sec    1.12    114.5±0.47µs    25.8 MB/sec
linter/default-rules/pydantic/types.py     1.00      2.1±0.00ms    12.3 MB/sec    1.12      2.3±0.01ms    11.0 MB/sec
group                                      bytes                                  bytes-logical
-----                                      -----                                  -------------
linter/all-rules/large/dataset.py          1.00      8.6±0.08ms     4.7 MB/sec    1.03      8.9±0.17ms     4.6 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      2.0±0.07ms     8.3 MB/sec    1.05      2.1±0.03ms     7.9 MB/sec
linter/all-rules/numpy/globals.py          1.00    220.7±4.63µs    13.4 MB/sec    1.04    228.8±2.74µs    12.9 MB/sec
linter/all-rules/pydantic/types.py         1.00      3.5±0.04ms     7.3 MB/sec    1.05      3.7±0.10ms     6.9 MB/sec
linter/default-rules/large/dataset.py      1.00      4.3±0.07ms     9.5 MB/sec    1.07      4.6±0.03ms     8.8 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00    870.1±5.69µs    19.1 MB/sec    1.12    971.9±3.08µs    17.1 MB/sec
linter/default-rules/numpy/globals.py      1.00     95.9±0.72µs    30.8 MB/sec    1.08    103.2±1.69µs    28.6 MB/sec
linter/default-rules/pydantic/types.py     1.00  1883.0±10.58µs    13.5 MB/sec    1.12      2.1±0.01ms    12.1 MB/sec

CPython

This benchmark measures the worst-case performance: A project with many violations.

  • Performance regresses for loading cached results (except when using --show-source). This is expected because printing diagnostics now always requires storing the source text and computing the source locations adds some overhead as well.
  • Slight improvement for --no-cache as seen in the micro benchmarks
  • 10% performance improvement when using silent (-s). This shows the potential of the refactor for projects with few or no diagnostics. Silent still pays the overhead for storing the source text for every diagnostic, but the implementation doesn't compute the LineIndex.
Benchmark results
Command Mean [ms] Min [ms] Max [ms] Relative
./ruff-bytes ./crates/ruff/resources/test/cpython/ -e 44.4 ± 2.0 40.3 49.0 1.17 ± 0.08
./ruff-main ./crates/ruff/resources/test/cpython/ -e 37.9 ± 1.8 34.1 43.5 1.00
./ruff-bytes ./crates/ruff/resources/test/cpython/ -e --no-cache 200.2 ± 4.0 194.5 210.1 5.28 ± 0.28
./ruff-main ./crates/ruff/resources/test/cpython/ -e --no-cache 215.7 ± 6.5 203.5 228.0 5.69 ± 0.32
./ruff-bytes ./crates/ruff/resources/test/cpython/ -e --select=ALL 401.6 ± 9.6 391.4 420.1 10.60 ± 0.57
./ruff-main ./crates/ruff/resources/test/cpython/ -e --select=ALL 395.6 ± 8.8 383.8 412.5 10.44 ± 0.55
./ruff-bytes ./crates/ruff/resources/test/cpython/ -e --no-cache --select=ALL 691.1 ± 9.6 677.6 709.3 18.24 ± 0.91
./ruff-main ./crates/ruff/resources/test/cpython/ -e --no-cache --select=ALL 716.7 ± 11.3 700.7 736.3 18.91 ± 0.96
./ruff-bytes ./crates/ruff/resources/test/cpython/ -e --show-source 85.7 ± 2.4 81.7 93.2 2.26 ± 0.13
./ruff-main ./crates/ruff/resources/test/cpython/ -e --show-source 85.0 ± 2.0 81.9 90.4 2.24 ± 0.12
./ruff-bytes ./crates/ruff/resources/test/cpython/ -e --no-cache --show-source 245.7 ± 4.4 238.5 251.2 6.48 ± 0.33
./ruff-main ./crates/ruff/resources/test/cpython/ -e --no-cache --show-source 259.8 ± 5.1 251.3 272.6 6.86 ± 0.36
./ruff-bytes ./crates/ruff/resources/test/cpython/ -e --select=ALL --show-source 1507.1 ± 18.6 1474.6 1531.3 39.77 ± 1.98
./ruff-main ./crates/ruff/resources/test/cpython/ -e --select=ALL --show-source 1459.6 ± 23.5 1426.0 1500.0 38.52 ± 1.96
./ruff-bytes ./crates/ruff/resources/test/cpython/ -e --no-cache --select=ALL --show-source 1770.1 ± 21.8 1747.0 1822.1 46.71 ± 2.32
./ruff-main ./crates/ruff/resources/test/cpython/ -e --no-cache --select=ALL --show-source 1799.7 ± 31.7 1762.7 1856.2 47.49 ± 2.44

Homeassitant

Best case benchmark: A project with very few diagnostics (10).

  • Performance is unchanged when running with caching
  • Performance improves by about 10% when caching is disabled.
Command Mean [ms] Min [ms] Max [ms] Relative
../ruff/ruff-bytes . -e 51.4 ± 2.0 44.6 55.7 1.00
../ruff/ruff-main . -e 51.4 ± 2.3 45.4 57.3 1.00 ± 0.06
../ruff/ruff-bytes . -e --no-cache 360.0 ± 3.0 355.4 364.0 7.01 ± 0.28
../ruff/ruff-main . -e --no-cache 387.4 ± 5.9 380.1 396.2 7.54 ± 0.32

Enabling logical lines introduces many new errors (2000), no longer showing the best case. But the new implementation still outperforms the old with logical-lines enabled and remains about 10% faster.

Command Mean [ms] Min [ms] Max [ms] Relative
../ruff/ruff-bytes-logical . -e 53.7 ± 2.3 49.1 59.8 1.03 ± 0.06
../ruff/ruff-main-logical . -e 52.1 ± 2.4 46.6 56.6 1.00
../ruff/ruff-bytes-logical . -e --no-cache 383.2 ± 2.7 378.9 387.6 7.35 ± 0.34
../ruff/ruff-main-logical . -e --no-cache 417.7 ± 6.3 411.8 433.6 8.01 ± 0.38

Test Plan

Breaking Changes

This PR changes the column numbers of fixes in the JSON output to be one indexed to align the column numbers with the Diagnostic start and end columns. I can undo this change but I got it "for free" by using SourceLocation consistently.

I reverted the change in this PR and extracted it into #4007

@evanrittenhouse
Copy link
Contributor

Just for my own curiosity, what's the context for this? Why's it necessary?

@MichaReiser
Copy link
Member Author

MichaReiser commented Apr 12, 2023

Just for my own curiosity, what's the context for this? Why's it necessary?

Strictly speaking, it isn't necessary from a functional point of view, but using byte offsets helps to improve performance and reduce memory consumption.

I started investigating switching to byte offsets because enabling the pycodestyle (logical line) rules #3689 results in a 20%-50% performance regression, even tough I already improved the performance of the rules themselves. A key observation is that benchmarks for the default-rules regress more than for the all rules benchmarks.

group                                      main                                   pr
-----                                      ----                                   --
linter/all-rules/large/dataset.py          1.00     13.5±0.06ms     3.0 MB/sec    1.20     16.3±0.09ms     2.5 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      3.5±0.01ms     4.7 MB/sec    1.18      4.2±0.01ms     4.0 MB/sec
linter/all-rules/numpy/globals.py          1.00    490.6±2.52µs     6.0 MB/sec    1.16    567.3±0.81µs     5.2 MB/sec
linter/all-rules/pydantic/types.py         1.00      6.0±0.02ms     4.3 MB/sec    1.23      7.4±0.02ms     3.5 MB/sec
linter/default-rules/large/dataset.py      1.00      7.2±0.01ms     5.7 MB/sec    1.38      9.9±0.04ms     4.1 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00   1615.1±3.91µs    10.3 MB/sec    1.39      2.3±0.01ms     7.4 MB/sec
linter/default-rules/numpy/globals.py      1.00    180.5±0.32µs    16.3 MB/sec    1.52    274.0±4.45µs    10.8 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.4±0.01ms     7.6 MB/sec    1.41      4.8±0.02ms     5.4 MB/sec

This is because the pycodestyle rules are the first rules in the default set that inspect the source code (trivia). The challenge with inspecting the source code is that you can't slice a string with a row/column location. This isn't possible: string[row=1, column=5..10]. Ruff works around this by building a LineIndex that maps row and column locations to byte offsets (somewhat expensive), and than uses that index (lookup is cheap) to retrieve the string locations.

My goal with using byte-offsets is to remove the need to build a LineIndex from the linting phase. It will still be necessary to build the line index when we emit diagnostics, because, as a user, I strongly prefer row/column numbers over byte indices ;). Having to build a LineIndex for all files with diagnostics may result in a performance regression for projects where most files have diagnostics, but this is rare for projects using ruff that tend to have zero or only few diagnostics.

There are other, non-pycodestyle specific reasons why I want to adopt byte offsets:

  • Code size reduction: Comparing a byte-offset is a single u32 comparison, compared to two u32 comparisons for a row/column based source location. I further made end_location mandatory on Located which helps removing some unwrap code paths. The result for ruff_python_formatter is a small code size reduction: ~4.4MB → 4.3
  • Reduced memory consumption: A row/column based Location uses 8 bytes (4 bytes for the row and column). A byte offset only uses half of that (single u32). This means, we reduce the size of every LexResult, Located, and AST node by 8 bytes (-4 bytes for the start, and -4 bytes for the end locations).
  • Performance improvements: Writing and reading less data, and fewer CPU instructions should help to improve overall performance. A preliminary benchmark comparing the parser and visitor (counting all statements) performance shows a 10-15% performance improvement. Whether we're able to reap all these improvements in Ruff depends on how efficiently we can rewrite the logic that uses row information in the linting phase today (noqa, isort comments, commented lines, etc)
parser/numpy/globals.py time:   [65.752 µs 65.844 µs 65.973 µs]
                        thrpt:  [44.725 MiB/s 44.813 MiB/s 44.876 MiB/s]
                 change:
                        time:   [-5.8033% -5.5729% -5.3542%] (p = 0.00 < 0.05)
                        thrpt:  [+5.6571% +5.9018% +6.1609%]
                        Performance has improved.
Found 21 outliers among 100 measurements (21.00%)
  15 (15.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe
parser/pydantic/types.py
                        time:   [1.4442 ms 1.4453 ms 1.4466 ms]
                        thrpt:  [17.630 MiB/s 17.646 MiB/s 17.659 MiB/s]
                 change:
                        time:   [-12.393% -12.225% -11.904%] (p = 0.00 < 0.05)
                        thrpt:  [+13.512% +13.927% +14.146%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
parser/numpy/ctypeslib.py
                        time:   [647.58 µs 650.18 µs 652.90 µs]
                        thrpt:  [25.503 MiB/s 25.610 MiB/s 25.713 MiB/s]
                 change:
                        time:   [-14.351% -14.154% -13.948%] (p = 0.00 < 0.05)
                        thrpt:  [+16.209% +16.488% +16.756%]
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  17 (17.00%) high mild
  1 (1.00%) high severe
parser/large/dataset.py time:   [3.5024 ms 3.5104 ms 3.5195 ms]
                        thrpt:  [11.559 MiB/s 11.589 MiB/s 11.616 MiB/s]
                 change:
                        time:   [-11.825% -11.603% -11.374%] (p = 0.00 < 0.05)
                        thrpt:  [+12.834% +13.126% +13.411%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

Other compilers using byte-offsts:

@evanrittenhouse
Copy link
Contributor

evanrittenhouse commented Apr 13, 2023

Thanks for the detailed explanation @MichaReiser! If you don't mind - where do the offsets come from? Column offset makes sense (obviously just offsetting from index 0), but how are rows represented? Or is a total byte-offset calculated from what is effectively row 0, column 0?

E: Never mind, just found locator.rs. Seems like we essentially treat the file as one giant string, then define rows as being delimited by \n/\r and then the columns as the offsets from that offset?

@github-actions
Copy link
Contributor

github-actions bot commented Apr 14, 2023

PR Check Results

Ecosystem

ℹ️ ecosystem check detected changes. (+0, -16, 0 error(s))

airflow (+0, -7)

- airflow/api_connexion/endpoints/task_instance_endpoint.py:274:12: RET504 Unnecessary variable assignment before `return` statement
- airflow/providers/amazon/aws/secrets/systems_manager.py:200:16: RET504 Unnecessary variable assignment before `return` statement
- airflow/providers/docker/operators/docker.py:479:16: RET504 Unnecessary variable assignment before `return` statement
- airflow/providers/oracle/hooks/oracle.py:42:12: RET504 Unnecessary variable assignment before `return` statement
- airflow/security/utils.py:83:12: RET504 Unnecessary variable assignment before `return` statement
- airflow/www/extensions/init_appbuilder.py:359:16: RET504 Unnecessary variable assignment before `return` statement
- tests/test_utils/gcp_system_helpers.py:65:12: RET504 Unnecessary variable assignment before `return` statement

bokeh (+0, -1)

- src/bokeh/core/property/datetime.py:165:16: RET504 Unnecessary variable assignment before `return` statement

zulip (+0, -8)

- zerver/data_import/rocketchat.py:141:12: RET504 Unnecessary variable assignment before `return` statement
- zerver/lib/message.py:186:12: RET504 Unnecessary variable assignment before `return` statement
- zerver/lib/narrow.py:891:12: RET504 Unnecessary variable assignment before `return` statement
- zerver/lib/url_preview/oembed.py:50:12: RET504 Unnecessary variable assignment before `return` statement
- zerver/models.py:184:12: RET504 Unnecessary variable assignment before `return` statement
- zerver/webhooks/basecamp/view.py:115:12: RET504 Unnecessary variable assignment before `return` statement
- zerver/webhooks/bitbucket2/view.py:436:12: RET504 Unnecessary variable assignment before `return` statement
- zerver/webhooks/zendesk/view.py:14:12: RET504 Unnecessary variable assignment before `return` statement

Benchmark

Linux

group                                      main                                   pr
-----                                      ----                                   --
linter/all-rules/large/dataset.py          1.00     14.9±0.06ms     2.7 MB/sec    1.00     14.9±0.09ms     2.7 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      3.6±0.04ms     4.6 MB/sec    1.00      3.6±0.01ms     4.6 MB/sec
linter/all-rules/numpy/globals.py          1.00    380.3±1.31µs     7.8 MB/sec    1.00    378.7±1.49µs     7.8 MB/sec
linter/all-rules/pydantic/types.py         1.00      6.2±0.01ms     4.1 MB/sec    1.00      6.2±0.01ms     4.1 MB/sec
linter/default-rules/large/dataset.py      1.00      7.6±0.01ms     5.3 MB/sec    1.00      7.6±0.03ms     5.3 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00   1596.6±2.95µs    10.4 MB/sec    1.00   1602.0±2.59µs    10.4 MB/sec
linter/default-rules/numpy/globals.py      1.00    174.3±0.29µs    16.9 MB/sec    1.00    174.0±0.63µs    17.0 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.4±0.01ms     7.5 MB/sec    1.00      3.4±0.01ms     7.5 MB/sec
parser/large/dataset.py                    1.01      5.9±0.00ms     6.8 MB/sec    1.00      5.9±0.01ms     6.9 MB/sec
parser/numpy/ctypeslib.py                  1.00   1144.2±2.10µs    14.6 MB/sec    1.00   1138.6±2.20µs    14.6 MB/sec
parser/numpy/globals.py                    1.00    116.8±0.22µs    25.3 MB/sec    1.00    117.1±0.19µs    25.2 MB/sec
parser/pydantic/types.py                   1.01      2.5±0.00ms    10.2 MB/sec    1.00      2.5±0.00ms    10.3 MB/sec

Windows

group                                      main                                   pr
-----                                      ----                                   --
linter/all-rules/large/dataset.py          1.04     25.1±0.86ms  1657.1 KB/sec    1.00     24.3±0.86ms  1717.6 KB/sec
linter/all-rules/numpy/ctypeslib.py        1.02      6.2±0.44ms     2.7 MB/sec    1.00      6.1±0.30ms     2.7 MB/sec
linter/all-rules/numpy/globals.py          1.00   707.5±38.28µs     4.2 MB/sec    1.00   710.4±43.20µs     4.2 MB/sec
linter/all-rules/pydantic/types.py         1.00     10.1±0.47ms     2.5 MB/sec    1.02     10.3±0.40ms     2.5 MB/sec
linter/default-rules/large/dataset.py      1.00     12.3±0.87ms     3.3 MB/sec    1.00     12.2±0.46ms     3.3 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00      2.5±0.13ms     6.6 MB/sec    1.02      2.6±0.13ms     6.4 MB/sec
linter/default-rules/numpy/globals.py      1.00   300.2±17.16µs     9.8 MB/sec    1.02   305.2±19.67µs     9.7 MB/sec
linter/default-rules/pydantic/types.py     1.00      5.4±0.25ms     4.7 MB/sec    1.01      5.5±0.25ms     4.7 MB/sec
parser/large/dataset.py                    1.00      9.9±0.48ms     4.1 MB/sec    1.01      9.9±0.35ms     4.1 MB/sec
parser/numpy/ctypeslib.py                  1.00  1864.2±78.93µs     8.9 MB/sec    1.01  1881.3±62.66µs     8.9 MB/sec
parser/numpy/globals.py                    1.00    193.7±9.50µs    15.2 MB/sec    1.02   196.8±15.61µs    15.0 MB/sec
parser/pydantic/types.py                   1.00      4.1±0.17ms     6.2 MB/sec    1.01      4.1±0.18ms     6.2 MB/sec

crates/ruff/src/autofix/actions.rs Outdated Show resolved Hide resolved
crates/ruff/src/checkers/imports.rs Outdated Show resolved Hide resolved
crates/ruff/src/directives.rs Outdated Show resolved Hide resolved
crates/ruff/src/doc_lines.rs Show resolved Hide resolved
crates/ruff/src/docstrings/definition.rs Outdated Show resolved Hide resolved
crates/ruff/src/rules/pandas_vet/fixes.rs Outdated Show resolved Hide resolved
crates/ruff/src/rules/pandas_vet/rules/inplace_argument.rs Outdated Show resolved Hide resolved
crates/ruff/src/rules/pycodestyle/rules/errors.rs Outdated Show resolved Hide resolved
@MichaReiser
Copy link
Member Author

MichaReiser commented Apr 15, 2023

Thanks for the detailed explanation @MichaReiser! If you don't mind - where do the offsets come from? Column offset makes sense (obviously just offsetting from index 0), but how are rows represented? Or is a total byte-offset calculated from what is effectively row 0, column 0?

E: Never mind, just found locator.rs. Seems like we essentially treat the file as one giant string, then define rows as being delimited by \n/\r and then the columns as the offsets from that offset?

@evanrittenhouse, sorry for the late reply.

The RustPython Lexer generates the offsets. The old implementation counted the rows and columns (from the start of the row). The lexer increments the current row index and resets the column to zero for every new line character.

Byte offsets don't use row or columns. Instead, it's an offset from the beginning of the file. Think of the string as a byte array and the byte offset is the index into that array:

def f(): pass
x = 20

The position of the identifier f:

  • row/column representation: Location { row: 1, column 4 }
  • byte offsets: TextSize::from(4)

The position of the = sign

  • row/column representation: Location { row: 2, column: 2 }
  • byte offsets: TextSize::from(16)

@@ -282,6 +282,16 @@ W19.py:133:1: W191 Indentation contains tabs
137 | def test_keys(self):
|

W19.py:136:1: W191 Indentation contains tabs
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this was a false negative because Indexer.strings incorrectly suppressed this violation because it is on a line with a string. This now gets correctly reported because we test if the tab is inside of a string range (rather than on a line)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right.

Base automatically changed from add-parser-benchmark to main April 17, 2023 14:44
@MichaReiser MichaReiser enabled auto-merge (squash) April 26, 2023 18:06
@MichaReiser MichaReiser merged commit cab65b2 into main Apr 26, 2023
13 of 28 checks passed
@MichaReiser MichaReiser deleted the byte-offset-parser branch April 26, 2023 18:11
renovate bot added a commit to ixm-one/pytest-cmake-presets that referenced this pull request May 2, 2023
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence |
|---|---|---|---|---|---|
| [ruff](https://togithub.com/charliermarsh/ruff) | `^0.0.263` ->
`^0.0.264` |
[![age](https://badges.renovateapi.com/packages/pypi/ruff/0.0.264/age-slim)](https://docs.renovatebot.com/merge-confidence/)
|
[![adoption](https://badges.renovateapi.com/packages/pypi/ruff/0.0.264/adoption-slim)](https://docs.renovatebot.com/merge-confidence/)
|
[![passing](https://badges.renovateapi.com/packages/pypi/ruff/0.0.264/compatibility-slim/0.0.263)](https://docs.renovatebot.com/merge-confidence/)
|
[![confidence](https://badges.renovateapi.com/packages/pypi/ruff/0.0.264/confidence-slim/0.0.263)](https://docs.renovatebot.com/merge-confidence/)
|

---

### Release Notes

<details>
<summary>charliermarsh/ruff</summary>

###
[`v0.0.264`](https://togithub.com/charliermarsh/ruff/releases/tag/v0.0.264)

[Compare
Source](https://togithub.com/charliermarsh/ruff/compare/v0.0.263...v0.0.264)

<!-- Release notes generated using configuration in .github/release.yml
at 8cb76f85eba1c970a8c800348fd1e0c874621a57 -->

#### What's Changed

##### Rules

- Autofix `EM101`, `EM102`, `EM103` if possible by
[@&#8203;dhruvmanila](https://togithub.com/dhruvmanila) in
[astral-sh/ruff#4123
- Add bugbear immutable functions as allowed in dataclasses by
[@&#8203;mosauter](https://togithub.com/mosauter) in
[astral-sh/ruff#4122

##### Settings

- Add support for providing command-line arguments via `argfile` by
[@&#8203;charliermarsh](https://togithub.com/charliermarsh) in
[astral-sh/ruff#4087

##### Bug Fixes

- Make D410/D411 autofixes mutually exclusive by
[@&#8203;evanrittenhouse](https://togithub.com/evanrittenhouse) in
[astral-sh/ruff#4110
- Remove `pyright` comment prefix from PYI033 checks by
[@&#8203;evanrittenhouse](https://togithub.com/evanrittenhouse) in
[astral-sh/ruff#4152
- Fix F811 false positive with match by
[@&#8203;JonathanPlasse](https://togithub.com/JonathanPlasse) in
[astral-sh/ruff#4161
- Fix `E713` and `E714` false positives for multiple comparisons by
[@&#8203;JonathanPlasse](https://togithub.com/JonathanPlasse) in
[astral-sh/ruff#4083
- Fix B023 shadowed variables in nested functions by
[@&#8203;MichaReiser](https://togithub.com/MichaReiser) in
[astral-sh/ruff#4111
- Preserve star-handling special-casing for force-single-line by
[@&#8203;charliermarsh](https://togithub.com/charliermarsh) in
[astral-sh/ruff#4129
- Respect parent-scoping rules for `NamedExpr` assignments by
[@&#8203;charliermarsh](https://togithub.com/charliermarsh) in
[astral-sh/ruff#4145
- Fix UP032 auto-fix by
[@&#8203;JonathanPlasse](https://togithub.com/JonathanPlasse) in
[astral-sh/ruff#4165
- Allow boolean parameters for `pytest.param` by
[@&#8203;charliermarsh](https://togithub.com/charliermarsh) in
[astral-sh/ruff#4176

##### Internal

- Replace row/column based `Location` with byte-offsets. by
[@&#8203;MichaReiser](https://togithub.com/MichaReiser) in
[astral-sh/ruff#3931
- perf(logical-lines): Various small perf improvements by
[@&#8203;MichaReiser](https://togithub.com/MichaReiser) in
[astral-sh/ruff#4022
- Use `memchr` to speedup newline search on x86 by
[@&#8203;MichaReiser](https://togithub.com/MichaReiser) in
[astral-sh/ruff#3985
- Remove `ScopeStack` in favor of child-parent `ScopeId` pointers by
[@&#8203;charliermarsh](https://togithub.com/charliermarsh) in
[astral-sh/ruff#4138

**Full Changelog**:
astral-sh/ruff@v0.0.263...v0.0.264

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Enabled.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://app.renovatebot.com/dashboard#github/ixm-one/pytest-cmake-presets).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNS42Ni4zIiwidXBkYXRlZEluVmVyIjoiMzUuNjYuMyIsInRhcmdldEJyYW5jaCI6Im1haW4ifQ==-->

Signed-off-by: Renovate Bot <bot@renovateapp.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
renovate bot added a commit to allenporter/flux-local that referenced this pull request May 3, 2023
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence |
|---|---|---|---|---|---|
| [ruff](https://togithub.com/charliermarsh/ruff) | `==0.0.263` ->
`==0.0.264` |
[![age](https://badges.renovateapi.com/packages/pypi/ruff/0.0.264/age-slim)](https://docs.renovatebot.com/merge-confidence/)
|
[![adoption](https://badges.renovateapi.com/packages/pypi/ruff/0.0.264/adoption-slim)](https://docs.renovatebot.com/merge-confidence/)
|
[![passing](https://badges.renovateapi.com/packages/pypi/ruff/0.0.264/compatibility-slim/0.0.263)](https://docs.renovatebot.com/merge-confidence/)
|
[![confidence](https://badges.renovateapi.com/packages/pypi/ruff/0.0.264/confidence-slim/0.0.263)](https://docs.renovatebot.com/merge-confidence/)
|

---

### Release Notes

<details>
<summary>charliermarsh/ruff</summary>

###
[`v0.0.264`](https://togithub.com/charliermarsh/ruff/releases/tag/v0.0.264)

[Compare
Source](https://togithub.com/charliermarsh/ruff/compare/v0.0.263...v0.0.264)

<!-- Release notes generated using configuration in .github/release.yml
at 8cb76f85eba1c970a8c800348fd1e0c874621a57 -->

#### What's Changed

##### Rules

- Autofix `EM101`, `EM102`, `EM103` if possible by
[@&#8203;dhruvmanila](https://togithub.com/dhruvmanila) in
[astral-sh/ruff#4123
- Add bugbear immutable functions as allowed in dataclasses by
[@&#8203;mosauter](https://togithub.com/mosauter) in
[astral-sh/ruff#4122

##### Settings

- Add support for providing command-line arguments via `argfile` by
[@&#8203;charliermarsh](https://togithub.com/charliermarsh) in
[astral-sh/ruff#4087

##### Bug Fixes

- Make D410/D411 autofixes mutually exclusive by
[@&#8203;evanrittenhouse](https://togithub.com/evanrittenhouse) in
[astral-sh/ruff#4110
- Remove `pyright` comment prefix from PYI033 checks by
[@&#8203;evanrittenhouse](https://togithub.com/evanrittenhouse) in
[astral-sh/ruff#4152
- Fix F811 false positive with match by
[@&#8203;JonathanPlasse](https://togithub.com/JonathanPlasse) in
[astral-sh/ruff#4161
- Fix `E713` and `E714` false positives for multiple comparisons by
[@&#8203;JonathanPlasse](https://togithub.com/JonathanPlasse) in
[astral-sh/ruff#4083
- Fix B023 shadowed variables in nested functions by
[@&#8203;MichaReiser](https://togithub.com/MichaReiser) in
[astral-sh/ruff#4111
- Preserve star-handling special-casing for force-single-line by
[@&#8203;charliermarsh](https://togithub.com/charliermarsh) in
[astral-sh/ruff#4129
- Respect parent-scoping rules for `NamedExpr` assignments by
[@&#8203;charliermarsh](https://togithub.com/charliermarsh) in
[astral-sh/ruff#4145
- Fix UP032 auto-fix by
[@&#8203;JonathanPlasse](https://togithub.com/JonathanPlasse) in
[astral-sh/ruff#4165
- Allow boolean parameters for `pytest.param` by
[@&#8203;charliermarsh](https://togithub.com/charliermarsh) in
[astral-sh/ruff#4176

##### Internal

- Replace row/column based `Location` with byte-offsets. by
[@&#8203;MichaReiser](https://togithub.com/MichaReiser) in
[astral-sh/ruff#3931
- perf(logical-lines): Various small perf improvements by
[@&#8203;MichaReiser](https://togithub.com/MichaReiser) in
[astral-sh/ruff#4022
- Use `memchr` to speedup newline search on x86 by
[@&#8203;MichaReiser](https://togithub.com/MichaReiser) in
[astral-sh/ruff#3985
- Remove `ScopeStack` in favor of child-parent `ScopeId` pointers by
[@&#8203;charliermarsh](https://togithub.com/charliermarsh) in
[astral-sh/ruff#4138

**Full Changelog**:
astral-sh/ruff@v0.0.263...v0.0.264

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Enabled.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://app.renovatebot.com/dashboard#github/allenporter/flux-local).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNS42OS4zIiwidXBkYXRlZEluVmVyIjoiMzUuNjkuMyIsInRhcmdldEJyYW5jaCI6Im1haW4ifQ==-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
renovate bot added a commit to allenporter/pyrainbird that referenced this pull request May 3, 2023
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence |
|---|---|---|---|---|---|
| [ruff](https://togithub.com/charliermarsh/ruff) | `==0.0.263` ->
`==0.0.264` |
[![age](https://badges.renovateapi.com/packages/pypi/ruff/0.0.264/age-slim)](https://docs.renovatebot.com/merge-confidence/)
|
[![adoption](https://badges.renovateapi.com/packages/pypi/ruff/0.0.264/adoption-slim)](https://docs.renovatebot.com/merge-confidence/)
|
[![passing](https://badges.renovateapi.com/packages/pypi/ruff/0.0.264/compatibility-slim/0.0.263)](https://docs.renovatebot.com/merge-confidence/)
|
[![confidence](https://badges.renovateapi.com/packages/pypi/ruff/0.0.264/confidence-slim/0.0.263)](https://docs.renovatebot.com/merge-confidence/)
|

---

### Release Notes

<details>
<summary>charliermarsh/ruff</summary>

###
[`v0.0.264`](https://togithub.com/charliermarsh/ruff/releases/tag/v0.0.264)

[Compare
Source](https://togithub.com/charliermarsh/ruff/compare/v0.0.263...v0.0.264)

<!-- Release notes generated using configuration in .github/release.yml
at 8cb76f85eba1c970a8c800348fd1e0c874621a57 -->

#### What's Changed

##### Rules

- Autofix `EM101`, `EM102`, `EM103` if possible by
[@&#8203;dhruvmanila](https://togithub.com/dhruvmanila) in
[astral-sh/ruff#4123
- Add bugbear immutable functions as allowed in dataclasses by
[@&#8203;mosauter](https://togithub.com/mosauter) in
[astral-sh/ruff#4122

##### Settings

- Add support for providing command-line arguments via `argfile` by
[@&#8203;charliermarsh](https://togithub.com/charliermarsh) in
[astral-sh/ruff#4087

##### Bug Fixes

- Make D410/D411 autofixes mutually exclusive by
[@&#8203;evanrittenhouse](https://togithub.com/evanrittenhouse) in
[astral-sh/ruff#4110
- Remove `pyright` comment prefix from PYI033 checks by
[@&#8203;evanrittenhouse](https://togithub.com/evanrittenhouse) in
[astral-sh/ruff#4152
- Fix F811 false positive with match by
[@&#8203;JonathanPlasse](https://togithub.com/JonathanPlasse) in
[astral-sh/ruff#4161
- Fix `E713` and `E714` false positives for multiple comparisons by
[@&#8203;JonathanPlasse](https://togithub.com/JonathanPlasse) in
[astral-sh/ruff#4083
- Fix B023 shadowed variables in nested functions by
[@&#8203;MichaReiser](https://togithub.com/MichaReiser) in
[astral-sh/ruff#4111
- Preserve star-handling special-casing for force-single-line by
[@&#8203;charliermarsh](https://togithub.com/charliermarsh) in
[astral-sh/ruff#4129
- Respect parent-scoping rules for `NamedExpr` assignments by
[@&#8203;charliermarsh](https://togithub.com/charliermarsh) in
[astral-sh/ruff#4145
- Fix UP032 auto-fix by
[@&#8203;JonathanPlasse](https://togithub.com/JonathanPlasse) in
[astral-sh/ruff#4165
- Allow boolean parameters for `pytest.param` by
[@&#8203;charliermarsh](https://togithub.com/charliermarsh) in
[astral-sh/ruff#4176

##### Internal

- Replace row/column based `Location` with byte-offsets. by
[@&#8203;MichaReiser](https://togithub.com/MichaReiser) in
[astral-sh/ruff#3931
- perf(logical-lines): Various small perf improvements by
[@&#8203;MichaReiser](https://togithub.com/MichaReiser) in
[astral-sh/ruff#4022
- Use `memchr` to speedup newline search on x86 by
[@&#8203;MichaReiser](https://togithub.com/MichaReiser) in
[astral-sh/ruff#3985
- Remove `ScopeStack` in favor of child-parent `ScopeId` pointers by
[@&#8203;charliermarsh](https://togithub.com/charliermarsh) in
[astral-sh/ruff#4138

**Full Changelog**:
astral-sh/ruff@v0.0.263...v0.0.264

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Enabled.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://app.renovatebot.com/dashboard#github/allenporter/pyrainbird).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNS42OS4zIiwidXBkYXRlZEluVmVyIjoiMzUuNjkuMyIsInRhcmdldEJyYW5jaCI6Im1haW4ifQ==-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants