Remove type parameter from `parse_*` methods #9466

MichaReiser · 2024-01-11T11:41:41Z

This PR removes the I (tokens iterator) type parameter from all parse_* methods. This is done in preparation for #9152 to avoid accidentally monomorphizing the parser more than once.

The Parser struct defined in #9152 is parametrized by the underlying tokens iterator. This is dangerous because Rust will monomorphize the entire parser for each distinct I type parameter.

This PR removes the I type parameters from all parse_* methods and introduces a new TokenSource type in the parser that:

Filters out trivia tokens
Allows lookahead without the need for MultiPeek (which has an overhead when reading tokens)
Is a single type used by our Parser to read tokens.

The downside is that all parse_ methods now take a Vec<LexResult>, which requires collecting the lexer result into a Vec before parsing. Our micro-benchmarks show that this is slower than streaming the tokens one by one.

However, it turns out that both the linter and formatter both collect the tokens before parsing them to an AST. That means the path tested by the micro-benchmark isn't used in the actual product and the introduced regression doesn't affect users.

You may wonder why this is only a problem now but hasn't been a problem with lalrpop. The problem existed with lalrpop as well but to a much smaller extent because lalrpop separates the parser into two parts:

A state machine parametrized by I that consumes the tokens and calls into the parser methods (only passing the tokens)
The actual parsing methods

Only the state machine gets monomorphized, which is fine because it is limited in size. Another way to think about it is that the lalrpop pushes the tokens into the parser (and, therefore, the parser is decoupled from I). In contrast, the handwritten parser pulls the tokens (and, therefore, is generic over I).

Alternatives

Alternatives that may be less affected by the performance regression are

A TokenSource enum with two variants: One storing the lexed tokens and the other storing the lexer to consume the tokens lazily.
Storing a Box<impl Iterator> in the hand written parser

I went with the above approach because we don't need the flexibility of either lexing lazily or eagerly in Ruff outside the parser benchmark. Thus, using a Vec<LexResult> seemed the easiest solution.

CC: @LaBatata101

codspeed-hq · 2024-01-11T12:01:38Z

CodSpeed Performance Report

Merging #9466 will degrade performances by 6.76%

_{Comparing token-source (9d17580) with main (14d3fe6)}

Summary

❌ 5 (👁 5) regressions
✅ 25 untouched benchmarks

Benchmarks breakdown

	Benchmark	`main`	`token-source`	Change
👁	`parser[unicode/pypinyin.py]`	4 ms	4.3 ms	-6.76%
👁	`parser[pydantic/types.py]`	25.7 ms	27.3 ms	-5.9%
👁	`parser[numpy/globals.py]`	1.1 ms	1.1 ms	-4.87%
👁	`parser[numpy/ctypeslib.py]`	11.6 ms	12.2 ms	-5.49%
👁	`parser[large/dataset.py]`	67.4 ms	70.8 ms	-4.82%

github-actions · 2024-01-11T12:43:19Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

charliermarsh

This seems reasonable to me given the explanation (especially around the perceived regression) in the summary.

BurntSushi

I buy what you're selling.

BurntSushi · 2024-01-11T15:10:04Z

crates/ruff_python_parser/src/parser.rs

    let marker_token = (Tok::start_marker(mode), TextRange::default());
-    let lexer = std::iter::once(Ok(marker_token)).chain(lxr);
+    let lexer = std::iter::once(Ok(marker_token)).chain(TokenSource::new(tokens));


I was trying to figure out why this is a dedicated type instead of a tokens.into_iter().filter(...) but couldn't come up with a reason. (Usually it's because you want to name the type somewhere, but maybe I'm missing that.)

It's mainly to have a named type in #9152 and a place where we can implement lookahead without using PeekMany

Ah gotya, don't mind me then. :)

LaBatata101 · 2024-01-11T18:28:47Z

LGTM. I was actually going to something like this later, glad you saved me the work 😊

MichaReiser force-pushed the token-source branch from ae6b1ac to 3112468 Compare January 11, 2024 11:54

MichaReiser force-pushed the token-source branch 2 times, most recently from 6552632 to 7403d2d Compare January 11, 2024 12:08

Add TokenSource

9d17580

MichaReiser force-pushed the token-source branch from 7403d2d to 9d17580 Compare January 11, 2024 12:24

MichaReiser changed the title ~~Add TokenSource~~ Remove type parameter from parse_* methods Jan 11, 2024

MichaReiser added internal An internal refactor or improvement parser Related to the parser labels Jan 11, 2024

MichaReiser marked this pull request as ready for review January 11, 2024 13:52

MichaReiser requested review from charliermarsh, dhruvmanila and BurntSushi January 11, 2024 13:52

charliermarsh approved these changes Jan 11, 2024

View reviewed changes

BurntSushi approved these changes Jan 11, 2024

View reviewed changes

MichaReiser merged commit f192c72 into main Jan 11, 2024
17 checks passed

MichaReiser deleted the token-source branch January 11, 2024 18:41

MichaReiser mentioned this pull request Jan 16, 2024

Approximate tokens len #9546

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove type parameter from `parse_*` methods #9466

Remove type parameter from `parse_*` methods #9466

MichaReiser commented Jan 11, 2024 •

edited

codspeed-hq bot commented Jan 11, 2024 •

edited

github-actions bot commented Jan 11, 2024

charliermarsh left a comment

BurntSushi left a comment

BurntSushi Jan 11, 2024

MichaReiser Jan 11, 2024

BurntSushi Jan 11, 2024

LaBatata101 commented Jan 11, 2024

Remove type parameter from parse_* methods #9466

Remove type parameter from parse_* methods #9466

Conversation

MichaReiser commented Jan 11, 2024 • edited

Alternatives

codspeed-hq bot commented Jan 11, 2024 • edited

CodSpeed Performance Report

Merging #9466 will degrade performances by 6.76%

Summary

Benchmarks breakdown

github-actions bot commented Jan 11, 2024

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

charliermarsh left a comment

Choose a reason for hiding this comment

BurntSushi left a comment

Choose a reason for hiding this comment

BurntSushi Jan 11, 2024

Choose a reason for hiding this comment

MichaReiser Jan 11, 2024

Choose a reason for hiding this comment

BurntSushi Jan 11, 2024

Choose a reason for hiding this comment

LaBatata101 commented Jan 11, 2024

Remove type parameter from `parse_*` methods #9466

Remove type parameter from `parse_*` methods #9466

MichaReiser commented Jan 11, 2024 •

edited

codspeed-hq bot commented Jan 11, 2024 •

edited

`ruff-ecosystem` results