Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pep440: some minor refactoring, mostly around error types #780

Merged
merged 10 commits into from
Jan 4, 2024

Conversation

BurntSushi
Copy link
Member

This PR does a bit of refactoring to the pep440 crate, and in
particular around the erorr types. This PR is meant to be a precursor
to another PR that does some surgery (both in parsing and in Version
representation) that benefits somewhat from this refactoring.

As usual, please review commit-by-commit.

@BurntSushi BurntSushi requested review from charliermarsh and konstin and removed request for charliermarsh January 4, 2024 14:57
crates/pep440-rs/src/version.rs Show resolved Hide resolved
crates/pep440-rs/src/version_specifier.rs Outdated Show resolved Hide resolved
crates/pep440-rs/src/version_specifier.rs Show resolved Hide resolved
crates/pep440-rs/src/version_specifier.rs Show resolved Hide resolved
crates/pep440-rs/src/version_specifier.rs Show resolved Hide resolved
I'll like expand this comment a little bit in subsequent commits, but I
felt like this was useful to put in the beginning of this patch series
to "set the stage." That is, this data will motivate some of the
optimizations we'll do for version parsing, version representation and
comparisons.
This moves LocalSegment and PreRelease after Version, and also keeps
their impl blocks grouped together.

No code is changed here.
It turns out that the current parser permits non-ASCII whitespace in
places. We can be sneaky here to cause the existing implementation to
produce incorrect offsets.

The core issue here is that `unicode-width` is used to compute codepoint
offsets, but its actual purpose is to compute the *visual width* of a
codepoint that has been rendered. Some codepoints use more than 1 unit
of visual width.

While this establishes a mismatch between the implementation and the
documented behavior of `Pep440Error`, we will also address this by
switching to byte offsets instead of codepoint offsets. (Codepoint
offsets are almost never what you want.)
We still use unicode-width, but only when rendering the error message.

We also add the error message itself to the Display impl.
This error type was only be used when one attempted to parse a string as
a `VersionSpecifiers`. We'll want to introduce more structured error
types for parsing `Version` and `VersionSpecifier` as well, so renaming
the error type helps make room for that.

We also make the error type opaque. Nothing (in puffin at least) seemed
to need its internals. We can always add accessor methods in the future
if something else needs it. It's overall rare to need to expose the
entire internal representation of an error type.
Since it's specifier to version specifier parsing, it makes sense to
have it live with the corresponding type. Putting it in lib.rs I think
makes it seem like an error type that applies to other things as well.
This smoothes out the error handling to using something a bit more
structured. I did this because I intend to do the same for Version
(eventually), and it seems good to be consistent. It also lets us nest
errors a bit more easily and scrutinize the different error classes at a
glance.
This makes the top-level error type small, just as the others we added
in the previous commit are. This was specifically flagged by Clippy.
Prior to the refactoring in previous commits, tests were generally
written against the rendered error message instead of the structured
message. This was in part because many of the error types themselves
were just a `String`, but also to explicitly test what an end user
actually sees. That's valuable.

While we keep most of the tests as rewritten to target the new
structured error representation, we add some new tests that captures the
value of testing the messages than humans will actually see.
@BurntSushi BurntSushi merged commit d7c9b15 into main Jan 4, 2024
3 checks passed
@BurntSushi BurntSushi deleted the ag/version-refactor branch January 4, 2024 17:28
BurntSushi added a commit that referenced this pull request Jan 5, 2024
This PR builds on #780 by making both version parsing faster, and
perhaps more importantly, making version comparisons much faster.
Overall, these changes result in a considerable improvement for the
`boto3.in` workload. Here's the status quo:

```
$ time puffin pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in
Resolved 31 packages in 34.56s

real    34.579
user    34.004
sys     0.413
maxmem  2867 MB
faults  0
```

And now with this PR:

```
$ time puffin pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in
Resolved 31 packages in 9.20s

real    9.218
user    8.919
sys     0.165
maxmem  463 MB
faults  0
```

This particular workload gets stuck in pubgrub doing resolution, and
thus benefits mightily from a faster `Version::cmp` routine. With that
said, this change does also help a fair bit with "normal" runs:

```
$ hyperfine -w10 \
    "puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in" \
    "puffin-cmparc pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in"
Benchmark 1: puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in
  Time (mean ± σ):     337.5 ms ±   3.9 ms    [User: 310.5 ms, System: 73.2 ms]
  Range (min … max):   333.6 ms … 343.4 ms    10 runs

Benchmark 2: puffin-cmparc pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in
  Time (mean ± σ):     189.8 ms ±   3.0 ms    [User: 168.1 ms, System: 78.4 ms]
  Range (min … max):   185.0 ms … 196.2 ms    15 runs

Summary
  puffin-cmparc pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in ran
    1.78 ± 0.03 times faster than puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in
```

There is perhaps some future work here (detailed in the commit
messages), but I suspect it would be more fruitful to explore ways of
making resolution itself and/or deserialization faster.

Fixes #373, Closes #396
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants