pep440: rewrite the parser and make version comparisons cheaper #789

BurntSushi · 2024-01-04T19:55:37Z

This PR builds on #780 by making both version parsing faster, and
perhaps more importantly, making version comparisons much faster.
Overall, these changes result in a considerable improvement for the
boto3.in workload. Here's the status quo:

$ time puffin pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in
Resolved 31 packages in 34.56s

real    34.579
user    34.004
sys     0.413
maxmem  2867 MB
faults  0

And now with this PR:

$ time puffin pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in
Resolved 31 packages in 9.20s

real    9.218
user    8.919
sys     0.165
maxmem  463 MB
faults  0

This particular workload gets stuck in pubgrub doing resolution, and
thus benefits mightily from a faster Version::cmp routine. With that
said, this change does also help a fair bit with "normal" runs:

$ hyperfine -w10 \
    "puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in" \
    "puffin-cmparc pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in"
Benchmark 1: puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in
  Time (mean ± σ):     337.5 ms ±   3.9 ms    [User: 310.5 ms, System: 73.2 ms]
  Range (min … max):   333.6 ms … 343.4 ms    10 runs

Benchmark 2: puffin-cmparc pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in
  Time (mean ± σ):     189.8 ms ±   3.0 ms    [User: 168.1 ms, System: 78.4 ms]
  Range (min … max):   185.0 ms … 196.2 ms    15 runs

Summary
  puffin-cmparc pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in ran
    1.78 ± 0.03 times faster than puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in

There is perhaps some future work here (detailed in the commit
messages), but I suspect it would be more fruitful to explore ways of
making resolution itself and/or deserialization faster.

Fixes #373, Closes #396

charliermarsh

Can't wait to read this, I feel like I'm gonna learn a lot.

crates/pep440-rs/src/version.rs

charliermarsh · 2024-01-05T01:55:40Z

crates/pep440-rs/src/version.rs

@@ -241,7 +242,7 @@ impl std::fmt::Display for OperatorParseError {
 /// ```
 #[derive(Clone)]
 pub struct Version {
-    inner: VersionInner,
+    inner: Arc<VersionInner>,


I will ask a dumb question: why Arc and not Box?

Oh, is it because Arc allows for cheap cloning...?

Do we want the Arc here or in PubGrubVersion? That is, would it have any disadvantages to only use it in PubGrubVersion so Version itself remains the "real" type / does the copy on write make a noticeable impact?

Yes, Arc makes cloning essentially free.

I think Version is fine with an Arc inside of it, and it's hard for me to conceive of a way that copy-on-write would hurt things. I would expect that the vast majority of the time, when Arc::make_mut is called, an extra copy is not made. (Since it's done as part of building a new Version.) And even then, in the cases where Arc::make_mut does make a copy, the caller would have had to make a copy anyway if Arc weren't used.

I think there are really only two downsides to using Arc here. Firstly, it introduces an alloc in the happy path of building/parsing a version. Secondly, it requires using either alloc or std. I don't think the first makes too much of a difference, and I think the latter doesn't matter for this crate (unless you want to take in the direction of core-only, which would be quite tricky).

does the copy on write make a noticeable impact?

It might help to clarify here: Arc::make_mut isn't quite just copy-on-write. It only makes a copy when the reference count is greater than 1. But if it's 1, and I suspect it usually will be in this context, then a mutable borrow with no copying can be handed out safely because exclusivity has been proven.

Typically copy-on-write implies something closer to "whenever one needs to write, you always copy first." And indeed, if that were the case here, it might make Arc-usage a little more suspicious.

charliermarsh

This is really quite an incredible PR, and you did a great job of breaking down some very sophisticated optimizations and decision-making. I loved reading it, and the results are remarkable!

crates/pep440-rs/src/version.rs

konstin

woah

konstin · 2024-01-05T13:38:26Z

crates/pep440-rs/src/version.rs

@@ -241,7 +242,7 @@ impl std::fmt::Display for OperatorParseError {
 /// ```
 #[derive(Clone)]
 pub struct Version {
-    inner: VersionInner,
+    inner: Arc<VersionInner>,


Do we want the Arc here or in PubGrubVersion? That is, would it have any disadvantages to only use it in PubGrubVersion so Version itself remains the "real" type / does the copy on write make a noticeable impact?

crates/pep440-rs/src/version.rs

This starts to untangle the dual use of Version for both acting as a literal version and a pattern for matching versions. We still retain reuse in the implementation, but are more explicit about the difference in the public API. This also makes it a little smoother to build a `VersionSpecifier` instead of carrying an extra boolean around.

The `Option<Vec<LocalSegment>>` representation creates redundant case analysis. That is, there doesn't appear to be a case where it can both be `Some` and have zero segments. Instead, we can just use the case of zero segments to mean "no local component."

This rewrites the parser to avoid regexes. We do things by hand directly on a `&[u8]`, and also use more structured errors internally (although we don't expose those yet). This is good for about a 1.1x improvement: $ hyperfine -w10 \ 'puffin pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in' \ 'puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in' Benchmark 1: puffin pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in Time (mean ± σ): 317.3 ms ± 18.1 ms [User: 268.4 ms, System: 106.8 ms] Range (min … max): 297.4 ms … 360.1 ms 10 runs Benchmark 2: puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in Time (mean ± σ): 345.0 ms ± 12.4 ms [User: 293.5 ms, System: 109.7 ms] Range (min … max): 329.9 ms … 372.8 ms 10 runs Summary puffin pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in ran 1.09 ± 0.07 times faster than puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in The improvement here is rather small because the fast path that was added recently was doing a lot of the heavy lifting. In subsequent commits, we will: * Expose the error type so that we don't just return a `String`. * Add back a fast path for parsing just `w.x.y.z`. * Tweak the representation of `Version` to make its `Ord` impl faster. In particular, this is still showing up as a bottleneck from resolution. Fixes #373

This adds a simple fast path to version parsing that just looks for versions that only have a release component with 4 or fewer numbers. This does lead to a small ~1.03x improvement on the benchmark mentioned in a previous commit, but the main reason to do this to set ourselves up for version parsing without any allocation at all. Namely, in the next commit, we're going to split the representation of Version in two.

The high level of this commit is to make `Version::cmp` lower to the equivalent of `u64::cmp` in the vast majority of cases. We achieve this by carving out a large subset (~90%) of all possible versions on PyPI and devising a small fixed representation for them. This is called `VersionSmall`. Everything that doesn't fit into this representation is put into `VersionFull`. The former is, essentially, a `u64`. The latter is what the status quo looked like before this commit. This change overall makes a rather large difference when puffin gets stuck in version resolution. Namely, pubgrub does a lot of version comparisons, and by changing that a simple u64 comparison, we get a huge win: $ time puffin-cmp pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in Resolved 31 packages in 10.90s real 10.923 user 10.406 sys 0.384 maxmem 2398 MB faults 0 $ time puffin-base pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in Resolved 31 packages in 34.56s real 34.579 user 34.004 sys 0.413 maxmem 2867 MB faults 0 Future work on this point might investigate removing some of the extra storage on `VersionSmall` (see code comments). And if we did that, it might make sense to box a `Version`.

This puts a `Version` inside of an `Arc` internally. This "works" pretty transparently because `Version` does not expose any methods for direct in-place mutation. While this doesn't *obviously* speed anything up directly, it does result in a fairly sizeable improvement to the `boto3` requirements list: $ time puffin-cmp pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in Resolved 31 packages in 10.84s real 10.858 user 10.272 sys 0.444 maxmem 2402 MB faults 0 $ time puffin-cmparc pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in Resolved 31 packages in 9.20s real 9.218 user 8.919 sys 0.165 maxmem 463 MB faults 0 My hypothesis for why this leads to an improvement is that this shrinks the size of a `Version` from 104 bytes(!) to 8 bytes. Since it looks like pubgrub builds in-memory collections of `Version` values, this in turn substantially shrinks heap usage of those collections and also likely makes things more cache friendly. It's plausible that an even better approach here would be to only wrap a `VersionFull` in an `Arc`. But to make this most effective, we'd want to truly represent a `VersionSmall` as just a single `u64`. Perhaps even a `NonZeroU64` to create a niche. But to do that, I think we need to: Represent the number of release segments in `VersionSmall::repr`. This should be easy enough to do, I think, by commandeering the two least significant bits of the release component (currently using the 5 most significant bytes of the u64 representation). This shouldn't impact comparisons since the lengths are at the least significant part, and thus, will be equivalent if a comparison needs to consider them. If we borrow the two bits from the first release segment (which is a u16), then it should still be able to represent calver since 2^14=16384. Then `Version` could be defined like this: struct Version(VersionInner); enum VersionInner { Small(u64), Full(Arc<VersionFull>), } Although this does raise the size of `Version` from the 8 bytes in this PR to 16. One might be able to reduce it back to 8 with a niche optimization, but I can't think of any obvious way of doing that in safe code.

Looking at the profile for tf-models-nightly after #789, `compare_release` is by far the slowest function. Adding a fast path, we avoid paying the cost for padding releases with 0s when they are the same length. ``` $ hyperfine --warmup 1 --runs 3 "target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt" "target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt" Benchmark 1: target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt Time (mean ± σ): 11.963 s ± 0.225 s [User: 11.478 s, System: 0.451 s] Range (min … max): 11.747 s … 12.196 s 3 runs Benchmark 2: target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt Time (mean ± σ): 10.317 s ± 0.720 s [User: 9.885 s, System: 0.404 s] Range (min … max): 9.501 s … 10.860 s 3 runs Summary target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt ran 1.16 ± 0.08 times faster than target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt ```

Looking at the profile for tf-models-nightly after #789, `compare_release` is the single biggest item. Adding a fast path, we avoid paying the cost for padding releases with 0s when they are the same length, resulting in a 16% for this pathological case. Note that this mainly happens because tf-models-nightly is almost all large dev releases that hit the slow path. **Before** ![image](https://github.com/astral-sh/puffin/assets/6826232/0d2b4553-da69-4cdb-966b-0894a6dd5d94) **After** ![image](https://github.com/astral-sh/puffin/assets/6826232/6d484808-9d16-408d-823e-a12d321802a5) ``` $ hyperfine --warmup 1 --runs 3 "target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt" "target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt" Benchmark 1: target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt Time (mean ± σ): 11.963 s ± 0.225 s [User: 11.478 s, System: 0.451 s] Range (min … max): 11.747 s … 12.196 s 3 runs Benchmark 2: target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt Time (mean ± σ): 10.317 s ± 0.720 s [User: 9.885 s, System: 0.404 s] Range (min … max): 9.501 s … 10.860 s 3 runs Summary target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt ran 1.16 ± 0.08 times faster than target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt ```

MichaReiser · 2024-01-09T16:31:02Z

crates/pep440-rs/src/version.rs

@@ -639,6 +615,7 @@ impl std::fmt::Debug for Version {
 }

 impl PartialEq<Self> for Version {
+    #[inline]
    fn eq(&self, other: &Self) -> bool {
        self.cmp(other) == Ordering::Equal


Should we add a fast path here that avoids calling into cmp_slow if both self.inner and other.inner point to the same Arc?

Not sure. cmp_slow is designed to be called very infrequently (in practice), and I don't know the portion of comparisons that enter cmp_slow that have pointer equality internally. My prior is that the portion would be rather small, but I could be wrong. Especially for particular benchmarks that might hit cmp_slow more often than is intended. For example: #799

BurntSushi requested review from charliermarsh and konstin January 4, 2024 19:56

charliermarsh reviewed Jan 4, 2024

View reviewed changes

charliermarsh self-requested a review January 4, 2024 19:58