-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pep440: rewrite the parser and make version comparisons cheaper #789
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't wait to read this, I feel like I'm gonna learn a lot.
@@ -241,7 +242,7 @@ impl std::fmt::Display for OperatorParseError { | |||
/// ``` | |||
#[derive(Clone)] | |||
pub struct Version { | |||
inner: VersionInner, | |||
inner: Arc<VersionInner>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will ask a dumb question: why Arc
and not Box
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, is it because Arc
allows for cheap cloning...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want the Arc
here or in PubGrubVersion
? That is, would it have any disadvantages to only use it in PubGrubVersion
so Version
itself remains the "real" type / does the copy on write make a noticeable impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Arc
makes cloning essentially free.
I think Version
is fine with an Arc
inside of it, and it's hard for me to conceive of a way that copy-on-write would hurt things. I would expect that the vast majority of the time, when Arc::make_mut
is called, an extra copy is not made. (Since it's done as part of building a new Version
.) And even then, in the cases where Arc::make_mut
does make a copy, the caller would have had to make a copy anyway if Arc
weren't used.
I think there are really only two downsides to using Arc
here. Firstly, it introduces an alloc in the happy path of building/parsing a version. Secondly, it requires using either alloc
or std
. I don't think the first makes too much of a difference, and I think the latter doesn't matter for this crate (unless you want to take in the direction of core
-only, which would be quite tricky).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the copy on write make a noticeable impact?
It might help to clarify here: Arc::make_mut
isn't quite just copy-on-write. It only makes a copy when the reference count is greater than 1. But if it's 1, and I suspect it usually will be in this context, then a mutable borrow with no copying can be handed out safely because exclusivity has been proven.
Typically copy-on-write implies something closer to "whenever one needs to write, you always copy first." And indeed, if that were the case here, it might make Arc
-usage a little more suspicious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really quite an incredible PR, and you did a great job of breaking down some very sophisticated optimizations and decision-making. I loved reading it, and the results are remarkable!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woah
@@ -241,7 +242,7 @@ impl std::fmt::Display for OperatorParseError { | |||
/// ``` | |||
#[derive(Clone)] | |||
pub struct Version { | |||
inner: VersionInner, | |||
inner: Arc<VersionInner>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want the Arc
here or in PubGrubVersion
? That is, would it have any disadvantages to only use it in PubGrubVersion
so Version
itself remains the "real" type / does the copy on write make a noticeable impact?
This starts to untangle the dual use of Version for both acting as a literal version and a pattern for matching versions. We still retain reuse in the implementation, but are more explicit about the difference in the public API. This also makes it a little smoother to build a `VersionSpecifier` instead of carrying an extra boolean around.
The `Option<Vec<LocalSegment>>` representation creates redundant case analysis. That is, there doesn't appear to be a case where it can both be `Some` and have zero segments. Instead, we can just use the case of zero segments to mean "no local component."
This rewrites the parser to avoid regexes. We do things by hand directly on a `&[u8]`, and also use more structured errors internally (although we don't expose those yet). This is good for about a 1.1x improvement: $ hyperfine -w10 \ 'puffin pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in' \ 'puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in' Benchmark 1: puffin pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in Time (mean ± σ): 317.3 ms ± 18.1 ms [User: 268.4 ms, System: 106.8 ms] Range (min … max): 297.4 ms … 360.1 ms 10 runs Benchmark 2: puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in Time (mean ± σ): 345.0 ms ± 12.4 ms [User: 293.5 ms, System: 109.7 ms] Range (min … max): 329.9 ms … 372.8 ms 10 runs Summary puffin pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in ran 1.09 ± 0.07 times faster than puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in The improvement here is rather small because the fast path that was added recently was doing a lot of the heavy lifting. In subsequent commits, we will: * Expose the error type so that we don't just return a `String`. * Add back a fast path for parsing just `w.x.y.z`. * Tweak the representation of `Version` to make its `Ord` impl faster. In particular, this is still showing up as a bottleneck from resolution. Fixes #373
This adds a simple fast path to version parsing that just looks for versions that only have a release component with 4 or fewer numbers. This does lead to a small ~1.03x improvement on the benchmark mentioned in a previous commit, but the main reason to do this to set ourselves up for version parsing without any allocation at all. Namely, in the next commit, we're going to split the representation of Version in two.
The high level of this commit is to make `Version::cmp` lower to the equivalent of `u64::cmp` in the vast majority of cases. We achieve this by carving out a large subset (~90%) of all possible versions on PyPI and devising a small fixed representation for them. This is called `VersionSmall`. Everything that doesn't fit into this representation is put into `VersionFull`. The former is, essentially, a `u64`. The latter is what the status quo looked like before this commit. This change overall makes a rather large difference when puffin gets stuck in version resolution. Namely, pubgrub does a lot of version comparisons, and by changing that a simple u64 comparison, we get a huge win: $ time puffin-cmp pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in Resolved 31 packages in 10.90s real 10.923 user 10.406 sys 0.384 maxmem 2398 MB faults 0 $ time puffin-base pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in Resolved 31 packages in 34.56s real 34.579 user 34.004 sys 0.413 maxmem 2867 MB faults 0 Future work on this point might investigate removing some of the extra storage on `VersionSmall` (see code comments). And if we did that, it might make sense to box a `Version`.
This puts a `Version` inside of an `Arc` internally. This "works" pretty transparently because `Version` does not expose any methods for direct in-place mutation. While this doesn't *obviously* speed anything up directly, it does result in a fairly sizeable improvement to the `boto3` requirements list: $ time puffin-cmp pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in Resolved 31 packages in 10.84s real 10.858 user 10.272 sys 0.444 maxmem 2402 MB faults 0 $ time puffin-cmparc pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in Resolved 31 packages in 9.20s real 9.218 user 8.919 sys 0.165 maxmem 463 MB faults 0 My hypothesis for why this leads to an improvement is that this shrinks the size of a `Version` from 104 bytes(!) to 8 bytes. Since it looks like pubgrub builds in-memory collections of `Version` values, this in turn substantially shrinks heap usage of those collections and also likely makes things more cache friendly. It's plausible that an even better approach here would be to only wrap a `VersionFull` in an `Arc`. But to make this most effective, we'd want to truly represent a `VersionSmall` as just a single `u64`. Perhaps even a `NonZeroU64` to create a niche. But to do that, I think we need to: Represent the number of release segments in `VersionSmall::repr`. This should be easy enough to do, I think, by commandeering the two least significant bits of the release component (currently using the 5 most significant bytes of the u64 representation). This shouldn't impact comparisons since the lengths are at the least significant part, and thus, will be equivalent if a comparison needs to consider them. If we borrow the two bits from the first release segment (which is a u16), then it should still be able to represent calver since 2^14=16384. Then `Version` could be defined like this: struct Version(VersionInner); enum VersionInner { Small(u64), Full(Arc<VersionFull>), } Although this does raise the size of `Version` from the 8 bytes in this PR to 16. One might be able to reduce it back to 8 with a niche optimization, but I can't think of any obvious way of doing that in safe code.
9d51b22
to
106af07
Compare
Looking at the profile for tf-models-nightly after #789, `compare_release` is by far the slowest function. Adding a fast path, we avoid paying the cost for padding releases with 0s when they are the same length. ``` $ hyperfine --warmup 1 --runs 3 "target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt" "target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt" Benchmark 1: target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt Time (mean ± σ): 11.963 s ± 0.225 s [User: 11.478 s, System: 0.451 s] Range (min … max): 11.747 s … 12.196 s 3 runs Benchmark 2: target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt Time (mean ± σ): 10.317 s ± 0.720 s [User: 9.885 s, System: 0.404 s] Range (min … max): 9.501 s … 10.860 s 3 runs Summary target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt ran 1.16 ± 0.08 times faster than target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt ```
Looking at the profile for tf-models-nightly after #789, `compare_release` is the single biggest item. Adding a fast path, we avoid paying the cost for padding releases with 0s when they are the same length, resulting in a 16% for this pathological case. Note that this mainly happens because tf-models-nightly is almost all large dev releases that hit the slow path. **Before** ![image](https://github.com/astral-sh/puffin/assets/6826232/0d2b4553-da69-4cdb-966b-0894a6dd5d94) **After** ![image](https://github.com/astral-sh/puffin/assets/6826232/6d484808-9d16-408d-823e-a12d321802a5) ``` $ hyperfine --warmup 1 --runs 3 "target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt" "target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt" Benchmark 1: target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt Time (mean ± σ): 11.963 s ± 0.225 s [User: 11.478 s, System: 0.451 s] Range (min … max): 11.747 s … 12.196 s 3 runs Benchmark 2: target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt Time (mean ± σ): 10.317 s ± 0.720 s [User: 9.885 s, System: 0.404 s] Range (min … max): 9.501 s … 10.860 s 3 runs Summary target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt ran 1.16 ± 0.08 times faster than target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt ```
@@ -639,6 +615,7 @@ impl std::fmt::Debug for Version { | |||
} | |||
|
|||
impl PartialEq<Self> for Version { | |||
#[inline] | |||
fn eq(&self, other: &Self) -> bool { | |||
self.cmp(other) == Ordering::Equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a fast path here that avoids calling into cmp_slow
if both self.inner
and other.inner
point to the same Arc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. cmp_slow
is designed to be called very infrequently (in practice), and I don't know the portion of comparisons that enter cmp_slow
that have pointer equality internally. My prior is that the portion would be rather small, but I could be wrong. Especially for particular benchmarks that might hit cmp_slow
more often than is intended. For example: #799
This PR builds on #780 by making both version parsing faster, and
perhaps more importantly, making version comparisons much faster.
Overall, these changes result in a considerable improvement for the
boto3.in
workload. Here's the status quo:And now with this PR:
This particular workload gets stuck in pubgrub doing resolution, and
thus benefits mightily from a faster
Version::cmp
routine. With thatsaid, this change does also help a fair bit with "normal" runs:
There is perhaps some future work here (detailed in the commit
messages), but I suspect it would be more fruitful to explore ways of
making resolution itself and/or deserialization faster.
Fixes #373, Closes #396