Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pep440: rewrite the parser and make version comparisons cheaper #789

Merged
merged 8 commits into from
Jan 5, 2024

Conversation

BurntSushi
Copy link
Member

@BurntSushi BurntSushi commented Jan 4, 2024

This PR builds on #780 by making both version parsing faster, and
perhaps more importantly, making version comparisons much faster.
Overall, these changes result in a considerable improvement for the
boto3.in workload. Here's the status quo:

$ time puffin pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in
Resolved 31 packages in 34.56s

real    34.579
user    34.004
sys     0.413
maxmem  2867 MB
faults  0

And now with this PR:

$ time puffin pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in
Resolved 31 packages in 9.20s

real    9.218
user    8.919
sys     0.165
maxmem  463 MB
faults  0

This particular workload gets stuck in pubgrub doing resolution, and
thus benefits mightily from a faster Version::cmp routine. With that
said, this change does also help a fair bit with "normal" runs:

$ hyperfine -w10 \
    "puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in" \
    "puffin-cmparc pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in"
Benchmark 1: puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in
  Time (mean ± σ):     337.5 ms ±   3.9 ms    [User: 310.5 ms, System: 73.2 ms]
  Range (min … max):   333.6 ms … 343.4 ms    10 runs

Benchmark 2: puffin-cmparc pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in
  Time (mean ± σ):     189.8 ms ±   3.0 ms    [User: 168.1 ms, System: 78.4 ms]
  Range (min … max):   185.0 ms … 196.2 ms    15 runs

Summary
  puffin-cmparc pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in ran
    1.78 ± 0.03 times faster than puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in

There is perhaps some future work here (detailed in the commit
messages), but I suspect it would be more fruitful to explore ways of
making resolution itself and/or deserialization faster.

Fixes #373, Closes #396

Copy link
Member

@charliermarsh charliermarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't wait to read this, I feel like I'm gonna learn a lot.

@charliermarsh charliermarsh self-requested a review January 4, 2024 19:58
@@ -241,7 +242,7 @@ impl std::fmt::Display for OperatorParseError {
/// ```
#[derive(Clone)]
pub struct Version {
inner: VersionInner,
inner: Arc<VersionInner>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will ask a dumb question: why Arc and not Box?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, is it because Arc allows for cheap cloning...?

Copy link
Member

@konstin konstin Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the Arc here or in PubGrubVersion? That is, would it have any disadvantages to only use it in PubGrubVersion so Version itself remains the "real" type / does the copy on write make a noticeable impact?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Arc makes cloning essentially free.

I think Version is fine with an Arc inside of it, and it's hard for me to conceive of a way that copy-on-write would hurt things. I would expect that the vast majority of the time, when Arc::make_mut is called, an extra copy is not made. (Since it's done as part of building a new Version.) And even then, in the cases where Arc::make_mut does make a copy, the caller would have had to make a copy anyway if Arc weren't used.

I think there are really only two downsides to using Arc here. Firstly, it introduces an alloc in the happy path of building/parsing a version. Secondly, it requires using either alloc or std. I don't think the first makes too much of a difference, and I think the latter doesn't matter for this crate (unless you want to take in the direction of core-only, which would be quite tricky).

Copy link
Member Author

@BurntSushi BurntSushi Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the copy on write make a noticeable impact?

It might help to clarify here: Arc::make_mut isn't quite just copy-on-write. It only makes a copy when the reference count is greater than 1. But if it's 1, and I suspect it usually will be in this context, then a mutable borrow with no copying can be handed out safely because exclusivity has been proven.

Typically copy-on-write implies something closer to "whenever one needs to write, you always copy first." And indeed, if that were the case here, it might make Arc-usage a little more suspicious.

Copy link
Member

@charliermarsh charliermarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really quite an incredible PR, and you did a great job of breaking down some very sophisticated optimizations and decision-making. I loved reading it, and the results are remarkable!

Copy link
Member

@konstin konstin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woah

@@ -241,7 +242,7 @@ impl std::fmt::Display for OperatorParseError {
/// ```
#[derive(Clone)]
pub struct Version {
inner: VersionInner,
inner: Arc<VersionInner>,
Copy link
Member

@konstin konstin Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the Arc here or in PubGrubVersion? That is, would it have any disadvantages to only use it in PubGrubVersion so Version itself remains the "real" type / does the copy on write make a noticeable impact?

crates/pep440-rs/src/version.rs Show resolved Hide resolved
crates/pep440-rs/src/version.rs Outdated Show resolved Hide resolved
crates/pep440-rs/src/version.rs Show resolved Hide resolved
This starts to untangle the dual use of Version for both acting as a
literal version and a pattern for matching versions. We still retain
reuse in the implementation, but are more explicit about the difference
in the public API. This also makes it a little smoother to build a
`VersionSpecifier` instead of carrying an extra boolean around.
The `Option<Vec<LocalSegment>>` representation creates redundant case
analysis. That is, there doesn't appear to be a case where it can both
be `Some` and have zero segments. Instead, we can just use the case of
zero segments to mean "no local component."
This rewrites the parser to avoid regexes. We do things by hand directly
on a `&[u8]`, and also use more structured errors internally (although
we don't expose those yet). This is good for about a 1.1x improvement:

    $ hyperfine -w10 \
        'puffin pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in' \
        'puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in'
    Benchmark 1: puffin pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in
      Time (mean ± σ):     317.3 ms ±  18.1 ms    [User: 268.4 ms, System: 106.8 ms]
      Range (min … max):   297.4 ms … 360.1 ms    10 runs

    Benchmark 2: puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in
      Time (mean ± σ):     345.0 ms ±  12.4 ms    [User: 293.5 ms, System: 109.7 ms]
      Range (min … max):   329.9 ms … 372.8 ms    10 runs

    Summary
      puffin pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in ran
        1.09 ± 0.07 times faster than puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null /m/astral/tmp/requirements.in

The improvement here is rather small because the fast path that was
added recently was doing a lot of the heavy lifting.

In subsequent commits, we will:

* Expose the error type so that we don't just return a `String`.
* Add back a fast path for parsing just `w.x.y.z`.
* Tweak the representation of `Version` to make its `Ord` impl faster.
  In particular, this is still showing up as a bottleneck from
  resolution.

Fixes #373
This adds a simple fast path to version parsing that just looks for
versions that only have a release component with 4 or fewer numbers.

This does lead to a small ~1.03x improvement on the benchmark mentioned
in a previous commit, but the main reason to do this to set ourselves up
for version parsing without any allocation at all. Namely, in the next
commit, we're going to split the representation of Version in two.
The high level of this commit is to make `Version::cmp` lower to the
equivalent of `u64::cmp` in the vast majority of cases.

We achieve this by carving out a large subset (~90%) of all possible
versions on PyPI and devising a small fixed representation for them.
This is called `VersionSmall`. Everything that doesn't fit into this
representation is put into `VersionFull`. The former is, essentially, a
`u64`. The latter is what the status quo looked like before this commit.

This change overall makes a rather large difference when puffin gets
stuck in version resolution. Namely, pubgrub does a lot of version
comparisons, and by changing that a simple u64 comparison, we get a huge
win:

    $ time puffin-cmp pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in
    Resolved 31 packages in 10.90s

    real    10.923
    user    10.406
    sys     0.384
    maxmem  2398 MB
    faults  0

    $ time puffin-base pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in
    Resolved 31 packages in 34.56s

    real    34.579
    user    34.004
    sys     0.413
    maxmem  2867 MB
    faults  0

Future work on this point might investigate removing some of the extra
storage on `VersionSmall` (see code comments). And if we did that, it
might make sense to box a `Version`.
This puts a `Version` inside of an `Arc` internally. This "works" pretty
transparently because `Version` does not expose any methods for direct
in-place mutation.

While this doesn't *obviously* speed anything up directly, it does
result in a fairly sizeable improvement to the `boto3` requirements
list:

    $ time puffin-cmp pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in
    Resolved 31 packages in 10.84s

    real    10.858
    user    10.272
    sys     0.444
    maxmem  2402 MB
    faults  0

    $ time puffin-cmparc pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in
    Resolved 31 packages in 9.20s

    real    9.218
    user    8.919
    sys     0.165
    maxmem  463 MB
    faults  0

My hypothesis for why this leads to an improvement is that this shrinks
the size of a `Version` from 104 bytes(!) to 8 bytes. Since it looks
like pubgrub builds in-memory collections of `Version` values, this in
turn substantially shrinks heap usage of those collections and also
likely makes things more cache friendly.

It's plausible that an even better approach here would be to only wrap a
`VersionFull` in an `Arc`. But to make this most effective, we'd want to
truly represent a `VersionSmall` as just a single `u64`. Perhaps even a
`NonZeroU64` to create a niche. But to do that, I think we need to:

Represent the number of release segments in `VersionSmall::repr`. This
should be easy enough to do, I think, by commandeering the two least
significant bits of the release component (currently using the 5 most
significant bytes of the u64 representation). This shouldn't impact
comparisons since the lengths are at the least significant part, and
thus, will be equivalent if a comparison needs to consider them. If we
borrow the two bits from the first release segment (which is a u16),
then it should still be able to represent calver since 2^14=16384.

Then `Version` could be defined like this:

    struct Version(VersionInner);

    enum VersionInner {
        Small(u64),
        Full(Arc<VersionFull>),
    }

Although this does raise the size of `Version` from the 8 bytes in this
PR to 16. One might be able to reduce it back to 8 with a niche
optimization, but I can't think of any obvious way of doing that in safe
code.
@BurntSushi BurntSushi merged commit 6c98ae9 into main Jan 5, 2024
3 checks passed
@BurntSushi BurntSushi deleted the ag/version-opt branch January 5, 2024 16:57
konstin added a commit that referenced this pull request Jan 5, 2024
Looking at the profile for tf-models-nightly after #789, `compare_release` is by far the slowest function. Adding a fast path, we avoid paying the cost for padding releases with 0s when they are the same length.

```
$ hyperfine --warmup 1 --runs 3 "target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt"
 "target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt"
Benchmark 1: target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt
  Time (mean ± σ):     11.963 s ±  0.225 s    [User: 11.478 s, System: 0.451 s]
  Range (min … max):   11.747 s … 12.196 s    3 runs

Benchmark 2: target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt
  Time (mean ± σ):     10.317 s ±  0.720 s    [User: 9.885 s, System: 0.404 s]
  Range (min … max):    9.501 s … 10.860 s    3 runs

Summary
  target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt ran
    1.16 ± 0.08 times faster than target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt
```
charliermarsh pushed a commit that referenced this pull request Jan 5, 2024
Looking at the profile for tf-models-nightly after #789,
`compare_release` is the single biggest item. Adding a fast path, we
avoid paying the cost for padding releases with 0s when they are the
same length, resulting in a 16% for this pathological case. Note that
this mainly happens because tf-models-nightly is almost all large dev
releases that hit the slow path.

**Before**


![image](https://github.com/astral-sh/puffin/assets/6826232/0d2b4553-da69-4cdb-966b-0894a6dd5d94)

**After**


![image](https://github.com/astral-sh/puffin/assets/6826232/6d484808-9d16-408d-823e-a12d321802a5)

```
$ hyperfine --warmup 1 --runs 3 "target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt"
 "target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt"
Benchmark 1: target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt
  Time (mean ± σ):     11.963 s ±  0.225 s    [User: 11.478 s, System: 0.451 s]
  Range (min … max):   11.747 s … 12.196 s    3 runs

Benchmark 2: target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt
  Time (mean ± σ):     10.317 s ±  0.720 s    [User: 9.885 s, System: 0.404 s]
  Range (min … max):    9.501 s … 10.860 s    3 runs

Summary
  target/profiling/puffin pip-compile -q scripts/requirements/tf-models-nightly.txt ran
    1.16 ± 0.08 times faster than target/profiling/main pip-compile -q scripts/requirements/tf-models-nightly.txt
```
@@ -639,6 +615,7 @@ impl std::fmt::Debug for Version {
}

impl PartialEq<Self> for Version {
#[inline]
fn eq(&self, other: &Self) -> bool {
self.cmp(other) == Ordering::Equal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a fast path here that avoids calling into cmp_slow if both self.inner and other.inner point to the same Arc?

Copy link
Member Author

@BurntSushi BurntSushi Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. cmp_slow is designed to be called very infrequently (in practice), and I don't know the portion of comparisons that enter cmp_slow that have pointer equality internally. My prior is that the portion would be rather small, but I could be wrong. Especially for particular benchmarks that might hit cmp_slow more often than is intended. For example: #799

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants