Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --jobs (parallel git fetch) #539

Merged
merged 14 commits into from
Mar 22, 2022

Conversation

m-o-e
Copy link
Contributor

@m-o-e m-o-e commented Jan 6, 2022

Resolves #371
Resolves #239

This PR parallelizes git fetching which gives me a ~40% speed-up
for shards install on average.

$ git clone https://github.com/amberframework/amber.git
$ cd amber

$ time shards --jobs=1
18.579s total

$ time shards --jobs=8
10.143s total
  • Only the fetching is parallelized, post-install still runs in serial.
    I've experimented with parallelizing that too but the speed gains
    were negligible and I'm not sure if it's always safe.

  • It's only implemented for git so far (no hg)

  • No test-coverage, yet!
    Although parallelism is enabled during test runs, the existing specs don't seem to actually exercise it.

src/config.cr Outdated Show resolved Hide resolved
src/resolvers/git.cr Outdated Show resolved Hide resolved
@straight-shoota
Copy link
Member

It would be great to have specs for parallel fetch.

m-o-e and others added 2 commits January 9, 2022 23:48
Co-authored-by: Johannes Müller <straightshoota@gmail.com>
@m-o-e
Copy link
Contributor Author

m-o-e commented Jan 9, 2022

It would be great to have specs for parallel fetch.

Agree.

Now that it's enabled in spec it should get plenty coverage.
Do you think that is sufficient or should there be an isolated test?
(not sure how to create a meaningful one tbh, ideas welcome 🤔)

@m-o-e
Copy link
Contributor Author

m-o-e commented Feb 6, 2022

After using this PR-version for a while I realized it would only parallelize
when a shard.lock exists, but not for the initial run after cloning a repo
without that file.

This is now fixed.

(had missed it in my previous testing because I'd muck
with shard.yml, delete lib, but never the shard.lock...)

src/molinillo_solver.cr Outdated Show resolved Hide resolved
docs/shards.adoc Outdated
@@ -74,6 +74,7 @@ doesn't generate a conflict, thus generating a new _shard.lock_ file.
--production:: same as _--frozen_ and _--without-development_
--skip-postinstall:: Does not run postinstall of dependencies.
--skip-executables:: Does not install executables.
--parallel-fetch:: Number of git fetches to perform in parallel (default: 8).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this works for every resolver (implementing update_local_cache), not just git?
If so, I'd rename it to parallel-jobs or sth like that.

Copy link
Contributor Author

@m-o-e m-o-e Feb 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought fetch would make the most sense for all resolvers where this applies
since it's about the data-fetching and nothing else - each VCS might call that step
something different but ultimately it's a "fetch".

It's only implemented for git atm, that's why I mentioned that one in the description.
(would replace "git" with "VCS" later, when other impls are added)

We could change the arg-name but to me parallel-jobs feels less to the point than parallel-fetch? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only implemented for git atm, that's why I mentioned that one in the description.

I don't follow. You're calling Resolver#update_local_cache for every resolver regardless, so how come you state it's implemented only for git?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heck... you are right.

I got confused because my first iteration was git-specific,
I totally forgot that it now indeed applies to all of them. 🙈

Ouch, that means this PR probably breaks hg.
The hg-side will need a change similar to this.

Sigh. This makes things more complicated.

I can port the change from git.cr to hg.cr - but I haven't wrapped my head
around the spec-suite enough to see how to make a test to validate it.

It looks like the current integration test only uses git URLs (no hg_url() in there).
That may be the reason why it didn't catch the breakage - or maybe
it just doesn't exercise the parallelism at all.

(I thought it does because it uses multiple deps in some places -
but the fact that it didn't break for hg... confuses me 🤔)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. The best I could offer for now would be to actually exclude hg from
parallelism (like I thought it already was), by doing something like:

          begin
-           dep.resolver.update_local_cache
+           dep.resolver.update_local_cache if dep.resolver.is_a? GitResolver
            ch.send(nil)
            active -= 1
          rescue ex : Exception
            ch.send(ex)
          end
        end

But I feel like if the spec's didn't catch the hg issue then they
might also not sufficiently cover the git-side. So even though it works
splendidly for me under real usage, spec-coverage should probably
be addressed before merging this.

If someone more familiar with the spec-suite and/or hg
has an idea on how to best do that, please don't hesitate. 😬

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but that will need test-coverage first.
For git I have some confidence that it works fine since I've been using it daily for a while, but
I've never encountered a hg-shard in the wild, so don't even have an URL to test that side manually.

Not sure when I'll have time to continue on this, so excluding Hg for now.

Copy link
Member

@straight-shoota straight-shoota Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's totally fine. Concurrent fetches are just an optimization without functional differences. There's no need to have this implemented for all resolvers immediately.
Wording should be more generic, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving this convo as I believe all points are addressed.

  • Parallelism is restricted to only apply to GitResolver (because hg not implemented, yet)
  • Flag has been renamed to --jobs to keep it generic (for consistency with bundler
    and for when we add hg in the future)
  • Help-text still mentions git explicitly - but I believe that makes sense as to not
    confuse users into thinking it would apply to all resolvers.

Please re-open or reply on main thread if any of these need further discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep the wording more general, but mention that it only works for git currently. For example:

--jobs:: Number of repository downloads to perform in parallel (default: 8). Currently only functions for git resolver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, updated in 670cfb7

Dropped "function" and "resolver" to make it shorter.

@carlhoerberg
Copy link

carlhoerberg commented Feb 23, 2022

I purpose that you rename the argument to -j/--jobs, like make, bundler and #545 uses.

@m-o-e
Copy link
Contributor Author

m-o-e commented Feb 25, 2022

Good idea @carlhoerberg!

Renamed in bac393b

Fwiw, I left the help-screen as Number of git fetches to perform in parallel (default: 8)
to keep it clear what it actually does. If anyone would prefer a different text just let me know. :)

I'm torn on whether -j should also be added, left it out here for now.
(I guess for consistency with make/bundler we could add it - but otoh I don't think
anyone would blindly rely on -j to be present, and it may stand out a bit weirdly
as most of the other flags don't have a short version)

begin
dep.resolver.update_local_cache if dep.resolver.is_a? GitResolver
ch.send(nil)
active -= 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that this is not safe from races. Probably OK if shards is not compiled with preview_mt, but I wonder if we shouldn't use a Mutex here to be on the safe side. @bcardiff WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. Thanks for catching this!
Maybe Atomic can help here (assuming it is MT-safe?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes! Atomic is what we need here. Do you want to do the change? Let me know otherwise and we'll take care.

Copy link
Member

@straight-shoota straight-shoota Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine. Shards is currently not intended to be built with multithreading. And I wouldn't even see a good reason to even go for that.
With any luck we should eventually get a tool in stdlib for abstracing such simple conccurrency workflows (cf. crystal-lang/crystal#6468). There should really be no need to implement this from scratch every time you want to run a couple tasks concurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the change seems easy enough, so I've added it here.
Better safe than sorry.

Copy link
Contributor Author

@m-o-e m-o-e Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in fact, another fix right here.
I think this latter one could actually have struck even without MT.

@m-o-e m-o-e changed the title Add --parallel-fetch Add --jobs (parallel git fetch) Mar 21, 2022
rescue ex : Exception
ch.send(ex)
ensure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@straight-shoota straight-shoota added this to the v0.17.0 milestone Mar 22, 2022
@straight-shoota straight-shoota merged commit 82c30a3 into crystal-lang:master Mar 22, 2022
@m-o-e m-o-e deleted the add-parallel-fetch branch March 22, 2022 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non-blocking dependencies fetching Run git fetches and post installs in parallel
5 participants