Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: parallelism within refurb? #279

Open
jamesbraza opened this issue Aug 14, 2023 · 7 comments
Open

[Question]: parallelism within refurb? #279

jamesbraza opened this issue Aug 14, 2023 · 7 comments
Labels
enhancement New feature or request mypy An upstream issue with Mypy

Comments

@jamesbraza
Copy link

jamesbraza commented Aug 14, 2023

Does refurb have any support for parallelism (e.g. multiprocessing)?

I am getting this tool adopted across the company I work at, and for larger repos it can take 10+ minutes.

@jamesbraza jamesbraza added the enhancement New feature or request label Aug 14, 2023
@dosisod
Copy link
Owner

dosisod commented Aug 14, 2023

Thank you for bringing this up! Refurb is built on top of Mypy, and as such, all caching and processing is done by Mypy. For reasons I have yet to find out, Refurb (and thus Mypy) is not reusing the cache after subsequent builds. Your question is about speeding up the initial/subsequent builds using multiprocessing, which is another speedup that will need looking into.

This is a known issue and is long overdue for a fix, so I'll go ahead and open an issue on Mypy today to get a conversation started and then go from there!

For my curiosity, how long does it take to run mypy on the repos that are taking 10+ minutes (after removing the .mypy_cache folder)? I strongly suspect it's the Mypy caching/speed issue, but it could be something unrelated that I haven't encountered yet.

@dosisod dosisod added the mypy An upstream issue with Mypy label Aug 14, 2023
@jamesbraza
Copy link
Author

Thanks for responding, and yeah it would be good to support parallelism. Thanks for being willing to implement this.

For timings running mypy repo-wide, using mypy 1.4.1 (compiled: yes):

  • With pre-made cache: 4-sec
  • Without pre-made cache: 22-sec

@dosisod
Copy link
Owner

dosisod commented Aug 15, 2023

Hi @jamesbraza , I just released a new version of Refurb to address some of the speed issues. How long does Refurb take with the new version (v1.20.0)?

If Refurb is still running slow, take a look at the timing stats using the new --timing-stats flag (see the docs for more info). This will output a JSON file with some info that might explain why Refurb is running so slow. Post the results here (or in a pastebin if it's too big) so that I can take a look at it. Note that this file will contain module names from your codebase (no source code, only file names), so if that is sensitive you might want to redact it first.

Thanks!

@jamesbraza
Copy link
Author

jamesbraza commented Aug 15, 2023

Thanks for doing a performance improvement! I was a little bored of making a massive unit test, so I went ahead and just did this right now.

Running refurb==1.20.0 for three consecutive invocations, I found it took 12-sec, 11-sec, 11-sec on one of the bigger modules within our monorepo. Here's the timing-stats.json.zip from a run with no .mypy_cache previously present

@dosisod
Copy link
Owner

dosisod commented Aug 15, 2023

Thank you for this data! Here's some stats from what I can see:

  • Total runtime: 12-11 seconds
    • Time spent in Mypy: 8.3 seconds
    • Time spent in Refurb: 0.1 seconds
    • Unknown: 3-4 seconds (probably need to add more timing info in different places to figure out where we're loosing time)

Most of the runtime is spent in Mypy, loading, parsing and type checking all the files and dependencies. There are a few ways we can mitigate this:

  1. Make Mypy load things faster
  2. Tell Mypy not to load certain packages

Number 1 is better overall, but might be harder. Number 2 might be easier to do, but could lead to important type info being lost, reducing Refurb's ability to check certain types. I'll keep looking into this.

Also, does it still take 10+ minutes to run Refurb on the whole repository? I don't need the timing stats for it, I'm just curious how much of an impact my speedup change made.

@jamesbraza
Copy link
Author

jamesbraza commented Aug 16, 2023

Thank you for the breakdown! That helps me understand things.

I am slowly figuring this out on my end too, now running refurb on the full repo (with macOS Monterey version 12.6):

  • Inside a Docker container (with a Docker compose volume mount for .mypy_cache) takes really long (10+ mins)
  • Running locally outside of Docker it's much faster (12 seconds)

I came across https://mypyc.readthedocs.io/en/latest/performance_tips_and_tricks.html, and blocklisting modules/packages from mypy as suggested is sort of listed there too.

Does mypy support parallelism? E.g. running one checker process/core, instead of just one checker process. I don't see anything about it in the mypy docs right now, seems from python/mypy#933 it's still outstanding.

@dosisod
Copy link
Owner

dosisod commented Aug 18, 2023

Thank you for the response! It's unfortunate that Docker runs so slow on macOS, but it's good to know that it's Docker, not Refurb, that's taking 10 minutes. Note to self, I should probably ask what environment people are running Refurb in when it comes to speed/performance issues.

Like I said before, blacklisting certain modules means that you won't be able to get type info from them. For certain checks this isn't an issue, but for checks that require type info, choosing which modules to include/exclude might be hard. This is something that I should look into nonetheless.

And from what I can tell, Mypy does not support parallelism. I've taken a look at the module loading/parsing code, but there's a lot to take in, so I don't know how hard it would be to parallelize this process. The issue you mentioned is almost 8 years old, which probably means that this is either really hard/time consuming, or has been on the back burner for a while (or a mix of both). I've been using Mypy for years and it's always been a bit slow, so it would be super cool (and super fast) if it were to have support for parallelism!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request mypy An upstream issue with Mypy
Projects
None yet
Development

No branches or pull requests

2 participants