Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lint multiple files in parallel [$500] #3565

Open
ilyavolodin opened this issue Aug 28, 2015 · 111 comments
Open

Lint multiple files in parallel [$500] #3565

ilyavolodin opened this issue Aug 28, 2015 · 111 comments

Comments

@ilyavolodin
Copy link
Member

@ilyavolodin ilyavolodin commented Aug 28, 2015

This is a discussion issue for adding ability to run eslint in parallel for multiple files.

The idea is that ESLint is mostly CPU bound, not IO bound, so creating multiple threads (for machine with multiple cores) might (and probably will) increase performance in a meaningful way. The downside is that currently ESLint's codebase is synchronous. So this would require rewriting everything up to and including eslint.js to be asynchronous, which would be a major effort.

I played with this a little while ago and found a few libraries for Node that handle thread pool, including detection of number of cores available on the machine.

  • Node-threads-a-gogo - seems pretty good, but looks dead.
  • nPool - seems actively in development, but has native components (C++)
  • Node WebWorkers - seems pretty dead too.
  • Parallel - seems dead, and no pool implementation.
  • Node Clusters - not stable yet, and probably isn't going to be available on Node v0.10
  • WebWorkers - seems that they are only implemented in io.js
    And there are a ton of other libraries out there for this.

If anyone had any experience writing multithreaded applications for node.js and would like to suggest alternatives or comment on the above list, please feel free.

P.S. https://www.airpair.com/javascript/posts/which-async-javascript-libraries-should-i-use

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@ilyavolodin ilyavolodin self-assigned this Aug 28, 2015
@ilyavolodin ilyavolodin added this to the v2.0.0 milestone Aug 28, 2015
@nzakas nzakas mentioned this issue Aug 28, 2015
11 of 11 tasks complete
@BYK
Copy link
Member

@BYK BYK commented Aug 28, 2015

Nice. I'm very interested in trying this myself too.

@ilyavolodin
Copy link
Member Author

@ilyavolodin ilyavolodin commented Aug 28, 2015

Another question that I had in my mind, if we need to rewrite everything to be async, should we use callback pattern? Promises? If so which library? Q or Bluebird? I personally would prefer promises to callback hell.

@IanVS
Copy link
Member

@IanVS IanVS commented Aug 28, 2015

I vote for promises. Bluebird is fast, but makes me nervous because it adds methods to the native Promise, which is likely not a great idea. I think Q might be the best bet. There is also Asequence, but I have no personal experience with it.

@gyandeeps
Copy link
Member

@gyandeeps gyandeeps commented Aug 28, 2015

Why not use built in promises? Just a question as I have no experience with promises yet.

@IanVS
Copy link
Member

@IanVS IanVS commented Aug 28, 2015

They're not supported in node 0.10, to my knowledge.

Besides that, the libraries give some nice "sugar" methods when working with Promises.

@btmills
Copy link
Member

@btmills btmills commented Aug 28, 2015

I've had plenty of success using native promises (or a polyfill when native promises aren't supported). That seems like a good starting point to me; if we need more than they provide we could probably swap out something that's API-compatible.

@nzakas
Copy link
Member

@nzakas nzakas commented Aug 28, 2015

I think we're putting the cart before the horse here. Let's hold off on promises vs. callbacks until we're at least ready to prototype. Get something working with callbacks and let's see how bad it is (or not).

@lo1tuma
Copy link
Member

@lo1tuma lo1tuma commented Aug 28, 2015

The idea is that ESLint is mostly CPU bound, not IO bound

ESLint also does a lot of IO (directory traversal, reading source files). So I think we would also profit here if we rewrite eslint to do non-blocking IO.

@ilyavolodin
Copy link
Member Author

@ilyavolodin ilyavolodin commented Aug 28, 2015

@lo1tuma I haven't profiled it yet, but in my mind, amount of IO we do is negligible comparing to amount of CPU cycles we eat. I will try to profile it and post results here if I will get anything meaningful.

@pnstickne
Copy link

@pnstickne pnstickne commented Aug 29, 2015

Using something like NodeClusters - or most other per-process implementations - would avoid the issue of needing to [vastly] rewrite ESLint. (Such implementations are strictly not threading, but allow managed parallel process execution.)

It would mostly just need to IPC-in/out the current ESLint; ESLint parallelism would then be able to work freely over different files in a per-process manner, but it could not (without more rework) run concurrently over a single file.

Thus if the goal is to run ESLint over different files in parallel I would urge such a 'simple' per-process concurrency approach. If the goal is to make ESLint parallel across the same source/AST then .. that is a more complicated can of worms as it changes the 'divergence' point.

If there is a strict v0.10 node target for ESLint, maybe have this as an feature when running a compatible node version.

@mysticatea
Copy link
Member

@mysticatea mysticatea commented Aug 29, 2015

My idea is:

  • The master process does queue control and collecting result.
  • Worker processes have a short queue (the length is about 2-4).
    A worker does:
    • Sends the result of the last file and requests a path of the next next next file to the master.
    • Reads asyncly the next next file.
    • Lints the next file.

There are source codes which has a variety of size, so I think the queue control in pull-style is important.

Library.... I think child_process.fork is enough.

@gyandeeps
Copy link
Member

@gyandeeps gyandeeps commented Aug 29, 2015

Sorry for this question: Based on all the comments, do we think the reward of this functionality is worth the effort/changes/complications? (like I said just a question)
Or is this functionality too early (like google glass) to implement without any actual estimates or ask or whatever.

@pnstickne
Copy link

@pnstickne pnstickne commented Aug 29, 2015

@gyandeeps My projects are not big enough, or computer slow enough, for me to really care either way.

In cases where there are sizable projects of many files, on computers with several cores and non-bound I/O, I imagine that it could lead to significantly reduced wall-clock, approaching Amdal's law. I would be less optimistic about this gain with fewer larger files, even with 'actual threading' or post-AST handling - but that is what performance profiles are for.

Of course another option is to only lint 'changed' files and provide some sort of result cache, but comes with additional data management burdens.

@ilyavolodin
Copy link
Member Author

@ilyavolodin ilyavolodin commented Aug 29, 2015

@gyandeeps To answer your question - we do not have definitive information on this. Right now my assumption is that we are CPU-bound, not IO. In that case utilizing more cores should have significant impact on larger projects (as @pnstickne mention, impact will be small or there might even be negative impact on a few large files).
I think the first step of the process would be to prove or disprove my assumption on CPU vs IO. Granted, if it turns out I'm wrong and we are IO-bound, then changing ESLint code to async would improve performance anyways.

@pnstickne Thanks for the insights. I'm really not familiar with NodeClusters, just know that they exist and do not require everything to be async for them to act. We definitely not going to try to make ESLint run AST analysis in parallel, that's going to be a huge change, and a lot of our rules expect nodes to be reported in the right order, so we would not only need to rewrite pretty much the whole core, we would also need to rewrite a lot of rules.
I would be interested in learning more how we could incorporate code that would only execute on node 0.12 and not on earlier version. I'm not sure that's the approach I would like to take, but it's an interesting option never the less.

@ilyavolodin
Copy link
Member Author

@ilyavolodin ilyavolodin commented Aug 30, 2015

So I did some very unscientific performance analysis on both Windows and OSX.
I specifically chose to lint a very large number of files (TodoMVC, 1597 files, 854 folders).
On Windows, results where very inconclusive. Basically my CPU usage was never over 15% (on 8 core machine, none of the cores were overtaxed at any point). But then my I/O never hit anything over 10MB/s on a SSD that's theoretically capable of 550 MB/s. So I have no idea why it didn't run any faster.
On OSX it never hit more then 15% CPU (I have no idea how many cores my macbook has, probably 8 too), but I/O was pretty taxed. It looked like there were a few spikes that were reaching 100% disk throughput. So maybe my assumption was wrong any we are not CPU bound?

@pnstickne
Copy link

@pnstickne pnstickne commented Aug 30, 2015

@ilyavolodin Try this:

Start running 8 different eslint processes (over the same set of files should be fine, although it would be 'more fair' to break up the files into 8 different equally-sized sets).

Compare the wall-clock times it takes for 1 process to complete and 8 processes to complete.

The 8 processes will have done 8x the work (if using the same set of files for each process as for the single process) or the same amount of work (if having split the source files among them), but in how much x the time?

This very crudely should show an approximate gain - if any - for using multi-process concurrency.

@platinumazure
Copy link
Member

@platinumazure platinumazure commented Aug 30, 2015

Late to the conversation, but... Is anyone opposed to someone starting up a pull request to implement the callback hell approach (i.e., make ESLint async across the board, including for all I/O operations)? Seems like it would make the eventual parallelization easier anyway.

@IanVS
Copy link
Member

@IanVS IanVS commented Aug 30, 2015

Of course another option is to only lint 'changed' files and provide some sort of result cache, but comes with additional data management burdens.
-@pnstickne

This is essentially what ESLinter from @royriojas does.

@pnstickne
Copy link

@pnstickne pnstickne commented Aug 30, 2015

@IanVS That's pretty cool .. now if only it was built-in to my eslint grunt task :}

(Okay, I could get it shimmed in pretty easy - but it'd still be nice to see a 'done package'.)

@ilyavolodin
Copy link
Member Author

@ilyavolodin ilyavolodin commented Aug 30, 2015

@pnstickne #2998
@platinumazure I think I would like to first prove that there's something to gain by going async/multithreaded just so nobody would waist their time creating a very large PR that will then be closed, because there is no gain.

@platinumazure
Copy link
Member

@platinumazure platinumazure commented Aug 30, 2015

@ilyavolodin That's fair enough. I'm wondering, though, would it be worth
creating a separate issue with the goal of making ESLint's I/O operations
asynchronous? Synchronous I/O is a bit of an anti-pattern in Node and it
might help us figure out just how much of a breaking change parallelization
would likely be.
On Aug 30, 2015 9:46 AM, "Ilya Volodin" notifications@github.com wrote:

@pnstickne https://github.com/pnstickne #2998
#2998
@platinumazure https://github.com/platinumazure I think I would like to
first prove that there's something to gain by going async/multithreaded
just so nobody would waist their time creating a very large PR that will
then be closed, because there is no gain.


Reply to this email directly or view it on GitHub
#3565 (comment).

@ilyavolodin
Copy link
Member Author

@ilyavolodin ilyavolodin commented Aug 30, 2015

@platinumazure While sync code is node anti-pattern, in our case, we can't do anything with the code (as in parse it into AST) until we read the whole file. So if improving performance is not on the plate, then changing code to async will increase complexity of the code, but not gain us anything. It's worth testing out and I still think there's a performance gain that we can get by doing parallel execution, but I would like to get some proof of that first.

@lo1tuma
Copy link
Member

@lo1tuma lo1tuma commented Aug 30, 2015

@ilyavolodin reading files asynchronously doesn’t mean you read them chunk-by-chunk. While you are waiting for one file to be read, you can lint a different file which has been read already.

@ilyavolodin
Copy link
Member Author

@ilyavolodin ilyavolodin commented Sep 20, 2018

My goal is to work on this in the beginning of next year, if nobody gets to it before. I think this is one of the biggest impact changes that we can introduce, and there's been a lot of interest in this (however - it's not easy to implement).

@nzakas
Copy link
Member

@nzakas nzakas commented Sep 21, 2018

@ljharb
Copy link
Contributor

@ljharb ljharb commented Sep 21, 2018

esprint requires an additional tool, config, and server to work. jest-eslint-runner requires using jest, and setting up a separate config. The jest runner (which I’ve helped maintain, since we use it at airbnb) is constrained by the limitations of jest itself, so for example, we can’t even use it for auto fixing at this time. In other words, our existing solution isn’t sufficient.

It’s great that these tools exist, but it’s a very large obstacle to have to use a different tool ran by a different team that has different maintenance behavior, may run on different node versions, has different documentation, and different priorities, and gains support for core eslint features on a different and delayed timeline.

@ilyavolodin
Copy link
Member Author

@ilyavolodin ilyavolodin commented Sep 22, 2018

Also, as mentioned before, external tools can't match the speed improvement that could be done in the core, due to the fact, that all of the external tools have to resolve all of the configs multiple times, where core can do it once and share it.

@nzakas
Copy link
Member

@nzakas nzakas commented Sep 24, 2018

@nzakas
Copy link
Member

@nzakas nzakas commented Nov 19, 2018

The ESLint team has just created an RFC process for complicated changes that require designs. This issue was marked as "needs design" and so falls into this new RFC process. You can read more about the RFC process here:

https://github.com/eslint/rfcs/

Thanks!

@xcombelle
Copy link

@xcombelle xcombelle commented May 7, 2019

I'm totally an outsider but I found this comment strange #3565 (comment) I don't see any reason why reading and processing the config files only once should add mesurable speed improvement especially taking in count that the config files would be in cache and as such reading would be a noop from io point of view. Does someone verified this claim? I feel like something basically different than a script that 1- split the files to process in workload of same number of files 2- run one instance of eslint on each workload , would be a useless overengeneering the problem

@isiahmeadows
Copy link

@isiahmeadows isiahmeadows commented May 7, 2019

@xcombelle Not on the ESLint team, but I feel qualified to explain despite that. (I've dealt with config loading in a few cases.)

That would mean you're loading all the configs and their plugins, once for every file. Instead of 1-2 syscalls per file, you're now making about a dozen, and that's after you resolve all of the internal require calls. So in reality, you'd easily have hundreds per file instead of a few. This might seem insignificant, but you're blocked by syscalls, hardware I/O, and initialization, all of which is far slower than the actual linting process (which is itself slow due to multiple traversals of a large tree). You might get small gains on a quad-core system, but nothing truly significant without a dedicated module that avoids much of the file system overhead and duplicates much of ESLint's built-in functionality. You also have to write a custom reporter to avoid output getting mangled and crossed up.

Also, I'm not even 100% sure if ESLint's CLI program even supports parallel execution in separate Node instances. It might conflict with other CLI instances when writing stuff back to the local .eslintcache.

@Gaafar
Copy link

@Gaafar Gaafar commented Aug 30, 2019

Hey @nzakas, I'd like to pick up this one.

I made a quick and dirty PoC here #12191

The main idea is to split the files into X chunks that can be linted sperately on X workers, I used workerpool for the workers, but most likely it will change. The main issue I have now is passing the ConfigArray object as it breaks when serializing/deserializing. I think it'd make sense to pass it around as a plain array and construct the class right before linting a file, but it might take me some time to refactor that as I'm new to the project.

Please have a look at the PR and let me know what you think of the general idea

UPDATE: I found workaround by passing the ConfigArray instance, then rehydrating the non serializable members in the workers https://github.com/eslint/eslint/pull/12191/files#diff-660ea0590a55a93f96e9f6979144e554R445

@tunnckoCore
Copy link

@tunnckoCore tunnckoCore commented Jan 19, 2020

@isiahmeadows, That would mean you're loading all the configs and their plugins, once for every file. Instead of 1-2 syscalls per file, you're now making about a dozen, and that's after you resolve all of the internal require calls. So in reality, you'd easily have hundreds per file instead of a few. This might seem insignificant, but you're blocked by syscalls, hardware I/O, and initialization, all of which is far slower than the actual linting process

Totally. It seems that the complexity is O(n). We are having huuuge config and the execution through the CLI is insanely slow, even for just single file. When just switch to eslint-config-airbnb (which is our base) everything is faster.

So, there are two problems - first, it happens even when using executeOnText, and second, heavy plugins/presets. But that probably proofs that there is also a problem in the linting.

I currently playing with both executeOnText and executeOnFiles called from a worker. Doesn't seems to be faster, the problem is just the ESLint core.

@btmills
Copy link
Member

@btmills btmills commented May 29, 2020

As a status update for anyone following this issue, ESLint v7 shipped with a new public Node.js API. The previous CLIEngine API was entirely synchronous, which prevented parallel linting. The ESLint public API class has asynchronous methods that will allow us to implement parallel linting once we've settled on the right design. If you're curious about the design of the new API, you can read more in the RFC.

@nzakas
Copy link
Member

@nzakas nzakas commented Aug 28, 2020

This is the current RFC in development for this feature:
eslint/rfcs#42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Core Roadmap
Needs Design
Public Roadmap
  
RFC Under Review
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.