Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lint multiple files in parallel [$500] #3565

Open
ilyavolodin opened this issue Aug 28, 2015 · 142 comments
Open

Lint multiple files in parallel [$500] #3565

ilyavolodin opened this issue Aug 28, 2015 · 142 comments
Assignees
Labels
accepted bounty cli feature needs bikeshedding needs design

Comments

@ilyavolodin
Copy link
Member

@ilyavolodin ilyavolodin commented Aug 28, 2015

This is a discussion issue for adding ability to run eslint in parallel for multiple files.

The idea is that ESLint is mostly CPU bound, not IO bound, so creating multiple threads (for machine with multiple cores) might (and probably will) increase performance in a meaningful way. The downside is that currently ESLint's codebase is synchronous. So this would require rewriting everything up to and including eslint.js to be asynchronous, which would be a major effort.

I played with this a little while ago and found a few libraries for Node that handle thread pool, including detection of number of cores available on the machine.

  • Node-threads-a-gogo - seems pretty good, but looks dead.
  • nPool - seems actively in development, but has native components (C++)
  • Node WebWorkers - seems pretty dead too.
  • Parallel - seems dead, and no pool implementation.
  • Node Clusters - not stable yet, and probably isn't going to be available on Node v0.10
  • WebWorkers - seems that they are only implemented in io.js
    And there are a ton of other libraries out there for this.

If anyone had any experience writing multithreaded applications for node.js and would like to suggest alternatives or comment on the above list, please feel free.

P.S. https://www.airpair.com/javascript/posts/which-async-javascript-libraries-should-i-use

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@ilyavolodin ilyavolodin added the needs bikeshedding label Aug 28, 2015
@ilyavolodin ilyavolodin self-assigned this Aug 28, 2015
@ilyavolodin ilyavolodin added this to the v2.0.0 milestone Aug 28, 2015
@nzakas nzakas mentioned this issue Aug 28, 2015
11 tasks
@BYK
Copy link
Member

@BYK BYK commented Aug 28, 2015

Nice. I'm very interested in trying this myself too.

@ilyavolodin
Copy link
Member Author

@ilyavolodin ilyavolodin commented Aug 28, 2015

Another question that I had in my mind, if we need to rewrite everything to be async, should we use callback pattern? Promises? If so which library? Q or Bluebird? I personally would prefer promises to callback hell.

@IanVS
Copy link
Member

@IanVS IanVS commented Aug 28, 2015

I vote for promises. Bluebird is fast, but makes me nervous because it adds methods to the native Promise, which is likely not a great idea. I think Q might be the best bet. There is also Asequence, but I have no personal experience with it.

@gyandeeps
Copy link
Member

@gyandeeps gyandeeps commented Aug 28, 2015

Why not use built in promises? Just a question as I have no experience with promises yet.

@IanVS
Copy link
Member

@IanVS IanVS commented Aug 28, 2015

They're not supported in node 0.10, to my knowledge.

Besides that, the libraries give some nice "sugar" methods when working with Promises.

@btmills
Copy link
Member

@btmills btmills commented Aug 28, 2015

I've had plenty of success using native promises (or a polyfill when native promises aren't supported). That seems like a good starting point to me; if we need more than they provide we could probably swap out something that's API-compatible.

@nzakas
Copy link
Member

@nzakas nzakas commented Aug 28, 2015

I think we're putting the cart before the horse here. Let's hold off on promises vs. callbacks until we're at least ready to prototype. Get something working with callbacks and let's see how bad it is (or not).

@nzakas nzakas added cli accepted feature labels Aug 28, 2015
@lo1tuma
Copy link
Member

@lo1tuma lo1tuma commented Aug 28, 2015

The idea is that ESLint is mostly CPU bound, not IO bound

ESLint also does a lot of IO (directory traversal, reading source files). So I think we would also profit here if we rewrite eslint to do non-blocking IO.

@ilyavolodin
Copy link
Member Author

@ilyavolodin ilyavolodin commented Aug 28, 2015

@lo1tuma I haven't profiled it yet, but in my mind, amount of IO we do is negligible comparing to amount of CPU cycles we eat. I will try to profile it and post results here if I will get anything meaningful.

@pnstickne
Copy link

@pnstickne pnstickne commented Aug 29, 2015

Using something like NodeClusters - or most other per-process implementations - would avoid the issue of needing to [vastly] rewrite ESLint. (Such implementations are strictly not threading, but allow managed parallel process execution.)

It would mostly just need to IPC-in/out the current ESLint; ESLint parallelism would then be able to work freely over different files in a per-process manner, but it could not (without more rework) run concurrently over a single file.

Thus if the goal is to run ESLint over different files in parallel I would urge such a 'simple' per-process concurrency approach. If the goal is to make ESLint parallel across the same source/AST then .. that is a more complicated can of worms as it changes the 'divergence' point.

If there is a strict v0.10 node target for ESLint, maybe have this as an feature when running a compatible node version.

@mysticatea
Copy link
Member

@mysticatea mysticatea commented Aug 29, 2015

My idea is:

  • The master process does queue control and collecting result.
  • Worker processes have a short queue (the length is about 2-4).
    A worker does:
    • Sends the result of the last file and requests a path of the next next next file to the master.
    • Reads asyncly the next next file.
    • Lints the next file.

There are source codes which has a variety of size, so I think the queue control in pull-style is important.

Library.... I think child_process.fork is enough.

@gyandeeps
Copy link
Member

@gyandeeps gyandeeps commented Aug 29, 2015

Sorry for this question: Based on all the comments, do we think the reward of this functionality is worth the effort/changes/complications? (like I said just a question)
Or is this functionality too early (like google glass) to implement without any actual estimates or ask or whatever.

@pnstickne
Copy link

@pnstickne pnstickne commented Aug 29, 2015

@gyandeeps My projects are not big enough, or computer slow enough, for me to really care either way.

In cases where there are sizable projects of many files, on computers with several cores and non-bound I/O, I imagine that it could lead to significantly reduced wall-clock, approaching Amdal's law. I would be less optimistic about this gain with fewer larger files, even with 'actual threading' or post-AST handling - but that is what performance profiles are for.

Of course another option is to only lint 'changed' files and provide some sort of result cache, but comes with additional data management burdens.

@ilyavolodin
Copy link
Member Author

@ilyavolodin ilyavolodin commented Aug 29, 2015

@gyandeeps To answer your question - we do not have definitive information on this. Right now my assumption is that we are CPU-bound, not IO. In that case utilizing more cores should have significant impact on larger projects (as @pnstickne mention, impact will be small or there might even be negative impact on a few large files).
I think the first step of the process would be to prove or disprove my assumption on CPU vs IO. Granted, if it turns out I'm wrong and we are IO-bound, then changing ESLint code to async would improve performance anyways.

@pnstickne Thanks for the insights. I'm really not familiar with NodeClusters, just know that they exist and do not require everything to be async for them to act. We definitely not going to try to make ESLint run AST analysis in parallel, that's going to be a huge change, and a lot of our rules expect nodes to be reported in the right order, so we would not only need to rewrite pretty much the whole core, we would also need to rewrite a lot of rules.
I would be interested in learning more how we could incorporate code that would only execute on node 0.12 and not on earlier version. I'm not sure that's the approach I would like to take, but it's an interesting option never the less.

@ilyavolodin
Copy link
Member Author

@ilyavolodin ilyavolodin commented Aug 30, 2015

So I did some very unscientific performance analysis on both Windows and OSX.
I specifically chose to lint a very large number of files (TodoMVC, 1597 files, 854 folders).
On Windows, results where very inconclusive. Basically my CPU usage was never over 15% (on 8 core machine, none of the cores were overtaxed at any point). But then my I/O never hit anything over 10MB/s on a SSD that's theoretically capable of 550 MB/s. So I have no idea why it didn't run any faster.
On OSX it never hit more then 15% CPU (I have no idea how many cores my macbook has, probably 8 too), but I/O was pretty taxed. It looked like there were a few spikes that were reaching 100% disk throughput. So maybe my assumption was wrong any we are not CPU bound?

@pnstickne
Copy link

@pnstickne pnstickne commented Aug 30, 2015

@ilyavolodin Try this:

Start running 8 different eslint processes (over the same set of files should be fine, although it would be 'more fair' to break up the files into 8 different equally-sized sets).

Compare the wall-clock times it takes for 1 process to complete and 8 processes to complete.

The 8 processes will have done 8x the work (if using the same set of files for each process as for the single process) or the same amount of work (if having split the source files among them), but in how much x the time?

This very crudely should show an approximate gain - if any - for using multi-process concurrency.

@platinumazure
Copy link
Member

@platinumazure platinumazure commented Aug 30, 2015

Late to the conversation, but... Is anyone opposed to someone starting up a pull request to implement the callback hell approach (i.e., make ESLint async across the board, including for all I/O operations)? Seems like it would make the eventual parallelization easier anyway.

@IanVS
Copy link
Member

@IanVS IanVS commented Aug 30, 2015

Of course another option is to only lint 'changed' files and provide some sort of result cache, but comes with additional data management burdens.
-@pnstickne

This is essentially what ESLinter from @royriojas does.

@pnstickne
Copy link

@pnstickne pnstickne commented Aug 30, 2015

@IanVS That's pretty cool .. now if only it was built-in to my eslint grunt task :}

(Okay, I could get it shimmed in pretty easy - but it'd still be nice to see a 'done package'.)

@ilyavolodin
Copy link
Member Author

@ilyavolodin ilyavolodin commented Aug 30, 2015

@pnstickne #2998
@platinumazure I think I would like to first prove that there's something to gain by going async/multithreaded just so nobody would waist their time creating a very large PR that will then be closed, because there is no gain.

@platinumazure
Copy link
Member

@platinumazure platinumazure commented Aug 30, 2015

@ilyavolodin That's fair enough. I'm wondering, though, would it be worth
creating a separate issue with the goal of making ESLint's I/O operations
asynchronous? Synchronous I/O is a bit of an anti-pattern in Node and it
might help us figure out just how much of a breaking change parallelization
would likely be.
On Aug 30, 2015 9:46 AM, "Ilya Volodin" notifications@github.com wrote:

@pnstickne https://github.com/pnstickne #2998
#2998
@platinumazure https://github.com/platinumazure I think I would like to
first prove that there's something to gain by going async/multithreaded
just so nobody would waist their time creating a very large PR that will
then be closed, because there is no gain.


Reply to this email directly or view it on GitHub
#3565 (comment).

@ilyavolodin
Copy link
Member Author

@ilyavolodin ilyavolodin commented Aug 30, 2015

@platinumazure While sync code is node anti-pattern, in our case, we can't do anything with the code (as in parse it into AST) until we read the whole file. So if improving performance is not on the plate, then changing code to async will increase complexity of the code, but not gain us anything. It's worth testing out and I still think there's a performance gain that we can get by doing parallel execution, but I would like to get some proof of that first.

@lo1tuma
Copy link
Member

@lo1tuma lo1tuma commented Aug 30, 2015

@ilyavolodin reading files asynchronously doesn’t mean you read them chunk-by-chunk. While you are waiting for one file to be read, you can lint a different file which has been read already.

@leoswing
Copy link

@leoswing leoswing commented Jul 17, 2021

@leoswing maybe worker_threads is worth a try, send the lint task to worker threads and then notify back to master when the task is resolved, however whether the notify-back action is required or not, just depends on if the service is required to respond the results to its consumer

Thanks~ appreciated for your quick reply, I've just tried with jest-worker and it runs well

@jaysonwu991
Copy link

@jaysonwu991 jaysonwu991 commented Oct 11, 2021

I've just tried with jest-worker and it runs well

@leoswing How did you use jest-worker with eslint, if you just write a script to execute eslint command? Thanks

@jaysonwu991
Copy link

@jaysonwu991 jaysonwu991 commented Oct 11, 2021

  1. CLIEngine Class is deprecated (Replaced by Eslint Class) and considered to be removed - https://eslint.org/blog/2020/02/whats-coming-in-eslint-7.0.0
  2. CLIEngine has been removed - https://eslint.org/docs/user-guide/migrating-to-8.0.0

With Eslint v8, does anyone have the idea to implement parallel linting? Thanks

@nzakas
Copy link
Member

@nzakas nzakas commented Oct 12, 2021

Parallel linting is still on the roadmap. We just have some other higher priority work going on right now.

@leoswing
Copy link

@leoswing leoswing commented Oct 12, 2021

I've just tried with jest-worker and it runs well

@leoswing How did you use jest-worker with eslint, if you just write a script to execute eslint command? Thanks

I've created a repo which supports ESLint with v7+, and it tests well with fast performance. I will try to update the codebase from my workspace to github these days. @jaysonwu991

@jaysonwu991
Copy link

@jaysonwu991 jaysonwu991 commented Oct 12, 2021

I've created a repo which supports ESLint with v7+, and it tests well with fast performance. I will try to update the codebase from my workspace to github these days.

@leoswing Thanks for the guidance, you have done a great job.

@gajus
Copy link
Contributor

@gajus gajus commented Dec 19, 2021

If anyone is interested to work on this in a near future, Contra would add USD 1000 to the bounty (expires end of January)

paulbrimicombe added a commit to paulbrimicombe/eslint that referenced this issue Jan 10, 2022
added a conncurency option that sets the number of threads to be used when linting files.

fixes eslint#3565
paulbrimicombe added a commit to paulbrimicombe/eslint that referenced this issue Jan 10, 2022
added a conncurency option that sets the number of threads to be used when linting files.

fixes eslint#3565
paulbrimicombe added a commit to paulbrimicombe/eslint that referenced this issue Jan 11, 2022
added a conncurency option that sets the number of threads to be used when linting files.

fixes eslint#3565
paulbrimicombe added a commit to paulbrimicombe/eslint that referenced this issue Jan 11, 2022
added a conncurency option that sets the number of threads to be used when linting files.

fixes eslint#3565
@nzakas
Copy link
Member

@nzakas nzakas commented Jan 12, 2022

For clarity: we are redoing some fundamental pieces of ESLint right now, and as such, it doesn't make sense to work on parallel linting until those pieces are finalized. (Primarily the new config system).

paulbrimicombe added a commit to paulbrimicombe/eslint that referenced this issue Jan 13, 2022
added a conncurency option that sets the number of threads to be used when linting files.

fixes eslint#3565
@borisyordanov
Copy link

@borisyordanov borisyordanov commented Jan 26, 2022

@paulbrimicombe @nzakas Does this mean this feature is delayed until the next major version is out?

@paulbrimicombe
Copy link

@paulbrimicombe paulbrimicombe commented Jan 27, 2022

@paulbrimicombe @nzakas Does this mean this feature is delayed until the next major version is out?

@borisyordanov unfortunately I'm not a maintainer on this project so I can't give you any details about the release schedule — I just raised a PR that addressed this issue. It looks like this work is blocked on approval of the RFC plus some in-flight refactor work that @nzakas mentioned above. In its current state, the RFC requires breaking changes (which I'm not convinced is at all necessary — certainly my PR doesn't have any breaking changes in it). Please comment on the RFC if you'd like to.

@nzakas
Copy link
Member

@nzakas nzakas commented Jan 28, 2022

@borisyordanov there is no timeline for parallel listing right now. We have a bunch of other core work that must be completed first. When that core work is done, we can evaluate where we are in terms of parallel linting.

@jinlong
Copy link

@jinlong jinlong commented Mar 22, 2022

I've just tried with jest-worker and it runs well

@leoswing How did you use jest-worker with eslint, if you just write a script to execute eslint command? Thanks

I've created a repo which supports ESLint with v7+, and it tests well with fast performance. I will try to update the codebase from my workspace to github these days. @jaysonwu991

Hi, @leoswing ,may i learn from your code example? I have got the same problem that linting multiple files is very slow. thanks a lot.

@gajus
Copy link
Contributor

@gajus gajus commented Mar 22, 2022

FYI, you can achieve this using https://github.com/jest-community/jest-runner-eslint

Our ESLint time has gone down from 13 minutes to 2 minutes

@paulbrimicombe
Copy link

@paulbrimicombe paulbrimicombe commented Mar 22, 2022

When I was testing my parallel eslint implementation (now sadly binned because of the ongoing config refactor) I found that enabling caching made a huge difference (see https://eslint.org/docs/user-guide/command-line-interface for details on the CLI flags).

@gajus
Copy link
Contributor

@gajus gajus commented Mar 22, 2022

You cannot / should not use cache in production CI. Otherwise, yeah, it helps a lot.

I would still encourage adopting Jest runner if you are at a point where this is a bottleneck. Chances are, you have other bottlenecks that can be helped with this (like tsc).

@FlorianWendelborn
Copy link

@FlorianWendelborn FlorianWendelborn commented Mar 22, 2022

@gajus can you elaborate on why you think you shouldn’t use cache in CI?

@gajus
Copy link
Contributor

@gajus gajus commented Mar 22, 2022

I would need to go back to reading ESLint documentation, but I am pretty sure that their cache does not use hash to identify if files have changed, but rather metadata (presumably last modified date). I would not trust this information to carry over through git accurately.

@FlorianWendelborn
Copy link

@FlorianWendelborn FlorianWendelborn commented Mar 22, 2022

@gajus you can configure that:

--cache-strategy String Strategy to use for detecting changed files in the cache - either: metadata or content - default: metadata
https://eslint.org/docs/user-guide/command-line-interface

@gajus
Copy link
Contributor

@gajus gajus commented Mar 22, 2022

Great. Use it if that fits your bill. In our case, parallelization was preferred. We keep CI jobs stateless.

@bmish
Copy link
Sponsor Member

@bmish bmish commented Mar 23, 2022

Regarding caching: I implemented ESLint caching on CI for a large repository at one point but scrapped it because it can result in lint violations getting through to master/main in rare situations:

It is technically possible that a change in one file can cause a linting violation in another unchanged file. For example, eslint-plugin-import does things like validate that imported modules are actually exported from the file they’re being imported from. Unfortunately, I believe this defeats any attempt at avoiding a full eslint run.

Note that aside from this edge case, caching on CI can work successfully most of the time, as long as you use:

  1. eslint --cache --cache-strategy
  2. A cache file (for --cache-location) with a name that includes a hash of anything that can affect your linting:
    • The list of eslint-related dependencies installed and their versions (#12578 (comment))
    • The implementation of any internal/local lint rules in a monorepo (which can change without the version of the internal eslint package changing)

For the record, here's the imperfect eslint CI caching script I put together. I don't really recommend anyone tries to use it given the limitations but am including it for context.

imperfect eslint CI caching script
#!/usr/bin/env bash

EXIT_STATUS=0

if [[ "$ENVIRONMENT" == "CI" ]]; then
  # Use kochiku's shared cache to cache the results of running ESLint between builds.
  # Depending on the number of files changed, caching can reduce lint running time from 10+ minutes to just seconds.

  # Generate a hash based on:
  # * The installed eslint package versions
  # * The code in eslint-plugin-dashboard (since this package is not versioned)
  # We use this to invalidate the cache if any lint packages are updated since
  # eslint caching itself does not take this into consideration: https://github.com/eslint/eslint/issues/12578
  HASH_ESLINT_PACKAGES="$(yarn list --pattern *eslint* --depth=0 | grep eslint | md5sum | awk '{print $1}')"
  HASH_ESLINT_PLUGIN_DASHBOARD="$(find ../packages/eslint-plugin-dashboard/lib -type f -name "*.js" -exec md5sum {} + | awk '{print $1}' | sort | md5sum | awk '{print $1}')"
  HASH="$HASH_ESLINT_PACKAGES-$HASH_ESLINT_PLUGIN_DASHBOARD"
  echo "ESLINT CACHING: Generated hash of eslint packages: $HASH"

  PATH_CACHE_ESLINT_DIR="/mnt/nfs/shared-cache/dashboard/eslint/"
  PATH_CACHE_ESLINT_FILE="$PATH_CACHE_ESLINT_DIR/$HASH"
  PATH_CACHE_ESLINT_TEMP_FILE="$(mktemp -u)"
  if [ "$GIT_BRANCH" == "master" ]; then
    # Every master build performs a full eslint run and caches the result for non-master builds to use.
    # Note: we use `--cache-strategy content` because the default caching strategy of mtime is useless to us (git does not maintain file mtime, so mtime will always be the time when the current build cloned the repository).
    yarn eslint --cache --cache-strategy content --cache-location "$PATH_CACHE_ESLINT_TEMP_FILE" . $@ || EXIT_STATUS=1
    rm -rf $PATH_CACHE_ESLINT_DIR # Delete any old cache files.
    mkdir $PATH_CACHE_ESLINT_DIR
    cp $PATH_CACHE_ESLINT_TEMP_FILE $PATH_CACHE_ESLINT_FILE
    echo "ESLINT CACHING: Generated new cache."
  else
    # Non-master builds use the cache generated by the last master build.
    # Note: a temporary file is used to avoid overwriting the master cache.
    if [ -f "$PATH_CACHE_ESLINT_FILE" ]; then
      echo "ESLINT CACHING: Found cache to use."
      cp $PATH_CACHE_ESLINT_FILE $PATH_CACHE_ESLINT_TEMP_FILE
    else
      echo "ESLINT CACHING: No cache found."
    fi
    # Note: we use `--cache-strategy content` because the default caching strategy of mtime is useless to us (git does not maintain file mtime, so mtime will always be the time when the current build cloned the repository).
    yarn eslint --cache --cache-strategy content --cache-location "$PATH_CACHE_ESLINT_TEMP_FILE" . $@ || EXIT_STATUS=1
  fi
  rm $PATH_CACHE_ESLINT_TEMP_FILE
else
  yarn eslint --cache . $@ || EXIT_STATUS=1
fi

exit $EXIT_STATUS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted bounty cli feature needs bikeshedding needs design
Projects
Public Roadmap
  
RFC Under Review
Core Roadmap
Needs Design
Development

Successfully merging a pull request may close this issue.