📎 Implement more targeted file scanner

### Description

With Biome 2, we have a file scanner that is responsible for indexing a repository for the purpose of multi-file analysis. But our initial implementation is rather naive and has some unintended implications.

The important thing to understand is that the file scanner needs to index `node_modules`, because it needs to be able to extract type information from dependencies. But it’s not just type information: We also use the file scanner for building our module graph, which we use for module resolution, cycle detection, and other features. For these features to work properly we need to index files that may be ignored in user configurations, such as generated files, `package.json` files, and indeed any dependencies of your code in general.

But there are also files that we know we never need to index: directories such as `.git`, `.nx`, `.cache` and more.

Our current strategy is to index _everything_ except the files/directories on a hardcoded exclusion list. It works for most projects, but obviously it can miss things and end up indexing things it shouldn’t. Most of the time, scanning too much isn’t a big deal, but of course it impacts performance. Sometimes unacceptably so. And even for projects where it works acceptably, it can hardly be called optimal.

So here is a list of ideas for improvement:
- Currently `dist` and `build` are not on the exclude list. By convention we can be pretty sure they’re unneeded, the only reason I didn’t was caution: Maybe they are used for legitimate sources in some projects? We can easily reconsider this.
- I’ve considered swapping the logic around: Instead of using an exclude list, we could start with `files.includes` and just add `node_modules` so we grab the dependencies too. It looks appealing at the surface, but quickly becomes even more complex to navigate: What if someone ignores `package.json` files in their `files.includes`? What about generated files they’ve excluded, but which are still relevant for type extraction? How does `.gitignore` factor in in this scenario? And for all that complexity we still don’t solve another major issue: Indexing of irrelevant dependencies in `node_modules`.
- We can give users control of the exclude list in their config files. We’re not a fan of this approach as it would not be very user-friendly. It’s confusing and hard to explain why in addition to `files.includes` (which already supports exclusions) we’d have another mechanism for excluding files with different behaviour that’s harder to observe.
- We can speed up the processing of `node_modules` by skipping all `.js` files that have a corresponding `.d.ts` file. This is probably something we want to do regardless, so it would make a good subtask of this one.
- We can first scan the files we want to process, extract which dependencies they reference, and scan those dependencies, check which transitive dependencies those have, then scan those and so on. This is possible, but comes with a lot of complexity:
  - We still need to scan all `package.json` files (including inside `node_modules`) for the module resolution to work.
  - We need separate strategies for the LSP (which wants to index the entire project) and the CLI (which wants to look only at the files for a given invocation).
  - It is harder to parallelise the work, as you get “choke points” during dependency resolution.
  - File watching becomes harder, because changing a file means its dependencies may change, meaning the scope of what needs to be scanned may change at any time.
- We could take a shortcut on all of this, and just distinguish between `dependencies` and `devDependencies` when it comes to scanning the `node_modules`. Many projects have a lot more dev dependencies than they have real dependencies, so we could restrict ourselves to scanning dependencies only. There would still be some complexity in figuring out transitive dependencies, but it would be a lot easier than handling everything per file. The downside is that scripts that use dev dependencies may have reduced functionality with some lint rules, but that may be an acceptable compromise.
- Other ideas/a combination of the above.

Feedback and ideas are welcome on this thread, but please check which suggestions have been made before. Giving thumbs up/down to suggestions gives us a better overview what we ideas are best to pursue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

📎 Implement more targeted file scanner #5636

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

📎 Implement more targeted file scanner #5636

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions