-
-
Notifications
You must be signed in to change notification settings - Fork 808
Description
Description
With Biome 2, we have a file scanner that is responsible for indexing a repository for the purpose of multi-file analysis. But our initial implementation is rather naive and has some unintended implications.
The important thing to understand is that the file scanner needs to index node_modules, because it needs to be able to extract type information from dependencies. But it’s not just type information: We also use the file scanner for building our module graph, which we use for module resolution, cycle detection, and other features. For these features to work properly we need to index files that may be ignored in user configurations, such as generated files, package.json files, and indeed any dependencies of your code in general.
But there are also files that we know we never need to index: directories such as .git, .nx, .cache and more.
Our current strategy is to index everything except the files/directories on a hardcoded exclusion list. It works for most projects, but obviously it can miss things and end up indexing things it shouldn’t. Most of the time, scanning too much isn’t a big deal, but of course it impacts performance. Sometimes unacceptably so. And even for projects where it works acceptably, it can hardly be called optimal.
So here is a list of ideas for improvement:
- Currently
distandbuildare not on the exclude list. By convention we can be pretty sure they’re unneeded, the only reason I didn’t was caution: Maybe they are used for legitimate sources in some projects? We can easily reconsider this. - I’ve considered swapping the logic around: Instead of using an exclude list, we could start with
files.includesand just addnode_modulesso we grab the dependencies too. It looks appealing at the surface, but quickly becomes even more complex to navigate: What if someone ignorespackage.jsonfiles in theirfiles.includes? What about generated files they’ve excluded, but which are still relevant for type extraction? How does.gitignorefactor in in this scenario? And for all that complexity we still don’t solve another major issue: Indexing of irrelevant dependencies innode_modules. - We can give users control of the exclude list in their config files. We’re not a fan of this approach as it would not be very user-friendly. It’s confusing and hard to explain why in addition to
files.includes(which already supports exclusions) we’d have another mechanism for excluding files with different behaviour that’s harder to observe. - We can speed up the processing of
node_modulesby skipping all.jsfiles that have a corresponding.d.tsfile. This is probably something we want to do regardless, so it would make a good subtask of this one. - We can first scan the files we want to process, extract which dependencies they reference, and scan those dependencies, check which transitive dependencies those have, then scan those and so on. This is possible, but comes with a lot of complexity:
- We still need to scan all
package.jsonfiles (including insidenode_modules) for the module resolution to work. - We need separate strategies for the LSP (which wants to index the entire project) and the CLI (which wants to look only at the files for a given invocation).
- It is harder to parallelise the work, as you get “choke points” during dependency resolution.
- File watching becomes harder, because changing a file means its dependencies may change, meaning the scope of what needs to be scanned may change at any time.
- We still need to scan all
- We could take a shortcut on all of this, and just distinguish between
dependenciesanddevDependencieswhen it comes to scanning thenode_modules. Many projects have a lot more dev dependencies than they have real dependencies, so we could restrict ourselves to scanning dependencies only. There would still be some complexity in figuring out transitive dependencies, but it would be a lot easier than handling everything per file. The downside is that scripts that use dev dependencies may have reduced functionality with some lint rules, but that may be an acceptable compromise. - Other ideas/a combination of the above.
Feedback and ideas are welcome on this thread, but please check which suggestions have been made before. Giving thumbs up/down to suggestions gives us a better overview what we ideas are best to pursue.