Skip to content

📎 Implement more targeted file scanner #5636

@arendjr

Description

@arendjr

Description

With Biome 2, we have a file scanner that is responsible for indexing a repository for the purpose of multi-file analysis. But our initial implementation is rather naive and has some unintended implications.

The important thing to understand is that the file scanner needs to index node_modules, because it needs to be able to extract type information from dependencies. But it’s not just type information: We also use the file scanner for building our module graph, which we use for module resolution, cycle detection, and other features. For these features to work properly we need to index files that may be ignored in user configurations, such as generated files, package.json files, and indeed any dependencies of your code in general.

But there are also files that we know we never need to index: directories such as .git, .nx, .cache and more.

Our current strategy is to index everything except the files/directories on a hardcoded exclusion list. It works for most projects, but obviously it can miss things and end up indexing things it shouldn’t. Most of the time, scanning too much isn’t a big deal, but of course it impacts performance. Sometimes unacceptably so. And even for projects where it works acceptably, it can hardly be called optimal.

So here is a list of ideas for improvement:

  • Currently dist and build are not on the exclude list. By convention we can be pretty sure they’re unneeded, the only reason I didn’t was caution: Maybe they are used for legitimate sources in some projects? We can easily reconsider this.
  • I’ve considered swapping the logic around: Instead of using an exclude list, we could start with files.includes and just add node_modules so we grab the dependencies too. It looks appealing at the surface, but quickly becomes even more complex to navigate: What if someone ignores package.json files in their files.includes? What about generated files they’ve excluded, but which are still relevant for type extraction? How does .gitignore factor in in this scenario? And for all that complexity we still don’t solve another major issue: Indexing of irrelevant dependencies in node_modules.
  • We can give users control of the exclude list in their config files. We’re not a fan of this approach as it would not be very user-friendly. It’s confusing and hard to explain why in addition to files.includes (which already supports exclusions) we’d have another mechanism for excluding files with different behaviour that’s harder to observe.
  • We can speed up the processing of node_modules by skipping all .js files that have a corresponding .d.ts file. This is probably something we want to do regardless, so it would make a good subtask of this one.
  • We can first scan the files we want to process, extract which dependencies they reference, and scan those dependencies, check which transitive dependencies those have, then scan those and so on. This is possible, but comes with a lot of complexity:
    • We still need to scan all package.json files (including inside node_modules) for the module resolution to work.
    • We need separate strategies for the LSP (which wants to index the entire project) and the CLI (which wants to look only at the files for a given invocation).
    • It is harder to parallelise the work, as you get “choke points” during dependency resolution.
    • File watching becomes harder, because changing a file means its dependencies may change, meaning the scope of what needs to be scanned may change at any time.
  • We could take a shortcut on all of this, and just distinguish between dependencies and devDependencies when it comes to scanning the node_modules. Many projects have a lot more dev dependencies than they have real dependencies, so we could restrict ourselves to scanning dependencies only. There would still be some complexity in figuring out transitive dependencies, but it would be a lot easier than handling everything per file. The downside is that scripts that use dev dependencies may have reduced functionality with some lint rules, but that may be an acceptable compromise.
  • Other ideas/a combination of the above.

Feedback and ideas are welcome on this thread, but please check which suggestions have been made before. Giving thumbs up/down to suggestions gives us a better overview what we ideas are best to pursue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-CoreArea: coreA-ProjectArea: projectS-EnhancementStatus: Improve an existing featureS-Help-wantedStatus: you're familiar with the code base and want to help the project

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions