red-knot: `VfsFile` input ingredient and a `Vfs` #11802

MichaReiser · 2024-06-08T15:22:33Z

This PR adds the foundation for red-knots salsa integration. It starts by defining a virtual filesystem that is Salsa's view of the files on disk (metadata only for now, I'll add support for content in the next PR).

VFS

The Vfs supports operating on files from the local filesystem (disk, editor, WASM, in memory) or vendored stub files (this PR only adds a stub implementation). It is virtual because it supports multiple sources and it is Salsa's view of the file system state.

The most notable change in terms of what we discussed for vendored files is that this PR exposes two methods on the Vfs:

file: To "intern" a file system path.
vendored: To intern a vendored path

The motivation behind this is that the methods differ in their return-type (and argument, but that's less important). For file, it's important that the implementation returns a File object even for files that don't exist so that salsa can track the fact that this query needs to be re-executed if such a file gets created later. This isn't necessary for vendored files because the vendored file system is readonly. Returning None in that case reduces the number of dependencies that salsa has to track, and in turn, can result in better performance. Splitting the methods also has the benefit that they're easier to call. I suspect that code paths will either exclusively work with file system or vendored path. It would be cumbersome if the callers need to wrap the path in a VfsPath just for the interning.

`FileSystem`

This PR also introduces a new FileSystem trait that abstracts away how filesystem files are read. The goal of this is that we can support different environments:

WASM: Ruff might not have access to a full file system or it's all in memory as it is the case in our playground today
LSP: The lsp does support reading from files, but unsaved changes take precedence over the content on disk.
tests: Ideally, tests don't need to write the content to disk. Instead, they can use an in-memory file system.

Crate name

I ended up creating a new crate because I couldn't find an existing crate that fits well. I first wanted to use ruff_source_file which is very similar. However, ruff_notebook depends on it and I suspect that this crate might depend on ruff_notebook in the long term (unless we use some Arc<dyn Trait + Eq> to represent file content).

The other reason why I didn't put it in ruff_source_file is that it is intended to store low level data structures for files. A Vfs and a dependency on salsa feels a bit overkill for such a crate

Open Questions

dir walking: It's not entirely clear to me how to implement dir walking because walkdir doesn't support virtual filesystems. Ideally, both the memory file system and the real implementation would use the same dir walking mechanism to avoid differences in behavior.
File watching: I think file systems need either to automatically watch files for changes or need an API that allows setting up file watchers. They probably need a second API that allows pulling all changes (file added, removed, changed events). The host would call a apply_changes method that takes a mutable self whenever it observed file system changes, so that the Vfs can update its metadata.

crates/ruff_db/src/vfs/path.rs

MichaReiser · 2024-06-08T15:24:58Z

crates/ruff_db/src/vfs.rs

+                    permission,
+                })
+            }
+            VfsPath::Vendored(_) => todo!(),


@AlexWaygood what I have in mind here is that the implementation calls into the vendored crate to read the file metadata (vendored_path.last_modification_time(), vendored_path.exists(), and ruff_venored::read_to_string(vendored_path)?)

github-actions · 2024-06-08T15:52:53Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

MichaReiser · 2024-06-09T13:55:44Z

crates/ruff_db/src/lib.rs

+    /// Interns a path to a vendored file and returns a salsa `File` ingredient.
+    fn vendored_file(&self, path: &camino::Utf8Path) -> Option<VfsFile>;


@AlexWaygood I think we ultimately want two different entry functions for regular files, and vendored files. Mainly because I think it's a bit annoying if you're working with a vendored path and you then need to convert it to a VfsPath just to get the VfsFile. Funnily, the first thing that the implementation would do is to dispatch on the path type.

MichaReiser · 2024-06-10T06:32:05Z

crates/ruff_db/src/file_system.rs

+// TODO support untitled files for the LSP use case. Wrap a `str` and `String`
+//    The main question is how `as_std_path` would work for untitled files, that can only exist in the LSP case
+//    but there's no compile time guarantee that a [`OsFileSystem`] never gets an untitled file path.


CC: @snowsignal I don't plan to support this as part of this PR but something we need to figure out for red-knot/LSP

Thanks for pinging me on this!

codspeed-hq · 2024-06-10T14:35:10Z

CodSpeed Performance Report

Merging #11802 will not alter performance

_{Comparing salsa-files-vfs (ce952b3) with salsa-files-vfs (a98db3d)}

Summary

✅ 30 untouched benchmarks

MichaReiser · 2024-06-10T16:55:55Z

Okay, not having a db.file(VfsPath) method is going to be problematic. I realized this when implementing the module resolver where we have a path_to_module(db, VfsFilePath) function. I could match on the path and call the two different functions but that's rather annoying.

I have to figure out how we handle the signature difference between vendored paths and regular paths. I would very much like the possibility that vendored returns None if the file doesn't exist.

I plan to address this as a separate PR to reduce my rebasing work.

Edit: This is solved in #11826

carljm

Strong +1 on the decision to make this its own crate. I think in general making new crates for red-knot functionality should be preferred over squeezing it into existing crates; I think the latter will cause more confusion. In particular I don't think we should try to put red-knot's semantic model into the existing ruff_python_semantic crate.

crates/ruff_db/src/file_system.rs

crates/ruff_db/src/lib.rs

carljm · 2024-06-11T15:46:36Z

crates/ruff_db/src/vfs.rs

+    /// The unix permissions of the file. Only supported on unix systems. Always 0 on Windows
+    /// or when the file has been deleted.


0 or None?

Should this be an Option<NonZeroU32>?

It should be None.

The FS api uses a u32 as return type. I'm not aware that 0 is ever a valid permission but I want to avoid conversion errors and using a NonZeroU32 doesn't give us much here (other than the size savings)

crates/ruff_db/src/vfs/path.rs

carljm · 2024-06-11T16:00:55Z

crates/ruff_db/src/vfs.rs

+/// ## Why do both the [`Vfs`] and [`FileSystem`](crate::FileSystem) trait exist?
+///
+/// It would have been an option to define [`FileSystem`](crate::FileSystem) in a way that all its operation accept
+/// a [`VfsPath`]. This would have allowed to unify most of [`Vfs`] and [`FileSystem`](crate::FileSystem). The reason why they are
+/// separate is that not all operations are supported for all [`VfsPath`]s:
+///
+/// * The only relevant operations for [`VendoredPath`]s are testing for existence and reading the content.
+/// * The vendored file system is immutable and doesn't support writing nor does it require watching for changes.
+/// * There's no requirement to walk the vendored typesystem.
+///
+/// The other reason is that most operations know if they are working with vendored or file system paths.
+/// Requiring them to convert the path to an `VfsPath` to test if the file exist is cumbersome.


I do find this layering pretty confusing in trying to understand the code. I think partly it's just the naming: what we call FileSystem is already "virtual" in that it can represent on-disk, memory, etc. And then we have another "virtual file system" layer on top of that.

I'm also not sure of the accuracy of this statement:

The other reason is that most operations know if they are working with vendored or file system paths.

Maybe it will turn out to be the case? It's just not clear to me.

I can say from using the API that it's working pretty well so far. But I admit that the terminology is confusing. Do you have any suggestions on how the naming could be improved or what specifically you find confusing.

I can try to improve the documentation. The way I think about FileSystem is less that it is a virtual file system. It's rather more how we access it. Mwhat would you think of FileSystemDriver?

I spent some more time looking at it and I think it's fine, if it seems to you to be working out well. I'm not sure that a rename of FileSystem to FileSystemDriver would make much difference; I think "Driver" ends up being kind of an empty filler word there that doesn't really communicate much.

It was hard for me to figure out the model for how these two things interact, since neither one encapsulates the other. It seems the model is that a Db has both a Vfs and a FileSystem, and it will use the FileSystem for looking up filesystem paths, and manage looking up vendored paths itself, though that hasn't been implemented yet, just stubbed.

Maybe what's confusing here is the naming of Vfs? Maybe it should just be VfsFiles?

Anyway, I do think your concern is valid and I went back and forth a couple of times between FileSystem handling all VfsPaths and the FileSystem only handling specific paths. And maybe my reasoning that vendored paths are different enough to not pass them through FileSystem is unjustified, considering that calling write_file with a path pointing to a directory also fails. So there's anyway some extra care that needs to be taken when handling paths that aren't checked at compile time. But I find it kind of nice if we can catch some of them at compile time.

The way I think about it is that FileSystem is a replacement for std::fs, that's it. It doesn't add support for using multiple file systems at once (vendored and the regular one), which is what Vfs does.

An entirely different design (not fully thought through) would be to ditch the FileSystem trait all together. The main motivation for it is that we can support unsaved files and untitled files in the LSP use case. But we could support this in another way by adding a open_file and close_file function to Vfs that allows to manually override the content of a file.

The only functionality we would looe is that we can't "mock out" the file system during tests with an in memory file system. But maybe that's not worth going through all that trouble. The WASM integration story would then require to provide a "stub" FS implementation. It's not a big detail and something that e.g. emscripten provides out of the box but it comes with a bit more boilerplate code than an option to just use a memory file system.

Either way, I don't consider the Vfs / FileSystem design that I proposed here as the final design. It requires some more iterating and I'm open to redesigning it completely.

We merge what we have now, with the risk that we might need to refactor some of the calling code

We strip out the FileSystem trait and revisit the design when implementing LSP support.

I'm leaning towards keeping what we have here. There's not a lot of downstream code depending on FileSystem, so that a refactor should be fairly painless.

MichaReiser · 2024-06-11T18:44:12Z

Strong +1 on the decision to make this its own crate. I think in general making new crates for red-knot functionality should be preferred over squeezing it into existing crates; I think the latter will cause more confusion. In particular I don't think we should try to put red-knot's semantic model into the existing ruff_python_semantic crate.

I'm a bit surprised by this comment because that's what I proposed in discord and I didn't see any objection to doing this for ruff_python_semantic

I can see how it can cause confusion. Mainly for imports when you get both Symbols (the ruff and red_knot one).

I don't think we can't avoid mixing them long term. The red knot crates at least will have to depend on their corresponding duff crates to use their functionality and it then becomes less clear to me what the benefit of keeping them separate really is. I also think that it is a motivation function to integrate early and more often compared to building this out entirely in different crates

carljm · 2024-06-12T04:22:51Z

I'm a bit surprised by this comment because that's what I proposed in discord and I didn't see any objection to doing this for ruff_python_semantic

Oh sorry! I think I do remember that Discord comment now; I guess I didn't think very carefully about it then.

I can see how it can cause confusion. Mainly for imports when you get both Symbols (the ruff and red_knot one).

I think it will just be generally hard to figure out how to structure things within one crate to make it clear what is red-knot and what is "v1". E.g. there is a top-level definition.rs in ruff_python_semantic, but red-knot will have its own version of Definitions; where do they go? Perhaps it will help me envision how this can work if I see the overall structure you have in mind within the crate to avoid just having a random assortment of sub-modules, some of which are v1 and some of which are red-knot, with no obvious way to tell which is which. If this structure is the outcome, I don't think that's a good outcome

I don't think we can't avoid mixing them long term. The red knot crates at least will have to depend on their corresponding duff crates to use their functionality and it then becomes less clear to me what the benefit of keeping them separate really is. I also think that it is a motivation function to integrate early and more often compared to building this out entirely in different crates

I agree that red knot will probably have to depend on v1, but the dependency should not go in the other direction, and to me this is enough reason to keep them separate.

I'm not clear what sort of "integration" we actually envision between the v1 semantic model and the red-knot semantic model that this would encourage. Can you clarify what kind of integration you are thinking of? My understanding is that they will always remain separate, and rules will have to explicitly be ported in some way.

MichaReiser · 2024-06-12T06:02:52Z

I think it will just be generally hard to figure out how to structure things within one crate to make it clear what is red-knot and what is "v1". E.g. there is a top-level definition.rs in ruff_python_semantic, but red-knot will have its own version of Definitions; where do they go? Perhaps it will help me envision how this can work if I see the overall structure you have in mind within the crate to avoid just having a random assortment of sub-modules, some of which are v1 and some of which are red-knot, with no obvious way to tell which is which. If this structure is the outcome, I don't think that's a good outcome

I agree on this and something I have started thinking about as well. I don't think what I've done in my follow up PRs is ideal and I wanted to iterate on it. For now, the red knot code is gated behind the red_knot feature. That means, by default the ruff_python_semantic crate doesn't compile with any red knot functionality. This prevents that any v1 code access red_knot code from ruff.

I think what could be improved is that we move all red_knot code that isn't shared between v1 and red_knot into a red_knot module. Although it then quickly becomes unclear what should be in there. Should we move the db out as soon as a single v1 API uses any red knot code? What about code that doesn't exist in v1 at all, e.g. the module resolver?

Regardless, this would then be very close to having two separate crates. The difference I see is that moving files in a crate tends to be easier and we can also already make use of the right crate visibilities (we don't need to make anything pub just so that we can access it from the red_knot crate.

I overall don't have a strong preference and I think I would just go with one approach for now. We don't need to get it right just now, it's easy to split the code out later.

I'm not clear what sort of "integration" we actually envision between the v1 semantic model and the red-knot semantic model that this would encourage. Can you clarify what kind of integration you are thinking of? My understanding is that they will always remain separate, and rules will have to explicitly be ported in some way.

I at least want to try to come up with a rule API that would work for both v1 and red_knot, at least for a majority of the rules. I would strongly prefer if we can avoid copying rules from ruff to the red_knot crate because that would come close to a rule-freeze for a couple of months. But this requires that ruff_linter has access to both red_knot and v1 code.

MichaReiser · 2024-06-12T07:01:40Z

I'll go ahead and merge this PR. This does not mean that we made a decision on how to proceed with FileSystem and Vfs or where we want to place the red_knot code in ruff_python_semantic. I just don't consider these two decisions as merge blocking and I would prefer to do this refactor as separate PRs on top of my entire stack anyway, because I don't enjoy suffering through rebases.

carljm · 2024-06-12T13:27:50Z

I do see that visibilities could be a reason to share a crate, if there's a lot of use of "old" internals from red-knot code.

Ultimately I think sharing a crate could work fine, too, as long as we have a structure that makes it clear what is v1 and what is red-knot.

AlexWaygood

Sorry for the slow review here. It took me a little bit of time to get my head round some of this, but it LGTM. Happy to open my own followup PR for some of my docs nits, unless there are any you particularly disagree with!

crates/ruff_db/src/file_system.rs

AlexWaygood · 2024-06-10T16:29:16Z

crates/ruff_db/src/file_system.rs

+
+/// Path to a file or directory stored in [`FileSystem`].
+///
+/// The path is guaranteed to be valid UTF-8.


Does this mean that we won't provide type-checking for Python scripts if their filename contains non-utf8 characters? Is this a limit Ruff already has when linting Python?

I think we can guarantee that vendored files are always going to be valid utf-8 but I'm not sure about non-vendored files

Yes, it limits us to UTF8 paths only but Ruff already assumes UTF8 today (we have so many path.to_str().unwrap() calls. Also, non UTF8 paths are extremely uncommon. I think all modern file system support UTF8 today (or at least, paths can be encoded to UTF8).

crates/ruff_db/src/file_system.rs

crates/ruff_db/src/file_system/memory.rs

crates/ruff_db/src/vfs.rs

MichaReiser added the red-knot Multi-file analysis & type inference label Jun 8, 2024

MichaReiser changed the title ~~red-knot: First draft of the File input ingredient and a Vfs~~ red-knot: File input ingredient and a Vfs Jun 8, 2024

MichaReiser commented Jun 8, 2024

View reviewed changes

crates/ruff_db/src/vfs/path.rs Show resolved Hide resolved

MichaReiser commented Jun 8, 2024

View reviewed changes

MichaReiser force-pushed the salsa-files-vfs branch from c730d27 to 9c450dd Compare June 8, 2024 15:32

MichaReiser changed the title ~~red-knot: File input ingredient and a Vfs~~ red-knot: VfsFile input ingredient and a Vfs Jun 9, 2024

MichaReiser commented Jun 9, 2024

View reviewed changes

MichaReiser force-pushed the salsa-files-vfs branch 4 times, most recently from fa3bb86 to 95d1254 Compare June 10, 2024 06:30

MichaReiser requested review from carljm and AlexWaygood June 10, 2024 06:30

MichaReiser commented Jun 10, 2024

View reviewed changes

MichaReiser marked this pull request as ready for review June 10, 2024 06:32

MichaReiser changed the base branch from main to set-minimal-rust-to-175 June 10, 2024 11:16

Base automatically changed from set-minimal-rust-to-175 to main June 10, 2024 12:39

MichaReiser added 3 commits June 10, 2024 16:30

Draft files and VFS

70b3285

Rename File to VfsFile, remove FileSystem, too many unknowns

8cd8e6d

Restore FileSystem

a98db3d

MichaReiser force-pushed the salsa-files-vfs branch from 95d1254 to a98db3d Compare June 10, 2024 14:30

MichaReiser changed the title ~~red-knot: VfsFile input ingredient and a Vfs~~ red-knot: VfsFile input ingredient and a Vfs [salsa 1..] Jun 11, 2024

MichaReiser changed the title ~~red-knot: VfsFile input ingredient and a Vfs [salsa 1..]~~ red-knot(salsa part 1): VfsFile input ingredient and a Vfs Jun 11, 2024

MichaReiser changed the title ~~red-knot(salsa part 1): VfsFile input ingredient and a Vfs~~ red-knot[salsa part 1]: VfsFile input ingredient and a Vfs Jun 11, 2024

carljm approved these changes Jun 11, 2024

View reviewed changes

Code review feedback

ce952b3

MichaReiser changed the title ~~red-knot[salsa part 1]: VfsFile input ingredient and a Vfs~~ red-knot: VfsFile input ingredient and a Vfs Jun 12, 2024

MichaReiser enabled auto-merge (squash) June 12, 2024 07:03

MichaReiser merged commit 93973b9 into main Jun 12, 2024
19 checks passed

MichaReiser deleted the salsa-files-vfs branch June 12, 2024 07:06

AlexWaygood reviewed Jun 12, 2024

View reviewed changes

MichaReiser added a commit that referenced this pull request Jun 13, 2024

Address review feedback from #11802

4e18150

MichaReiser mentioned this pull request Jun 13, 2024

[red-knot] Improve Vfs and FileSystem documentation #11856

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

red-knot: `VfsFile` input ingredient and a `Vfs` #11802

red-knot: `VfsFile` input ingredient and a `Vfs` #11802

MichaReiser commented Jun 8, 2024 •

edited

Loading

MichaReiser Jun 8, 2024 •

edited

Loading

github-actions bot commented Jun 8, 2024 •

edited

Loading

MichaReiser Jun 9, 2024

MichaReiser Jun 10, 2024

snowsignal Jun 10, 2024

codspeed-hq bot commented Jun 10, 2024 •

edited

Loading

MichaReiser commented Jun 10, 2024 •

edited

Loading

carljm left a comment

carljm Jun 11, 2024

MichaReiser Jun 12, 2024

carljm Jun 11, 2024

MichaReiser Jun 11, 2024

carljm Jun 12, 2024

MichaReiser Jun 12, 2024

MichaReiser commented Jun 11, 2024

carljm commented Jun 12, 2024

MichaReiser commented Jun 12, 2024

MichaReiser commented Jun 12, 2024 •

edited

Loading

carljm commented Jun 12, 2024

AlexWaygood left a comment

AlexWaygood Jun 10, 2024

MichaReiser Jun 12, 2024

		/// Interns a path to a vendored file and returns a salsa `File` ingredient.
		fn vendored_file(&self, path: &camino::Utf8Path) -> Option<VfsFile>;

		/// The unix permissions of the file. Only supported on unix systems. Always 0 on Windows
		/// or when the file has been deleted.

red-knot: VfsFile input ingredient and a Vfs #11802

red-knot: VfsFile input ingredient and a Vfs #11802

Conversation

MichaReiser commented Jun 8, 2024 • edited Loading

VFS

FileSystem

Crate name

Open Questions

MichaReiser Jun 8, 2024 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Jun 8, 2024 • edited Loading

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codspeed-hq bot commented Jun 10, 2024 • edited Loading

Merging #11802 will not alter performance

Summary

MichaReiser commented Jun 10, 2024 • edited Loading

carljm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaReiser commented Jun 11, 2024

carljm commented Jun 12, 2024

MichaReiser commented Jun 12, 2024

MichaReiser commented Jun 12, 2024 • edited Loading

carljm commented Jun 12, 2024

AlexWaygood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

red-knot: `VfsFile` input ingredient and a `Vfs` #11802

red-knot: `VfsFile` input ingredient and a `Vfs` #11802

MichaReiser commented Jun 8, 2024 •

edited

Loading

`FileSystem`

MichaReiser Jun 8, 2024 •

edited

Loading

github-actions bot commented Jun 8, 2024 •

edited

Loading

`ruff-ecosystem` results

codspeed-hq bot commented Jun 10, 2024 •

edited

Loading

MichaReiser commented Jun 10, 2024 •

edited

Loading

MichaReiser commented Jun 12, 2024 •

edited

Loading