Salsa based red-knot prototype #11338

MichaReiser · 2024-05-08T13:50:06Z

This is a prototype that uses Salsa for our red-knot prototype 😆

The PR implements cross-module type inference invalidation based on Salsa. What makes this hard is that

Salsa query arguments need to be ingredients. That means we can no longer pass arbitrary arguments to queries.
Salsa limits invalidation if the result of a query compares equal to the value it returned previously. We need to remodel our query return values to make best use of that and avoid that e.g. all types invalidate on a single whitespace change (the thing we return from types should be location-independent

I'll go through the important data models jar by jar.

Source

The source jar gives access to files, the text of a file, and a file's AST.

File

ruff/crates/red_knot/src/salsa_db/source.rs

Lines 31 to 44 in 946493c

    
           #[salsa::input(jar=Jar)] 
        
           pub struct File { 
        
               #[return_ref] 
        
               pub path: PathBuf, 
        
               pub permissions: u32, 
        
               pub revision: FileRevision, 
        
               pub status: FileStatus, 
        
               #[allow(unused)] 
        
               count: Count<File>, 
        
           }

The file stores the basic metadata about a file but doesn't store the file's content. This is mainly because of persistent caching. Restoring the database from disk requires that we restore all files. If the source is stored on the file, we would have to read the content of every file, and that would be very expensive (we want the source validation to happen lazily). That's why the file only stores basic metadata.

Note: We may decide long-term to have a configuration option that allows users to select if they want to use mtime or the file's has for change detection. In that case, I think we would have a source: Option<String> on file so that the source_text query avoids re-reading the file from disk.

Files are salsa inputs. Salsa doesn't know how to compute files. Instead, we need to tell salsa which files exist and when they change. That's why files are resolved using db.file(path) where we perform our own mapping from Path -> File (Salsa inputs have no identity other than their instance).

SourceText

The source_text(file: File) -> SourceText query allows retrieving a file's source text. The source text isn't very exciting. It just stores the file's content.

ruff/crates/red_knot/src/salsa_db/source.rs

Lines 163 to 167 in 946493c

    
           #[derive(Debug, Clone, Eq, PartialEq)] 
        
           pub struct SourceText { 
        
               pub text: Arc<str>, 
        
               count: Count<SourceText>, 
        
           }

Some notes about the implementation:

The query calls file.revision() (equal to the file's mtime) to inform Salsa that the query should rerun whenever the file is modified. It actually doesn't need the value.
It might happen that the file has been deleted between calling db.file(path) and source_text(db, file). In that case, we just assume that the file is empty. That's the best we can do without dealing with awkward results in all caller paths.

parse

ruff/crates/red_knot/src/salsa_db/source.rs

Lines 175 to 183 in 946493c

    
           #[tracing::instrument(level = "debug", skip(db))] 
        
           #[salsa::tracked(jar=Jar, no_eq)] 
        
           pub fn parse(db: &dyn Db, file: File) -> Arc<Parsed<ModModule>> { 
        
               let source = file.source(db); 
        
               let result = ruff_python_parser::parse_unchecked_source(source.text(), PySourceType::Python); 
        
               Arc::new(result) 
        
           }

The parse query is almost boring. It retrieves the file text and calls the parser. We opt out of Salsa's eq optimization because the parse tree is guaranteed to change whenever the source text changes (and our AST doesn't implement Eq because of floats).

Semantic

This is where it gets interesting.

AstIds

ruff/crates/red_knot/src/salsa_db/semantic/ast_ids.rs

Lines 18 to 24 in 176267f

    
           #[tracing::instrument(level = "debug", skip(db))] 
        
           #[salsa::tracked(jar=Jar, no_eq, return_ref)] 
        
           pub fn ast_ids(db: &dyn Db, file: File) -> AstIds { 
        
               let parsed = parse(db.upcast(), file); 
        
               AstIds::from_parsed(&parsed) 
        
           }

AstIds are a location-independent representation that allows mapping from Id -> AstNode and from AstNode -> Id. The implementation tries to assign stable IDs by first giving IDs to the module-level statements and expressions, and only then traversing into the function or class level.

a = 10 # statement-id: 0

def test(a): # statement-id: 1
	if a: # statement-id: 4
		pass # statement-id 5

print(a) # statement-id: 2

class Test: # statement-id: 3
	pass # statement-id: 6

This way, IDs of top level statements remain unchanged when only making changes to a function's body. Having stable top-level IDs is important because they are referred to from other modules.

symbols, cfg

semantic_index

ruff/crates/red_knot/src/salsa_db/semantic.rs

Lines 103 to 123 in 176267f

    
           #[tracing::instrument(level = "debug", skip(db))] 
        
           #[salsa::tracked(jar=Jar, return_ref)] 
        
           pub fn semantic_index(db: &dyn Db, file: File) -> SemanticIndex { 
        
               let root_scope_id = SymbolTable::root_scope_id(); 
        
               let mut indexer = SemanticIndexer { 
        
                   db, 
        
                   file, 
        
                   symbol_table_builder: SymbolTableBuilder::new(), 
        
                   flow_graph_builder: FlowGraphBuilder::new(), 
        
                   scopes: vec![ScopeState { 
        
                       scope_id: root_scope_id, 
        
                       current_flow_node_id: FlowGraph::start(), 
        
                   }], 
        
                   current_definition: None, 
        
               }; 
        
               let parsed = parse(db.upcast(), file); 
        
               indexer.visit_body(&parsed.syntax().body); 
        
               indexer.finish() 
        
           }

The semantic_index query computes a single file's symbol table and control flow graph. It shouldn't be used directly because the semantic_index changes every time the AST changes.

symbol_table

ruff/crates/red_knot/src/salsa_db/semantic/symbol_table.rs

Lines 21 to 25 in 176267f

    
           #[tracing::instrument(level = "debug", skip(db))] 
        
           #[salsa::tracked(jar=Jar)] 
        
           pub fn symbol_table(db: &dyn Db, file: File) -> Arc<SymbolTable> { 
        
               semantic_index(db, file).symbol_table.clone() 
        
           }

The query itself just calls into semantic_index. The trick here is that the symbol table itself doesn't contain any data that references the AST. Instead, all data uses AstIds. What this query enables is that Salsa can avoid running queries that depend on the symbol_table if the constructed symbol table hasn't changed. For example, a comment only change doesn't invalidate the symbol table.

flow_graph

ruff/crates/red_knot/src/salsa_db/semantic/flow_graph.rs

Lines 12 to 17 in 176267f

    
           #[allow(unused)] 
        
           #[tracing::instrument(level = "debug", skip(db))] 
        
           #[salsa::tracked(jar=Jar)] 
        
           pub fn flow_graph(db: &dyn Db, file: File) -> Arc<FlowGraph> { 
        
               semantic_index(db, file).flow_graph.clone() 
        
           }

We apply the same trick for the control flow graph

Typing

typing_scopes

Typing is where the code changes the most. The existing implementation does type inference per expression. I don't think that type_inference per expression will be fast in Salsa because storing a query result has some overhead. Salsa is also limited to at most u32 results per query. I think large projects could reach that limit, especially when the server runs for a long time.

That's why this PR changes inference to happen per TypingScope instead. For now, a typing scope is either a Module, Function, or Class. So this PR infers all types per module, class, or function (but the module doesn't traverse into function or class bodies).

The reason why we don't perform type inference on a module scope is to get more fine-grained dependency tracking across files. The type checking of a dependency must only be rerun if the types of the scope where the symbol is defined depend on changes. If the types remain unchanged (for example because the public interface isn't changing), then type checking doesn't need to re-run.

The first step to make this possible is to create a FunctionTypingScope and ClassTypingScopes for every Function and Class in the file and store them in Salsa to use them as query arguments.

ruff/crates/red_knot/src/salsa_db/semantic/types.rs

Lines 26 to 40 in 176267f

    
           #[salsa::tracked(jar=Jar, return_ref)] 
        
           pub(crate) fn typing_scopes(db: &dyn Db, file: File) -> TypingScopes { 
        
               let ast_ids = ast_ids(db, file); 
        
               let functions = ast_ids 
        
                   .functions() 
        
                   .map(|(id, _)| (id, FunctionTypingScope::new(db, file, id))) 
        
                   .collect(); 
        
               let classes = ast_ids 
        
                   .classes() 
        
                   .map(|(id, _)| (id, ClassTypingScope::new(db, file, id))) 
        
                   .collect(); 
        
               TypingScopes { functions, classes } 
        
           }

infer_*_body

The other important queries are infer_module_body, infer_function_body, and infer_class_body. They perform type inference for a single module, function or class, but without traversing into nested classes or functions.

ruff/crates/red_knot/src/salsa_db/semantic/types/infer.rs

Lines 39 to 64 in 176267f

    
           #[salsa::tracked(jar=Jar, return_ref)] 
        
           pub fn infer_function_body(db: &dyn Db, scope: FunctionTypingScope) -> TypeInference { 
        
               let function = scope.node(db); 
        
               let mut builder = TypeInferenceBuilder::new(db, scope.into()); 
        
               builder.lower_function_body(&function); 
        
               builder.finish() 
        
           } 
        
           #[salsa::tracked(jar=Jar, return_ref)] 
        
           pub fn infer_class_body(db: &dyn Db, scope: ClassTypingScope) -> TypeInference { 
        
               let class = scope.node(db); 
        
               let mut builder = TypeInferenceBuilder::new(db, scope.into()); 
        
               builder.lower_class_body(&class); 
        
               builder.finish() 
        
           } 
        
           #[salsa::tracked(jar=Jar, return_ref)] 
        
           pub fn infer_module_body(db: &dyn Db, file: File) -> TypeInference { 
        
               let parsed = parse(db.upcast(), file); 
        
               let mut builder = TypeInferenceBuilder::new(db, file.into()); 
        
               builder.lower_module(&parsed.syntax()); 
        
               dbg!(builder.finish()) 
        
           }

Doing type-checking per block introduces some complexity. Mainly that getting the type data for a TypeId not just requires knowing the file from which the data needs to be read, but also from which typing scope. There's even an extra complexity. There are cases where we want to to resolve the type for a type_id. But we may only just be building up that typing table. I solved this by introducing TypingContext and passing that to TypeId::ty. The TypingContext can have an override so that queries for a specific typing-scope are directly resolved without calling into the database.

ruff/crates/red_knot/src/salsa_db/semantic/types.rs

Lines 527 to 556 in 176267f

    
           pub struct TypingContext<'a> { 
        
               db: &'a dyn Db, 
        
               local: Option<(TypingScope, &'a TypeInference)>, 
        
           } 
        
           impl<'a> TypingContext<'a> { 
        
               pub fn local(db: &'a dyn Db, local_scope: TypingScope, types: &'a TypeInference) -> Self { 
        
                   Self { 
        
                       db, 
        
                       local: Some((local_scope, types)), 
        
                   } 
        
               } 
        
               pub fn global(db: &'a dyn Db) -> Self { 
        
                   Self { db, local: None } 
        
               } 
        
               pub fn db(&self) -> &'a dyn Db { 
        
                   self.db 
        
               } 
        
               pub fn types(&self, scope: TypingScope) -> &'a TypeInference { 
        
                   if let Some((local_scope, types)) = self.local { 
        
                       if local_scope == scope { 
        
                           return types; 
        
                       } 
        
                   } 
        
                   infer_types(self.db, scope) 
        
               }

Public API

The public API for types should be limited to:

ruff/crates/red_knot/src/salsa_db/semantic/types/infer.rs

Lines 24 to 29 in 176267f

    
           pub fn infer_expression_type(db: &dyn Db, expression_id: GlobalId<ExpressionId>) -> Type { 
        
               let typing_scope = TypingScope::for_expression(db, expression_id); 
        
               let types = infer_types(db, typing_scope); 
        
               types.expression_ty(expression_id.local()) 
        
           }

ruff/crates/red_knot/src/salsa_db/semantic.rs

Lines 79 to 101 in 176267f

    
           #[tracing::instrument(level = "debug", skip(db))] 
        
           pub fn resolve_global_symbol(db: &dyn Db, file: File, name: &str) -> Option<GlobalSymbolId> { 
        
               let symbol_table = symbol_table_query(db, file); 
        
               let symbol_id = symbol_table.root_symbol_id_by_name(name)?; 
        
               Some(GlobalSymbolId::new(file, symbol_id)) 
        
           } 
        
           #[tracing::instrument(level = "debug", skip(db))] 
        
           pub fn global_symbol_type(db: &dyn Db, symbol: GlobalSymbolId) -> Type { 
        
               let typing_scope = TypingScope::for_symbol(db, symbol); 
        
               let types = infer_types(db, typing_scope); 
        
               types.symbol_ty(symbol.local()) 
        
           } 
        
           #[tracing::instrument(level = "debug", skip(db))] 
        
           pub fn global_symbol_type_by_name(db: &dyn Db, module: File, name: &str) -> Option<Type> { 
        
               let symbols = symbol_table_query(db, module); 
        
               let symbol = symbols.root_symbol_id_by_name(name)?; 
        
               Some(global_symbol_type(db, GlobalSymbolId::new(module, symbol))) 
        
           }

Module Resolver

The module resolver remains mostly unchanged, although I did some renaming.

Module

I think the naming could be better. Module is mainly a ModuleName but interned into salsa so that it can be used as a query argument.

ruff/crates/red_knot/src/salsa_db/semantic/module.rs

Lines 13 to 17 in 4c70337

    
           #[salsa::interned(jar=Jar)] 
        
           pub struct Module { 
        
               #[return_ref] 
        
               name: ModuleName, 
        
           }

I didn't want to intern ModuleName directly because I think there are places where we want to use it without the need for having it in Salsa. But maybe that's the wrong call and we should just intern ModuleName directly.

resolve_module

The main query remains resolve_module

ruff/crates/red_knot/src/salsa_db/semantic/module.rs

Lines 193 to 214 in 4c70337

    
           #[tracing::instrument(level = "debug", skip(db))] 
        
           #[salsa::tracked(jar=Jar)] 
        
           pub fn resolve_module(db: &dyn Db, module: Module) -> Option<ResolvedModule> { 
        
               let name = module.name(db); 
        
               let (root_path, resolved_file, kind) = resolve_module_path(db, name)?; 
        
               let normalized = resolved_file 
        
                   .path(db.upcast()) 
        
                   .canonicalize() 
        
                   .map(|path| db.file(path)) 
        
                   .unwrap_or_else(|_| resolved_file); 
        
               Some(ResolvedModule { 
        
                   inner: Arc::new(ResolveModuleInner { 
        
                       module, 
        
                       kind, 
        
                       search_path: root_path, 
        
                       file: normalized, 
        
                   }), 
        
               }) 
        
           }

What changed is that it now accepts a Module and returns an Option<ResolvedModule>. Again, I'm open to suggestion for better naming. The idea is that a ResolvedModule represents to what a module name resolves. I'm consider renaming it to ResolvedModulePath because I think that's really what it is.

I think the implementation became much simpler because the module resolver now uses File and File::exists internally. This has the advantage that Salsa will automatically invalidate the resolve_module result if a relevant file gets added or removed.

file_to_module

ruff/crates/red_knot/src/salsa_db/semantic/module.rs

Line 228 in 4c70337

pub fn file_to_module(db: &dyn Db, file: File) -> Option<ResolvedModule> {

Resolves a file to a Option<ResolvedModule> if it is a module and to None otherwise. This is mostly unchanged.

module_search_paths and set_module_search_paths

ruff/crates/red_knot/src/salsa_db/semantic/module.rs

Lines 178 to 185 in 4c70337

    
           #[allow(unused)] 
        
           pub fn set_module_search_paths(db: &mut dyn Db, search_paths: Vec<ModuleSearchPath>) { 
        
               if let Some(existing) = ModuleSearchPaths::try_get(db) { 
        
                   existing.set_paths(db).to(search_paths); 
        
               } else { 
        
                   ModuleSearchPaths::new(db, search_paths); 
        
               } 
        
           }

These queries shouldn't exist long term but it was a "quick" way to allow setting the module search paths without supporting settings. I'll adapt this to @AlexWaygood's most recent changes by having a set_module_resolver_settings short term (that has fields for the different lookup paths). The long term goal is that the module resolver queries the settings and constructs the search paths from the settings (it probably should remain a query)

github-actions · 2024-05-08T14:09:06Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

codspeed-hq · 2024-05-28T08:31:55Z

CodSpeed Performance Report

Merging #11338 will not alter performance

_{Comparing red-knot-salsa (06ee178) with main (2e0a975)}

Summary

✅ 30 untouched benchmarks

MichaReiser · 2024-05-31T13:28:05Z

crates/red_knot/src/salsa_db/semantic/module.rs

+}
+
+#[salsa::tracked(jar=Jar)]
+pub struct ResolvedModule {


@AlexWaygood this is where I'm currently landing on a Salsa design for a module resolver. I think it would simplify a lot for you because you no longer need to think about invalidation, Salsa will take care of that for you. The only thing necessary for this to work is that you use db.file(path).exists() to test if a file exists.

But check out resolve_module, it's now almost empty!

Ah, thanks for the ping. Yes, this indeed does make the code look a lot cleaner! It was making my head hurt a little bit to see all the cache-checking stuff right alongside the search-path semantics in resolve_module()

MichaReiser · 2024-05-31T13:30:41Z

crates/red_knot/src/salsa_db/semantic/module.rs

+pub fn path_to_module(db: &dyn Db, path: &Path) -> Option<ResolvedModule> {
+    let file = db.file(path.to_path_buf());


It's a bit weird that path_to_module converts the path to a file as the very first thing only so that file_to_module then reads the path. However, for file_to_module to be a salsa query, it can only accept an ingredient as an argument and file is an ingredient but path isn't.

…de as derived queries.

A query result only needs to be a tracked struct if we intend to use it as a query ingredient. It's unclear to me whether this is the case for `ResolvedModule`, that's why I make it a regular struct for now. We can easily make it a tracked struct later on.

tracked-structs are only necessary when the struct should be used as an argument to a derived Salsa query. I don't expect that the lint results itself should be used as queries, therefore, normal structs do just fine.

MichaReiser · 2024-06-07T19:17:24Z

There's one limitation with the current model where the invalidation isn't as good as it could be and it is due to the fact that we build the entire symbol table at once (we don't have to and we could refactor that later).

Let's say we start with

# main.py
import foo;

x = foo.x

# foo.py
x = 10

def foo(): 
	pass

And we infer the type of x in main. To do this, the implementation runs

It parses main , builds its symbol table and cfg, calls into infer_module_body
It resolves foo when reaching import foo
It calls ModuleType.member when reaching foo.x. This fans out to parse foo, build its symbol table and control flow graph, and then runs module-level type inference for foo
...

When we now change the content of foo to

x = 10

def foo(): 
	y = 10

What I expected is that the type inference for main wouldn't re-run because the module-level types of foo remain unchanged. However, that's not the case. The reason is that foo.x resolved the symbol table of foo.py and the symbol table has changed because we introduced y in the scope of foo. We have the same problem when a flag of an enclosing symbol changes. For example if the body of foo.foo is changed to y = x. The symbol table of that module changes because x now has the flag used.

We can avoid this by also building the symbol table per scope rather than once globally. Or have a query that reduces the global symbol table to just the global symbols. I do think something like that would be nice to have more fine granular invalidations.

…odule looking up the definition

carljm · 2024-06-07T21:58:44Z

This is a great write-up, thanks for taking the time! A few thoughts from the write-up, before I dive into the code:

One caveat with building symbol tables per-scope is that the contents of nested functions can affect the symbol table of the enclosing function. (If a nested function uses nonlocal x and assigns to x, the symbol kind for x may change based on the knowledge that a nested function might assign to it.) So we can build symbol tables separately for module scope, class scope, and function scope -- but building a function symbol table needs to build all enclosed symbol tables (which may involve N nested class and function scopes, to arbitrary depth.) I think in practice this is still fine, though, and worth doing the split. Nesting isn't that common, and by definition the symbols inside a function scope aren't ones that another module can depend on, anyway.
My initial feeling is that what you are calling Module maybe should be called ModuleName, and what you are calling ResolvedModule maybe should be called Module. But I'm not totally sure until I look closer at the code.
I think if we are doing type inference per-scope rather than per-expression that will probably also recommend changes to how type inference works in the first place. We can probably just walk the AST for the scope, assigning types to expressions as we go, and similarly tracking narrowed local types for local symbols as we go. This will mean we don't even need FlowGraph anymore (a lot of the concepts developed in building it will still carry over, it will just be resolved eagerly instead.)

MichaReiser · 2024-06-08T06:07:35Z

One caveat with building symbol tables per-scope is that the contents of nested functions can affect the symbol table of the enclosing function.

The part that's unclear to me is how we would compute flags like USED when building the symbol table lazily per scope, because a symbol might be used in a child scope. But we also have the option to build the entire symbol table at once and then split it per scope (similar to the trick with SymbolIndex, build it once, then have sub-queries that only return a slice of that.

My initial feeling is that what you are calling Module maybe should be called ModuleName, and what you are calling ResolvedModule maybe should be called Module. But I'm not totally sure until I look closer at the code.

Yeah, but that would require that ModuleName becomes a sala ingredient. It might be fine but it makes ModuleName a bit more awkward to use. But if we keep ModuleName a regular struct, than what would you call the module thing that we pass to resolve_module? I might be overthinking this because we can make the argument to resolve_module private and only expose a resolve_module(db: &dyn Db, name: &str) function that internally converts the &str to that Module thing and we can then keep our existing terminology. Maybe that's for the best (it also hides complexity).

My thinking why I called the Module { name: ModuleName } a module is that I don't think that the existence of the file on the disk makes a module. When we have import foo, then there's an import of the foo module, regardless if that module exists or not. That's why I think that a module's identity is really defined by its name.

We can probably just walk the AST for the scope, assigning types to expressions as we go, and similarly tracking narrowed local types for local symbols as we go. This will mean we don't even need FlowGraph anymore (a lot of the concepts developed in building it will still carry over, it will just be resolved eagerly instead.)

This is what the new implementation does. But I must say, it would make me sad to see your CFG go away. I think it could be useful for other things than just typing, like an unreachable rule.

carljm · 2024-06-08T16:21:55Z

The part that's unclear to me is how we would compute flags like USED when building the symbol table lazily per scope, because a symbol might be used in a child scope.

The USED flag only means "used in current scope", so it's not an issue. The only analysis that crosses scopes is the one Patrick was working on, and that's the case I discussed in my comment; but it only applies to nested scopes within function scopes, so handling a function scope and all it's nested scopes together is one way to handle it; module scopes and class scopes not nested in functions can be fully independent.

build it once, then have sub-queries that only return a slice of that.

This could work, too.

When we have import foo, then there's an import of the foo module, regardless if that module exists or not. That's why I think that a module's identity is really defined by its name.

This is a good point. I think it's a solid enough reason for the current naming. We will need to be able to track dependencies on nonexistent modules.

This is what the new implementation does.

"tracking narrowed local types for local symbols as we go" is the part that would specifically replace the CFG; I don't think the implementation here does that yet.

it would make me sad to see your CFG go away. I think it could be useful for other things than just typing, like an unreachable rule.

The eager version of the same logic would also have the ability to discover unreachable branches.

AlexWaygood

Thanks for the great PR writeup. Overall this looks good to me, though it looks like it's missing some of my recent changes to module.rs.

I left a bunch of comments below, mostly pretty minor. I think many of them may also apply to the existing red-knot codebase -- I wouldn't yet consider myself an expert in the crate overall -- so please feel free to ignore any that you don't feel are useful. I think @carljm will probably be a much better reviewer for this in general :/

AlexWaygood · 2024-06-10T10:47:10Z

crates/red_knot/Cargo.toml

@@ -37,5 +40,6 @@ tracing-tree = { workspace = true }
 [dev-dependencies]
 tempfile = { workspace = true }

+


nit ;)

Suggested change

crates/red_knot/src/salsa_db/semantic.rs

AlexWaygood · 2024-06-10T10:56:47Z

crates/red_knot/src/salsa_db/semantic.rs

+pub type GlobalSymbolId = GlobalId<SymbolId>;
+
+#[derive(Debug, Eq, PartialEq)]
+pub struct SemanticIndex {


Would be nice to have a docstring for this type as well

AlexWaygood · 2024-06-10T11:04:10Z

crates/red_knot/src/salsa_db/semantic.rs

+    let root_scope_id = SymbolTable::root_scope_id();
+    let mut indexer = SemanticIndexer {
+        db,
+        file,
+        symbol_table_builder: SymbolTableBuilder::new(),
+        flow_graph_builder: FlowGraphBuilder::new(),
+        scopes: vec![ScopeState {
+            scope_id: root_scope_id,
+            current_flow_node_id: FlowGraph::start(),
+        }],
+        current_definition: None,
+    };


Maybe this should go into a new associated method for SemanticIndexer (or an implementation of the Default trait)?

impl SemanticIndexer { fn new(db: &dyn Db, file: File) -> Self { Self { db, file, symbol_table_builder: SymbolTableBuilder::new(), flow_graph_builder: FlowGraphBuilder::new(), scopes: vec![ScopeState { scope_id: SymbolTable::root_scope_id(), current_flow_node_id: FlowGraph::start(), }], current_definition: None, } } }

AlexWaygood · 2024-06-10T11:52:27Z

crates/red_knot/src/salsa_db/source.rs

+#[derive(Copy, Clone, Debug, Eq, PartialEq)]
+pub enum FileRevision {
+    LastModified(FileTime),
+    #[allow(unused)]
+    ContentHash(u128),
+}


The second variant of this enum is so that we can also detect when the "revision" of a vendored source file changes?

No, this was mainly to explore how and if we could support file revisions e.g. based on a file's hash rather than the last modified timestamp. But this isn't used right now.

AlexWaygood · 2024-06-10T13:51:31Z

crates/red_knot/src/salsa_db/semantic/types.rs

+#[derive(Copy, Clone, Debug, Eq, PartialEq, Hash)]
+pub struct GlobalTypeId<T>
+where
+    T: LocalTypeId,
+{
+    scope: TypingScope,
+    local_id: T,
+}


Some docstrings for these *Id types would be really helpful

AlexWaygood · 2024-06-10T14:02:31Z

crates/red_knot/src/salsa_db/semantic/symbol_table.rs

+        let mut table = SymbolTable {
+            scopes_by_id: IndexVec::new(),
+            symbols_by_id: IndexVec::new(),
+            defs: FxHashMap::default(),
+            scopes_by_node: FxHashMap::default(),
+            dependencies: Vec::new(),
+            expression_scopes: IndexVec::default(),
+        };


I think if we derived Default on the SymbolTable struct, this could just be

Suggested change

let mut table = SymbolTable {

scopes_by_id: IndexVec::new(),

symbols_by_id: IndexVec::new(),

defs: FxHashMap::default(),

scopes_by_node: FxHashMap::default(),

dependencies: Vec::new(),

expression_scopes: IndexVec::default(),

};

let mut table = SymbolTable::default();

right?

I intentionally avoided that, IIRC, because I don't want it to be possible (and especially not easy!) to create a SymbolTable without the root scope.

AlexWaygood · 2024-06-10T14:11:30Z

crates/red_knot/src/salsa_db/semantic/flow_graph.rs

+}
+
+impl<'a> ReachableDefinitionsIterator<'a> {
+    #[allow(unused)]


I think this shouldn't be necessary since the function's pub?

Suggested change

#[allow(unused)]

AlexWaygood · 2024-06-10T14:17:18Z

crates/red_knot/src/salsa_db/semantic/ast_ids.rs

+    #[allow(unused)]
+    pub fn functions(&self) -> impl Iterator<Item = (FunctionId, &StmtFunctionDef)> {
+        self.statements
+            .iter_enumerated()
+            .filter_map(|(index, stmt)| Some((FunctionId(index), stmt.as_function_def_stmt()?)))
+    }
+
+    #[allow(unused)]


I think these #[allow(unused)] shouldn't be needed because they're pub

Suggested change

#[allow(unused)]

pub fn functions(&self) -> impl Iterator<Item = (FunctionId, &StmtFunctionDef)> {

self.statements

.iter_enumerated()

.filter_map(|(index, stmt)| Some((FunctionId(index), stmt.as_function_def_stmt()?)))

}

#[allow(unused)]

pub fn functions(&self) -> impl Iterator<Item = (FunctionId, &StmtFunctionDef)> {

self.statements

.iter_enumerated()

.filter_map(|(index, stmt)| Some((FunctionId(index), stmt.as_function_def_stmt()?)))

}

AlexWaygood · 2024-06-10T14:23:31Z

crates/red_knot/src/salsa_db/semantic/ast_ids.rs

+pub struct AstIds {
+    expressions: IndexVec<ExpressionId, AstNodeRef<Expr>>,
+
+    /// Maps expressions to their expression id. Uses `NodeKey` because it avoids cloning [`Parsed`].
+    expressions_map: FxHashMap<NodeKey, ExpressionId>,
+
+    statements: IndexVec<StatementId, AstNodeRef<Stmt>>,
+
+    statements_map: FxHashMap<NodeKey, StatementId>,
+}


Did you consider using something like https://docs.rs/bimap/latest/bimap/ here, instead of having one mapping for ID-to-expression, and another mapping for expression-to-ID (IIUC)?

I haven't, and i wasn't aware of that data structure. I prefer our implementation because we use an IndexVec for statements and expressions where a lookup is just an array offset whereas BiMap would require a hash map lookup.

AlexWaygood · 2024-06-10T14:40:08Z

Module Resolver

The module resolver remains mostly unchanged, although I did some renaming.

Module

My initial feeling is that what you are calling Module maybe should be called ModuleName, and what you are calling ResolvedModule maybe should be called Module. But I'm not totally sure until I look closer at the code.

Yeah, but that would require that ModuleName becomes a sala ingredient. It might be fine but it makes ModuleName a bit more awkward to use. But if we keep ModuleName a regular struct, than what would you call the module thing that we pass to resolve_module? I might be overthinking this because we can make the argument to resolve_module private and only expose a resolve_module(db: &dyn Db, name: &str) function that internally converts the &str to that Module thing and we can then keep our existing terminology. Maybe that's for the best (it also hides complexity).

I wonder if what's currently called Module could be renamed to ModuleRequest. The user "requests" a module (and the request is represented with a ModuleRequest instance) by importing a module with a certain ModuleName, but they might not actually get a module back, because the module might not actually exist. Unlike the Module type on the main branch, your Module type in crates/red_knot/src/salsa_db/semantic/module.rs doesn't really feel like a Module to me, as you can't query any information about the module directly from the type -- you have to resolve it first, and it feels like the module object is the thing you get given at the end of the resolution process.

MichaReiser · 2024-06-10T14:54:37Z

Thanks @AlexWaygood for the feedback. I don't plan to incorporate any of the code changes into this PR because I don't plan on merging. I'll incorporate your changes when working on the specific areas before pulling them into ruff.

carljm

I looked over all the code. This was a lot of work to translate all this, thanks for doing this! I don't see anything here that I think can't work in the new approach. I think overall on the semantic side this PR now has kind of a mish-mash of the old approach (per expression laziness) and the new approach (per scope typing) that is probably more complex and less efficient than we could achieve, so I expect that over the next few weeks we'll want to re-work and simplify a fair bit of it. But it makes sense to land something working with Salsa and iterate from there.

carljm · 2024-06-07T19:44:54Z

Cargo.toml

@@ -4,7 +4,7 @@ resolver = "2"

 [workspace.package]
 edition = "2021"
-rust-version = "1.74"
+rust-version = "1.73"


Why are we dropping our rust version in this PR? Did you add a dependency here that doesn't work with 1.74?

carljm · 2024-06-07T19:45:39Z

crates/red_knot/Cargo.toml

 hashbrown = { workspace = true }
 indexmap = { workspace = true }
 notify = { workspace = true }
 parking_lot = { workspace = true }
 rayon = { workspace = true }
 rustc-hash = { workspace = true }
+salsa = { git = "https://github.com/salsa-rs/salsa.git", package = "salsa-2022", rev = "05b4e3ebdcdc47730cdd359e7e97fb2470527279" }


Does this incorporate any of Niko's newest work on "v3" yet? Or are those changes we'll have to adapt to yet in the future?

Not yet, v3 is only a PR at this point. I scanned through the code and v3 is fairly close to v2022, so we're using that for now. But yes, we'll probably have to adapt some code.

carljm · 2024-06-10T15:24:34Z

crates/red_knot/src/salsa_db.rs

+
+impl salsa::Database for Database {
+    fn salsa_event(&self, event: Event) {
+        if matches!(event.kind, EventKind::WillCheckCancellation) {


What does this event mean?

It's possible to create multiple snapshots of the database that then each can run in isolation (they still share the underlying caches). This is useful when using salsa in a multithreaded context.

Now, Salsa cancels any pending snapshots (other threads) when you want to make changes to it. The way this works is that each query tests if cancellation was requested and if so, it panics with a specific error. The WillCheckCancellation indicates that Salsa now tests cancellation.

I removed the log because it is very noisy. I think I often saw 2-3 of these logs per query. Maybe something that can be optimized later to reduce it to just one. Removing it made the log a bit more dense and easier to read thorough

carljm · 2024-06-10T15:32:03Z

crates/red_knot/src/salsa_db/lint.rs

+#[salsa::tracked(jar=Jar)]
+pub fn check_syntax(db: &dyn Db, file: File) -> SyntaxCheck {
+    // TODO I haven't looked into how many rules are pure syntax checks.
+    //   It may be necessary to at least give access to a simplified semantic model.


I'm not sure why we would bother with a simplified semantic model (unless it's extremely simple). It seems better to just give the rules that need semantic information access to the full semantic model, and avoid inconsistency.

Neither do I but it probably also depends on what we refer to as the semantic model. Is it any information that isn't part of the AST? If so, maybe exposing the parent expression or statement is something that we can support even for syntax rules. But yeah, I don't know if it's worth it. I think this is a comment copied from the existing implementation.

carljm · 2024-06-10T15:35:46Z

crates/red_knot/src/salsa_db/semantic.rs

+    let typing_scope = TypingScope::for_symbol(db, symbol);
+    let types = infer_types(db, typing_scope);
+
+    types.symbol_ty(symbol.local())


Should we be consistent about using _ty vs _type in APIs?

Probably. ty is somewhat common in the Rust ecosystem and has the advantage that it isn't a keyword (it also works for variables).

carljm · 2024-06-10T21:25:41Z

crates/red_knot/src/salsa_db/semantic/types.rs

+        }
+    }
+
+    /// Infers the type of a location definition.


Not sure what a "location definition" is?

carljm · 2024-06-10T21:26:56Z

crates/red_knot/src/salsa_db/semantic/types.rs

+            // The fact that the interner is local to a body means that we can't reuse the same union type
+            // across different call sites. But that's something we aren't doing yet anyway. Our interner doesn't
+            // deduplicate union types that are identical.


We do need a place to add this deduplication (as well as the flattening/simplification that I already added in PRs since you translated this to Salsa); it's not clear to me where in this structure that should happen.

Yeah, agree. I think we probably want to have methods on the TypeInferenceBuilder because we only need to e.g. track the reverse map of already created unions back to their type ids during construction but we won't need it once type inference is complete (and we won't create any new types)

carljm · 2024-06-10T21:31:22Z

crates/red_knot/src/salsa_db/semantic/types.rs

+    }
+}
+
+enum DefinitionType {


I'm not sure what a DefinitionType is supposed to represent that is different from a Type, or why it needs to exist at all. It seems like all it does is intern unions? (And with narrowing it will probably have to intern intersections, too.) But infer_definitions already has a TypeInference -- why can't it do the interning itself?

The enum is a lifetime hack. infer_definition takes &self as argument, so it can't intern a new union type.

The fact that it is a readonly reference is important in finish where we iterate over self.symbol_table.

for symbol in self.symbol_table.symbol_ids_for_scope(self.enclosing_scope) { let definition_type = self.typing_context().infer_definitions( symbol_table .definitions(symbol) .iter() .map(|definition| ReachableDefinition::Definition(*definition)), GlobalId::new(self.file, self.enclosing_scope), ); public_symbol_types.insert(symbol, definition_type.into_type(&mut self.result)); }

Taking a &mut wouldn't compile because Rust couldn't prove that the symbol_table doesn't get mutated (a method taking &mut self can mutate any field). By explicitly passing &mut self.result in into_type Rust can prove that self.symbol_table is never borrowed mutably

carljm · 2024-06-10T21:57:42Z

crates/red_knot/src/salsa_db/semantic/types/infer.rs

+    // TODO: This is going to be somewhat slow because we need to map the AST node to the expression id for
+    //   every expression in the body. That's a lot of hash map lookups.
+    //  We can't use an `IndexVec` here because a) expression ids are per module and b) the type inference
+    //  builder visits the expressions in evaluation order and not in pre-order.


The location of this comment seems odd; it's not clear what code it is referring to.

carljm · 2024-06-10T23:56:25Z

crates/red_knot/src/salsa_db/semantic/types/infer.rs

+                ImportDefinition {
+                    import: import_id,
+                    name: u32::try_from(i).unwrap(),
+                }


It's strange to me that we build definitions in SemanticIndexer, but now we're rebuilding definitions from scratch here as well. This seems like duplication we probably don't want.

Is your concern just about the ImportDefinition creation that is used as key? Because there's a difference. We associate a definition with its type.

I don't think we can avoid this much without having a way to iterate over the AST and definitions at the same time.

MichaReiser force-pushed the red-knot-salsa branch from 93dd0ac to bb974e2 Compare May 10, 2024 12:42

MichaReiser closed this May 28, 2024

MichaReiser force-pushed the red-knot-salsa branch from bb974e2 to 3989cb8 Compare May 28, 2024 08:03

MichaReiser reopened this May 28, 2024

MichaReiser commented May 31, 2024

View reviewed changes

MichaReiser force-pushed the red-knot-salsa branch 5 times, most recently from 1843679 to 06ee178 Compare June 5, 2024 11:43

MichaReiser added 7 commits June 5, 2024 16:03

Restore Salsa DB for exploring Salsa further

e40186e

Add tracked structs

2ed4193

Track File status as input. Implement module resovler and source co…

6b37629

…de as derived queries.

Make SyntaxCheck and PhysicalLineCheck non-tracked structs

340ca33

tracked-structs are only necessary when the struct should be used as an argument to a derived Salsa query. I don't expect that the lint results itself should be used as queries, therefore, normal structs do just fine.

Use sruff_python_parser::Parsed` struct over rolling our own

946493c

Semantic model

4c70337

MichaReiser force-pushed the red-knot-salsa branch from 06ee178 to 4c70337 Compare June 6, 2024 09:10

Typing model

176267f

MichaReiser changed the title ~~Restore Salsa DB for exploring Salsa further~~ Salsa based red-knot prototype Jun 7, 2024

MichaReiser added the red-knot Multi-file analysis & type inference label Jun 7, 2024

Don't lazily infer imports because it adds an AST dependency to the m…

d79ff35

…odule looking up the definition

MichaReiser force-pushed the red-knot-salsa branch from 9539ff4 to d79ff35 Compare June 7, 2024 19:52

AlexWaygood reviewed Jun 10, 2024

View reviewed changes

carljm reviewed Jun 11, 2024

View reviewed changes

	#[salsa::input(jar=Jar)]
	pub struct File {
	#[return_ref]
	pub path: PathBuf,

	pub permissions: u32,

	pub revision: FileRevision,

	pub status: FileStatus,

	#[allow(unused)]
	count: Count<File>,
	}

	#[derive(Debug, Clone, Eq, PartialEq)]
	pub struct SourceText {
	pub text: Arc<str>,
	count: Count<SourceText>,
	}

	#[tracing::instrument(level = "debug", skip(db))]
	#[salsa::tracked(jar=Jar, no_eq)]
	pub fn parse(db: &dyn Db, file: File) -> Arc<Parsed<ModModule>> {
	let source = file.source(db);

	let result = ruff_python_parser::parse_unchecked_source(source.text(), PySourceType::Python);

	Arc::new(result)
	}

	#[tracing::instrument(level = "debug", skip(db))]
	#[salsa::tracked(jar=Jar, no_eq, return_ref)]
	pub fn ast_ids(db: &dyn Db, file: File) -> AstIds {
	let parsed = parse(db.upcast(), file);

	AstIds::from_parsed(&parsed)
	}

	#[tracing::instrument(level = "debug", skip(db))]
	#[salsa::tracked(jar=Jar, return_ref)]
	pub fn semantic_index(db: &dyn Db, file: File) -> SemanticIndex {
	let root_scope_id = SymbolTable::root_scope_id();
	let mut indexer = SemanticIndexer {
	db,
	file,
	symbol_table_builder: SymbolTableBuilder::new(),
	flow_graph_builder: FlowGraphBuilder::new(),
	scopes: vec![ScopeState {
	scope_id: root_scope_id,
	current_flow_node_id: FlowGraph::start(),
	}],
	current_definition: None,
	};

	let parsed = parse(db.upcast(), file);

	indexer.visit_body(&parsed.syntax().body);
	indexer.finish()
	}

	#[tracing::instrument(level = "debug", skip(db))]
	#[salsa::tracked(jar=Jar)]
	pub fn symbol_table(db: &dyn Db, file: File) -> Arc<SymbolTable> {
	semantic_index(db, file).symbol_table.clone()
	}

	#[salsa::tracked(jar=Jar, return_ref)]
	pub(crate) fn typing_scopes(db: &dyn Db, file: File) -> TypingScopes {
	let ast_ids = ast_ids(db, file);

	let functions = ast_ids
	.functions()
	.map(\|(id, _)\| (id, FunctionTypingScope::new(db, file, id)))
	.collect();
	let classes = ast_ids
	.classes()
	.map(\|(id, _)\| (id, ClassTypingScope::new(db, file, id)))
	.collect();

	TypingScopes { functions, classes }
	}

	#[salsa::tracked(jar=Jar, return_ref)]
	pub fn infer_function_body(db: &dyn Db, scope: FunctionTypingScope) -> TypeInference {
	let function = scope.node(db);

	let mut builder = TypeInferenceBuilder::new(db, scope.into());
	builder.lower_function_body(&function);
	builder.finish()
	}

	#[salsa::tracked(jar=Jar, return_ref)]
	pub fn infer_class_body(db: &dyn Db, scope: ClassTypingScope) -> TypeInference {
	let class = scope.node(db);

	let mut builder = TypeInferenceBuilder::new(db, scope.into());
	builder.lower_class_body(&class);
	builder.finish()
	}

	#[salsa::tracked(jar=Jar, return_ref)]
	pub fn infer_module_body(db: &dyn Db, file: File) -> TypeInference {
	let parsed = parse(db.upcast(), file);

	let mut builder = TypeInferenceBuilder::new(db, file.into());
	builder.lower_module(&parsed.syntax());
	dbg!(builder.finish())
	}

	pub struct TypingContext<'a> {
	db: &'a dyn Db,
	local: Option<(TypingScope, &'a TypeInference)>,
	}

	impl<'a> TypingContext<'a> {
	pub fn local(db: &'a dyn Db, local_scope: TypingScope, types: &'a TypeInference) -> Self {
	Self {
	db,
	local: Some((local_scope, types)),
	}
	}

	pub fn global(db: &'a dyn Db) -> Self {
	Self { db, local: None }
	}

	pub fn db(&self) -> &'a dyn Db {
	self.db
	}

	pub fn types(&self, scope: TypingScope) -> &'a TypeInference {
	if let Some((local_scope, types)) = self.local {
	if local_scope == scope {
	return types;
	}
	}

	infer_types(self.db, scope)
	}

	pub fn infer_expression_type(db: &dyn Db, expression_id: GlobalId<ExpressionId>) -> Type {
	let typing_scope = TypingScope::for_expression(db, expression_id);
	let types = infer_types(db, typing_scope);

	types.expression_ty(expression_id.local())
	}

	#[tracing::instrument(level = "debug", skip(db))]
	pub fn resolve_global_symbol(db: &dyn Db, file: File, name: &str) -> Option<GlobalSymbolId> {
	let symbol_table = symbol_table_query(db, file);
	let symbol_id = symbol_table.root_symbol_id_by_name(name)?;

	Some(GlobalSymbolId::new(file, symbol_id))
	}

	#[tracing::instrument(level = "debug", skip(db))]
	pub fn global_symbol_type(db: &dyn Db, symbol: GlobalSymbolId) -> Type {
	let typing_scope = TypingScope::for_symbol(db, symbol);
	let types = infer_types(db, typing_scope);

	types.symbol_ty(symbol.local())
	}

	#[tracing::instrument(level = "debug", skip(db))]
	pub fn global_symbol_type_by_name(db: &dyn Db, module: File, name: &str) -> Option<Type> {
	let symbols = symbol_table_query(db, module);
	let symbol = symbols.root_symbol_id_by_name(name)?;

	Some(global_symbol_type(db, GlobalSymbolId::new(module, symbol)))
	}

	#[salsa::interned(jar=Jar)]
	pub struct Module {
	#[return_ref]
	name: ModuleName,
	}

	#[tracing::instrument(level = "debug", skip(db))]
	#[salsa::tracked(jar=Jar)]
	pub fn resolve_module(db: &dyn Db, module: Module) -> Option<ResolvedModule> {
	let name = module.name(db);

	let (root_path, resolved_file, kind) = resolve_module_path(db, name)?;

	let normalized = resolved_file
	.path(db.upcast())
	.canonicalize()
	.map(\|path\| db.file(path))
	.unwrap_or_else(\|_\| resolved_file);

	Some(ResolvedModule {
	inner: Arc::new(ResolveModuleInner {
	module,
	kind,
	search_path: root_path,
	file: normalized,
	}),
	})
	}

	#[allow(unused)]
	pub fn set_module_search_paths(db: &mut dyn Db, search_paths: Vec<ModuleSearchPath>) {
	if let Some(existing) = ModuleSearchPaths::try_get(db) {
	existing.set_paths(db).to(search_paths);
	} else {
	ModuleSearchPaths::new(db, search_paths);
	}
	}

		pub fn path_to_module(db: &dyn Db, path: &Path) -> Option<ResolvedModule> {
		let file = db.file(path.to_path_buf());

		@@ -37,5 +40,6 @@ tracing-tree = { workspace = true }
		[dev-dependencies]
		tempfile = { workspace = true }

Salsa based red-knot prototype #11338

Are you sure you want to change the base?

Salsa based red-knot prototype #11338

Conversation

MichaReiser commented May 8, 2024 • edited

Source

Semantic

symbols, cfg

Typing

Module Resolver

github-actions bot commented May 8, 2024 • edited

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

codspeed-hq bot commented May 28, 2024 • edited

Merging #11338 will not alter performance

Summary

MichaReiser May 31, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaReiser commented Jun 7, 2024 • edited

carljm commented Jun 7, 2024

MichaReiser commented Jun 8, 2024

carljm commented Jun 8, 2024

AlexWaygood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexWaygood commented Jun 10, 2024

Module Resolver

MichaReiser commented Jun 10, 2024

carljm left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaReiser commented May 8, 2024 •

edited

github-actions bot commented May 8, 2024 •

edited

`ruff-ecosystem` results

codspeed-hq bot commented May 28, 2024 •

edited

MichaReiser May 31, 2024 •

edited

MichaReiser commented Jun 7, 2024 •

edited

carljm left a comment •

edited