Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Salsa based red-knot prototype #11338

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft

Salsa based red-knot prototype #11338

wants to merge 9 commits into from

Conversation

MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented May 8, 2024

This is a prototype that uses Salsa for our red-knot prototype 😆

The PR implements cross-module type inference invalidation based on Salsa. What makes this hard is that

  1. Salsa query arguments need to be ingredients. That means we can no longer pass arbitrary arguments to queries.
  2. Salsa limits invalidation if the result of a query compares equal to the value it returned previously. We need to remodel our query return values to make best use of that and avoid that e.g. all types invalidate on a single whitespace change (the thing we return from types should be location-independent

I'll go through the important data models jar by jar.

Source

The source jar gives access to files, the text of a file, and a file's AST.

File

#[salsa::input(jar=Jar)]
pub struct File {
#[return_ref]
pub path: PathBuf,
pub permissions: u32,
pub revision: FileRevision,
pub status: FileStatus,
#[allow(unused)]
count: Count<File>,
}

The file stores the basic metadata about a file but doesn't store the file's content. This is mainly because of persistent caching. Restoring the database from disk requires that we restore all files. If the source is stored on the file, we would have to read the content of every file, and that would be very expensive (we want the source validation to happen lazily). That's why the file only stores basic metadata.

Note: We may decide long-term to have a configuration option that allows users to select if they want to use mtime or the file's has for change detection. In that case, I think we would have a source: Option<String> on file so that the source_text query avoids re-reading the file from disk.

Files are salsa inputs. Salsa doesn't know how to compute files. Instead, we need to tell salsa which files exist and when they change. That's why files are resolved using db.file(path) where we perform our own mapping from Path -> File (Salsa inputs have no identity other than their instance).

SourceText

The source_text(file: File) -> SourceText query allows retrieving a file's source text. The source text isn't very exciting. It just stores the file's content.

#[derive(Debug, Clone, Eq, PartialEq)]
pub struct SourceText {
pub text: Arc<str>,
count: Count<SourceText>,
}

Some notes about the implementation:

  • The query calls file.revision() (equal to the file's mtime) to inform Salsa that the query should rerun whenever the file is modified. It actually doesn't need the value.
  • It might happen that the file has been deleted between calling db.file(path) and source_text(db, file). In that case, we just assume that the file is empty. That's the best we can do without dealing with awkward results in all caller paths.

parse

#[tracing::instrument(level = "debug", skip(db))]
#[salsa::tracked(jar=Jar, no_eq)]
pub fn parse(db: &dyn Db, file: File) -> Arc<Parsed<ModModule>> {
let source = file.source(db);
let result = ruff_python_parser::parse_unchecked_source(source.text(), PySourceType::Python);
Arc::new(result)
}

The parse query is almost boring. It retrieves the file text and calls the parser. We opt out of Salsa's eq optimization because the parse tree is guaranteed to change whenever the source text changes (and our AST doesn't implement Eq because of floats).

Semantic

This is where it gets interesting.

AstIds

#[tracing::instrument(level = "debug", skip(db))]
#[salsa::tracked(jar=Jar, no_eq, return_ref)]
pub fn ast_ids(db: &dyn Db, file: File) -> AstIds {
let parsed = parse(db.upcast(), file);
AstIds::from_parsed(&parsed)
}

AstIds are a location-independent representation that allows mapping from Id -> AstNode and from AstNode -> Id. The implementation tries to assign stable IDs by first giving IDs to the module-level statements and expressions, and only then traversing into the function or class level.

a = 10 # statement-id: 0

def test(a): # statement-id: 1
	if a: # statement-id: 4
		pass # statement-id 5

print(a) # statement-id: 2

class Test: # statement-id: 3
	pass # statement-id: 6

This way, IDs of top level statements remain unchanged when only making changes to a function's body. Having stable top-level IDs is important because they are referred to from other modules.

symbols, cfg

semantic_index

#[tracing::instrument(level = "debug", skip(db))]
#[salsa::tracked(jar=Jar, return_ref)]
pub fn semantic_index(db: &dyn Db, file: File) -> SemanticIndex {
let root_scope_id = SymbolTable::root_scope_id();
let mut indexer = SemanticIndexer {
db,
file,
symbol_table_builder: SymbolTableBuilder::new(),
flow_graph_builder: FlowGraphBuilder::new(),
scopes: vec![ScopeState {
scope_id: root_scope_id,
current_flow_node_id: FlowGraph::start(),
}],
current_definition: None,
};
let parsed = parse(db.upcast(), file);
indexer.visit_body(&parsed.syntax().body);
indexer.finish()
}

The semantic_index query computes a single file's symbol table and control flow graph. It shouldn't be used directly because the semantic_index changes every time the AST changes.

symbol_table

#[tracing::instrument(level = "debug", skip(db))]
#[salsa::tracked(jar=Jar)]
pub fn symbol_table(db: &dyn Db, file: File) -> Arc<SymbolTable> {
semantic_index(db, file).symbol_table.clone()
}

The query itself just calls into semantic_index. The trick here is that the symbol table itself doesn't contain any data that references the AST. Instead, all data uses AstIds. What this query enables is that Salsa can avoid running queries that depend on the symbol_table if the constructed symbol table hasn't changed. For example, a comment only change doesn't invalidate the symbol table.

flow_graph

#[allow(unused)]
#[tracing::instrument(level = "debug", skip(db))]
#[salsa::tracked(jar=Jar)]
pub fn flow_graph(db: &dyn Db, file: File) -> Arc<FlowGraph> {
semantic_index(db, file).flow_graph.clone()
}

We apply the same trick for the control flow graph

Typing

typing_scopes

Typing is where the code changes the most. The existing implementation does type inference per expression. I don't think that type_inference per expression will be fast in Salsa because storing a query result has some overhead. Salsa is also limited to at most u32 results per query. I think large projects could reach that limit, especially when the server runs for a long time.

That's why this PR changes inference to happen per TypingScope instead. For now, a typing scope is either a Module, Function, or Class. So this PR infers all types per module, class, or function (but the module doesn't traverse into function or class bodies).

The reason why we don't perform type inference on a module scope is to get more fine-grained dependency tracking across files. The type checking of a dependency must only be rerun if the types of the scope where the symbol is defined depend on changes. If the types remain unchanged (for example because the public interface isn't changing), then type checking doesn't need to re-run.

The first step to make this possible is to create a FunctionTypingScope and ClassTypingScopes for every Function and Class in the file and store them in Salsa to use them as query arguments.

#[salsa::tracked(jar=Jar, return_ref)]
pub(crate) fn typing_scopes(db: &dyn Db, file: File) -> TypingScopes {
let ast_ids = ast_ids(db, file);
let functions = ast_ids
.functions()
.map(|(id, _)| (id, FunctionTypingScope::new(db, file, id)))
.collect();
let classes = ast_ids
.classes()
.map(|(id, _)| (id, ClassTypingScope::new(db, file, id)))
.collect();
TypingScopes { functions, classes }
}

infer_*_body

The other important queries are infer_module_body, infer_function_body, and infer_class_body. They perform type inference for a single module, function or class, but without traversing into nested classes or functions.

#[salsa::tracked(jar=Jar, return_ref)]
pub fn infer_function_body(db: &dyn Db, scope: FunctionTypingScope) -> TypeInference {
let function = scope.node(db);
let mut builder = TypeInferenceBuilder::new(db, scope.into());
builder.lower_function_body(&function);
builder.finish()
}
#[salsa::tracked(jar=Jar, return_ref)]
pub fn infer_class_body(db: &dyn Db, scope: ClassTypingScope) -> TypeInference {
let class = scope.node(db);
let mut builder = TypeInferenceBuilder::new(db, scope.into());
builder.lower_class_body(&class);
builder.finish()
}
#[salsa::tracked(jar=Jar, return_ref)]
pub fn infer_module_body(db: &dyn Db, file: File) -> TypeInference {
let parsed = parse(db.upcast(), file);
let mut builder = TypeInferenceBuilder::new(db, file.into());
builder.lower_module(&parsed.syntax());
dbg!(builder.finish())
}

Doing type-checking per block introduces some complexity. Mainly that getting the type data for a TypeId not just requires knowing the file from which the data needs to be read, but also from which typing scope. There's even an extra complexity. There are cases where we want to to resolve the type for a type_id. But we may only just be building up that typing table. I solved this by introducing TypingContext and passing that to TypeId::ty. The TypingContext can have an override so that queries for a specific typing-scope are directly resolved without calling into the database.

pub struct TypingContext<'a> {
db: &'a dyn Db,
local: Option<(TypingScope, &'a TypeInference)>,
}
impl<'a> TypingContext<'a> {
pub fn local(db: &'a dyn Db, local_scope: TypingScope, types: &'a TypeInference) -> Self {
Self {
db,
local: Some((local_scope, types)),
}
}
pub fn global(db: &'a dyn Db) -> Self {
Self { db, local: None }
}
pub fn db(&self) -> &'a dyn Db {
self.db
}
pub fn types(&self, scope: TypingScope) -> &'a TypeInference {
if let Some((local_scope, types)) = self.local {
if local_scope == scope {
return types;
}
}
infer_types(self.db, scope)
}

Public API

The public API for types should be limited to:

pub fn infer_expression_type(db: &dyn Db, expression_id: GlobalId<ExpressionId>) -> Type {
let typing_scope = TypingScope::for_expression(db, expression_id);
let types = infer_types(db, typing_scope);
types.expression_ty(expression_id.local())
}

#[tracing::instrument(level = "debug", skip(db))]
pub fn resolve_global_symbol(db: &dyn Db, file: File, name: &str) -> Option<GlobalSymbolId> {
let symbol_table = symbol_table_query(db, file);
let symbol_id = symbol_table.root_symbol_id_by_name(name)?;
Some(GlobalSymbolId::new(file, symbol_id))
}
#[tracing::instrument(level = "debug", skip(db))]
pub fn global_symbol_type(db: &dyn Db, symbol: GlobalSymbolId) -> Type {
let typing_scope = TypingScope::for_symbol(db, symbol);
let types = infer_types(db, typing_scope);
types.symbol_ty(symbol.local())
}
#[tracing::instrument(level = "debug", skip(db))]
pub fn global_symbol_type_by_name(db: &dyn Db, module: File, name: &str) -> Option<Type> {
let symbols = symbol_table_query(db, module);
let symbol = symbols.root_symbol_id_by_name(name)?;
Some(global_symbol_type(db, GlobalSymbolId::new(module, symbol)))
}

Module Resolver

The module resolver remains mostly unchanged, although I did some renaming.

Module

I think the naming could be better. Module is mainly a ModuleName but interned into salsa so that it can be used as a query argument.

#[salsa::interned(jar=Jar)]
pub struct Module {
#[return_ref]
name: ModuleName,
}

I didn't want to intern ModuleName directly because I think there are places where we want to use it without the need for having it in Salsa. But maybe that's the wrong call and we should just intern ModuleName directly.

resolve_module

The main query remains resolve_module

#[tracing::instrument(level = "debug", skip(db))]
#[salsa::tracked(jar=Jar)]
pub fn resolve_module(db: &dyn Db, module: Module) -> Option<ResolvedModule> {
let name = module.name(db);
let (root_path, resolved_file, kind) = resolve_module_path(db, name)?;
let normalized = resolved_file
.path(db.upcast())
.canonicalize()
.map(|path| db.file(path))
.unwrap_or_else(|_| resolved_file);
Some(ResolvedModule {
inner: Arc::new(ResolveModuleInner {
module,
kind,
search_path: root_path,
file: normalized,
}),
})
}

What changed is that it now accepts a Module and returns an Option<ResolvedModule>. Again, I'm open to suggestion for better naming. The idea is that a ResolvedModule represents to what a module name resolves. I'm consider renaming it to ResolvedModulePath because I think that's really what it is.

I think the implementation became much simpler because the module resolver now uses File and File::exists internally. This has the advantage that Salsa will automatically invalidate the resolve_module result if a relevant file gets added or removed.

file_to_module

pub fn file_to_module(db: &dyn Db, file: File) -> Option<ResolvedModule> {

Resolves a file to a Option<ResolvedModule> if it is a module and to None otherwise. This is mostly unchanged.

module_search_paths and set_module_search_paths

#[allow(unused)]
pub fn set_module_search_paths(db: &mut dyn Db, search_paths: Vec<ModuleSearchPath>) {
if let Some(existing) = ModuleSearchPaths::try_get(db) {
existing.set_paths(db).to(search_paths);
} else {
ModuleSearchPaths::new(db, search_paths);
}
}

These queries shouldn't exist long term but it was a "quick" way to allow setting the module search paths without supporting settings. I'll adapt this to @AlexWaygood's most recent changes by having a set_module_resolver_settings short term (that has fields for the different lookup paths). The long term goal is that the module resolver queries the settings and constructs the search paths from the settings (it probably should remain a query)

Copy link
Contributor

github-actions bot commented May 8, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

Copy link

codspeed-hq bot commented May 28, 2024

CodSpeed Performance Report

Merging #11338 will not alter performance

Comparing red-knot-salsa (06ee178) with main (2e0a975)

Summary

✅ 30 untouched benchmarks

}

#[salsa::tracked(jar=Jar)]
pub struct ResolvedModule {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexWaygood this is where I'm currently landing on a Salsa design for a module resolver. I think it would simplify a lot for you because you no longer need to think about invalidation, Salsa will take care of that for you. The only thing necessary for this to work is that you use db.file(path).exists() to test if a file exists.

But check out resolve_module, it's now almost empty!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the ping. Yes, this indeed does make the code look a lot cleaner! It was making my head hurt a little bit to see all the cache-checking stuff right alongside the search-path semantics in resolve_module()

Comment on lines +184 to +221
pub fn path_to_module(db: &dyn Db, path: &Path) -> Option<ResolvedModule> {
let file = db.file(path.to_path_buf());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit weird that path_to_module converts the path to a file as the very first thing only so that file_to_module then reads the path. However, for file_to_module to be a salsa query, it can only accept an ingredient as an argument and file is an ingredient but path isn't.

@MichaReiser MichaReiser force-pushed the red-knot-salsa branch 5 times, most recently from 1843679 to 06ee178 Compare June 5, 2024 11:43
A query result only needs to be a tracked struct if we intend to use it as a query ingredient.
It's unclear to me whether this is the case for `ResolvedModule`, that's why I make it a regular
struct for now. We can easily make it a tracked struct later on.
tracked-structs are only necessary when the struct should be used as an
argument to a derived Salsa query. I don't expect that the lint results itself
should be used as queries, therefore, normal structs do just fine.
@MichaReiser MichaReiser changed the title Restore Salsa DB for exploring Salsa further Salsa based red-knot prototype Jun 7, 2024
@MichaReiser
Copy link
Member Author

MichaReiser commented Jun 7, 2024

There's one limitation with the current model where the invalidation isn't as good as it could be and it is due to the fact that we build the entire symbol table at once (we don't have to and we could refactor that later).

Let's say we start with

# main.py
import foo;

x = foo.x

# foo.py
x = 10

def foo(): 
	pass

And we infer the type of x in main. To do this, the implementation runs

  • It parses main , builds its symbol table and cfg, calls into infer_module_body
  • It resolves foo when reaching import foo
  • It calls ModuleType.member when reaching foo.x. This fans out to parse foo, build its symbol table and control flow graph, and then runs module-level type inference for foo
  • ...

When we now change the content of foo to

x = 10

def foo(): 
	y = 10

What I expected is that the type inference for main wouldn't re-run because the module-level types of foo remain unchanged. However, that's not the case. The reason is that foo.x resolved the symbol table of foo.py and the symbol table has changed because we introduced y in the scope of foo. We have the same problem when a flag of an enclosing symbol changes. For example if the body of foo.foo is changed to y = x. The symbol table of that module changes because x now has the flag used.

We can avoid this by also building the symbol table per scope rather than once globally. Or have a query that reduces the global symbol table to just the global symbols. I do think something like that would be nice to have more fine granular invalidations.

@MichaReiser MichaReiser added the red-knot Multi-file analysis & type inference label Jun 7, 2024
@carljm
Copy link
Contributor

carljm commented Jun 7, 2024

This is a great write-up, thanks for taking the time! A few thoughts from the write-up, before I dive into the code:

  1. One caveat with building symbol tables per-scope is that the contents of nested functions can affect the symbol table of the enclosing function. (If a nested function uses nonlocal x and assigns to x, the symbol kind for x may change based on the knowledge that a nested function might assign to it.) So we can build symbol tables separately for module scope, class scope, and function scope -- but building a function symbol table needs to build all enclosed symbol tables (which may involve N nested class and function scopes, to arbitrary depth.) I think in practice this is still fine, though, and worth doing the split. Nesting isn't that common, and by definition the symbols inside a function scope aren't ones that another module can depend on, anyway.
  2. My initial feeling is that what you are calling Module maybe should be called ModuleName, and what you are calling ResolvedModule maybe should be called Module. But I'm not totally sure until I look closer at the code.
  3. I think if we are doing type inference per-scope rather than per-expression that will probably also recommend changes to how type inference works in the first place. We can probably just walk the AST for the scope, assigning types to expressions as we go, and similarly tracking narrowed local types for local symbols as we go. This will mean we don't even need FlowGraph anymore (a lot of the concepts developed in building it will still carry over, it will just be resolved eagerly instead.)

@MichaReiser
Copy link
Member Author

One caveat with building symbol tables per-scope is that the contents of nested functions can affect the symbol table of the enclosing function.

The part that's unclear to me is how we would compute flags like USED when building the symbol table lazily per scope, because a symbol might be used in a child scope. But we also have the option to build the entire symbol table at once and then split it per scope (similar to the trick with SymbolIndex, build it once, then have sub-queries that only return a slice of that.

My initial feeling is that what you are calling Module maybe should be called ModuleName, and what you are calling ResolvedModule maybe should be called Module. But I'm not totally sure until I look closer at the code.

Yeah, but that would require that ModuleName becomes a sala ingredient. It might be fine but it makes ModuleName a bit more awkward to use. But if we keep ModuleName a regular struct, than what would you call the module thing that we pass to resolve_module? I might be overthinking this because we can make the argument to resolve_module private and only expose a resolve_module(db: &dyn Db, name: &str) function that internally converts the &str to that Module thing and we can then keep our existing terminology. Maybe that's for the best (it also hides complexity).

My thinking why I called the Module { name: ModuleName } a module is that I don't think that the existence of the file on the disk makes a module. When we have import foo, then there's an import of the foo module, regardless if that module exists or not. That's why I think that a module's identity is really defined by its name.

We can probably just walk the AST for the scope, assigning types to expressions as we go, and similarly tracking narrowed local types for local symbols as we go. This will mean we don't even need FlowGraph anymore (a lot of the concepts developed in building it will still carry over, it will just be resolved eagerly instead.)

This is what the new implementation does. But I must say, it would make me sad to see your CFG go away. I think it could be useful for other things than just typing, like an unreachable rule.

@carljm
Copy link
Contributor

carljm commented Jun 8, 2024

The part that's unclear to me is how we would compute flags like USED when building the symbol table lazily per scope, because a symbol might be used in a child scope.

The USED flag only means "used in current scope", so it's not an issue. The only analysis that crosses scopes is the one Patrick was working on, and that's the case I discussed in my comment; but it only applies to nested scopes within function scopes, so handling a function scope and all it's nested scopes together is one way to handle it; module scopes and class scopes not nested in functions can be fully independent.

build it once, then have sub-queries that only return a slice of that.

This could work, too.

When we have import foo, then there's an import of the foo module, regardless if that module exists or not. That's why I think that a module's identity is really defined by its name.

This is a good point. I think it's a solid enough reason for the current naming. We will need to be able to track dependencies on nonexistent modules.

This is what the new implementation does.

"tracking narrowed local types for local symbols as we go" is the part that would specifically replace the CFG; I don't think the implementation here does that yet.

it would make me sad to see your CFG go away. I think it could be useful for other things than just typing, like an unreachable rule.

The eager version of the same logic would also have the ability to discover unreachable branches.

Copy link
Member

@AlexWaygood AlexWaygood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great PR writeup. Overall this looks good to me, though it looks like it's missing some of my recent changes to module.rs.

I left a bunch of comments below, mostly pretty minor. I think many of them may also apply to the existing red-knot codebase -- I wouldn't yet consider myself an expert in the crate overall -- so please feel free to ignore any that you don't feel are useful. I think @carljm will probably be a much better reviewer for this in general :/

@@ -37,5 +40,6 @@ tracing-tree = { workspace = true }
[dev-dependencies]
tempfile = { workspace = true }


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit ;)

Suggested change

crates/red_knot/src/salsa_db/semantic.rs Show resolved Hide resolved
pub type GlobalSymbolId = GlobalId<SymbolId>;

#[derive(Debug, Eq, PartialEq)]
pub struct SemanticIndex {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have a docstring for this type as well

Comment on lines +106 to +117
let root_scope_id = SymbolTable::root_scope_id();
let mut indexer = SemanticIndexer {
db,
file,
symbol_table_builder: SymbolTableBuilder::new(),
flow_graph_builder: FlowGraphBuilder::new(),
scopes: vec![ScopeState {
scope_id: root_scope_id,
current_flow_node_id: FlowGraph::start(),
}],
current_definition: None,
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should go into a new associated method for SemanticIndexer (or an implementation of the Default trait)?

impl SemanticIndexer {
    fn new(db: &dyn Db, file: File) -> Self {
        Self {
            db,
            file,
            symbol_table_builder: SymbolTableBuilder::new(),
            flow_graph_builder: FlowGraphBuilder::new(),
            scopes: vec![ScopeState {
                scope_id: SymbolTable::root_scope_id(),
                current_flow_node_id: FlowGraph::start(),
            }],
            current_definition: None,
        }
    }
}

Comment on lines +17 to +22
#[derive(Copy, Clone, Debug, Eq, PartialEq)]
pub enum FileRevision {
LastModified(FileTime),
#[allow(unused)]
ContentHash(u128),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second variant of this enum is so that we can also detect when the "revision" of a vendored source file changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this was mainly to explore how and if we could support file revisions e.g. based on a file's hash rather than the last modified timestamp. But this isn't used right now.

Comment on lines +262 to +269
#[derive(Copy, Clone, Debug, Eq, PartialEq, Hash)]
pub struct GlobalTypeId<T>
where
T: LocalTypeId,
{
scope: TypingScope,
local_id: T,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some docstrings for these *Id types would be really helpful

Comment on lines +388 to +395
let mut table = SymbolTable {
scopes_by_id: IndexVec::new(),
symbols_by_id: IndexVec::new(),
defs: FxHashMap::default(),
scopes_by_node: FxHashMap::default(),
dependencies: Vec::new(),
expression_scopes: IndexVec::default(),
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we derived Default on the SymbolTable struct, this could just be

Suggested change
let mut table = SymbolTable {
scopes_by_id: IndexVec::new(),
symbols_by_id: IndexVec::new(),
defs: FxHashMap::default(),
scopes_by_node: FxHashMap::default(),
dependencies: Vec::new(),
expression_scopes: IndexVec::default(),
};
let mut table = SymbolTable::default();

right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally avoided that, IIRC, because I don't want it to be possible (and especially not easy!) to create a SymbolTable without the root scope.

}

impl<'a> ReachableDefinitionsIterator<'a> {
#[allow(unused)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this shouldn't be necessary since the function's pub?

Suggested change
#[allow(unused)]

Comment on lines +57 to +64
#[allow(unused)]
pub fn functions(&self) -> impl Iterator<Item = (FunctionId, &StmtFunctionDef)> {
self.statements
.iter_enumerated()
.filter_map(|(index, stmt)| Some((FunctionId(index), stmt.as_function_def_stmt()?)))
}

#[allow(unused)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these #[allow(unused)] shouldn't be needed because they're pub

Suggested change
#[allow(unused)]
pub fn functions(&self) -> impl Iterator<Item = (FunctionId, &StmtFunctionDef)> {
self.statements
.iter_enumerated()
.filter_map(|(index, stmt)| Some((FunctionId(index), stmt.as_function_def_stmt()?)))
}
#[allow(unused)]
pub fn functions(&self) -> impl Iterator<Item = (FunctionId, &StmtFunctionDef)> {
self.statements
.iter_enumerated()
.filter_map(|(index, stmt)| Some((FunctionId(index), stmt.as_function_def_stmt()?)))
}

Comment on lines +26 to +35
pub struct AstIds {
expressions: IndexVec<ExpressionId, AstNodeRef<Expr>>,

/// Maps expressions to their expression id. Uses `NodeKey` because it avoids cloning [`Parsed`].
expressions_map: FxHashMap<NodeKey, ExpressionId>,

statements: IndexVec<StatementId, AstNodeRef<Stmt>>,

statements_map: FxHashMap<NodeKey, StatementId>,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider using something like https://docs.rs/bimap/latest/bimap/ here, instead of having one mapping for ID-to-expression, and another mapping for expression-to-ID (IIUC)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't, and i wasn't aware of that data structure. I prefer our implementation because we use an IndexVec for statements and expressions where a lookup is just an array offset whereas BiMap would require a hash map lookup.

@AlexWaygood
Copy link
Member

Module Resolver

The module resolver remains mostly unchanged, although I did some renaming.

Module

My initial feeling is that what you are calling Module maybe should be called ModuleName, and what you are calling ResolvedModule maybe should be called Module. But I'm not totally sure until I look closer at the code.

Yeah, but that would require that ModuleName becomes a sala ingredient. It might be fine but it makes ModuleName a bit more awkward to use. But if we keep ModuleName a regular struct, than what would you call the module thing that we pass to resolve_module? I might be overthinking this because we can make the argument to resolve_module private and only expose a resolve_module(db: &dyn Db, name: &str) function that internally converts the &str to that Module thing and we can then keep our existing terminology. Maybe that's for the best (it also hides complexity).

I wonder if what's currently called Module could be renamed to ModuleRequest. The user "requests" a module (and the request is represented with a ModuleRequest instance) by importing a module with a certain ModuleName, but they might not actually get a module back, because the module might not actually exist. Unlike the Module type on the main branch, your Module type in crates/red_knot/src/salsa_db/semantic/module.rs doesn't really feel like a Module to me, as you can't query any information about the module directly from the type -- you have to resolve it first, and it feels like the module object is the thing you get given at the end of the resolution process.

@MichaReiser
Copy link
Member Author

Thanks @AlexWaygood for the feedback. I don't plan to incorporate any of the code changes into this PR because I don't plan on merging. I'll incorporate your changes when working on the specific areas before pulling them into ruff.

Copy link
Contributor

@carljm carljm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked over all the code. This was a lot of work to translate all this, thanks for doing this! I don't see anything here that I think can't work in the new approach. I think overall on the semantic side this PR now has kind of a mish-mash of the old approach (per expression laziness) and the new approach (per scope typing) that is probably more complex and less efficient than we could achieve, so I expect that over the next few weeks we'll want to re-work and simplify a fair bit of it. But it makes sense to land something working with Salsa and iterate from there.

@@ -4,7 +4,7 @@ resolver = "2"

[workspace.package]
edition = "2021"
rust-version = "1.74"
rust-version = "1.73"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we dropping our rust version in this PR? Did you add a dependency here that doesn't work with 1.74?

hashbrown = { workspace = true }
indexmap = { workspace = true }
notify = { workspace = true }
parking_lot = { workspace = true }
rayon = { workspace = true }
rustc-hash = { workspace = true }
salsa = { git = "https://github.com/salsa-rs/salsa.git", package = "salsa-2022", rev = "05b4e3ebdcdc47730cdd359e7e97fb2470527279" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this incorporate any of Niko's newest work on "v3" yet? Or are those changes we'll have to adapt to yet in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, v3 is only a PR at this point. I scanned through the code and v3 is fairly close to v2022, so we're using that for now. But yes, we'll probably have to adapt some code.


impl salsa::Database for Database {
fn salsa_event(&self, event: Event) {
if matches!(event.kind, EventKind::WillCheckCancellation) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this event mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to create multiple snapshots of the database that then each can run in isolation (they still share the underlying caches). This is useful when using salsa in a multithreaded context.

Now, Salsa cancels any pending snapshots (other threads) when you want to make changes to it. The way this works is that each query tests if cancellation was requested and if so, it panics with a specific error. The WillCheckCancellation indicates that Salsa now tests cancellation.

I removed the log because it is very noisy. I think I often saw 2-3 of these logs per query. Maybe something that can be optimized later to reduce it to just one. Removing it made the log a bit more dense and easier to read thorough

#[salsa::tracked(jar=Jar)]
pub fn check_syntax(db: &dyn Db, file: File) -> SyntaxCheck {
// TODO I haven't looked into how many rules are pure syntax checks.
// It may be necessary to at least give access to a simplified semantic model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we would bother with a simplified semantic model (unless it's extremely simple). It seems better to just give the rules that need semantic information access to the full semantic model, and avoid inconsistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither do I but it probably also depends on what we refer to as the semantic model. Is it any information that isn't part of the AST? If so, maybe exposing the parent expression or statement is something that we can support even for syntax rules. But yeah, I don't know if it's worth it. I think this is a comment copied from the existing implementation.

let typing_scope = TypingScope::for_symbol(db, symbol);
let types = infer_types(db, typing_scope);

types.symbol_ty(symbol.local())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be consistent about using _ty vs _type in APIs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. ty is somewhat common in the Rust ecosystem and has the advantage that it isn't a keyword (it also works for variables).

}
}

/// Infers the type of a location definition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what a "location definition" is?

Comment on lines +577 to +579
// The fact that the interner is local to a body means that we can't reuse the same union type
// across different call sites. But that's something we aren't doing yet anyway. Our interner doesn't
// deduplicate union types that are identical.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need a place to add this deduplication (as well as the flattening/simplification that I already added in PRs since you translated this to Salsa); it's not clear to me where in this structure that should happen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agree. I think we probably want to have methods on the TypeInferenceBuilder because we only need to e.g. track the reverse map of already created unions back to their type ids during construction but we won't need it once type inference is complete (and we won't create any new types)

}
}

enum DefinitionType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what a DefinitionType is supposed to represent that is different from a Type, or why it needs to exist at all. It seems like all it does is intern unions? (And with narrowing it will probably have to intern intersections, too.) But infer_definitions already has a TypeInference -- why can't it do the interning itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enum is a lifetime hack. infer_definition takes &self as argument, so it can't intern a new union type.

The fact that it is a readonly reference is important in finish where we iterate over self.symbol_table.

        for symbol in self.symbol_table.symbol_ids_for_scope(self.enclosing_scope) {
            let definition_type = self.typing_context().infer_definitions(
                symbol_table
                    .definitions(symbol)
                    .iter()
                    .map(|definition| ReachableDefinition::Definition(*definition)),
                GlobalId::new(self.file, self.enclosing_scope),
            );

            public_symbol_types.insert(symbol, definition_type.into_type(&mut self.result));
        }

Taking a &mut wouldn't compile because Rust couldn't prove that the symbol_table doesn't get mutated (a method taking &mut self can mutate any field). By explicitly passing &mut self.result in into_type Rust can prove that self.symbol_table is never borrowed mutably

Comment on lines +86 to +89
// TODO: This is going to be somewhat slow because we need to map the AST node to the expression id for
// every expression in the body. That's a lot of hash map lookups.
// We can't use an `IndexVec` here because a) expression ids are per module and b) the type inference
// builder visits the expressions in evaluation order and not in pre-order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The location of this comment seems odd; it's not clear what code it is referring to.

Comment on lines +158 to +161
ImportDefinition {
import: import_id,
name: u32::try_from(i).unwrap(),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's strange to me that we build definitions in SemanticIndexer, but now we're rebuilding definitions from scratch here as well. This seems like duplication we probably don't want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your concern just about the ImportDefinition creation that is used as key? Because there's a difference. We associate a definition with its type.

I don't think we can avoid this much without having a way to iterate over the AST and definitions at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
red-knot Multi-file analysis & type inference
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants