Skip to content

Conversation

@nickrolfe
Copy link
Contributor

@nickrolfe nickrolfe commented Oct 10, 2025

This adds overlay support to the extractor, along with corresponding QL discard predicates. I have not attempted to add overlay[local] annotations to any libraries, meaning that the 'overlay frontier' is currently at the dbscheme.

Some additional notes:

  • Although the extractor already supported path transformers, it needed a little tweaking to work correctly (and to not be overridden in certain cases).
  • There are a few (pretty rare) cases where analysis of an overlay database will not perfectly match behaviour of full analysis, since we discard anything that is @locatable, but some entities that don't have locations (like types) don't get discarded.
  • There's a bit of a hack to always re-extract cgo-processed files (and, to match that, always to discard entities in such files), since, as far as I can tell, there's no way for the extractor to know which original source file a cgo-processed file came from.

Commit-by-commit review is recommended.

@github-actions github-actions bot added the Go label Oct 10, 2025
@nickrolfe nickrolfe force-pushed the nickrolfe/go-extractor-overlay branch from ffc7fee to 199b8fc Compare October 23, 2025 15:03
@nickrolfe nickrolfe marked this pull request as ready for review October 24, 2025 10:19
@nickrolfe nickrolfe requested a review from a team as a code owner October 24, 2025 10:19
Copilot AI review requested due to automatic review settings October 24, 2025 10:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds foundational support for database overlay functionality in the Go extractor, enabling incremental analysis by reusing a base database and overlaying only the changes from modified files.

Key changes:

  • Introduces new database schema relations (databaseMetadata and overlayChangedFiles) to support overlay metadata
  • Adds logic to skip extraction of unchanged files when building an overlay database
  • Updates path handling to use transformed paths consistently across extraction and trap file generation

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
go/ql/lib/upgrades/b3da71c3ac204b557c86e9d9c26012360bdbdccb/upgrade.properties Defines upgrade metadata for adding overlay support relations
go/ql/lib/upgrades/b3da71c3ac204b557c86e9d9c26012360bdbdccb/old.dbscheme Baseline schema snapshot before overlay changes
go/ql/lib/upgrades/b3da71c3ac204b557c86e9d9c26012360bdbdccb/go.dbscheme Updated schema with overlay support relations
go/ql/lib/semmle/go/Overlay.qll Defines entity discard predicates for overlay analysis
go/ql/lib/semmle/go/Locations.qll Imports overlay module for location handling
go/ql/lib/qlpack.yml Enables overlay evaluation compilation
go/ql/lib/go.dbscheme Adds databaseMetadata and overlayChangedFiles relations
go/extractor/util/overlays.go Implements overlay detection and changed file handling
go/extractor/util/BUILD.bazel Adds overlays.go to build sources
go/extractor/trap/labels.go Updates file label generation to use transformed paths
go/extractor/srcarchive/srcarchive.go Refactors path transformer initialization
go/extractor/srcarchive/projectlayout.go Adds environment-based project layout loader and exports fields
go/extractor/srcarchive/BUILD.bazel Adds util dependency
go/extractor/gomodextractor.go Skips extraction of unchanged go.mod files in overlay mode
go/extractor/extractor.go Integrates overlay change tracking and skips unchanged file extraction
go/extractor/dbscheme/tables.go Adds overlay relations to schema definition
go/extractor/cli/go-extractor/go-extractor.go Adds --source-root argument parsing
go/extractor/cli/go-autobuilder/go-autobuilder.go Passes source root to extractor and writes overlay metadata
go/extractor/cli/go-autobuilder/BUILD.bazel Adds srcarchive dependency
go/downgrades/b1341734d6870b105e5c9d168ce7dec25d7f72d0/upgrade.properties Defines downgrade operation for overlay relations
go/downgrades/b1341734d6870b105e5c9d168ce7dec25d7f72d0/old.dbscheme Schema with overlay relations for downgrade reference
go/downgrades/b1341734d6870b105e5c9d168ce7dec25d7f72d0/go.dbscheme Target schema without overlay relations
go/codeql-extractor.yml Declares overlay support version

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

if err != nil {
return nil, err
}
pathTransformer, err = LoadProjectLayout(ptf)
Copy link

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pathTransformer variable being assigned here is a package-level variable, but in the context of this function, a local variable should be used instead. The function returns a *ProjectLayout, so the assignment should be to a local variable that is then returned, not to the package-level pathTransformer.

Suggested change
pathTransformer, err = LoadProjectLayout(ptf)
pathTransformer, err := LoadProjectLayout(ptf)

Copilot uses AI. Check for mistakes.
// extract a file if it's not in the package directory.
if extraction.OverlayChanges != nil &&
!extraction.OverlayChanges[path] &&
strings.HasPrefix(path+string(filepath.Separator), pkg.Dir) {
Copy link

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path prefix check is incorrect. It should use pkg.Dir+string(filepath.Separator) as the prefix to check, not append the separator to path. The current logic checks if path concatenated with a separator starts with pkg.Dir, which is backwards.

Suggested change
strings.HasPrefix(path+string(filepath.Separator), pkg.Dir) {
strings.HasPrefix(path, pkg.Dir+string(filepath.Separator)) {

Copilot uses AI. Check for mistakes.
@owen-mc
Copy link
Contributor

owen-mc commented Oct 28, 2025

Are you aware of slog? It is a structured logging library in the go standard library. It was added to the standrad library relatively recently. Ideally we'd want to move the go extractor over to it, because it allows you to have different logging levels. I believe it was used in the recently-created file go/extractor/util/registryproxy.go. You might want to consider using it to allow debug-level logging, while this code is fresh in your mind. But there is no need to use it - I mention it in case you would find it useful.

@owen-mc
Copy link
Contributor

owen-mc commented Oct 29, 2025

Another comment: is it worth adding some tests for any of the new functionality, like respecting the existing path transformer env var? I see we already have go/extractor/srcarchive/projectlayout_test.go - you could add some tests for LoadProjectLayoutFromEnv there.

@owen-mc
Copy link
Contributor

owen-mc commented Oct 30, 2025

I've looked into whether everything in the discard predicate has a star ID. There are two issues:

  1. @xmllocatable from the xml extractor.
  2. @file, which gets a global ID.

Have you looked into @xmllocatable before for other languages? And what is the issue with @file? In the most common case we will just extract it again, but in the case of deleted files I wonder if we want to discard the db row for that file or if we want to keep it around as other things in the base db will refer to it?

@nickrolfe
Copy link
Contributor Author

I've looked into whether everything in the discard predicate has a star ID. There are two issues:

  1. @xmllocatable from the xml extractor.
  2. @file, which gets a global ID.

Have you looked into @xmllocatable before for other languages? And what is the issue with @file? In the most common case we will just extract it again, but in the case of deleted files I wonder if we want to discard the db row for that file or if we want to keep it around as other things in the base db will refer to it?

That's a good point — I see that Go uses the HTML extractor, which I assume is similar to, and uses the same tables as, the XML extractor. We've already handled @xmllocatable for Java, so I should be able to copy the discard predicates for that.

Regarding @file, I think the situation is fine. Since the compiler/evaluator compute the set (base(x) minus discard(x)) union overlay(x), discarding the @file entity will do the right thing:

  • If the file's been deleted in the overlay, there won't be a definition of the @file entity in the overlay db, so the set will not contain that file.
  • If the file's been modified in the overlay, there will be a definition of the @file in the overlay db. So, even though the @file entity is selected by the discard predicate and therefore not in the LHS of the union, it will still be in the RHS of the union.

@jbj
Copy link
Contributor

jbj commented Nov 4, 2025

I'm not sure @file is fine. The argument about (base(x) minus discard(x)) union overlay(x) is valid for a single-column predicate, but the problem appears with multiple columns: if there's a local predicate p(file, other) where other should not be discarded, the whole tuple will be discarded if file is discarded. The tuple will only be recovered in union overlay(x) if the overlay produces it. The predicate p here could be in the dbscheme or in any local QL code, now or in the future.

As a concrete example, imagine a local predicate imports(@file importer, @file importee). If an importee is modified, all import relationships into it would be discarded from the base. The only ones that are recovered by the union are the ones where the importer happens to be in the overlay too.

For Java, entities with named ids are only discarded when they are fully deleted. I think something similar should happen here, or alternatively don't attempt to discard @file at all. It's worth taking a look at whether we're treating @file the same in all languages (Java, Python, Ruby), but I haven't done that.

@nickrolfe
Copy link
Contributor Author

In practice, I don't think the Go extractor will ever produce a @file ref for a file other than the one currently being extracted.

I don't think it would hurt to exclude @file in the discard predicate as a precaution, however.

@jbj
Copy link
Contributor

jbj commented Nov 5, 2025

In practice, I don't think the Go extractor will ever produce a @file ref for a file other than the one currently being extracted.

Yes, so I think all is good as long as the local-to-global frontier is at the dbscheme level like it is now. In the future, when more overlay[local] annotations are added, discarding can potentially happen at any local predicate, and therefore we should already now account for hypothetical local predicates like my imports example above.

@owen-mc
Copy link
Contributor

owen-mc commented Nov 5, 2025

To clarify my point about @xmllocatable - I didn't look into it any further, because I assume you will have done that for other languages. So it may be that they are all star IDs.

When in overlay mode, extractFile will exit early if the file isn't in
the list of files that changed since the base was extracted.
This ensures the extractor can resolve the relative paths for files
changed in the overlay.
This is adapted from the implementation for Java.

Since the HTML/XML extractor is not (yet) incremental, it will extract
files that were not in the diff. These discard predicates are intended
to cope with that, while also being robust against a future version
where the extractor *is* overlay-aware.
...since they are shared between base and overlay
@nickrolfe nickrolfe force-pushed the nickrolfe/go-extractor-overlay branch from 199b8fc to e32a5ca Compare November 7, 2025 16:54
@nickrolfe nickrolfe requested a review from a team as a code owner November 7, 2025 16:54
@nickrolfe nickrolfe force-pushed the nickrolfe/go-extractor-overlay branch from d7525cf to fa18d0f Compare November 11, 2025 15:46
@nickrolfe nickrolfe force-pushed the nickrolfe/go-extractor-overlay branch from fa18d0f to e5ba414 Compare November 11, 2025 15:48
@nickrolfe
Copy link
Contributor Author

I think this is now ready for re-review. A DCA experiment shows that tuple-count accuracy in overlay mode is still good. I'll run a regular (non-overlay) experiment just to re-check the effects of the compileForOverlayEval flag (since this affects the compiler/optimiser output, and could potentially cause a slowdown).

@nickrolfe nickrolfe merged commit 86465b3 into main Nov 12, 2025
23 checks passed
@nickrolfe nickrolfe deleted the nickrolfe/go-extractor-overlay branch November 12, 2025 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants