Skip to content

Conversation

kaspersv
Copy link
Contributor

@kaspersv kaspersv commented Sep 18, 2025

Java property extraction and XML extraction is currently non-incremental, meaning that all property and XML files are extracted in the overlay. The current Java config and XML discard predicates makes use of this fact and discard all config locatables from base if any config locatables have been extracted in the overlay and likewise for xml locatables. This is sound, but will require an update of the discard predicate in lock-step with an extractor change if we later make Java property or XML extraction incremental. This PR updates the Java config and XML discard predicates to discard based on overlayChangedFiles instead. This is future-proof if we later make Java property or XML extraction incremental.

Commit 1 future-proofs Java config discarding. All Java config entities use * TRAP ids and can therefore be discarded from base if they appear in files extracted in the overlay.

Commit 2 future-proofs Java XML discarding. All XML entities except namespaces use * TRAP ids and can therefore be discarded from base if they appear in files extracted in the overlay. XML namespace entities use TRAP keys of the form @"$URI;xmlnamespace" that are not file-local and cannot safely be discarded from base.

With the current non-incremental extraction, unchanged Java property and XML files will be extracted in both base and overlay and will not be discarded from either. This causes apparent accuracy regressions in DCA experiments (variant vs baseline for java/maven/non-https-url, java/maven/dependency-upon-bintray and java/spring-boot-exposed-actuators-config. I've inspected the java/maven/non-https-url regressions locally and they occur because the query selects DeclaredRepository which extend @xmlelement entities that are extracted in both base and overlay with different ids. This causes an apparent regression at the bqrs-level, but at the SARIF level the duplicate results collapse to a single alert. The DCA alert-comparison which compares SARIF alerts shows no regressions (variant vs. baseline).

@github-actions github-actions bot added the Java label Sep 18, 2025
@kaspersv kaspersv force-pushed the kaspersv/future-proof-java-discarding branch from bb2ef45 to 1cc0c63 Compare September 18, 2025 08:46
@kaspersv kaspersv force-pushed the kaspersv/future-proof-java-discarding branch from 1cc0c63 to bb9e773 Compare September 18, 2025 08:57
@kaspersv kaspersv force-pushed the kaspersv/future-proof-java-discarding branch from bb9e773 to dbb9a26 Compare September 18, 2025 09:37
@kaspersv kaspersv requested a review from alexet September 19, 2025 10:57
@kaspersv kaspersv marked this pull request as ready for review September 19, 2025 10:57
@kaspersv kaspersv requested a review from a team as a code owner September 19, 2025 10:57
@Copilot Copilot AI review requested due to automatic review settings September 19, 2025 10:57
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR future-proofs overlay discarding for Java config and XML by switching from “any locatable exists in overlay” checks to using overlayChangedFiles, making it robust if extraction becomes incremental.

  • getRawFile now supports XML locatables via xmllocations
  • Java config discard now keys off overlayChangedFiles using per-entity raw file resolution
  • XML module introduces DiscardableXml* classes and imports Overlay to support discard behavior

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
java/ql/lib/semmle/code/xml/XML.qll Adds Overlay import and DiscardableXml* classes to enable file-based discard behavior while excluding namespaces.
java/ql/lib/semmle/code/java/Overlay.qll Updates getRawFile to support XML locatables and reworks config discard to be based on overlayChangedFiles with a helper to resolve config raw files.

Comment on lines +74 to +82
private class DiscardableXmlAttribute extends DiscardableLocatable, @xmlattribute { }

private class DiscardableXmlElement extends DiscardableLocatable, @xmlelement { }

private class DiscardableXmlComment extends DiscardableLocatable, @xmlcomment { }

private class DiscardableXmlCharacters extends DiscardableLocatable, @xmlcharacters { }

private class DiscardableXmlDtd extends DiscardableLocatable, @xmldtd { }
Copy link
Preview

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider adding a brief comment explaining why these XML entities are marked discardable and that XML namespaces are intentionally excluded due to non-file-local TRAP keys. This will help future readers understand the scope and rationale of discard behavior.

Copilot uses AI. Check for mistakes.

Comment on lines 19 to 24
string getRawFile(@locatable el) {
exists(@location loc, @file file |
hasLocation(el, loc) and
(hasLocation(el, loc) or xmllocations(el, loc)) and
locations_default(loc, file, _, _, _, _) and
files(file, result)
)
Copy link
Preview

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Please add a short doc comment clarifying that getRawFile now handles XML locatables via xmllocations in addition to standard locations, since this broadens the function’s scope and improves discoverability.

Copilot uses AI. Check for mistakes.

Comment on lines +76 to +83
/** Gets the raw file for a configLocatable. */
overlay[local]
private predicate baseConfigLocatable(@configLocatable l) { not isOverlay() and exists(l) }

overlay[local]
private predicate overlayHasConfigLocatables() {
isOverlay() and
exists(@configLocatable el)
}

overlay[discard_entity]
private predicate discardBaseConfigLocatable(@configLocatable el) {
// The properties extractor is currently not incremental, so if
// the overlay contains any config locatables, the overlay should
// contain a full extraction and all config locatables from base
// should be discarded.
baseConfigLocatable(el) and overlayHasConfigLocatables()
}

overlay[local]
private predicate baseXmlLocatable(@xmllocatable l) {
not isOverlay() and not files(l, _) and not xmlNs(l, _, _, _)
private string getRawFileForConfig(@configLocatable el) {
exists(@location loc, @file file |
configLocations(el, loc) and
locations_default(loc, file, _, _, _, _) and
files(file, result)
)
Copy link
Preview

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider making the doc a bit more explicit, e.g., “Returns the raw file path for a config locatable using configLocations; used to drive overlayChangedFiles-based discarding.” This clarifies how the function is used and why configLocations is the source.

Copilot uses AI. Check for mistakes.

@kaspersv kaspersv added the no-change-note-required This PR does not need a change note label Sep 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Java no-change-note-required This PR does not need a change note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants