Skip to content

Conversation

@tdruez
Copy link
Contributor

@tdruez tdruez commented Sep 9, 2025

Background

The documentDescribes field should describe the "root" software artifact(s) represented in the SPDX document—typically just one, such as the top-level project or container—but not all scanned packages or files, as previously implemented in ScanCode.io: aboutcode-org/scancode.io#564

This migration of documentDescribes content is required to ensure that SPDX output generated by ScanCode.io can be reliably consumed as input by other Software Composition Analysis (SCA) tools. A concrete example is the ORT integration work tracked in #1727, where downstream tools expect documentDescribes to reference only the root package rather than all discovered elements.

Changes made

  • Add support for downloading results as SPDX 2.2 (WebUI, REST API, CLI)
  • Use the project's input as the root element that the SPDX document describes.
    Notes: for projects with multiple inputs, a root SPDX package (project_as_root_package) is used for the documentDescribes.
  • Updated the SPDX.Document and its as_dict() serialization logic to follow this model.
  • Adjusted sample JSON fixtures to comply with the intended structure.
  • Revised tests (test_output.py, test_spdx.py) to assert correct behavior and updated expected counts and data.
  • Enhanced doc comments for clarity on the purpose and usage of documentDescribes.

Impact

This aligns code with SPDX best practices, improves clarity for consumers, and ensures reliable test coverage.
It also enables interoperability with other SCA tools that depend on SPDX output following this convention.

Related Issues/Discussions

@tdruez tdruez requested a review from tsteenbe September 10, 2025 07:38
files_analyzed=True,
)

packages_as_spdx = [project_as_root_package]
Copy link

@tsteenbe tsteenbe Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdruez Using project_as_root_package is incorrect imo as it's possible in ScanCode.io to upload multiple archives as such there would be multiple root packages so variable should be projects_as_root_packages as documentDescribes should be array of SPDX packages (one for each archive). E.g upload 5 archives to ScanCode.io in a project than there should be 5 SPDX root elements in documentDescribes of the resulting SPDX file.

In case ScanCode.io is given a single PURL for code repository as it's project input such as pkg:github/package-url/purl-spec@244fd47e07d1004f0aed9c then documentDescribes is still an array but should only contain a single package for the code repository that was scanned, see the comment of SPDX maintainer Rose spdx/spdx-spec#395 (comment).

If a single SPDX of Cyclone SBOM was provided as ScanCode.io input for a project then documentDescribes should point to the SPDX package for provided SBOM imo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case ScanCode.io is given a single PURL for code repository as it's project input such as pkg:github/package-url/purl-spec@244fd47e07d1004f0aed9c then documentDescribes is still an array but should only contain a single package for the code repository that was scanned,
If a single SPDX of Cyclone SBOM was provided as ScanCode.io input for a project then documentDescribes should point to the SPDX package for provided SBOM imo.

@tsteenbe The code was adjusted to use the Project's input as the root package, addressing those 2 points.

The following forms of input are supported:

  • Input manually copied to Project's inputs directory
  • Input uploaded
  • Input fetched: download_url, purl, docker, git, ...)

Using project_as_root_package is incorrect imo as it's possible in ScanCode.io to upload multiple archives as such there would be multiple root packages so variable should be projects_as_root_packages as documentDescribes should be array of SPDX packages (one for each archive). E.g upload 5 archives to ScanCode.io in a project than there should be 5 SPDX root elements in documentDescribes of the resulting SPDX file.

Now, for the multiple inputs case, this will require additional design work and likely some changes in the SCIO architecture to properly track CodebaseResource and DiscoveredPackage objects back to their input origin.

This will be handled in a separate PR, since it first requires further discussion.

Also, note that projects with multiple inputs (e.g. when using the deploy_to_develop pipeline) are not expected to fetch SPDX documents.

Signed-off-by: tdruez <tdruez@nexb.com>
Signed-off-by: tdruez <tdruez@nexb.com>
Signed-off-by: tdruez <tdruez@nexb.com>
@tdruez tdruez changed the title Set documentDescribes to reference the root SPDX element(s) only Support for SPDX 2.2 and update documentDescribes to reference root element only Sep 12, 2025
@tdruez tdruez changed the title Support for SPDX 2.2 and update documentDescribes to reference root element only SPDX 2.2 support and documentDescribes update to reference root element only Sep 12, 2025
@tdruez
Copy link
Contributor Author

tdruez commented Sep 12, 2025

Added support for downloading results as SPDX 2.2 (WebUI, REST API, CLI).

Signed-off-by: tdruez <tdruez@nexb.com>

# Conflicts:
#	scanpipe/tests/pipes/test_output.py
Signed-off-by: tdruez <tdruez@nexb.com>
Signed-off-by: tdruez <tdruez@nexb.com>

# Conflicts:
#	scanpipe/pipes/spdx.py
@tdruez tdruez merged commit 30c23a3 into main Sep 15, 2025
15 checks passed
@tdruez tdruez deleted the 1727-sca-integration-ort-spdx branch September 15, 2025 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants