Skip to content

Classify licenses based on file contents #656

Open
@wagoodman

Description

@wagoodman

What would you like to be added:
The ability to read entire file contents (or just the top X bytes of the file) and classify the contents as a particular license (e.g. MIT, Apache 2.0, etc). This is a larger addition than #565 (which just covers the SPDX identifiers) but should be thought about together. License content discovered could be persisted optionally in the final SBOM (supported in SPDX).

Why is this needed:
Keeping a curated list of licenses for your dependencies is a common use case for SBOMs.

Additional context:
Consider using https://github.com/google/licenseclassifier for the heavy lifting.

As a start this could key off of file extensions to filter down to source files (.py, .go, .c, etc) or by filename (e.g. "license", "LICENSE", "license..*, etc") to keep the search scope reasonable.

This could be implemented as it's own cataloger that is only responsible for finding licenses in files. This would make the configuration easily accessible, for example:

license:
  cataloger:
    enabled: true
    scope: "squashed"
  
  # keep the license content in the final SBOM
  capture-content: true

  # only search in the following files (by glob)
  globs: 
    - license*
    - License*
    - *.c
    - *.go
    - *.py
    - *.ts
    - *.tsx
    ...

More thought is needed as to how this is organized in the Syft JSON output. That is, does this show up as snippets under packages? Snippets under files? Maybe they get their own section? How does this relate to the licenses field under a package? (will it change? relate to another field? or something else?).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestlicenserelating to software licensing

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions