Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1577: update and clean license list generation to return more SPDXID for more inputs #1691

Merged
merged 6 commits into from
Mar 23, 2023

Conversation

spiffcs
Copy link
Contributor

@spiffcs spiffcs commented Mar 22, 2023

Summary

Update the license_list.go to have more permissible inputs for greater SPDXID matching.

The spdxlicense package contains the ID method which interacts with the generated file license_list.go

The current implementation contains - in the keys. This PR removes these - and sanitizes the inputs so that we can match on a wider range of inputs found in the wild.

spdxlicense.ID has also been changed to only consider the generated list and return if a value exists

The logic of how this information is encoded has been temporarily been moved to the format helpers.

Note this is temporary. #1554 will be used to update our license parsing logic so that license creation is done at the same time as package creation.

In a follow up PR encoders or other middle layers of syft should no longer have any concerns surrounding updating/finding the correct SPDXID or Expression as this will be done when packages are created at the cataloger level.

Example:

GPL3 gpl3 gpl-3 and GPL-3 can all map to GPL-3.0-only

By moving all strings to lower and removing the - we're able to return valid SPDX license ID for a greater diversity of input strings.

Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
@github-actions
Copy link

github-actions bot commented Mar 22, 2023

Benchmark Test Results

Benchmark results from the latest changes vs base branch
goos: linux
goarch: amd64
pkg: github.com/anchore/syft/test/integration
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
                                                          │ ./.tmp/benchmark-ea57438.txt │
                                                          │            sec/op            │
ImagePackageCatalogers/alpmdb-cataloger-2                                   11.78m ±  1%
ImagePackageCatalogers/ruby-gemspec-cataloger-2                             803.4µ ± 15%
ImagePackageCatalogers/python-package-cataloger-2                           2.998m ±  0%
ImagePackageCatalogers/php-composer-installed-cataloger-2                   644.5µ ±  0%
ImagePackageCatalogers/javascript-package-cataloger-2                       336.2µ ±  1%
ImagePackageCatalogers/dpkgdb-cataloger-2                                   467.3µ ±  1%
ImagePackageCatalogers/rpm-db-cataloger-2                                   434.8µ ±  3%
ImagePackageCatalogers/java-cataloger-2                                     10.94m ±  1%
ImagePackageCatalogers/graalvm-native-image-cataloger-2                     7.733µ ±  2%
ImagePackageCatalogers/apkdb-cataloger-2                                    501.4µ ±  1%
ImagePackageCatalogers/go-module-binary-cataloger-2                         17.90µ ±  1%
ImagePackageCatalogers/dotnet-deps-cataloger-2                              948.1µ ±  1%
ImagePackageCatalogers/portage-cataloger-2                                  284.7µ ±  1%
ImagePackageCatalogers/sbom-cataloger-2                                     104.1µ ±  1%
ImagePackageCatalogers/binary-cataloger-2                                   184.7µ ±  1%
geomean                                                                     442.8µ

                                                          │ ./.tmp/benchmark-ea57438.txt │
                                                          │             B/op             │
ImagePackageCatalogers/alpmdb-cataloger-2                                   5.062Mi ± 0%
ImagePackageCatalogers/ruby-gemspec-cataloger-2                             123.8Ki ± 0%
ImagePackageCatalogers/python-package-cataloger-2                           947.0Ki ± 0%
ImagePackageCatalogers/php-composer-installed-cataloger-2                   155.8Ki ± 0%
ImagePackageCatalogers/javascript-package-cataloger-2                       91.07Ki ± 0%
ImagePackageCatalogers/dpkgdb-cataloger-2                                   144.5Ki ± 0%
ImagePackageCatalogers/rpm-db-cataloger-2                                   170.9Ki ± 0%
ImagePackageCatalogers/java-cataloger-2                                     2.726Mi ± 0%
ImagePackageCatalogers/graalvm-native-image-cataloger-2                     1.555Ki ± 0%
ImagePackageCatalogers/apkdb-cataloger-2                                    129.3Ki ± 0%
ImagePackageCatalogers/go-module-binary-cataloger-2                         3.133Ki ± 0%
ImagePackageCatalogers/dotnet-deps-cataloger-2                              314.1Ki ± 0%
ImagePackageCatalogers/portage-cataloger-2                                  75.53Ki ± 0%
ImagePackageCatalogers/sbom-cataloger-2                                     13.08Ki ± 0%
ImagePackageCatalogers/binary-cataloger-2                                   29.06Ki ± 0%
geomean                                                                     108.4Ki

                                                          │ ./.tmp/benchmark-ea57438.txt │
                                                          │          allocs/op           │
ImagePackageCatalogers/alpmdb-cataloger-2                                    86.71k ± 0%
ImagePackageCatalogers/ruby-gemspec-cataloger-2                              2.049k ± 0%
ImagePackageCatalogers/python-package-cataloger-2                            15.48k ± 0%
ImagePackageCatalogers/php-composer-installed-cataloger-2                    3.458k ± 0%
ImagePackageCatalogers/javascript-package-cataloger-2                        1.214k ± 0%
ImagePackageCatalogers/dpkgdb-cataloger-2                                    2.646k ± 0%
ImagePackageCatalogers/rpm-db-cataloger-2                                    3.759k ± 0%
ImagePackageCatalogers/java-cataloger-2                                      38.26k ± 0%
ImagePackageCatalogers/graalvm-native-image-cataloger-2                       40.00 ± 0%
ImagePackageCatalogers/apkdb-cataloger-2                                     3.437k ± 0%
ImagePackageCatalogers/go-module-binary-cataloger-2                           101.0 ± 0%
ImagePackageCatalogers/dotnet-deps-cataloger-2                               5.010k ± 0%
ImagePackageCatalogers/portage-cataloger-2                                   1.487k ± 0%
ImagePackageCatalogers/sbom-cataloger-2                                       392.0 ± 0%
ImagePackageCatalogers/binary-cataloger-2                                     872.0 ± 0%
geomean                                                                      2.219k

"apl1": "APL-1.0",
"apl1.0": "APL-1.0",
"apl1.0.0": "APL-1.0",
"apps2.0.0p": "App-s2p",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is now detecting versions to expand from the simplified string: We can talk through if we want to handle this case s2p where 2 is not a version. I don't think it would be App-s3p in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be a couple of these cases which has caused the list to grow by 18 lines --- potentially 6 cases of a string that are now expanding version permutations which were not doing it before this PR

Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
@spiffcs spiffcs marked this pull request as ready for review March 22, 2023 22:35
@@ -35,7 +33,7 @@ var licenseIDs = map[string]string{
}
`))

var versionMatch = regexp.MustCompile(`-([0-9]+)\.?([0-9]+)?\.?([0-9]+)?\.?`)
var versionMatch = regexp.MustCompile(`([0-9]+)\.?([0-9]+)?\.?([0-9]+)?\.?`)
Copy link
Contributor Author

@spiffcs spiffcs Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Causing us to match on numbers in licenses that would otherwise not be considered versions, but I think the trade off of being able to match on more strings out in the wild and return a correct SPDX ID is a good one here.

// so we need to guarantee the order they are created to avoid mapping them wrongly. So we use a sorted list.
// To overwrite deprecated licenses during the first pass we would later on rely on map order,
// [which in go is not consistent by design](https://stackoverflow.com/a/55925880).
// The order of variations/permutations of a license ID matter.
Copy link
Contributor Author

@spiffcs spiffcs Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified this logic down to a single pass after we sort the list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented out the log messages to reduce the noise since it's just a script we run to generate our license list and not part of the larger syft program.

You can turn those on to see how the sort prevents things like alpm1 from mapping to later versions

if l.Deprecated {
return false
}

// We want to replace deprecated licenses with non-deprecated counterparts
// For more information, see: https://github.com/spdx/license-list-XML/issues/1676
if other.Deprecated {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findReplacementLicense already assumes a deprecated input

return l.ID == other.ID
}

func (ll LicenseList) findReplacementLicense(deprecated License) *License {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved above the function canReplace for readability

return "", false
}

func cleanLicenseID(id string) string {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to duplicate this between generate and the actual package. If there is a better way to share the code happy to update to that!

true,
},
// the below few cases are NOT expected, however, seem unavoidable given the current approach
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer returning true 🥳

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:chef-kiss:

Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
@spiffcs spiffcs marked this pull request as draft March 22, 2023 22:56
@spiffcs
Copy link
Contributor Author

spiffcs commented Mar 23, 2023

Things to look at in the AM:

  • bzip2-1 <-- This test case is odd when removing the -. We might have to make the capture groups a little better to handle this case
  • Cyclonedx License regression for made-up
  • CycloneDX test updates with name carried through
  • Encoder updates with new license updates as a consequence of ID change
  • CLI regressions

Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
@spiffcs spiffcs marked this pull request as ready for review March 23, 2023 14:48
Copy link
Contributor

@wagoodman wagoodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really nice work 🥇

GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024
Update the license_list.go to have more permissible inputs for greater SPDXID matching.
EX:
GPL3 gpl3 gpl-3 and GPL-3 can all map to GPL-3.0-only

By moving all strings to lower and removing the "-" we're able to return valid SPDX license ID for a greater diversity of input strings.
---------

Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants