don't split licenses from `License` field from python packaging #5204

DmitriyLewen · 2023-09-18T07:13:46Z

Discussed in #5155

License field in .dist-info/METADATA files contains textual information about license/exceptions/specific versions/etc... - https://packaging.python.org/en/latest/specifications/core-metadata/#license

We can't split this text correctly.
Save this field without division.

The text was updated successfully, but these errors were encountered:

dus7eh · 2024-02-07T15:11:42Z

I'm not sure if this this the way to handle such case.
The mentioned structlog 23.1.0 contains a License field with full license. The sbom and probably similarly license scans generate license name not license text. If we had the full text as the output would we be able to add some custom classifications for such licenses in trivy config at all?

Thankfully there are additional classifier fields in METADATA file which can serve such info. Part of the file from structlog:

Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python

Maybe it would be a good option to use Classifier: License fields when present and fallback to License field otherwise.

I stumbled on a similar issue with scrapli package and this approach would solve the issue there as well. License scan output:

DmitriyLewen · 2024-02-08T04:04:04Z

Hello @dus7eh
Thanks for your information!

We already check Classifier: License - https://github.com/aquasecurity/go-dep-parser/blob/4f19ab402b0babc81876dc518be0ed82f93f0255/pkg/python/packaging/parse.go#L51
I also thought about checking Classifier: License before License, but we can miss some moments - aquasecurity/go-dep-parser#256 (comment)

Perhaps we can try to using https://github.com/google/licenseclassifier for License field. But i am not sure that License field contains text of license.

But License and Classifier: License fields are deprecated - https://peps.python.org/pep-0639/#deprecate-license-field.
Such problems will decrease over time.

dus7eh · 2024-02-08T08:09:58Z

Didn't notice earlier that these fields are deprecated. Still it will take some time before they actually go out of use.

For the scrapli package case I've mentioned there's a tab character in the new line after the MIT License statement which messes things up. I've checked that removing it fixes how the License field is parsed.
Are you open for a PR with tab trimming on processing this field?

DmitriyLewen · 2024-02-08T08:21:30Z

Can you send License field for METADATA file for scrapli please?

dus7eh · 2024-02-08T08:53:44Z

Sure thing, here is the field with some context around it:

Author-email: Carl Montanari <carl.r.montanari@gmail.com>
License: MIT License
        
        Copyright (c) 2021 Carl Montanari
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Changelog, https://carlmontanari.github.io/scrapli/changelog
Project-URL: Changelog, https://carlmontanari.github.io/scrapli/changelog
Project-URL: Docs, https://carlmontanari.github.io/scrapli/
Project-URL: Homepage, https://github.com/carlmontanari/scrapli
Keywords: arista,automation,cisco,eos,iosxe,iosxr,juniper,junos,netconf,network,nxos,ssh,telnet
Classifier: License :: OSI Approved :: MIT License

Although I was slightly mistaken, it's a series of 8 spaces rather than a tab that messes things. When you remove 4 or all 8 spaces in the line between MIT License and Copyright... trivy simply outputs MIT License in the report which is expected. So this actually might be an issue that exactly 4 leading spaces are trimmed (and not a multiply of 4).

DmitriyLewen · 2024-02-08T11:25:01Z

It's just a lucky coincidence that License field starts with the license name.
Your solution doesn't help in this case - #5155 (comment)

Also we still have a lot of meaningless licenses ( e.g. copy, merge, etc.).

#5211 doesn't separate licenses from License field. I also thought about using only first x (e.g. 30) characters for License field to avoid bit size of license name.

dus7eh · 2024-02-26T15:04:50Z

But License and Classifier: License fields are deprecated - https://peps.python.org/pep-0639/#deprecate-license-field.
Such problems will decrease over time.

Just a quick remartk about the article you've mentioned (regarding License-expression). It's in draft status for almost 4 years now. So I wouldn't state that License and Classifier: License are deprecated yet :)

Another thing is that according to the description and usage (https://packaging.python.org/en/latest/specifications/core-metadata/#license) of License field I'd say tha Classifier type should have higher precedence. If Classifier: License is present it defines the license type with some modifications or additional specifications which might be described in License field. And yes, some nuances may apply but maybe they could be denoted in the output? For example somehow translated into lower confidence level.

If Classifier is not present, then License from METADATA could treated as license. This is similar to how other python specific tools handle the case currently: pip-licenses.

I'm not sure how the full flow works but you've mentioned about using licenseclassifier for recognizing licenses for this case. Python packages in the [package-name]-dist-info dir often contain LICENSE file which could be analyzed as well. Are they taken into account?

DmitriyLewen · 2024-02-27T05:53:30Z

Just a quick remartk about the article you've mentioned (regarding License-expression). It's in draft status for almost 4 years now. So I wouldn't state that License and Classifier: License are deprecated yet :)

Hm... i missed status for PEP639.

Then we can try to change fields priority.

And yes, some nuances may apply but maybe they could be denoted in the output? For example somehow translated into lower confidence level.

I have another idea - we can add new field with text from LICENSE field.
By default, we will include in report license from Classifier: License and add note indicating that this license may have been updated. User can check this in json format ( we will add new filed to json format).
I will try this idea when will move PR from go-dep-parser to Trivy.

Thanks for your help and information about this case!

I'm not sure how the full flow works but you've mentioned about using licenseclassifier for recognizing licenses for this case. Python packages in the [package-name]-dist-info dir often contain LICENSE file which could be analyzed as well. Are they taken into account?

licenseclassifier uses a lot of resources. So we only check LICENSE file if License-File field is present.

dus7eh · 2024-02-27T08:15:51Z

I've adjusted the implementation to:

if l := h.Get("License-Expression"); l != "" {
	licenses = types.LicensesFromString(l, types.LicenseTypeName)
	// } else if l := h.Get("License"); l != "" {
} else {
	// read license name from license classifier
	for _, classifier := range h.Values("Classifier") {
		if strings.HasPrefix(classifier, "License :: ") {
			values := strings.Split(classifier, " :: ")
			// there can be several classifiers with licenses
			licensename := values[len(values)-1]
			// According to the classifier list https://pypi.org/classifiers/ there is one grouping classifier
			// without a specific license definition (Classifier: License :: OSI Approved) - it is skipped
			if licensename != "OSI Approved" {
				licenses = append(licenses, types.LicensesFromString(licensename, types.LicenseTypeName)...)
			}
		}
	}
	// The license field can contain information about different licenses, license exceptions , etc.:
	// https://packaging.python.org/en/latest/specifications/core-metadata/#license,
	// but it is impossible to define a delimiter to separate them.
	// Mark them, so we don't have to separate them later.
        if l := h.Get("License"); l != "" && len(licenses) == 0 {
		licenses = types.LicensesFromString(l, types.LicenseTypeNonSeparable)
	}
}
if len(licenses) == 0 && h.Get("License-File") != "" {
	licenses = types.LicensesFromStringSlice(h.Values("License-File"), types.LicenseTypeFile)
}

which seems to work quite well. If you handle the go-dep-parser migration I could prepare a PR with proper changes. For now I started off from your dont-split-python-licenses branch and managed to get more expected python licenses results.

DmitriyLewen · 2024-02-27T09:28:46Z

It takes time to update my PR and integrate these changes.
Therefore, you can open a new PR (if you have time) with only the priority changed (use the master branch as a base).

DmitriyLewen added the kind/bug Categorizes issue or PR as related to a bug. label Sep 18, 2023

DmitriyLewen self-assigned this Sep 18, 2023

DmitriyLewen linked a pull request Sep 19, 2023 that will close this issue

feat(license): add struct with types for licenses #5211

Draft

7 tasks

dus7eh mentioned this issue Feb 28, 2024

fix(license): reorder logic of how python package licenses are acquired #6220

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't split licenses from `License` field from python packaging #5204

don't split licenses from `License` field from python packaging #5204

DmitriyLewen commented Sep 18, 2023

dus7eh commented Feb 7, 2024

DmitriyLewen commented Feb 8, 2024 •

edited

Loading

dus7eh commented Feb 8, 2024

DmitriyLewen commented Feb 8, 2024 •

edited

Loading

dus7eh commented Feb 8, 2024 •

edited

Loading

DmitriyLewen commented Feb 8, 2024

dus7eh commented Feb 26, 2024 •

edited

Loading

DmitriyLewen commented Feb 27, 2024

dus7eh commented Feb 27, 2024 •

edited

Loading

DmitriyLewen commented Feb 27, 2024

don't split licenses from License field from python packaging #5204

don't split licenses from License field from python packaging #5204

Comments

DmitriyLewen commented Sep 18, 2023

Discussed in #5155

dus7eh commented Feb 7, 2024

DmitriyLewen commented Feb 8, 2024 • edited Loading

dus7eh commented Feb 8, 2024

DmitriyLewen commented Feb 8, 2024 • edited Loading

dus7eh commented Feb 8, 2024 • edited Loading

DmitriyLewen commented Feb 8, 2024

dus7eh commented Feb 26, 2024 • edited Loading

DmitriyLewen commented Feb 27, 2024

dus7eh commented Feb 27, 2024 • edited Loading

DmitriyLewen commented Feb 27, 2024

don't split licenses from `License` field from python packaging #5204

don't split licenses from `License` field from python packaging #5204

DmitriyLewen commented Feb 8, 2024 •

edited

Loading

DmitriyLewen commented Feb 8, 2024 •

edited

Loading

dus7eh commented Feb 8, 2024 •

edited

Loading

dus7eh commented Feb 26, 2024 •

edited

Loading

dus7eh commented Feb 27, 2024 •

edited

Loading