Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define an order of precedence + provide a means to override it (REUSE.yaml) #779

Closed
carmenbianca opened this issue Jun 21, 2023 · 15 comments · Fixed by #863
Closed

Define an order of precedence + provide a means to override it (REUSE.yaml) #779

carmenbianca opened this issue Jun 21, 2023 · 15 comments · Fixed by #863
Labels
discussion needed documentation Missing or wrong documentation enhancement New feature or request help wanted Extra attention is needed spec Requires specification change

Comments

@carmenbianca
Copy link
Member

carmenbianca commented Jun 21, 2023

Help, I got here because I got a PendingDeprecationWarning

PendingDeprecationWarning: Copyright and licensing information for 'my-project/foo.py' has been found in both 'my-project/foo.py' and in the DEP5 file located at '.reuse/dep5'. The information for these two sources has been aggregated. In the future this behaviour will change, and you will need to explicitly enable aggregation. See #779. You need do nothing yet. Run with --suppress-deprecation to hide this warning.

Don't worry, there's nothing you need to do quite yet, except be vigilant of upcoming changes.

The reason you're getting this warning is because of the following scenario. You have a file my-project/foo.py which contains the following header:

# SPDX-FileCopyrightText: 2023 Jane Doe
#
# SPDX-License-Identifier: GPL-3.0-or-later

But you also have a .reuse/dep5 file which contains the following section:

Files: my-project/*.py
Copyright: 2020 Example NGO
License: 0BSD

The problem: Under which licence is the file? Who are the copyright holders?

Currently, we err on the side of caution, and just aggregate the results. The answer to both questions is 'both', as far as the tool is concerned.

However, that behaviour is not actually specified in the REUSE Specification, and there is a consensus among the maintainers of REUSE that the current behaviour isn't great. So we want to change it.

A naïve proposal + some history

We want to define an order of precedence for copyright and licensing information. Here is a concrete proposal:

Copyright and Licensing Information is considered according to the
following order of precedence:

  1. Information defined in the .license file.
  2. Information defined in the Commentable File.
  3. Information defined in .reuse/dep5.

There is no merging of information from different sources. Only the
source with the highest precedence is considered.

In fact, this proposal is so concrete that—for a few hours—it was in REUSE Specification 3.1 and tool version 2.0.0! However, because of quick negative feedback, this update to the specification was promptly reverted, and tool version 2.0.0 was yanked. A little embarrassing on our part, but we're thankful for the constructive feedback.

Copied from the change log:

While the intention of the breaking change was sound (don't mix information sources; define a single source of truth), there were legitimate use-cases that were broken as a result of this.

The legitimate use-case is the following scenario: You copy a project Foo wholesale into your own project as a static dependency. Foo is not REUSE-compliant, but does contain copyright statements in some code headers. You write a section into .reuse/dep5 broadly declaring that static/Foo/* is under its declared licence, and attribute The Foo Authors as the copyright holders. However, because the DEP5 file is now no longer applied to the files that contain copyright statements, REUSE will complain that these files do not have a declared licence.

Within the restrictions of the above proposal, there is no good workable solution to this use-case. You could manually edit the headers (not great, especially when Foo is big, or you regularly need to update it), or you could manually add .license files, which may be a huge task.

An actual but not-so-concrete proposal

We still want to define an order of precedence. But we must provide a way to force aggregation (current behaviour) or hard-coding precedence (e.g., prefer .reuse/dep5 over the file contents).

There does not yet exist a concrete way of doing this, but you may think of it like this. Given the example .reuse/dep5 section at the start of this issue, we could instead write this:

Files: my-project/*.py
Copyright: 2020 Example NGO
License: 0BSD
Precedence: [file (default)|aggregate|dep5]

The problem, however, is that DEP5 does not support this field, and we don't want to make it support this field.

So we want to pivot away from DEP5 and adopt a different configuration method. We've been brainstorming this since 2021 (volunteer projects aren't very fast), and we're internally referring to it as REUSE.yaml (although the YAML part is a bit up in the air).

An actual for-real-this-time concrete proposal

Find below a real and actual concrete proposal:

# The version of the TOML schema. A simple integer should be fine.
# Mandatory.
version = 1

[[annotations]]
# The path (or paths) that are covered, relative to REUSE.toml's directory. Mandatory.
path = [
     ".bumpversion.cfg",
     "setup.cfg"
]
# A string that defines the precedence of copyright and licensing information.
# The choices are:
#
# - "toml" -> Treat the information in this file as the ultimate authority of
#   the copyright and licensing of the covered files. If multiple nested
#   REUSE.tomls have this precedence for the same file, then the topmost REUSE.toml
#   is authoritative.
# - "closest" -> Use the information closest to the file (including inside the file)
#   if available. If no such information is found, then the information inside this
#   REUSE.toml is applied to that file. TODO: what if there is only partial information
#   inside the file?
# - "aggregate" -> Aggregate the information from this file with the information
#   inside of the covered files.
#
# Not mandatory. Defaults to "closest".
precedence = "toml"
# The copyright notice (or notices) that apply to the above paths. Mandatory.
SPDX-FileCopyrightText = "2017 Free Software Foundation Europe e.V. <https://fsfe.org>"
# The license expression (or expressions) that apply to the above paths.
# Mandatory.
SPDX-License-Identifier = "GPL-3.0-or-later"

# Subsequent tables override previous tables. This does NOT interact with the
# 'precedence' key.
[[annotations]]
# Can contain globs. I would define the globbing behaviour as equal to Python's
# glob.glob (recursive=True, include_hidden=True), but that would require copying
# the glob.glob code and tweaking it to work without filesystem reads. So instead,
# I used the wcmatch.glob 3rd-party library, with GLOBSTAR, MATCHBASE, and DOTGLOB,
# which gives a behaviour very similar to .gitignore and glob.glob (but not precisely
# matching either).
path = "docs/reuse*.rst"
precedence = "toml"
SPDX-FileCopyrightText = [
    "2017 Free Software Foundation Europe e.V. <https://fsfe.org>",
    "2023 Jane Doe",
]
# These SHOULD be joined with AND, but files support multiple separate
# SPDX-License-Identifier tags, so let's support it here as well.
SPDX-License-Identifier = [
    "CC-BY-SA-4.0",
    "GPL-3.0-or-later"
]

Some notes on the implementation:

  • I chose TOML. I had previously been partial to YAML, but reading over the discussions in the linked issues, and having worked a little more with TOML recently, it's a lot more fool-proof to write, especially as concerns indentation. It's not very good for nesting of data, but we're not doing that, so it's fine. We could bikeshed this choice further, but I propose that we just go with it.
  • path, SPDX-FileCopyrightText, and SPDX-License-Identifier can be either a single string or a list of strings. This (partially) matches DEP5 behaviour, making it easy to port. It's also convenient to not mandate lists; we'll probably convert string values into single-value lists in the under-the-hood implementation (edit: that is exactly what I did).
  • I chose to use the full SPDX-[...] key names. This works better in TOML than in YAML (because the semicolon in YAML messes with this tool's parsing). It's a bit more annoying to type, but it's also very consistent, and means that the user has to memorise less.
  • The version key doesn't do much of anything precently. I'm not sure if it'll ever become important, but if it does, it'll be good to have.

Some notes about the file itself:

  • I intend to support exclusively REUSE.toml, and NOT .REUSE.toml. People will be peeved by this choice (they don't like random tools littering their clean workspace), but I propose that we stand by this choice. By allowing dotfiles, we would run the risk that the licensing information is hidden on some computers. Licensing information should not be hidden, ergo let's not do dotfiles.
  • REUSE.toml files can be nested! You can place them anywhere in the project, not just in the root directory. They use the same closest/aggregate/toml precedence system. closest resolves to the file itself OR to the nearest REUSE.toml (can be self). aggregate just aggregates the REUSE.toml's information always, and then behaves like closest. toml is an aggressive "ignore everything underneath me; I am the ultimate authority here" precedence setting. The topmost REUSE.toml with toml is authoritative.

Related issues

Here are some issues of relevance (in order of relevance, feel free to reference more):

This issue will exist as a sort of meta issue to refer back to and track work in other issues.

@carmenbianca carmenbianca changed the title Define an order of precedence + provide a means to override it. Define an order of precedence + provide a means to override it (REUSE.yaml) Jun 21, 2023
@carmenbianca carmenbianca added enhancement New feature or request help wanted Extra attention is needed spec Requires specification change documentation Missing or wrong documentation discussion needed labels Jun 21, 2023
@linozen linozen pinned this issue Jun 22, 2023
@floriansnow
Copy link
Contributor

Just documenting this here as well. I think having a precedence implies there is some sort of legal precedence as well. This is a problem if a repo supplies licensing information in a funny way, i.e. adds information in one way and then in another way while keeping both.

So I think it would be good to keep on aggregating and issue a warning whenever that happens because that means licensing information is unclear and that's not what we want. In later versions, we could perhaps even make the warning a failure. This would improve licensing overall by making it clearer.

However, the legitimate use case Carmen describes above, is an issue with my suggestion. Something we should figure out is how common static dependencies are and if they are typically necessary. If the answers are very common and necessary, then perhaps there is a way to declare something a static dependency and in that case be able to override included licensing information? Or perhaps we could ignore something that is declared a dependency? However, there is an underlying question we should also debate: Can a repo be REUSE compliant with static dependencies that aren't?

@DerMolly

This comment was marked as resolved.

@carmenbianca
Copy link
Member Author

@DerMolly This mechanism exists :D See https://github.com/fsfe/reuse-tool/blob/6b81d04dfa198423b3a1adc0483368af7b73fc7c/docs/usage.rst#ignoring-parts-of-a-file

Between .license files and the ignore tags, that use-case is well-covered at the moment. It doesn't, however, address the problem of static dependencies.

rpatterson added a commit to rpatterson/project-structure that referenced this issue Oct 20, 2023
    /opt/venv/lib/python3.11/site-packages/reuse/project.py:224: PendingDeprecationWarning: Copyright and licensing information for 'newsfragments/.gitignore' has been found in both 'newsfragments/.gitignore' and in the DEP5 file located at '.reuse/dep5'. The information for these two sources has been aggregated. In the future this behaviour will change, and you will need to explicitly enable aggregation. See <fsfe/reuse-tool#779>. You need do nothing yet. Run with `--suppress-deprecation` to hide this warning.
      warnings.warn(
vszakats added a commit to vszakats/libssh2 that referenced this issue Nov 8, 2023
```
/opt/venv/lib/python3.10/site-packages/reuse/project.py:224: PendingDeprecationWarning:
Copyright and licensing information for 'tests/openssh_server/Dockerfile' has been found
in both 'tests/openssh_server/Dockerfile' and in the DEP5 file located at '.reuse/dep5'.
The information for these two sources has been aggregated. In the future this behaviour
will change, and you will need to explicitly enable aggregation. See
<fsfe/reuse-tool#779>. You need do nothing yet.
Run with `--suppress-deprecation` to hide this warning.
```
Ref: https://github.com/libssh2/libssh2/actions/runs/6789274955/job/18456085964#step:4:4
@carmenbianca
Copy link
Member Author

I added a proposal for REUSE.toml in the section 'An actual for-real-this-time concrete proposal' of the issue.

@mxmehl @silverhook @nicorikken

@silverhook
Copy link
Contributor

I support the TOML choice and also that we avoid having a dotfile. REUSE.toml should be plenty fine.

Regarding using SPDX- tags, I’m fine with it, but there is a question whether it is a benefit or a problem for (other) tools. IMHO it’s a benefit, as the content is correct, albeit in a different file than tools might suspect (the same is true for *.license files too, so nothing new). but it might make sense to reach out to ScanCode and FOSSology for feedback. (also regarding merging this with ScanCode’s ABOUT files)

The example TOML looks good to me. I would expect that that using SPDX annotation within the field would also work fine. e.g.:

[[annotations]]

path = "docs/reuse*.rst"
precedence = "toml"
SPDX-FileCopyrightText = [
    "2017 Free Software Foundation Europe e.V. <https://fsfe.org>",
    "2023 Jane Doe",
]

SPDX-License-Identifier = [
    "(CC-BY-SA-4.0 AND GPL-3.0-or-later) OR MIT"
]

I have some concerns about the precedence field and allowing the REUSE.toml to be anywhere. There is room for misuse there, but on the other hand, people will game whatever they get. So I’m not too concerned. Still perhaps something to keep in mind, as it does complicate things quite a bit and makes it a bit less predictable.

@carmenbianca
Copy link
Member Author

carmenbianca commented Nov 30, 2023

I no longer intend to implement REUSE.toml files in multiple locations. This is non-trivial to implement, and there are so many permutations of potentially conflicting information.

@carmenbianca
Copy link
Member Author

Let's no longer make 'precedence' mandatory -> default to file.

@mxmehl
Copy link
Member

mxmehl commented Nov 30, 2023

I no longer intend to implement REUSE.toml files in multiple locations. This is non-trivial to implement, and there are so many permutations of potentially conflicting information.

After the discussion today, do you think differently? I think we need this feature, that was one of the USPs of REUSE.toml/yaml.

@carmenbianca
Copy link
Member Author

I wrote that just before the meeting. My mind was changed during the meeting :)

@andreashaerter
Copy link

There is also an issue as the deprecation warning is getting printed when there is nothing to aggregate / both .reuse/dep5 and the header are not in conflict, at least with reuse version 2.1.0. Example with a wildcard .reuse/dep5 and two files with license headers:

$ tree -a -I '.git' /tmp/issue779/
/tmp/issue779/
├── foobar.html
├── foobar.yaml
├── LICENSES
│   └── GPL-3.0-or-later.txt
└── .reuse
    └── dep5
$ cat /tmp/issue779/.reuse/dep5 
Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Upstream-Name: Issue 779
Upstream-Contact: ACME Inc. <https://example.com>
Source: <URL of the public repository, https://github.com/...>

Files: *
Copyright: ACME Inc. <https://example.com>
License: GPL-3.0-or-later
$ cat /tmp/issue779/foobar.yaml 
# Test
#
# SPDX-License-Identifier: GPL-3.0-or-later
# SPDX-FileCopyrightText: ACME Inc. <https://example.com>
---
foo: bar
$ cat /tmp/issue779/foobar.html 
<html>
<!--
SPDX-License-Identifier: GPL-3.0-or-later
SPDX-FileCopyrightText: ACME Inc. <https://example.com>
-->

I would expect no warnings as the information is redundant but in sync, no licenses to aggregate and therefore no precedence issues. However, a warning gets printed:

$ reuse --version lint
reuse 2.1.0

$ reuse --root /tmp/issue779/ lint
/tmp/.venv/issue779/lib64/python3.12/site-packages/reuse/project.py:224: PendingDeprecationWarning: Copyright and licensing information for '/tmp/issue779/foobar.html' has been found in both '/tmp/issue779/foobar.html' and in the DEP5 file located at '.reuse/dep5'. The information for these two sources has been aggregated. In the future this behaviour will change, and you will need to explicitly enable aggregation. See <https://github.com/fsfe/reuse-tool/issues/779>. You need do nothing yet. Run with `--suppress-deprecation` to hide this warning.
  warnings.warn(
/tmp/.venv/issue779/lib64/python3.12/site-packages/reuse/project.py:224: PendingDeprecationWarning: Copyright and licensing information for '/tmp/issue779/foobar.yaml' has been found in both '/tmp/issue779/foobar.yaml' and in the DEP5 file located at '.reuse/dep5'. The information for these two sources has been aggregated. In the future this behaviour will change, and you will need to explicitly enable aggregation. See <https://github.com/fsfe/reuse-tool/issues/779>. You need do nothing yet. Run with `--suppress-deprecation` to hide this warning.
  warnings.warn(

# SUMMARY

* Bad licenses: 0
* Deprecated licenses: 0
* Licenses without file extension: 0
* Missing licenses: 0
* Unused licenses: 0
* Used licenses: GPL-3.0-or-later
* Read errors: 0
* files with copyright information: 2 / 2
* files with license information: 2 / 2

Congratulations! Your project is compliant with version 3.0 of the REUSE Specification :-)

Is this expected behavior? If so, senseful wildcards would effectively prevent users to add .license files or header comments in files.

@raforg
Copy link

raforg commented Jan 1, 2024

I agree. The code just concatenates data from both sources without any attempt to avoid duplication, and emits multiple warnings. I logged an issue offering to fix this four months ago but there was no response. So I don't think there's any interest in that aproach. It's possible to suppress the warnings, but it's not possible to suppress the duplicates (in spdx output). Maybe it's messy but harmless. But I wish it was at least possible to instruct the code to not scan the source when there is a .reuse/dep5 file. That seems to me to be a very sensible thing to do, but it is also possible to scan both and cull duplicates (it's just less efficient if there is total agreement - where there is disagreement targeted warnings could help to achieve agreement). I think we have to wait until a decision is made about a configuration file that instructs reuse what to do in this regard. I just hope that the decision supports projects that want to continue to specify all of the licensing information (for REUSE purposes) in a single file separate to the source files (i.e., .reuse/dep5).

@carmenbianca
Copy link
Member Author

@andreashaerter That is an interesting bug/feature/behaviour that probably shouldn't be too difficult to fix. However, #863 should fix this class of issues, so it's a touch low on the priority list. I'll create an issue for it nonetheless.

Interestingly, the code does actually avoid duplication, ultimately. But at that exact stage in the code, not quite.

@raforg
Copy link

raforg commented Jan 2, 2024

If there is duplicate removal somewhere, it would be good if it occurred as part of or before the spdx subcommand. Its output contains all the duplicates.

@carmenbianca
Copy link
Member Author

carmenbianca commented Jan 3, 2024

I updated the first post with some new details after doing some tinkering work in #863.

  • file precedence renamed to closest.
  • closest precedence is now the implicit default, and the precedence key is no longer mandatory.
  • REUSE.toml files can now be nested in a project. Or rather, you can plop down REUSE.toml files anywhere. They use the same closest/aggregate/toml precedence system. closest resolves to the file itself OR to the nearest REUSE.toml (can be self). aggregate just aggregates the REUSE.toml's information always, and then behaves like closest. toml is an aggressive "ignore everything underneath me; I am the ultimate authority here" precedence setting. The topmost REUSE.toml with toml is authoritative.
  • The path globbing behaviour was defined. Copying: I would define the globbing behaviour as equal to Python's glob.glob (recursive=True, include_hidden=True), but that would require copying the glob.glob code and tweaking it to work without filesystem reads. So instead, I used the wcmatch.glob 3rd-party library, with GLOBSTAR (match **), MATCHBASE (match directory-less pattern also in subdirectories), and DOTGLOB (match .foo.py with *.py), which gives a behaviour very similar to .gitignore and glob.glob (but not precisely matching either).

Some notes:

  • The precedence handling of the nesting of REUSE.toml files is simple, but appears complete to me. However:
  • What to do if the REUSE.toml precedence is closest, but the file only contains a copyright line? Presently, this means that the REUSE.toml's information is never applied, and the file has no licence applied to it. There are three options:
    1. This behaviour is desirable actually, and shouldn't be changed.
    2. If the file contains exclusively copyright info or exclusively licensing info, then apply all the REUSE.toml information. This one seems good to me.
    3. Selectively apply exclusively the licensing information if there is only copyright information in the file, and vice versa. This seems needlessly complicated to me.
  • While I'm fairly happy with how the globbing works, it's going to be a pain to describe in the spec. We could cheat and say 'globbing works like wcmatch.glob with GLOBSTAR, MATCHBASE, and DOTGLOB', but that's a bit weird.

@silverhook
Copy link
Contributor

2. If the file contains exclusively copyright info or exclusively licensing info, then apply all the REUSE.toml information. This one seems good to me.

Practically speaking, this sounds like a good trade-off. But the tool should raise a warning if that happens.

If someone is already using REUSE.toml to override info, then it’s on them to make sure the override is exact.

(It’s going to be fun explaining it in the spec.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion needed documentation Missing or wrong documentation enhancement New feature or request help wanted Extra attention is needed spec Requires specification change
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants