Skip to content
This repository has been archived by the owner on May 27, 2024. It is now read-only.

Define precedence of information with REUSE.yaml #70

Open
mxmehl opened this issue Jul 31, 2020 · 14 comments
Open

Define precedence of information with REUSE.yaml #70

mxmehl opened this issue Jul 31, 2020 · 14 comments
Labels
spec Specification

Comments

@mxmehl
Copy link
Member

mxmehl commented Jul 31, 2020

We should decide on a precedence chain for which the tool (and our spec) can decide which licensing/copyright information is the determining one if there are conflics, e.g. between the source code file and the respective .license file.

All is based on the assumption that we will introduce a REUSE.yaml file.

There are three opions that make sense in their own ways. Which one do you prefer?

Option 1: most intuitive

Rationale: The closer to the file, the more weight the information should have

  1. file content
  2. file.license
  3. adjacent REUSE.yaml (no matter if explicit or glob)
  4. faraway REUSE.yaml

Option 2: override compromise

Rationale: Allow to override file contents, e.g. if there are misleading strings which confuse the REUSE tool. Otherwise, as Option 1

  1. file.license
  2. file content
  3. adjacent REUSE.yaml (no matter if explicit or glob)
  4. faraway REUSE.yaml

Option 3: most tooling robust

Rationale: Similar to Option 2, but make a difference between explicit and glob file definitions

  1. file.license
  2. explicit in adjacent yaml (so file directly covered)
  3. explicit in faraway yaml
  4. file content
  5. glob in adjacent yaml (file covered e.g. as part of *.png)
  6. glob in faraway yaml
@oddhack
Copy link

oddhack commented Sep 21, 2020

I like #3. I've been doing REUSE compliance for some repositories which import multiple other repos which are not REUSE-compliant and aren't likely to be, so globbing is important.

@mxmehl
Copy link
Member Author

mxmehl commented Sep 21, 2020

I like #3. I've been doing REUSE compliance for some repositories which import multiple other repos which are not REUSE-compliant and aren't likely to be, so globbing is important.

Ah, I think there's a misunderstanding. Globbing would be possible in each scenario. No3 just makes a difference on whether a file is defined explicitely or via a glob, and how this changes depending on the distance of the YAML file. I tried to make that a bit clearer above.

@oddhack
Copy link

oddhack commented Sep 22, 2020

I wasn't entirely clear either. A confusing thing I kept running into was how .reuse/dep5 is parsed such that the last match is the controlling one - someone in another issue pointed to dep5 documentation on how that works, but I didn't see it in REUSE documentation.

So having the processing stages clearly laid out such that explicit matches happen first and in a separate stage from glob matching would be more sensible for me, and better documented (even if just at the level of the comment above). However you do it, I encourage being really clear about how multiple matches to the same file are handled.

@robinkrahl
Copy link

robinkrahl commented Sep 22, 2020

For the use case I described in this comment, I’d prefer option 3:

I’m including the source code of a third-party library in a project I maintain. The library uses the deprecated LGPL-3.0 identifier which reuse does not accept. I would like to overwrite these annotations in the dep5 file, but reuse still parses the library files and reports the incorrect license identifier.

Alternatively, would it be possible to add an override field to the entries in REUSE.yaml to disable checking the matched files?

@mxmehl
Copy link
Member Author

mxmehl commented Sep 22, 2020

For the use case I described in this comment, I’d prefer option 3:

Thank you. May I ask why option 2 would not work for you?

Alternatively, would it be possible to add an override field to the entries in REUSE.yaml to disable checking the matched files?

If possible, my preference would be to have the precedence chain make such manual overrides obsolete to not run into the trap of option overload.

@robinkrahl
Copy link

May I ask why option 2 would not work for you?

You are right, it would work too. I just would have to generate multiple .license files which is rather tedious, especially when I have to merge in new versions from the upstream project. Specifying the information in a single location would be easier.

@silverhook
Copy link
Collaborator

silverhook commented Sep 22, 2020

For practical reasons, I would prefer №3.
It is more complex, but that complexity does bring with it flexibility to make REUSE actually useful in more complex (dare I say real life) scenarios where 3rd party code mixes with 1st party code and a tonne of non-editable files.

Still, in the spec and FAQ (perhaps even in the tool’s output) we should continue to emphasise the importance of having the licensing/copyright info in the files themselves, if at all possible. As then this info does not get lost if a file is taken outside of its home context.

@mxmehl
Copy link
Member Author

mxmehl commented Sep 24, 2020

You are right, it would work too. I just would have to generate multiple .license files which is rather tedious, especially when I have to merge in new versions from the upstream project. Specifying the information in a single location would be easier.

Ah, but in option 1 and 2, you would also have the glob available. Option 3 compared to option 2 just makes a difference on the type of coverage (explicit path vs. glob) when information overlaps.

Excuse me for asking you so much about your rationale ;)

@mxmehl
Copy link
Member Author

mxmehl commented Sep 24, 2020

For practical reasons, I would prefer №3.
It is more complex, but that complexity does bring with it flexibility to make REUSE actually useful in more complex (dare I say real life) scenarios where 3rd party code mixes with 1st party code and a tonne of non-editable files.

I am starting to get the same feeling. I am just not so happy with having the file content ranked that low, given our actual priority and the usefulness for human readers.

@oddhack
Copy link

oddhack commented Sep 24, 2020

While there are a couple of different ways of presenting it, for me #3 boils down to

  • Explicit override of file, when necessary (special case / one-off license / bad file contents (we have run into this last))
  • File contents
  • Globbing for e.g. imported repositories whose content can't be modified

That may make it a little more clear why I'm in favor? Having 3 ways to specify the explicit override does seem a bit overkill-ish. Though in our usage thus far we have completely avoided .license files as they clutter the repository - a lot, if you have e.g. hundreds of images.

@robinkrahl
Copy link

robinkrahl commented Sep 24, 2020 via email

@mxmehl
Copy link
Member Author

mxmehl commented Sep 25, 2020

Yes, but in options 1 and 2, the information in the file has precedence overthe glob, right? My problem was that the annotations in the file use an outdated license specifier, so I don’t want the files to be parsed.

Yes, but the same applies to option 3 where the glob has a lower priority than the file content. To override a file content, option 2 and 3 would require an explicit override (while in option 1 this is not possible, and therefore not practical). Either as a .license file (option 2 and 3), or as an explicit mention in the YAML file (option 3).

Say that you want to override the content in src/code.py for any reason:

Option 1

Not possible as file content ("as close to the file as possible") is authoritative.

Option 2

  1. Create src/code.py.license

Option 3

  1. Create src/code.py.license
  2. Write in your YAML file (again, this format is not specified yet, but on the roadmap):
- src/code.py:
    license:
      SPDX-License-Identifier: MIT
    copyright: |
      SPDX-FileCopyrightText: 2020 Me

In this option, a glob would still not override the file, so the following things will not work:

- *:
    license:
      SPDX-License-Identifier: MIT
    copyright: |
      SPDX-FileCopyrightText: 2020 Me

- src/*:
    license:
      SPDX-License-Identifier: MIT
    copyright: |
      SPDX-FileCopyrightText: 2020 Me

- src/*.py:
    license:
      SPDX-License-Identifier: MIT
    copyright: |
      SPDX-FileCopyrightText: 2020 Me

The reason for that: REUSE is designed to allow devs to re-use code by others, who ideally also applied REUSE to their code. So if I just copy someone else's file into my own repo, e.g. in src/, it should be a well-thought step to override the existing information. With using a glob, this can happen accidentially, so that is why globs have the lowest priority in option 3.

As a reminder: the main idea of REUSE is that devs write the copyright/license information in the file headers as this best preserves this kind of information, also for non-REUSE tooling.

@robinkrahl
Copy link

Yes, but the same applies to option 3 where the glob has a lower priority than the file content.

I see, thanks for the explanation! In this case, just ignore my previous comments.

@buxtonpaul
Copy link

Option 3 for me!
I think this in combination with the multiple reuse.yaml will cover mostly all my current issues.

I think in addition that if there are multiple explicit definitions (either yaml, or through .license) there should be a warning generated.

@mxmehl mxmehl changed the title Define precedence of information Define precedence of information with REUSE.yaml Jun 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
spec Specification
Projects
None yet
Development

No branches or pull requests

5 participants