Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip Markdown markup prior to comparision #247

Closed
eine opened this issue Dec 14, 2017 · 13 comments
Closed

Strip Markdown markup prior to comparision #247

eine opened this issue Dec 14, 2017 · 13 comments

Comments

@eine
Copy link

eine commented Dec 14, 2017

It seems that gpl-2.0.txt is properly detected (e.g. torvalds/linux/blob/master/COPYING), but gpl-2.0.md is not (e.g. 1138-4EB/license-test/blob/master/LICENSE.md).

Hope I am doing nothing wrong, as LICENSE.md is a valid name according to help.github.com/articles/adding-a-license-to-a-repository, and the content is an exact match to a known license.

@benbalter
Copy link
Contributor

The issue is that the linked file is Markdown formatted, whereas the known license is not. We can theoretically strip markdown formatting from licenses prior to comparison, as it would not be legally significant.

If you clone this repo locally and run script/git-repo https://github.com/1138-4EB/license-test --license gpl-2.0 you should be able to see the difference.

@benbalter benbalter changed the title gpl-2.0.md not detected Strip Markdown markup prior to comparision Dec 14, 2017
@benbalter
Copy link
Contributor

We'll likely want to remove the non-word characters someplace in https://github.com/benbalter/licensee/blob/master/lib/licensee/content_helper.rb.

@eine
Copy link
Author

eine commented Dec 14, 2017

The issue is that the linked file is Markdown formatted, whereas the known license is not.

What's the point in suggesting that the filename can be named LICENSE.md if using Markdown format will prevent it from being detected? Note that the markdown version is as known as the non-markdown one, as both are provided by the same source. Is it something related to GitHub and not to the library?

We can theoretically strip markdown formatting from licenses prior to comparison, as it would not be legally significant.

Indeed, it seems that it is partially done already: https://github.com/benbalter/licensee/blob/master/lib/licensee/content_helper.rb#L145-L152

If you clone this repo locally and run script/git-repo https://github.com/1138-4EB/license-test --license gpl-2.0 you should be able to see the difference.

I tried, but it is not easy to see it:

./script/git-repo https://github.com/1138-4EB/licen
se-test --license gpl-2.0
License: Other
Matched files: ["LICENSE.md"]
LICENSE.md:
  Content hash: dd2e94bd5dfa5ecd80572192e5f41682075579a8
  Attribution: Copyright (C) 1989, 1991 Free Software Foundation, Inc.
  License: Other
  Closest licenses:
    * LPPL-1.3c similarity: 49.63%
    * CC-BY-SA-4.0 similarity: 39.24%
Comparing to GNU General Public License v2.0:
  Input length: 17583
  License length: 14697
  Similarity: 93.19%

*** Please tell me who you are.

Run

  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: unable to auto-detect email address (got 'root@e6bd66637988.(none)')
diff --git a/LICENSE b/LICENSE
index 715b803..5e9d019 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,9 +1,9 @@
51 franklin street, fifth floor, boston, ma [-02110-1301-]{+02110-1301,+} usa ev
eryone is
permitted to copy and distribute verbatim copies of this license document, but
changing it is not allowed. {+###+} preamble the licenses for most software are
designed to take away your freedom to share and change it. by contrast, the gnu
general public license is intended to guarantee your freedom to share and change
free software--to make sure the software is free for all its users. this general
public license applies to most of the free software foundation's software and to
any other program whose authors commit to using it. (some other free software
foundation software is covered by the gnu lesser general public license
@@ -32,49 +32,49 @@ redistributors of a free program will individually obtain pa
tent licenses, in
effect making the program proprietary. to prevent this, we have made it clear
that any patent must be licensed for everyone's free use or not licensed at all.
the precise terms and conditions for copying, distribution and modification
follow. [-gnu general public license-]{+###+} terms and conditions for copying,
distribution and modification
[-0.-]{+**0.**+} this license applies to any program or other work which contain
s a notice
placed by the copyright holder saying it may be distributed under the terms of
this general public license. the "program", below, refers to any such program or
work, and a "work based on the program" means either the program or any
derivative work under copyright law: that is to say, a work containing the
program or a portion of it, either verbatim or with modifications and/or
translated into another language. (hereinafter, translation is included without
limitation in the term "modification".) each licensee is addressed as "you".
activities other than copying, distribution and modification are not covered by
this license; they are outside its scope. the act of running the program is not
restricted, and the output from the program is covered only if its contents
constitute a work based on the program (independent of having been made by
running the program). whether that is true depends on what the program does.
[-1.-]{+**1.**+} you may copy and distribute verbatim copies of the program's so
urce code
as you receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice and
disclaimer of warranty; keep intact all the notices that refer to this license
and to the absence of any warranty; and give any other recipients of the program
a copy of this license along with the program. you may charge a fee for the
physical act of transferring a copy, and you may at your option offer warranty
protection in exchange for a fee. [-2.-]{+**2.**+} you may modify your copy or c
opies of
the program or any portion of it, thus forming a work based on the program, and
copy and distribute such modifications or work under the terms of section 1
above, provided that you also meet all of these conditions: [-a)-]{+**a)**+} you
 must
cause the modified files to carry prominent notices stating that you changed the
files and the date of any change. [-b)-]{+**b)**+} you must cause any work that
you
distribute or publish, that in whole or in part contains or is derived from the
program or any part thereof, to be licensed as a whole at no charge to all third
parties under the terms of this license. [-c)-]{+**c)**+} if the modified progra
m normally
reads commands interactively when run, you must cause it, when started running
for such interactive use in the most ordinary way, to print or display an
announcement including an appropriate copyright notice and a notice that there
is no warranty (or else, saying that you provide a warranty) and that users may
redistribute the program under these conditions, and telling the user how to
view a copy of this license. (exception: if the program itself is interactive
but does not normally print such an announcement, your work based on the program
is not required to print an announcement.) these requirements apply to the
modified work as a whole. if identifiable sections of that work are not derived
from the program, and can be reasonably considered independent and separate
works in themselves, then this license, and its terms, do not apply to those
sections when you distribute them as separate works. but when you distribute the
same sections as part of a whole which is a work based on the program, the
distribution of the whole must be on the terms of this license, whose
permissions for other licensees extend to the entire whole, and thus to each and
every part regardless of who wrote it. thus, it is not the intent of this
@@ -83,52 +83,52 @@ rather, the intent is to exercise the right to control the d
istribution of
derivative or collective works based on the program. in addition, mere
aggregation of another work not based on the program with the program (or with a
work based on the program) on a volume of a storage or distribution medium does
not bring the other work under the scope of this license. [-3.-]{+**3.**+} you m
ay copy
and distribute the program (or a work based on it, under section 2) in object
code or executable form under the terms of sections 1 and 2 above provided that
you also do one of the following: [-a)-]{+**a)**+} accompany it with the complet
e
corresponding machine-readable source code, which must be distributed under the
terms of sections 1 and 2 above on a medium customarily used for software
interchange; or, [-b)-]{+**b)**+} accompany it with a written offer, valid for a
t least
three years, to give any third party, for a charge no more than your cost of
physically performing source distribution, a complete machine-readable copy of
the corresponding source code, to be distributed under the terms of sections 1
and 2 above on a medium customarily used for software interchange; or, [-c)-]{+*
*c)**+}
accompany it with the information you received as to the offer to distribute
corresponding source code. (this alternative is allowed only for noncommercial
distribution and only if you received the program in object code or executable
form with such an offer, in accord with subsection b above.) the source code for
a work means the preferred form of the work for making modifications to it. for
an executable work, complete source code means all the source code for all
modules it contains, plus any associated interface definition files, plus the
scripts used to control compilation and installation of the executable. however,
as a special exception, the source code distributed need not include anything
that is normally distributed (in either source or binary form) with the major
components (compiler, kernel, and so on) of the operating system on which the
executable runs, unless that component itself accompanies the executable. if
distribution of executable or object code is made by offering access to copy
from a designated place, then offering equivalent access to copy the source code
from the same place counts as distribution of the source code, even though third
parties are not compelled to copy the source along with the object code. [-4.-]{
+**4.**+}
you may not copy, modify, sublicense, or distribute the program except as
expressly provided under this license. any attempt otherwise to copy, modify,
sublicense or distribute the program is void, and will automatically terminate
your rights under this license. however, parties who have received copies, or
rights, from you under this license will not have their licenses terminated so
long as such parties remain in full compliance. [-5.-]{+**5.**+} you are not req
uired to
accept this license, since you have not signed it. however, nothing else grants
you permission to modify or distribute the program or its derivative works.
these actions are prohibited by law if you do not accept this license.
therefore, by modifying or distributing the program (or any work based on the
program), you indicate your acceptance of this license to do so, and all its
terms and conditions for copying, distributing or modifying the program or works
based on it. [-6.-]{+**6.**+} each time you redistribute the program (or any wor
k based on
the program), the recipient automatically receives a license from the original
licensor to copy, distribute or modify the program subject to these terms and
conditions. you may not impose any further restrictions on the recipients'
exercise of the rights granted herein. you are not responsible for enforcing
compliance by third parties to this license. [-7.-]{+**7.**+} if, as a consequen
ce of a
court judgment or allegation of patent infringement or for any other reason (not
limited to patent issues), conditions are imposed on you (whether by court
order, agreement or otherwise) that contradict the conditions of this license,
they do not excuse you from the conditions of this license. if you cannot
@@ -150,13 +150,13 @@ system in reliance on consistent application of that syste
m; it is up to the
author/donor to decide if he or she is willing to distribute software through
any other system and a licensee cannot impose that choice. this section is
intended to make thoroughly clear what is believed to be a consequence of the
rest of this license. [-8.-]{+**8.**+} if the distribution and/or use of the pro
gram is
restricted in certain countries either by patents or by copyrighted interfaces,
the original copyright holder who places the program under this license may add
an explicit geographical distribution limitation excluding those countries, so
that distribution is permitted only in or among countries not thus excluded. in
such case, this license incorporates the limitation as if written in the body of
this license. [-9.-]{+**9.**+} the free software foundation may publish revised
and/or new
versions of the general public license from time to time. such new versions will
be similar in spirit to the present version, but may differ in detail to address
new problems or concerns. each version is given a distinguishing version number.
@@ -165,13 +165,13 @@ and "any later version", you have the option of following
the terms and
conditions either of that version or of any later version published by the free
software foundation. if the program does not specify a version number of this
license, you may choose any version ever published by the free software
foundation. [-10.-]{+**10.**+} if you wish to incorporate parts of the program i
nto other
free programs whose distribution conditions are different, write to the author
to ask for permission. for software which is copyrighted by the free software
foundation, write to the free software foundation; we sometimes make exceptions
for this. our decision will be guided by the two goals of preserving the free
status of all derivatives of our free software and of promoting the sharing and
reuse of software generally. [-no-]{+**no+} warranty [-11.-]{+11.**+} because th
e program is licensed
free of charge, there is no warranty for the program, to the extent permitted by
applicable law. except when otherwise stated in writing the copyright holders
and/or other parties provide the program "as is" without warranty of any kind,
@@ -179,11 +179,48 @@ either expressed or implied, including, but not limited to
, the implied
warranties of merchantability and fitness for a particular purpose. the entire
risk as to the quality and performance of the program is with you. should the
program prove defective, you assume the cost of all necessary servicing, repair
or correction. [-12.-]{+**12.**+} in no event unless required by applicable law
or agreed
to in writing will any copyright holder, or any other party who may modify
and/or redistribute the program as permitted above, be liable to you for
damages, including any general, special, incidental or consequential damages
arising out of the use or inability to use the program (including but not
limited to loss of data or data being rendered inaccurate or losses sustained by
you or third parties or a failure of the program to operate with any other
programs), even if such holder or other party has been advised of the
possibility of such damages. {+### end of terms and conditions ### how to apply+
}
{+these terms to your new programs if you develop a new program, and you want it
+}
{+to be of the greatest possible use to the public, the best way to achieve this
+}
{+is to make it free software which everyone can redistribute and change under+}
{+these terms. to do so, attach the following notices to the program. it is safe
st+}
{+to attach them to the start of each source file to most effectively convey the
+}
{+exclusion of warranty; and each file should have at least the "copyright" line
+}
{+and a pointer to where the full notice is found. one line to give the program'
s+}
{+name and an idea of what it does. copyright (c) yyyy name of author this progr
am+}
{+is free software; you can redistribute it and/or modify it under the terms of+
}
{+the gnu general public license as published by the free software foundation;+}
{+either version 2 of the license, or (at your option) any later version. this+}
{+program is distributed in the hope that it will be useful, but without any+}
{+warranty; without even the implied warranty of merchantability or fitness for
a+}
{+particular purpose. see the gnu general public license for more details. you+}
{+should have received a copy of the gnu general public license along with this+
}
{+program; if not, write to the free software foundation, inc., 51 franklin+}
{+street, fifth floor, boston, ma 02110-1301, usa. also add information on how t
o+}
{+contact you by electronic and paper mail. if the program is interactive, make
it+}
{+output a short notice like this when it starts in an interactive mode:+}
{+gnomovision version 69, copyright (c) year name of author gnomovision comes wi
th+}
{+absolutely no warranty; for details type `show w'. this is free software, and+
}
{+you are welcome to redistribute it under certain conditions; type `show c' for
+}
{+details. the hypothetical commands \`show w' and \`show c' should show the+}
{+appropriate parts of the general public license. of course, the commands you u
se+}
{+may be called something other than \`show w' and \`show c'; they could even be
+}
{+mouse-clicks or menu items--whatever suits your program. you should also get+}
{+your employer (if you work as a programmer) or your school, if any, to sign a+
}
{+"copyright disclaimer" for the program, if necessary. here is a sample; alter+
}
{+the names: yoyodyne, inc., hereby disclaims all copyright interest in the+}
{+program `gnomovision' (which makes passes at compilers) written by james hacke
r.+}
{+signature of ty coon, 1 april 1989 ty coon, president of vice this general+}
{+public license does not permit incorporating your program into proprietary+}
{+programs. if your program is a subroutine library, you may consider it more+}
{+useful to permit linking proprietary applications with the library. if this is
+}
{+what you want to do, use the [gnu lesser general public+}
{+license](https://www.gnu.org/licenses/lgpl.html) instead of this license.+}

@benbalter
Copy link
Contributor

What's the point in suggesting that the filename can be named LICENSE.md if using Markdown format will prevent it from being detected?

A file that has a .md extension, but is plain text, will render as styled HTML, whereas a .txt file will render as a <PRE> block.

wking added a commit to wking/license-list-XML that referenced this issue Dec 14, 2017
Upstream is not consistent about this:

  $ curl -s https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html | grep USA
  51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA
  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
  $ curl -s https://www.gnu.org/licenses/old-licenses/gpl-2.0.txt | grep USA
   51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
      51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

so support both forms.  I've stuck with our old comma version as
canonical.

Reported by 1138-4EB [1].

[1]: licensee/licensee#247 (comment)
@wking
Copy link
Contributor

wking commented Dec 14, 2017 via email

@js-choi
Copy link

js-choi commented Dec 19, 2017

The WHATWG has run into what is probably this issue, during the transition of its specifications for HTML, etc. from CC0 to CC-BY (whatwg/sg#51). They were unable to get GitHub to correctly display those specifications’ repository licenses as not CC0 using Markdown license files; after realizing what was happening, they are switching to plain-text license files instead.

wking added a commit to wking/license-list-XML that referenced this issue Dec 19, 2017
Upstream is not consistent about this:

  $ curl -s https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html | grep USA
  51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA
  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
  $ curl -s https://www.gnu.org/licenses/old-licenses/gpl-2.0.txt | grep USA
   51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
      51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

so support both forms.  I've stuck with our old comma version as
canonical.

Reported by 1138-4EB [1].

[1]: licensee/licensee#247 (comment)
@benbalter
Copy link
Contributor

Thinking through how to implement this, rendering the markdown to HTML and stripping tags feels to heavyweight and using regex feels to flimsy.

I believe we can strip all non-word characters, as they shouldn't have legal significance for length comparison purposes (and are stripped prior to the wordset anyway).

@benbalter
Copy link
Contributor

With #249, this is now detected, but with two odd changes:

➜  licensee git:(strip-markdown) ✗ script/git-repo https://github.com/1138-4EB/license-test --license gpl-2.0
License: GNU General Public License v2.0
Matched files: ["LICENSE.md"]
LICENSE.md:
  Content hash: 1b4871ae29d7f3bcbf940c9a030e75d374dc066f
  Attribution: Copyright (C) 1989, 1991 Free Software Foundation, Inc.
  Confidence: 100.00%
  Matcher: Licensee::Matchers::Dice
  License: GNU General Public License v2.0
Comparing to GNU General Public License v2.0:
  Input length: 14612
  License length: 14638
  Similarity: 100.00%
diff --git a/LICENSE b/LICENSE
index 4f35c7f..74a8c0e 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,8 +1,8 @@
51 franklin street, fifth floor, boston, ma [-02110-1301-]{+02110-1301,+} usa everyone is
permitted to copy and distribute verbatim copies of this license document, but
changing it is not allowed. preamble the licenses for most software are designed
to take away your freedom to share and change it. by contrast, the gnu general
public license is intended to guarantee your freedom to share and change free
software--to make sure the software is free for all its users. this general
public license applies to most of the free software foundation's software and to
any other program whose authors commit to using it. some other free software
@@ -32,82 +32,82 @@ redistributors of a free program will individually obtain patent licenses, in
effect making the program proprietary. to prevent this, we have made it clear
that any patent must be licensed for everyone's free use or not licensed at all.
the precise terms and conditions for copying, distribution and modification
follow.[-gnu general public license-] terms and conditions for copying, distribution and modification 0. this

@eine
Copy link
Author

eine commented Dec 28, 2017

Thanks @benbalter! Just a couple of questions:

  • Does the fix apply to GPLv2 only, or is it extended to any license?
  • Shall I drop an email to GitHub support (as a user) to let them know that I'd like this feature updated? Or do you already have some 'internal' notification process?

@benbalter
Copy link
Contributor

Does the fix apply to GPLv2 only, or is it extended to any license?

Any license

Shall I drop an email to GitHub support (as a user) to let them know that I'd like this feature updated? Or do you already have some 'internal' notification process?

We'll need to update Licensee on GitHub.com, which we do on a regular basis.

@eine
Copy link
Author

eine commented Dec 28, 2017

Great. Thanks again, for fast and effective response.

@Elioty
Copy link

Elioty commented Jan 30, 2018

Has this gone live on Github?!

I have those licenses in markdown format on some of my projects:

Or any tip on making licensee recognise my GPL-v3 markdown-formatted license?

@eine
Copy link
Author

eine commented Apr 8, 2018

@Elioty, you should have a look at idleberg/Creative-Commons-Markdown#10. It seems that, although licensee can properly detect the license in some repos, GitHub is failing to display it.

However, this is not the case for your markdown formatted GPL-v3. On the one hand, the format you use is not exactly the same as the markdown version available at gnu.org:

On the other hand, licensee does not currently detect neither of them if executed in the root of the repo. But the one from GNU is properly detected if provided as an argument (quite weird):

/# git clone https://github.com/xenomorphales/xenocad/
...
/# cd xenocad/
/xenocad# licensee
License: BSD 3-Clause "New" or "Revised" License
Matched files: ["LICENSE.md"]
LICENSE.md:
  Content hash: a30dc8484c208e0089e81e338659160b15287c3a
  Attribution: Copyright (c) 2017, Xenomorphales
  Confidence: 98.72%
  Matcher: Licensee::Matchers::Dice
  License: BSD 3-Clause "New" or "Revised" License
  Closest licenses:
    * BSD-3-Clause similarity: 98.72%

/xenocad# cd ..

/# git clone https://github.com/xenomorphales/hermes
...
/# cd hermes/
/hermes# licensee
License: Other
Matched files: ["LICENSE.md"]
LICENSE.md:
  Content hash: 390127b4a7314953bdc4aaa67a3faa201f057d14
  Attribution: Copyright (C) 2007 [Free Software Foundation, Inc.](http://fsf.or
g/)
  License: Other

/hermes# wget https://www.gnu.org/licenses/gpl.md
...
gpl.md              100%[===================>]  34.10K  --.-KB/s    in 0.1s
2018-04-08 02:26:54 (258 KB/s) - 'gpl.md' saved [34916/34916]

/hermes# licensee
License: Other
Matched files: ["LICENSE.md"]
LICENSE.md:
  Content hash: 390127b4a7314953bdc4aaa67a3faa201f057d14
  Attribution: Copyright (C) 2007 [Free Software Foundation, Inc.](http://fsf.or
g/)
  License: Other

/hermes# mv LICENSE.md LICENSE.old
/hermes# mv gpl.md LICENSE.md

/hermes# licensee
License: Other
Matched files: ["LICENSE.md"]
LICENSE.md:
  Content hash: 390127b4a7314953bdc4aaa67a3faa201f057d14
  Attribution: Copyright (C) 2007 [Free Software Foundation, Inc.](http://fsf.or
g/)
  License: Other

/hermes# licensee LICENSE.md
License: GNU General Public License v3.0
Matched files: ["LICENSE.md"]
LICENSE.md:
  Content hash: 2787d013cb5b3a5c49d07628b98206e8146aaefb
  Attribution: Copyright (C) 2007 Free Software Foundation, Inc.
  Confidence: 99.85%
  Matcher: Licensee::Matchers::Dice
  License: GNU General Public License v3.0
  Closest licenses:
    * GPL-3.0 similarity: 99.85%
    * AGPL-3.0 similarity: 95.58%

/hermes# licensee LICENSE.old
License: Other
Matched files: ["LICENSE.old"]
LICENSE.old:
  Content hash: 390127b4a7314953bdc4aaa67a3faa201f057d14
  Attribution: Copyright (C) 2007 [Free Software Foundation, Inc.](http://fsf.or
g/)
  License: Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants