New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Licence Filter to searchcode server #96

Open
boyter opened this Issue Apr 9, 2017 · 14 comments

Comments

Projects
None yet
3 participants
@boyter
Owner

boyter commented Apr 9, 2017

It would be very useful to be able to filter by licence on the code. This is a complex one to implement though. Perhaps just having a simple bit of logic to identify the 10 most common licences.

https://www.blackducksoftware.com/top-open-source-licenses

What would be very cool is that given the above implementation we should label if a project is ever in breech, say by using the guid from http://www.gnu.org/licenses/license-list.html for GPL compatible software see http://stackoverflow.com/questions/1978511/is-there-a-chart-of-which-oss-license-is-compatible-with-which for more details.

@boyter

This comment has been minimized.

Show comment
Hide comment
@boyter

boyter Apr 9, 2017

Owner

Might be able to leverage https://www.tldrlegal.com/ in this as well.

Owner

boyter commented Apr 9, 2017

Might be able to leverage https://www.tldrlegal.com/ in this as well.

@boyter

This comment has been minimized.

Show comment
Hide comment
@boyter

boyter Apr 9, 2017

Owner

Probably worth reaching out to Nuno of TripleCheck as there may be a possibility to leverage the existing databases he has and or integrated.

Owner

boyter commented Apr 9, 2017

Probably worth reaching out to Nuno of TripleCheck as there may be a possibility to leverage the existing databases he has and or integrated.

@boyter

This comment has been minimized.

Show comment
Hide comment
@boyter

boyter Apr 12, 2017

Owner

https://spdx.org/licenses/ is a good place to get the text for each licence

Owner

boyter commented Apr 12, 2017

https://spdx.org/licenses/ is a good place to get the text for each licence

@boyter boyter added the enhancement label Apr 28, 2017

@boyter boyter self-assigned this Apr 28, 2017

@pombredanne

This comment has been minimized.

Show comment
Hide comment
@pombredanne

pombredanne May 9, 2017

@boyter Hey! Someone pointed to your posting at http://www.boyter.org/2017/05/identify-software-licenses-python-vector-space-search-ngram-keywords/
we discussed a long while ago... If you want decently accurate license detection I suggest you also check out my https://github.com/nexB/scancode-toolkit/

@boyter Hey! Someone pointed to your posting at http://www.boyter.org/2017/05/identify-software-licenses-python-vector-space-search-ngram-keywords/
we discussed a long while ago... If you want decently accurate license detection I suggest you also check out my https://github.com/nexB/scancode-toolkit/

@boyter

This comment has been minimized.

Show comment
Hide comment
@boyter

boyter May 9, 2017

Owner

@pombredanne Thanks. I will have a look at it. I would like something more accurate then what I currently have working.2

Owner

boyter commented May 9, 2017

@pombredanne Thanks. I will have a look at it. I would like something more accurate then what I currently have working.2

@sschuberth

This comment has been minimized.

Show comment
Hide comment
@sschuberth

sschuberth May 22, 2017

I'm interested in this feature, too. Just for my understanding, are you looking into filtering on the declared license of a project (e.g. as listed in the POM file of a Maven project), or the actual licenses as referred to from the source code? I see a lot of projects declaring themselves as e.g. Apache-2.0 licensed, but if you really scan the source code you end up finding incompatible GPL-licensed code...

I'm interested in this feature, too. Just for my understanding, are you looking into filtering on the declared license of a project (e.g. as listed in the POM file of a Maven project), or the actual licenses as referred to from the source code? I see a lot of projects declaring themselves as e.g. Apache-2.0 licensed, but if you really scan the source code you end up finding incompatible GPL-licensed code...

@sschuberth

This comment has been minimized.

Show comment
Hide comment
@sschuberth

sschuberth May 22, 2017

Also, you could have saved yourself the SPDX web site crawling and HTML to JSON conversion as the license list is already available in JSON format (and others) from https://github.com/spdx/license-list-data 😉

Also, you could have saved yourself the SPDX web site crawling and HTML to JSON conversion as the license list is already available in JSON format (and others) from https://github.com/spdx/license-list-data 😉

@sschuberth

This comment has been minimized.

Show comment
Hide comment
@sschuberth

sschuberth May 22, 2017

For reference, if you want something simple that only looks at LICENSE files and friends, https://github.com/benbalter/licensee is a nice project. It's the code that powers GitHub's license information in the repository overview.

For reference, if you want something simple that only looks at LICENSE files and friends, https://github.com/benbalter/licensee is a nice project. It's the code that powers GitHub's license information in the repository overview.

@boyter

This comment has been minimized.

Show comment
Hide comment
@boyter

boyter May 22, 2017

Owner

@sschuberth So the implementation is still something I am playing around with at this. Currently it is scanning though looking for licence files indicating the projects licence, and then additionally scanning header as well to see if there is a header indicating the license type. So the latter in your question.

As for the SPDX JSON :) Yep a few people pointed it out to me, after the article I posted did the rounds. No harm though I switching over to that.

I had a look at a few implementations. The reason I even started building my own is that I want searchcode server to be a single install application. Thats actually the reason I wrote it in Java to begin with because every library I needed already existed. To fit in the withe indexing pipeline I need more control of how this works, hence trying to write my own parser. Its something I am trying to work in with the idea that it can expand out though.

This is still work in progress at this point. I need to modify how the indexing works to support this and as such am starting to break up that code to be more modular and testable.

Owner

boyter commented May 22, 2017

@sschuberth So the implementation is still something I am playing around with at this. Currently it is scanning though looking for licence files indicating the projects licence, and then additionally scanning header as well to see if there is a header indicating the license type. So the latter in your question.

As for the SPDX JSON :) Yep a few people pointed it out to me, after the article I posted did the rounds. No harm though I switching over to that.

I had a look at a few implementations. The reason I even started building my own is that I want searchcode server to be a single install application. Thats actually the reason I wrote it in Java to begin with because every library I needed already existed. To fit in the withe indexing pipeline I need more control of how this works, hence trying to write my own parser. Its something I am trying to work in with the idea that it can expand out though.

This is still work in progress at this point. I need to modify how the indexing works to support this and as such am starting to break up that code to be more modular and testable.

@sschuberth

This comment has been minimized.

Show comment
Hide comment
@sschuberth

sschuberth May 23, 2017

I do get the point about using Java. We basically have the same issue as many building blocks of our internal pipeline are already JVM-based. And while I'm very confident about the scan quality of @pombredanne's ScanCode (written in Python), I see that's why you're mentioning @maxbrito's (Greetings!) TripleCheck engine (written in Java).

I do get the point about using Java. We basically have the same issue as many building blocks of our internal pipeline are already JVM-based. And while I'm very confident about the scan quality of @pombredanne's ScanCode (written in Python), I see that's why you're mentioning @maxbrito's (Greetings!) TripleCheck engine (written in Java).

@boyter

This comment has been minimized.

Show comment
Hide comment
@boyter

boyter May 23, 2017

Owner

Indeed. The biggest issue I ever had with installing software was the dependancies. I did consider using GoLang to get around this by producing a single binary file but it was missing a lot of the libraries I wanted, hence choosing Java with the only requirement being the JRE.

For a 3rd party tool, I recommend whatever works, but in my case my goals are for everything to be integrated into an easy to install application where possible.

Owner

boyter commented May 23, 2017

Indeed. The biggest issue I ever had with installing software was the dependancies. I did consider using GoLang to get around this by producing a single binary file but it was missing a lot of the libraries I wanted, hence choosing Java with the only requirement being the JRE.

For a 3rd party tool, I recommend whatever works, but in my case my goals are for everything to be integrated into an easy to install application where possible.

@boyter boyter added this to In Progress in Release 1.3.13 Feb 13, 2018

@boyter

This comment has been minimized.

Show comment
Hide comment
@boyter

boyter Feb 13, 2018

Owner

Notes.

  • Pre-scanning the whole directory to and then matching up what licenses exist is going to slow down the indexing process to much. As such its not a good idea to integrate https://github.com/boyter/lc/
  • Work backwards. Check if there are any SPDX indicators, then check if the current folder contains licence files and then drop directories each time.
  • The above can be cached to speed up processing for each file.

Need to investigate the adding of multiple facets for a single document. I think this should work but might need to modify how the facets are being added.

Owner

boyter commented Feb 13, 2018

Notes.

  • Pre-scanning the whole directory to and then matching up what licenses exist is going to slow down the indexing process to much. As such its not a good idea to integrate https://github.com/boyter/lc/
  • Work backwards. Check if there are any SPDX indicators, then check if the current folder contains licence files and then drop directories each time.
  • The above can be cached to speed up processing for each file.

Need to investigate the adding of multiple facets for a single document. I think this should work but might need to modify how the facets are being added.

@boyter

This comment has been minimized.

Show comment
Hide comment
@boyter

boyter Feb 28, 2018

Owner

Probably going to not add the vector space calculation for this for performance reasons. Keyword matching and SPDX identifiers should be enough.

Owner

boyter commented Feb 28, 2018

Probably going to not add the vector space calculation for this for performance reasons. Keyword matching and SPDX identifiers should be enough.

@boyter

This comment has been minimized.

Show comment
Hide comment
@boyter

boyter Jun 4, 2018

Owner

Parking this for the moment while improving the matching code in https://github.com/boyter/lc/

Owner

boyter commented Jun 4, 2018

Parking this for the moment while improving the matching code in https://github.com/boyter/lc/

@boyter boyter removed this from In Progress in Release 1.3.13 Jun 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment