-
Notifications
You must be signed in to change notification settings - Fork 107
Project Ideas Detect unknown licenses and indirect license references in ScanCode
The goal of this project is first to improve the license detection of unknown licenses and license references and second to improve license detection speed.
-
Some unknown licenses may not be detected correctly. The approach to detect unknown licenses requires some creative thinking and research. One possibility could be to build an index of ngrams over all the available license texts and license rules and use these to find longer sequences of license-like texts.
-
Some license references such as "see license in file LICENSE.txt" are reported as unknown license references and we could instead follow the referenced file to find what was detected there. The license rules YAML data files contain already a "referenced_filenames" attribute that was added to support this feature.
As a bonus the license detection speed could be improved possibly by porting some critical code sections to C or Rust and or using Cython. That would require first to perform some proper profiling. And then applying selective optimization driven by the profiling results.
See these tickets for reference:
- https://github.com/nexB/scancode-toolkit/issues/1675 [RFC] Revamp "unknown" license detection: it contains detailed design directions
- https://github.com/nexB/scancode-toolkit/issues/2257 "Report separately unknown licenses and related non-conclusive license detections"
-
Level
- Advanced
-
Tech
- Python.
- Potentially Cython, C/C++, Rust, Go
-
Mentors
- @mjherzog https://github.com/mjherzog
- @pombredanne https://github.com/pombredanne