Skip to content

Project Ideas Improve Copyright Detection Accuracy and Speed

Philippe Ombredanne edited this page Feb 15, 2023 · 4 revisions

Improve Copyright detection speed and accuracy in ScanCode

Copyright detection is the slowest scanner in ScanCode. It is based on Pygmars, derived from NLTK part of speech (PoS) tagging to build a copyright lexer and grammar and makes extensive usage of regex.

The goal of this project is to: - improve Copyright detection speed by using the Google re2 high performance

regex engine and its python bindings and
  • add support for collecting copyright names with unicode accents (such as German umlauts)
  • using re2 experiment with ahead-of-time regex compilation with the re2 Set feature
  • still ensure that the code works in a slower, degraded fashion without re2 installed (e.g., using the standard re module)

This requires to embrace the surprising complexity and ambiguity of parsing what looks on the surface as simple copyright statements. Luckily and to help you in your quests, we have over 3000 tests available and potentially 9M of scans to use as a safety net.

Clone this wiki locally