Project Ideas Improve Copyright Detection Accuracy and Speed

Improve Copyright detection speed and accuracy in ScanCode

Copyright detection is the slowest scanner in ScanCode. It is based on Pygmars, derived from NLTK part of speech (PoS) tagging to build a copyright lexer and grammar and makes extensive usage of regex.

The goal of this project is to: - improve Copyright detection speed by using the Google re2 high performance

regex engine and its python bindings and

add support for collecting copyright names with unicode accents (such as German umlauts)
using re2 experiment with ahead-of-time regex compilation with the re2 Set feature
still ensure that the code works in a slower, degraded fashion without re2 installed (e.g., using the standard re module)

This requires to embrace the surprising complexity and ambiguity of parsing what looks on the surface as simple copyright statements. Luckily and to help you in your quests, we have over 3000 tests available and potentially 9M of scans to use as a safety net.

Level
- Advanced
Tech
- Python + re2 (C/C++)
URLS
- https://github.com/nexB/scancode-toolkit/tree/develop/src/cluecode
Mentors
- @JonoYang https://github.com/JonoYang
- @pombredanne https://github.com/pombredanne

http://aboutcode.org/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Ideas Improve Copyright Detection Accuracy and Speed

Improve Copyright detection speed and accuracy in ScanCode

Clone this wiki locally