-
-
Notifications
You must be signed in to change notification settings - Fork 38
Description
I've set up purldb and scancode.io locally, where I run make run_visit and make run_map to visit and map Maven packages. I then run make request_scans and make process_scans to get information on the Resources in the packages we visit and map. I've noticed that the scan requests we send off to scancode.io are for multiple versions of the same package. This causes a few problems:
- We cannot possibly scan every single package in the maven index to create our matching index
- In the case of the directory structure fingerprint, if there is little/no difference between the different versions of a package, then we are essentially scanning and indexing the same package repeatedly
- We can't populate our matching index with a single scancode.io instance, we need multiple instances so we can have multiple scans going at once
For the first two issues, we need to come up with a new way to group and index fingerprints. A starting idea would be to come up with a bit more general. Currently, we create directory fingerprints for every package we map. If two packages we index are the same package but different versions, then we may have the same fingerprints twice. We could do something along the lines of indexing fingerprints to a package in general, rather than to a specific package version.
For the second issue, we will have to flip the current scan queue request model. purldb will have a queue of packages that it wants scanned and it will be up to scancode.io to poll purldb to see what needs to be scanned. scancode.io would poll purldb, get the package that needs to be scanned, scan and fingerprint it, then send the results back to purldb. This issue is tracked at #14