Skip to content

cerrno/winnow

master
Switch branches/tags
Code
This branch is up to date with schuermannator/winnow:master.
Contribute

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
src
 
 
 
 
 
 
 
 
 
 
 
 
 
 

winnow

build

Software source code similarity detection similar to Moss. It is based on incremental winnowing of commits.

Our approach

We desire to detect similar code between files in different git repositories over the set of all commits. That is, we wish to be able to identify if there is shared code between any two files in any two different repositories across all commits in those repositories.

TODO:

  • Since we are winnowing hunks rather than files, the fingerprints will not match on the boundaries. A temporary workaround is to use small ngrams until this is more properly considered.

Definitions:

  • A document is a set of additive changes corresponding to a hunk in a diff for a particular file in a git repository. A git repository contains a set of commits, each of which contain a number of files, themselves containing a number of hunks, or individual sets of changes. These additive subset of all hunks in all files in all commits gives the set of all documents in a repository.
  • A location uniquely identifies the line number of a fingerprint in a document, given by the tuple (repository, filename, commit hash, line number).
  • A fingerprint is a hash resulting from the application of the winnowing algorithm to subset (window) of a specific document. Fingerprints are always associated with the documents they derive from by means of their location.
  1. Given a set of repository urls, minp, and maxp
  2. Load the set of all documents
    • Pull all repositories
    • Per repository, read all commits
    • Per commit, read all hunks
    • Per hunk, do the following
  3. Compute the set of all fingerprints
    • Initialize an empty vector of fingerprints and location tuples, fv
    • Given hunk, perform winnowing on the text, giving a set of fingerprints
    • Per resulting fingerprint, construct a location given the current context of repository, filename (from commit), commit, and line number (from hunk)
    • Store in fv
  4. Construct a reverse index on the set of fingerprints fv
    • Initialize an empty nested hashmap of fingerprint hash to hashmap of repository name to location, fi
    • Per element of fv, insert fingerprint into fi using hash, repository (from location), and location from tuple
  5. Prune set of fingerprints
    • Initialize an empty hashmap of (repository, filename, commit hash) tuple to vector of fingerprint hash, fd
    • Per key,value in fv, use fingerprint hash (key) to query index fi
    • Compute fingerprint popularity p from length of the keys of the value (nested) map (the number of unique repositories corresponding to the locations where this fingerprint hash can be found) minus one (discounting the current repository)
    • If p < the minimum popularity cutoff minp, there are not enough matches across the set of documents for this fingerprint to be interesting to identify similarity. If p > the maximum popularity cutoff maxp, this fingerprint is likely a language keyword, boilerplate code, or something else shared amongst almost all files/repositories.
    • Otherwise, insert element into fd using the value (nested) map's data to give the location and the key fingerprint hash
  6. Perform quadratic (pairwise) document comparison
    • Initialize an empty output map of (document, document) tuple to vector of fingerprints, out
    • Iterate through keys in fd as document a
    • Per document a, iterate through other keys in fd as document b
    • Use fi to determine fingerprints existing in both a and b
    • Save tuple (a,b) and matching fingerprint in map
  7. Return rank-ordered map out, sorted by the length of the value (the number of matches for a given pair of documents)
  8. Compute useful metrics from out

Past work

Winnowing paper

Link to paper

For this application, positional information (document and line number) is stored with each selected fingerprint. The first step builds an index mapping fingerprints to locations for all documents, much like the inverted index built by search engines mapping words to positions in documents. In the second step, each document is fingerprinted a second time and the selected fingerprints are looked up in the index; this gives the list of all matching fingerprints for each document. Now the list of matching fingerprints for a document d may contain fingerprints from many different documents d1, d2, . . .. In the next step, the list of matching fingerprints for each document d is sorted by document and the matches for each pair of documents (d, d1 ), (d, d2 ), . . . is formed. Matches between documents are rank-ordered by size (number of fingerprints) and the largest matches are reported to the user. Note that up until this last step, no explicit consideration of pairs of documents is required. This is very important, as we could not hope to carry out copy detection by comparing each pair of documents in a large corpus. By postponing the quadratic computation to the last step, we can optimize it by never materializing the matching for a pair of documents if it falls below some user-specified threshold.

Presentation of the results is another important issue for users. Statistics such as reporting the percentage of overlap between two documents are useful, but not nearly as useful as actually showing the matches marked-up in the original text. MOSS uses the fingerprints to determine where the longest matching sequences are; in particular, if a1 in document 1 matches a2 in document 2, and b1 in document 1 matches b2 in document 2, and furthermore a1 and b1 are consecutive in document 1 and a2 and b2 are consecutive in document 2, then we have discovered a longer match across documents consisting of a followed by b. While this merging of matches is easy to implement, k-grams are naturally coarse and some of the match is usually lost at the beginning and the end of the match. It is possible that once a pair of similar documents are detected using fingerprinting that it would be better to use a suffix-tree algorithm [15] to find maximal matches in just that pair of documents.

Others

Preprocessing

"It does this by preprocessing the source code files, calculating a numeric fingerprint for each file, and then performing a longest common sequence search on the two fingerprints. The preprocessing stage replaces all function and variable names with a single token, and removes all comments and whitespace from the source code. The fingerprint stage calculates hash values for windows of characters in the resulting file, preserving the lowest hash values as the file’s fingerprint" Engels et al. 2007

Notes

About

Winnowing algorithm for document fingerprinting in Rust

Resources

License

Stars

Watchers

Forks

Languages

  • Rust 80.0%
  • Scheme 19.4%
  • Makefile 0.6%