Skip to content
Plagiarism detection for programming exercises
TypeScript JavaScript CSS
Branch: develop
Clone or download
Latest commit 123ae72 Nov 19, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.circleci also build in circle ci Jul 17, 2019
samples/js Fix failing test Nov 5, 2019
src Improve directory structure Nov 19, 2019
typings Initial implementation of the hash function. Jun 28, 2019
.gitignore added benchmark files Sep 6, 2019
LICENSE Initial commit Jun 23, 2019
README.md fixed README Aug 28, 2019
jest.config.js remove the spotify dependency Jul 8, 2019
package.json Make dolos installable Nov 19, 2019
prettier.config.js run prettier Jul 6, 2019
tsconfig.json
tslint.json add additional filters Jul 6, 2019
yarn.lock Bump @types/jest from 24.0.19 to 24.0.20 Oct 28, 2019

README.md

Dolos

Plagiarism detection for computer programming assignments.

Right now, C#, Haskell, Java, JavaScript and Python are officially supported. Additional languages for which a tree-sitter grammar is available might also work.

How does Dolos work?

Tokenization

To be immune to plagiarism where variables and functions are renamed, Dolos doesn't run directly on the source code subjected to the test. First a tokenization step is performed using tree-sitter. Tree-sitter can generate syntax trees for many languages and converts source code to a more structured form, free of naming variabilities.

For example, the code

function sum(a, b) {
  return a + b;
}

is converted to something like this:

program ([0, 0] - [3, 0])
  function ([0, 0] - [2, 1])
    identifier ([0, 9] - [0, 12])
    formal_parameters ([0, 12] - [0, 18])
      identifier ([0, 13] - [0, 14])
      identifier ([0, 16] - [0, 17])
    statement_block ([0, 19] - [2, 1])
      return_statement ([1, 2] - [1, 15])
        binary_expression ([1, 9] - [1, 14])
          identifier ([1, 9] - [1, 10])
          identifier ([1, 13] - [1, 14])

Next, we can start looking for similarities in the submitted files.

Substring matching

To measure similarities between (converted) files, Dolos tries to find common substrings between them. We use substrings of a fixed length called k-mers. To efficiently make these comparisons and reduce the memory usage, all k-mers are hashed using a rolling hash function (the one used by Rabin-Karp in their string matching algorithm).

To further reduce the memory usage, only a subset of all hashes are stored. The selection of which hashes to store is done by the Winnowing algorithm as described in http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf.

Using an index

Because we want to compare all files with each other, it is more efficient to first create an index containing the hashes of all files. For each of the hash values encountered in any of the files, we store the file and corresponding line number where we encountered that hash.

Afterwards, if we want to check a file against the rest, we can simply compute all hash values for that file and look them up in the index. We can then aggregate the results of all matches and report on the most similar files.

How to get this code running

  • Make sure Node 10 is installed. Node 12 is not supported by tree-sitter, one of our dependencies.
  • Clone the repository
  • Run yarn install to install all dependencies
  • If you want to generate plain JS files, run yarn build and the dist folder should be created
  • Running yarn test will run all tests
  • Running yarn lint will run the linter
  • Running yarn start will compile everything and run the app.js file. Additional arguments are required, use the --help option for more information.
You can’t perform that action at this time.