Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[experimental] Add ability to ignore template code or frequently occuring fingerprints #1524

Merged
merged 11 commits into from
May 31, 2024

Conversation

rien
Copy link
Member

@rien rien commented May 16, 2024

This PR adds the ability to ignore template code by manually specifying ignore files or by setting a maximum count or percentage of files a code fragment can occur in before it is ignored.

Note: this feature is currently experimental. We're not convinced of the initial results and will be performing more tests to see whether this functionality would actually improve plagiarism reports or not.

This option is currently not available in the web server, however we are thinking how to implement this (see #1535).

Closes #1213, #716, #1163

Meanwhile the following changes have been done to the dolos, dolos-core, and dolos-lib npm packages:

API changes

CLI

  • Added a new option -i, --ignore <path> to ignore matches with that file in the analysis
  • Fixed the options -m, --max-fingerprint-count <integer> and -M, --max-fingerprint-percentage <fraction> to ignore matches if the code is present in more than that count/percentage of files

Core

  • FingerprintIndex now has the ability to ignore files or fingerprints occurring in more than a specified amount of files.
    • constructor: added optional argument maxFingerprintFileCount which can be used to set the maximum number of files a fingerprint can occur in before it is ignored.
    • new function addIgnoredFile(file: TokenizedFile): void can be used to ignore all the fingerprints in a file.
    • new function ignoredEntries(): Array<FileEntry> to retrieve all ignored files.
    • new functions getMaxFingerprintFileCount(): number and updateMaxFingerprintFileCount(maxFingerprintFileCount: number | undefined) to retrieve and update the maxFingerprintFileCount. The change will immediately change the index to reflect this value.
    • new function addIgnoredHashes(hashes: Array<Hash>) which can be used to manually ignore certain hashes.
  • interface FileEntry: added field ignored: Set<SharedFingerprints> to track ignored fingerprints and field isIgnored: boolean to sign whether this file is an ignored file or not.
  • SharedFingerprint now has a boolean ignored to reflect whether this shared fingerprint is ignored or not.
    • new function includesFile(file: TokenizedFile): boolean to request whether this fingerprint ins included in the given file.

Lib

  • Dolos class now has the option to ignore a file or ignore fingeprints occuring in more than a specified amount or percentage of files
    • The options maxFingerprintCount and maxFingerprintPercentage now have an effect (they were previously ignored): code matchign with more than this count or percentage of files will be ignored
    • analyzePaths has an extra optional parameter ignore?: string which can be set to the path of the file to ignore
    • analyze has an extra optional parameter ignoredFile?: File which can be set to the File to ignore
  • Report class now has an extra function ignoredEntries(): Array<FileEntry> to retrieve the files that have been ignored

Experimental results

To observe the effects of ignoring template code, we've run Dolos on a recent case of plagiarism.

The cases with confirmed plagiarism are present in the baseline comparison with a high similarity 79% and are present in one of the four clusters.

Throughout all the configurations, these cases are present the identified clusters. However the similarities decrease with the aggressiveness of the -M option and the other clusters vary a little.

Even with -M .25 the confirmed cases are on top of the highest ranking submissions and comparing them does not differ much.

Baseline (no ignoring)

image

Ignore template code (-i boilerplate.java)

image

Ignore fingerprints occurring in 75% of files (-M .75)

image

Ignore fingerprints occurring in 50% of files (-M .50)

image

Ignore fingerprints occurring in 25% of files (-M .25)

image

Ignore template code AND fingerprints in 75% of files (-i boilerplate.java -M .75)

image

@rien rien force-pushed the feature/ignore-templates branch from 225dd48 to 1104f12 Compare May 27, 2024 12:37
@rien rien marked this pull request as ready for review May 28, 2024 09:39
@rien rien added the enhancement New feature or request label May 28, 2024
@rien rien force-pushed the feature/ignore-templates branch from 827cb53 to dfac9e1 Compare May 29, 2024 12:23
@rien rien changed the title Add ability to ignore template code or frequently occuring fingerprints [experimental] Add ability to ignore template code or frequently occuring fingerprints May 30, 2024
@rien rien merged commit 0aacf06 into main May 31, 2024
26 checks passed
@rien rien deleted the feature/ignore-templates branch May 31, 2024 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Creating templates to ignore sections of code
1 participant