Skip to content

A simple Plagiarism detector for code, made in C and Rust.

License

Notifications You must be signed in to change notification settings

YJDoc2/Plagiarism-Detector

Repository files navigation

Plagiarism Detector

Version Alpha

This is a plagiarism detector for code, which works by first training on sample code, making an index of it in form of trie, and then matching the code to be tested by the values in index

File Structure :

.
├── lexer
        ├── lexer.l :           flex file for generating lexer
        └── lexer.c :           lexer generated by flex        

├── src
        ├── build.rs :          build file for linking the library made by C
        ├── ffi.rs :            contains the safe wrapper for unsafe FFI  with C
        ├── main.rs :           main function, train function and eval funtion
        ├── reference.rs :      contains the struct used to store references
        └── trie.rs :           contains trie like stuct which is used as index

├── Cargo.toml :                TOML file of cargo config
├── Cargo.lock :                lockfile generated by Cargo
├── sample-syntax :             syntax for sample file
├── makefile :                  used to build C library, build rust library. If changes are made in lexer folder, this should be run
├── sample_test.txt :           sample data for testing
├── sample_train.txt :          sample data for training
├── LICENCE :                   GNU GPL V3 copy
└── README.md :                 Readme file

Requirements

C compiler for compiling generated Lexer. (gcc/clang preferred) Cargo and Rust for compiling and running Rust GNU Make for using makefile GNU flex if you want to change the lexer.Then after editing lexer.l file, make must be run.

Building

The compiling of Rust file is taken care of by Cargo. For compiling the C use makefile. In case you don't have flex, remove the first part of statement, and use. The C library must be compiled before running cargo build/run.

Usage

If you are not running with cargo (after generating production binary) replace 'cargo run' by binary name.

  • cargo run train input_file_path : trains on the file at input_file_path, and saves the index as index.json in executing folder.
  • cargo run update index_file_path input_file_path : updates the index in file index_file_path by training on file at input_file_path and re-writes the index file.
  • cargo run test index_file_path input_file_path : loads index from index_file_path and tests for matches in file input_file_path

Working :

Indexing :

This parses input token by token, as defined by the lexer in lexer folder, and makes a Trie-like structure, which is of type : Trie{ token_number -> { set of values encountered for the token, Trie of tokens encountered after this } }

Testing :

For each token, it searches in the index. If found, the next index for searching is set to the Trie of that token, if not, then it just skips tokens until and EOL token is encountered. for each matched token it adds token_score to total, and if the value of token is in the value set of that token in the index, it adds the max score to total as well. The final score is (total score / number of tokens from last EOL to current EOL )* 100 %

After scanning all file, it filters the matched tokens , with condition that the score > cutoff score, and the remaining matches are reported, along with % score and the source, in which the line was matched.

Test data file :

The file should follow following : Each sample should start with : -----> START SAMPLE ref : (reference name, website, etc, without space) code

[next sample]

a simple sample can be seen in sample_train.txt (for training) and sample_test(for testing).

TODOs:

Currently there is only one lexer for C files, and this has not been tested on real data. Fine tuning is needed for lexer (Should all keywords be a single type of token, with different values, or each should have a different token) ; metric function used to match the values ; values used for token score and max score etc.

About

A simple Plagiarism detector for code, made in C and Rust.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published