Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Detecting (crosslingual) text reuse

This repo contains simple Python utilities for identifying crosslingual textual reuse. Quickstart:

git clone 
cd detect_reuse 
cd text_cleaning_resources
gunzip normalized_stats_one_million.txt
cd ../
python sample/encyclopedie_volume05_translated.txt sample/goldsmith_animated_nature_full_unsplit.txt 8 4 4

This command looks for textual reuse between "sample/encyclopedie_volume05_translated.txt" and "sample/goldsmith_animated_nature_full_unsplit.txt"

Translation Utility

translate_texts/ uses goslate (pip install goslate) to translate all texts into a common language. Usage:

python encyclopedie_volume05.txt "en" "utf-8"

Where the arguments in order are: the text to be translated, the language into which the text should be translated, and the encoding of the input.

Running the command above transforms Volume V of the French Encyclopédie into English: "L'Encyclopédie vient de faire une excellente acquisition en la personne de M. Bourgelat , Ecuyer du Roi, chef de son Académie à Lyon ..." becomes "The Encyclopedia has made a great acquisition in the person of Mr. Bourgelat, Esquire of the King, the captain of his Academy in Lyons ..."

Detecting Textual Reuse

One can search for textual reuse between two files by running:

python {text_one} {text_two} {window size} {step size} {ngram size}

{window size} = the size of the sliding window to be created
{step size}   = number of words to advance the sliding window when it moves, and 
{ngram size}  = number of words to include in each ngram.

The output will contain data in the following format:

path_to_text_one {tab} path_to_text_two {tab} number_of_shared_ngrams {tab} sentence_from_text_one {tab} sentence_from_text_two {newline}

Sorting by the third column can give an estimate of textual similarity between the passages, with more similar passages having higher values here.


Python utilities for detecting textual reuse




No releases published


No packages published


You can’t perform that action at this time.