Skip to content

duhaime/visualize-text-reuse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

###Dependencies

This code requires the following Python libraries: numpy, annoy, and nltk. If you don't already have numpy, I recommend the Anaconda distribution of Python, which ships with numpy, scipy, and other libraries that can prove difficult to compile manually. Once the numpy dependency is settled, one can satisfy the other dependencies by running:

pip install nltk  
pip install annoy

###Quickstart

This repository contains utilities for detecting and visualizing text reuse. To get started, run:

git clone https://github.com/duhaime/visualize-text-reuse.git
cd visualize-text-reuse/utils
python detect_reuse.py
cd ../
python -m SimpleHTTPServer 8000

Then open a browser (Chrome is recommended) and navigate to localhost:8000 to see the results of the analysis.

###Process custom dataset

To process a dataset other than the files contained in data/full_text/, just open up config.json and provide a new glob path to the infile_glob parameter, as well as a new metadata file to the metadata parameter. Make sure that the metadata file you provide is formatted as data/metadata/corpus_metadata.tsv is formatted:

filename display title publication year file id file author

Change runtime parameters

One can change the following runtime parameters within config.json:

infile_glob: Glob path to the plaintext files to be processed

metadata: Metadata file that corresponds to the files in infile_glob

persist_index: {0,1} Controls whether the index used to detect text reuse will be saved to disk.

load_index: {0,1} Controls whether the algorithm will load an already saved index from disk.

knn: An integer that controls the number of nearest neighbors to find for each passage of each file.

print_nn: {0,1} If set to 1, running utils/detect_reuse.py will print the nearest neighbors of each passage to the terminal.

n_trees: An integer that controls the number of trees to build within the index. Increasing this value requires more memory but improves search precision.

search_k: An integer that determines the number of searches to be executed for each nearest neighbor lookup. Increasing this value requires more runtime but improves search precision.

maximum_processes: An integer that indicates the maximum number of processes to be run concurrently.

minimum_similarity: {0:1} A float value indicating the minimum similarity an alignment must have in order to be persisted on disk.

About

Visualize text reuse with D3.js

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published