Install dependencies (use sudo
for system-wide installation):
pip install sklearn nltk
Download NLTK's required resources by doing:
python -m nltk.downloader punkt averaged_perceptron_tagger wordnet
Install Sampled-MinHashing (see README at https://github.com/gibranfp/Sampled-MinHashing).
It will download the followind resources/corpora using 'scripts/prepare_db.sh':
- 20 Newsgroups
- English Wikipedia
- Spanish Wikipedia
- If you have access to a copy of the Reuters corpus, the script will prompt you to add the path to it.
To run the experiments, from the main directory:
bash scripts/prepare_db.sh -a
bash scripts/run_all.sh