Topic discovery with Sampled Min-Hashing

Installation

Install dependencies (use sudo for system-wide installation):

pip install sklearn nltk

Download NLTK's required resources by doing:

python -m nltk.downloader punkt averaged_perceptron_tagger wordnet

It will download the followind resources/corpora using 'scripts/prepare_db.sh':

20 Newsgroups
English Wikipedia
Spanish Wikipedia
If you have access to a copy of the Reuters corpus, the script will prompt you to add the path to it.

To run the experiments, from the main directory:

bash scripts/prepare_db.sh -a
bash scripts/run_all.sh

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
example_topics		example_topics
plots		plots
python		python
scripts		scripts
.gitignore		.gitignore
README.md		README.md