Skip to content

Latest commit

 

History

History
90 lines (53 loc) · 2.57 KB

README.md

File metadata and controls

90 lines (53 loc) · 2.57 KB

Pipeline for Mathematical namespace discovery

input:

output:

  • namespaces (like here)

Running It

git clone https://github.com/alexeygrigorev/namespacediscovery-pipeline.git
cd namespacediscovery-pipeline/src
python pipeline.py

Modify luigi.cfg to set different configuration parameters

You need to at least change the following parameters:

  • [MlpResultsReadTask]/mlp_results - path to the output of mlp
  • [MlpResultsReadTask]/categories_processed - path to the category information
  • (optional) [DEFAULT]/intermediate_result_dir - path to directory where pre-calculated results will be stored

Other parameters ([DEFAULT] section):

  • isv_type identifier vector space model, can be nodef, weak or strong
  • vectorizer_dim_red type of dimentionality reduction, can be none, svd, nmf or random
  • clustering_algorithm, now only kmeans is implemented

Dependencies

  • python2
  • numpy
  • scipy
  • scikit-learn
  • nltk
  • python-Levenshtein
  • fuzzywuzzy
  • rdflib
  • luigi

for PyData stack libraries such as numpy, scipy, scikit-learn and nltk it's best to use anaconda installer

Not all dependencies come pre-installed with anaconda, use pip to install them:

pip install python-Levenshtein
pip install fuzzywuzzy
pip install luigi
pip install rdflib

We also need to download some data for nltk: the list of stopwords and the model for tokenization. Run it in the python console to install them:

import nltk
nltk.download('stopwords')
nltk.download('punkt')

see SETUP.md for an example how to set up the environment

Datasets

We use the following datasets as input:

  • mlp ...
  • dbpedia category information

Classification schemes:

The classification schemes datasets are already available in the data directory.