Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
35 lines (27 sloc) 3.05 KB

Mathematical Language Processing

Build Status

Run

  • compile the maven project

  • run the jar (mathoshpere-core-3.0.0-SNAPSHOT-jar-with-dependencies)

  • try java -jar mathoshpere-core-3.0.0-SNAPSHOT-jar-with-dependencies to see the list of possible commands

  • try java -jar mathoshpere-core-3.0.0-SNAPSHOT-jar-with-dependencies list -in *in file* --tex to extract the identifiers from a wikipedia article

  • try java -jar mathoshpere-core-3.0.0-SNAPSHOT-jar-with-dependencies extract -in *in file* --tex to extract identifiers and corresponding set of definitions from a wikipedia article

  • test data can be found in mathosphere-core\src\test\resources\com\formulasearchengine\mathosphere\mlp\performance and mathosphere-core\src\test\resources\com\formulasearchengine\mathosphere\mlp\gold

  • this project can also be run on a flink cluster for fast analysis of large text corpora

How to use the machine learning classifier

train & test

  • use test data from the eval_dataset.xml and gold.json

  • java -jar mathosphere-core-3.0.0-SNAPSHOT-jar-with-dependencies.jar ml -in eval_dataset.xml -out . --goldFile gold.json --tex --threads 10 --writeSvmModel --svmGamma 0.0185881361 --svmCost 1.0

  • this will yield a model and a string2vector filter which can be used to classify instances and detailed statistics of the 10-fold cross evaluation that precedes the training process.

  • if you provide several values for cost and gamma all possible combinations will be trained and tested (i.e. --svmGamma 0.018 --svmGamma 0.019 --svmCost 1.0 --svmCost 2.0). This can be used to find optimal svm parameters.

  • for faster tex extraction it is advisable to install mathoid locally and use the --texvcinfo parameter, e.g. add --texvcinfo http://localhost:10044/texvcinfo to the execution parameters.

classify

  • assuming yor machine has n cores, we advise you use n threads.
  • java -jar mathosphere-core-3.0.0-SNAPSHOT-jar-with-dependencies.jar classify -in *wikipedia dump*.xml -out . --tex --threads 10 --stringFilter string_filter_c_1.0_gamma_0.0185881361.model --svmModel svm_model_c_1.0_gamma_0.0185881361.model
  • The result of this will be a folder called extractedDefiniens/json with number of threads files containing he classifications in json format. Remarks:
  • eval_dataset.xml can be used to test the classification, but that is circular reasoning.
  • Always use a string2vector filters and a model that were trained in the same run, the results are otherwise undefined.
  • As before it is advisable to use a local mathoid instance.