-
Notifications
You must be signed in to change notification settings - Fork 1
Ensemble Learning for Anaphora- and Coreference-Resolution-Systems
License
akullpp/ELAC
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
License: Apache Usage Instructions Please read INSTALL for installation instructions. Important notes: 1. BART is a very complex tool with very specific options. Therefore, you have to set it up and run it on the corpora specified in ./config/config.xml as PathToTrainingFiles and PathToTestFiles with your individual options. Furthermore, we suggest that you train BART itself otherwise its results may be to bad to be considered by the classifiers. This is a process our program does not want to do automatically since many external factors and the user's intention must be considered when running BART. By running this program, we assume that you did the necessary steps with BART and the output of BART is located in the markables folder. 2. We are sorry for the huge console output. This is a general problem of MMAX2 and was reported by us. A concise output of the results will be printed after the specific process has finished. 3. Please read about our findings in the ./poster.png Instructions: Start the CLI via "java -Xmx1024M -jar ELAC.jar" (1) starts the JavaRap preprocessing unit. This is a necessary prerequisite for the training (4) and testing (5) process! It creates a javarap subfolder in the Basedata folder specified by PathToTrainingFiles/PathToTestFiles in the config.xml with the JavaRap results. Please select whether you have JavaRap installed locally (1-2) or you need to access it via an SSH connection (1-1). You will be prompted for your credentials if you choose to run JavaRap via SSH. It is very important that you forward X in order to get prompted for your credentials. (2) starts the feature distribution analysis tool. This is a standalone tool. Please specify whether you want the feature distribution of the test corpus (2-1) or training corpus (2-2) which have to be specified by PathToTrainingFiles/PathToTestFiles in the config.xml. After the process is finished, you will find a file in the ResultOutputDir (see config.xml) with the total/average distribution of each feature in the selected corpus. (3) starts the Salsa converter. This is a standalone tool. It converts an XML file in Tiger/XML format which you have to provide to an XML file with an inline annotation that can be processed by MMAX2. It will be written to the folder specified by ResultOutputDir in the config. (4) starts the training of the classifier. You will need the results from the JavaRapPreProcessing unit (1) in order to run this process! Furthermore, you will need to set following options in your .config/config.xml: a) PathToTrainingFiles: Please enter the path to the training corpus. b) Runner: Please enter the ACR-Systems you want to use. In order to use an ACR-System it must be specified in de.uniheidelberg.cl.swp.testacr. Currently BART, JavaRap and LingPipe are supported. Please feel free to add your ACR-System. c) EvaluationMethod: MUC6 is the default to choose. If you consider to use DIRECTNEIGHBORSONLY please read the extended documentation. d) FeatureFilter: Enter the features which are to be extracted. The features have to be specified in de.uniheidelberg.cl.swp.featureExtraction.features. Please note that every feature which is specified in FeatureFilter has to have its specific entry with its custom value or value range. The results of each individual ACR-System and the generated results.arff file can be found at the location specified by ResultOutputDir in the config. The results.arff is used in the testing process (5). Notes: 1. Multiple options like Runner or FeatureFilter as well as the specifc entries for each feature have to be separated by ";". 2. Depending on the size of the corpus, this process might take a long time. (5) starts the testing/machine-learning process. You will need a WEKA compatible ARFF file in order to run this process, which is either created by the training process (4) or can be provided in form of a custom ARFF file (5-1)! Furthermore, you will need to set the following options in your .config/config.xml: a) PathToTestFiles: Please enter the path to the test corpus. b) classifier/subclassifier: Specify the classifier/subclassifier. We currently support: J48 as classifier and subclassifier BAYES as classifier and subclassifier BAGGING as classifier which needs a subclassifier ADABOOST as classifier which needs a subclassifier NEARESTNEIGHBOR as subclassifier for stacking ZEROR as subclassifier for stacking KSTAR as subclassifier for stacking BFTREE as subclassifier for stacking c) stacking: True or false. If you want to use stacking, you will have to specify a classifier for level 1 and a subclassifier for level 0. d) options: Options for the (sub-)classifier. Please read the WEKA documentation for an extensive list. The result of the testing process can be found at the location specified by ResultOutputDir in the config. Note, depending on the size of the corpus, this process might take a long time. (6) starts the Ablation Testing tool. This is a standalone tool. You will have to specify the name for the new file, output will be written to the folder specified by ResultOutputDir in the config. Furthermore, you will have to specify the correct path to the test corpus in the PathToTestFiles entry in the config. Note, depending on the size of the corpus, this process might take a long time.
About
Ensemble Learning for Anaphora- and Coreference-Resolution-Systems
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published