IERCT extracts key trial information (patient group, treatments, outcome measure, results) from abstracts of Randomised Clinical Trial reports.
take_abstract.take(pmid,path)
retrieves the abstract of a RCT report (identified by "pmid") from the PubMed website,
and saves it in a disk location ("path") in a format readable by the preprocessing module
preprocessing_functions.preprocess_file(filename)
preprocesses the abstract given in "filename" performing part-of-speech
tagging, text normalization, chunking and semantic categorization.
preprocessing_functinos.preprocess_data(preprocess_from, read_dat)
batch preprocesses the abstracts in "preprocess_from"
classifier_functions.apply_features(data, feature_extractor)
returns a list of feature sets for each preprocessed abstract
in "data", where the features are extracted by the function "feature_extractor" (i.e. classifier_functions.feature_extractor
)
the class classifier_functions.Classifier(feature_extractor)
contains the main methods for training the model and identifying
the information items.
classifier_functions.Classifier.train(train_data)
trains the model on the abstracts in "train_data" as returned by classifier_functions.apply_features
classifier_functions.Classifier.batch_tagger(test_data,excluded)
identifies the information items by tagging the abstracts given
in "test_data" (in "excluded" is a list of indexes of tokens that we do not want the model to consider, and it is returned by
classifier_functions.apply_features
along with the feature set)
For a demo output run
from ierct.src import classifier_functions
classifier_functions.demo()
the classes classifier_evaluation.HoldOut(classifier,test_set,excluded)
and classifier_evaluation.CrossValidation(classifier,data,excluded,folds)
perform evaluation routines for the classifier instance "classifier" (classifier_functions.Classifier
) and the test data in
"test_set" (classifier_evaluation.HoldOut
) or for a number of Cross Validation folds ("folds") in "data" (classifier_evaluation.CrossValidation
).
classifier_evaluation.HoldOut.tabulate_evaluation_measures()
and classifier_evaluation.CrossValidation.tabulate_evaluation_measures()
print the results
Run testing.py to replicate the results with the datasets in "./data".
Run demo.py to see a demo output.
- Python 2.7 or higher
- scipy
- numpy
- nltk (*)
- re
- GeniaTagger 3.0.1 (**)
- sklearn
- gurobipy
- beatifulsoup4
- html5lib
- json
- urlib
- shelve
(*) The stopword corpus is needed. Instructions here.
(**) Install Genia Tagger files in "%ProgramFiles%\geniatagger-3.0.1". A windows port can be found here (many thanks to Syeed Ibn Faiz for this).