Combining lexical and context features for automatic ontology extension
This repository contains script which were used to build and train the prediction models together with the scripts for evaluating their performance.
Full-text PMC articles from Europe PMC.
To install python dependencies run:
pip install -r requirements.txt
- For removing Stopwords and Punctuations from yours corpus/article file using
cleanText.pywith extended list of stopwords and punctuations that been removed as our needing (the comprehensive list written inside the code).
Reasoner.groovyto identify Infectious and Anatomical disease classes and their corresponding subclasses.
labelSynoExtraction.groovyto extract ontology synonyms/labels.
Diseases.pywe use the whole generated vectors to train ANN with different hidden layer sizes to classifiy wheather a term refer to a disease class or non-disease term.
inecAnato.pywe use 2resINF.txt to train ANN to classifiy wheather a term refer to one of the Infectious disease sub-classes (bactiral, fungal, parastic or viral), 2resANA.txt Anatomical disease sub-classes (12 different sub-classes) or 2resboth.txt combining both classes.
- The evaluation for each cases was done using F-score and AUC functions within
STEP1. Clean the Full-text PMC articles by running
STEP2. Annotate Full-text PMC articles by employing Whatizit with disease names.
STEP3. Generate the word embeddings for the annotated text using word2vec.
Diseases.py script with specifiying the file name containing the resulted embeddings from STEP2 (CleanAllVectors.txt in our case) + DiseasVectors.txt in order to predict wheather a word is a disease of other.
inecAnato.py script with specifiying the file name containing the embeddings 2resINF.txt or 2resANA.txt or 2resboth.txt in order to predict a new sub-classes within infectious disease, anatomical disease or both of them.
For any comments or help needed with how to run the scripts, please send an email to: firstname.lastname@example.org