Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
PostScript
C
Makefile
HTML
Other
Shell
Other
Cannot retrieve the latest commit at this time.
| Failed to load latest commit information. | |||
|
|
classes | ||
|
|
data | ||
|
|
etc | ||
|
|
lib | ||
|
|
ncbo-v4-validation | ||
|
|
pi-knowtator-xml | ||
|
|
scripts | ||
|
|
src | ||
|
|
NOTES-ON-PORTING-TO-NCBO-V4 | ||
|
|
README | ||
|
|
build.xml | ||
README
UNIVERSITY OF PITTBURGH SPL DRUG NAME ENTITY RECOGNIZER Author: Richard D Boyce, PhD, Greg Gardner, Yifan Ning Copyright (c) 2011-2015 Richard D Boyce, PhD, University of Pittsburgh All Rights Reserved Use, reproduction, and preparation of derivative works are permitted. Any copy of this software or of any derivative work must include the above copyright notice and this paragraph. Any distribution of this software or derivative works must comply with all applicable United States export control laws. This software is made available as is, and Richard D Boyce and the University of Pittsburgh disclaim all warranties, express or implied, including without limitation the implied warranties of merchantability and fitness for a particular purpose, and notwithstanding any other provision contained herein, any liability for damages resulting from the software or its use is expressly disclaimed, whether arising in contract, tort (including negligence) or strict liability, even if Richard D Boyce and the University of Pittsburgh is advised of the possibility of such damages. -------------------------------------------------------------------------------- ---------- HOW TO RUN ---------- ** To run the NER program 1) Place the files you want to annotate in the 'textfiles' sub-folder 2) Change the following BASEPATH to your local configuration and export: BASEPATH="/home/rdb20/u-of-pitt-SPL-drug-NER" 3) Configure the 'dictionary_path' variable in etc/file_properties.xml so the the path to the WordNet dictionary resolves correctly 4) Configure the BASEPATH and PIPATH path assignments in src/groovy/util/PItoXML.groovy 5) export this variable CLASSPATH=$BASEPATH/lib/xml-apis-1.4.01.jar:$BASEPATH/lib/jena-iri-0.9.2.jar:$BASEPATH/lib/httpcore-4.1.3.jar:$BASEPATH/lib/extjwnl-1.6.4.jar:$BASEPATH/lib/commons-codec-1.5.jar:$BASEPATH/lib/jena-arq-2.9.2.jar:$BASEPATH/lib/log4j-1.2.16.jar:$BASEPATH/lib/commons-httpclient-3.0.1.jar:$BASEPATH/lib/commons-logging-1.1.1.jar:$BASEPATH/lib/concurrentlinkedhashmap-lru-1.2.jar:$BASEPATH/lib/jena-tdb-0.9.2.jar:$BASEPATH/lib/httpclient-4.1.2.jar:$BASEPATH/lib/extjwnl-utilities-1.6.4.jar:$BASEPATH/lib/jena-core-2.7.2.jar:$BASEPATH/lib/mysql-connector-java-5.1.17.jar:$BASEPATH/lib/xercesImpl-2.10.0.jar:$BASEPATH/lib/http-builder-0.5.2.jar:$BASEPATH/lib/jackson-core-2.1.1.jar:$BASEPATH/lib/jackson-databind-2.1.1.jar:$BASEPATH/lib/jackson-annotations-2.1.1.jar:$BASEPATH/target/classes/: 5) Run $ groovy -cp $CLASSPATH src/groovy/util/PItoXML.groovy 6) Output will be written to 'processed-output' -- ** Test run of the Annotator 1) Set the BASEPATH and CLASSPATH as above 2) compile using 'ant compile' 3) cd target/classes 4) java -cp $CLASSPATH ncbo/AnnotatorClient There is a default string that will be sent to the annotator for processing. You should see results from MESH and RXNORM. ------------------------------------------------------------ Validation procedure ------------------------------------------------------------ 1) remove any text files from the folder 'textfiles' 2) copy all sections located in 'sample-package-insert-statements' to the 'textfiles' folder 3) run the NER program. This will produce output XML files in the 'processed-output' folder 4) make a temporary folder for analysis e.g, /tmp/ner-performance-analysis-09152014/ 5) now run the following series of commands from within the ZSH shell $ cd textfiles $ for i in *.txt echo $i && python ../src/python/calculate-metrics-for-a-single-note.py -n ../processed-output/$i-PROCESSED.xml -k ../pi-knowtator-xml/$i.knowtator.xml > <path to analysis folder>/$i-ner-performance.txt 5) Use these grep commands for getting lists of recall, precision, and F values that can then be copied into R vectors for summary statistics grep -r "Recall:" ./ grep -r "F-measure:" ./ grep -r "precision:" ./ NOTE: when this analysis was ran on 9.15.2014 to detect changes in performance due to migration to Bioportal v4, the following sections had to be edited for the NER to complete: - package-insert-section-20.txt - package-insert-section-55.txt In both cases, the string 'magnesium stearate" had to be removed from the orignal text so the the NER program did not generate the following exception: java.io.IOException: Server returned HTTP response code: 500 for URL: http://data.bioontology.org/ontologies/MESH/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FMESH%2FC031183 ------------------------------------------------------------------------------------------ COMMAND FOR IDENTIFYING XML ENCODING PROBLEMS...... ------------------------------------------------------------------------------------------ for i in *PROCESSED* xmllint --encode UTF-8 $i > $i-LINTED.xml ----------------------------------------------------------------------------------- How to converts NER outputs to the format that feeds domeo loading program ----------------------------------------------------------------------------------- (1) run section text parsing program at "analyses/ddi-table-parsing/scripts/retrieveSPLSquery.py" Inputs: list of setIds at file "analyses/ddi-table-parsing/scripts/setIDs.txt" Outputs: dailymed section text at folder "analyses/ddi-table-parsing/scripts/outfiles" Copy txt files in outfiles folder to "u-of-pitt-SPL-drug-NER/textfiles/" as section label sources (2) run NER program to get NER dataset in XML (3) run converting script "convertNER_2_JSON_DOMEO.py" to get ner data set in json format that feeds domeo annotation pre-loading program $ convertNER_2_JSON_DOMEO.py > NER-outputs.json domeo loading program at "u-of-pitt-spl-ddi-v2.0/PK-DDI-NLP-pipeline/loadElasticSearch"