No description, website, or topics provided.
PostScript C Makefile HTML Other Shell Other
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
classes
data
etc
lib
ncbo-v4-validation
pi-knowtator-xml
scripts
src
NOTES-ON-PORTING-TO-NCBO-V4
README
build.xml

README

UNIVERSITY OF PITTBURGH SPL DRUG NAME ENTITY RECOGNIZER

Author: Richard D Boyce, PhD, Greg Gardner, Yifan Ning

Copyright (c) 2011-2015 Richard D Boyce, PhD, University of Pittsburgh 
All Rights Reserved

Use, reproduction, and preparation of derivative works are permitted.
Any copy of this software or of any derivative work must include the
above copyright notice and this paragraph.  Any distribution of this
software or derivative works must comply with all applicable United
States export control laws.  This software is made available as is,
and Richard D Boyce and the University of Pittsburgh disclaim all
warranties, express or implied, including without limitation the
implied warranties of merchantability and fitness for a particular
purpose, and notwithstanding any other provision contained herein, any
liability for damages resulting from the software or its use is
expressly disclaimed, whether arising in contract, tort (including
negligence) or strict liability, even if Richard D Boyce and the
University of Pittsburgh is advised of the possibility of such
damages.

--------------------------------------------------------------------------------

----------
HOW TO RUN
----------

** To run the NER program

1) Place the files you want to annotate in the 'textfiles' sub-folder

2) Change the following BASEPATH to your local configuration and export:

BASEPATH="/home/rdb20/u-of-pitt-SPL-drug-NER"

3) Configure the 'dictionary_path' variable in etc/file_properties.xml so the the path to the WordNet dictionary resolves correctly

4) Configure the BASEPATH and PIPATH path assignments in src/groovy/util/PItoXML.groovy

5) export this variable

CLASSPATH=$BASEPATH/lib/xml-apis-1.4.01.jar:$BASEPATH/lib/jena-iri-0.9.2.jar:$BASEPATH/lib/httpcore-4.1.3.jar:$BASEPATH/lib/extjwnl-1.6.4.jar:$BASEPATH/lib/commons-codec-1.5.jar:$BASEPATH/lib/jena-arq-2.9.2.jar:$BASEPATH/lib/log4j-1.2.16.jar:$BASEPATH/lib/commons-httpclient-3.0.1.jar:$BASEPATH/lib/commons-logging-1.1.1.jar:$BASEPATH/lib/concurrentlinkedhashmap-lru-1.2.jar:$BASEPATH/lib/jena-tdb-0.9.2.jar:$BASEPATH/lib/httpclient-4.1.2.jar:$BASEPATH/lib/extjwnl-utilities-1.6.4.jar:$BASEPATH/lib/jena-core-2.7.2.jar:$BASEPATH/lib/mysql-connector-java-5.1.17.jar:$BASEPATH/lib/xercesImpl-2.10.0.jar:$BASEPATH/lib/http-builder-0.5.2.jar:$BASEPATH/lib/jackson-core-2.1.1.jar:$BASEPATH/lib/jackson-databind-2.1.1.jar:$BASEPATH/lib/jackson-annotations-2.1.1.jar:$BASEPATH/target/classes/:


5) Run

$ groovy -cp $CLASSPATH src/groovy/util/PItoXML.groovy

6) Output will be written to 'processed-output' -- 


** Test run of the Annotator

1) Set the BASEPATH and CLASSPATH as above

2) compile using 'ant compile'

3) cd target/classes

4) java -cp $CLASSPATH  ncbo/AnnotatorClient

There is a default string that will be sent to the annotator for
processing. You should see results from MESH and RXNORM.

------------------------------------------------------------
Validation procedure
------------------------------------------------------------

1) remove any text files from the folder 'textfiles'

2) copy all sections located in 'sample-package-insert-statements' to the 'textfiles' folder

3) run the NER program. This will produce output XML files in the 'processed-output' folder

4) make a temporary folder for analysis e.g, /tmp/ner-performance-analysis-09152014/

5) now run the following series of commands from within the ZSH shell

$ cd textfiles

$ for i in *.txt
echo $i && python ../src/python/calculate-metrics-for-a-single-note.py -n ../processed-output/$i-PROCESSED.xml -k ../pi-knowtator-xml/$i.knowtator.xml > <path to analysis folder>/$i-ner-performance.txt

5) Use these grep commands for getting lists of recall, precision, and F values that can then be copied into R vectors for summary statistics

grep -r "Recall:" ./
grep -r "F-measure:" ./
grep -r "precision:" ./


NOTE: when this analysis was ran on 9.15.2014 to detect changes in
performance due to migration to Bioportal v4, the following sections
had to be edited for the NER to complete:

- package-insert-section-20.txt
- package-insert-section-55.txt

In both cases, the string 'magnesium stearate" had to be removed from
the orignal text so the the NER program did not generate the following
exception:

java.io.IOException: Server returned HTTP response code: 500 for URL: http://data.bioontology.org/ontologies/MESH/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FMESH%2FC031183



------------------------------------------------------------------------------------------
COMMAND FOR IDENTIFYING XML ENCODING PROBLEMS......
------------------------------------------------------------------------------------------

for i in *PROCESSED*                
xmllint --encode UTF-8 $i > $i-LINTED.xml



-----------------------------------------------------------------------------------
How to converts NER outputs to the format that feeds domeo loading program
-----------------------------------------------------------------------------------

(1) run section text parsing program at "analyses/ddi-table-parsing/scripts/retrieveSPLSquery.py"

Inputs: list of setIds at file "analyses/ddi-table-parsing/scripts/setIDs.txt"
Outputs: dailymed section text at folder "analyses/ddi-table-parsing/scripts/outfiles"
Copy txt files in outfiles folder to "u-of-pitt-SPL-drug-NER/textfiles/" as section label sources

(2) run NER program to get NER dataset in XML

(3) run converting script "convertNER_2_JSON_DOMEO.py" to get ner data set in json format that feeds domeo annotation pre-loading program

$ convertNER_2_JSON_DOMEO.py > NER-outputs.json

domeo loading program at "u-of-pitt-spl-ddi-v2.0/PK-DDI-NLP-pipeline/loadElasticSearch"