bioresources

Data resources from the biomedical domain

Information for developers

Extending grounding resource files

The src/main/resources/org/clulab/reach/kb folder contains a number of tab separated value (TSV) files which contain grounding entries. Several of these files have corresponding automated update scripts in the scripts folder. If an update script exists, the corresponding TSV file should not be manually edited, rather, changes should be integrated by changing and running the script.

Most TSV files contain primary grounding entries from a given source. Additionally, the NER-Grounding-Override.tsv file contains manually curated groundings that are used to apply overrides.

Note that the files are version-controlled in a gzipped form and therefore need to be decompressed for editing and then compressed again for checking in to version control.

Note also that if a new TSV file or a new entity type needs to be added. it requires corresponding changes in several places in the Reach code base. For an example of changes that needed to be made in Reach when adding mesh_disease.tsv to bioresources, see https://github.com/clulab/reach/pull/686/files.

Updating the NER files

Once edits have been made to one or more files in the kb folder, the NER files need to be regenerated. For this, reach needs to be cloned in the same parent folder in which the bioresources repo was cloned. Then, the ner_kb.sh script needs to be run which converts the KBs in org.clulab.reach.kb into the format expected by the BioNLPProcessor NER. Please re-run this script everytime a grounding file changes, or when the tokenization algorithm changes in BioNLPProcessor.

The ner_kb.sh script uses ner_kb.config as a configuration input. If only a small number of KBs were modified, edit the file and keep only the modified KBs to avoid re-generating all KBs. The config file also controls what organisms' gene/protein synonyms should be included in the NER resources. By default, only human proteins are included but additional organism names can be listed in ner_kb.config to extend NER to these organisms.

Testing bioresources updated with Reach

To test changes in bioresources, first, bioresources need to be built using sbt as follows:

sbt publishLocal

then check version.sbt to get the current published version of bioresources, typically something like x.x.x-SNAPSHOT. Then navigate to the reach repo, edit processors/build.sbt and change the bioresources version to the one published. This will result in Reach using the locally published bioresources.

It is also possible to automatically build a custom branch of bioresources and Reach, and then run Reach tests using Docker. This is documented here: https://github.com/clulab/reach/tree/master/docker.

Name		Name	Last commit message	Last commit date
Latest commit History 405 Commits
project		project
scripts		scripts
src/main/resources/org/clulab/reach		src/main/resources/org/clulab/reach
.gitignore		.gitignore
CHANGES.md		CHANGES.md
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
ner_kb.bat		ner_kb.bat
ner_kb.config		ner_kb.config
ner_kb.sh		ner_kb.sh
version.sbt		version.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bioresources

Information for developers

Extending grounding resource files

Updating the NER files

Testing bioresources updated with Reach

About

Releases

Packages

Contributors 8

Languages

License

clulab/bioresources

Folders and files

Latest commit

History

Repository files navigation

bioresources

Information for developers

Extending grounding resource files

Updating the NER files

Testing bioresources updated with Reach

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages