Repository used to collect biomedical corpus on the Internet!
https://github.com/spyysalo/bc2gm-corpus
Provides a corpus of scientific texts, used for BioCreative, a competition in which participants are given well defined text-mining or information extraction tasks in the biological domain. BC2GM-corpus consists mainly of the training and testing corpora from BioCreative I and the testing corpus for the current task consists of an additional 5,000 sentences that were held 'in reserve'.
https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/track-3-cdr/
https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data/BC4CHEMD-IOBES
https://github.com/cambridgeltl/MTL-Bioinformatics-2016
The 2015 CDR challenge is now successfully completed! Please find the overview paper below:
Wei CH, Peng Y, Leaman R, Davis AP, Mattingly CJ, Li J, Wiegers TC, and Lu Z. Overview of the BioCreative V Chemical Disease Relation (CDR) Task. Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, pp. 154-166
https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data/BC5CDR-chem-IOB
https://github.com/wonjininfo/CollaboNet
The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.
https://github.com/spyysalo/genia-pos
https://github.com/spyysalo/s800
SPECIES: a standalone command line application capable of identifying taxonomic mentions in documents and mapping them to corresponding NCBI Taxonomy database entries.
Given a folder with plain text files, SPECIES based on its taxonomic name and synonym dictionary reports the taxonomic mentions (start, end position in each document), the detected term and the corresponding NCBI Taxonomy database record identifier.
Besides binomials following the Linnaean naming convention, recognised taxonomic mentions include acronyms, common names and abbreviations, as well as misspellings and the rest of the naming types supported by the NCBI Taxonomy.
https://arxiv.org/abs/1901.10219
https://github.com/spyysalo/jnlpba
https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/
The NCBI disease corpus is fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community.