Improve Orphanet disease description knowledge by phenotypic automated recognition using scrapping toolkits.
Representatives: David Lagorce , Marc Hanauer
Orphanet INSERM US14 - Elixir FR - Excelerate WP8 Rare diseases
Orphanet is a website dedicated to rare diseases, providing several kind of information such nomenclature, classifications, textual information, disorders/genes relations and also dedicated resources in the field (Experts centres, Diagnostic tests, clinical trials, orphandrugs, registries and biobanks, supports groups etc.) for more than 40 countries. The site has a huge audience, around 1 million unique visitors/month and 8 languages.
The database content is linked to the Orphanet nomenclature. We produce or aggregate textual information, expertised and manually curated. Clinical description of disease is done by using HPO (Human Phenotype Ontology) terms.
Orphanet produce also the Orphanet Rare Diseases Ontology. Each disease concept has a unique, stable, identifier (Orphacode) which could be used to identify diseases in health information system. The orphacode has been integrated in several countries.
Orphanet provides for around 3000 described diseases more than 60000 diseases/HPO annotations curated by experts. The aim of this proposal, by using several toolkits, will help us to try to:
- Speed-up the process of disease-HPO annotation by using text-mining recognition on Orphanet textual information / or pubmed publication. We will try to extract automatically HPO terms in publication.
- Improve the curation process by comparison between the automated recognition and the annotations already provided by Orphanet. (detecting missing terms etc.)
To this end, through a dedicated pipeline we propose to text-mine data from our database and/or from elsewhere (url, text files) in order to scrap HPO terms.
Approaches to reach goals
Extracts information from life-science websites and texts, generating phenopackets with the extracted information and correct external ontology references.
This is a search browser written in React JS to provide a user interface for Phenomics backend services (Phantom)
MER is a tool which given any lexicon and any input text returns the list of terms recognized in the text, including their exact location (annotations). MER is a Named-Entity Recognition tool which given any lexicon and any input text returns the list of terms recognized in the text, including their exact location (annotations) and link entities with a given ontology
Framework for identifying Human Phenotype entities
programmers, developers - Python, ReactJS, Web APIs - XML, JSON, RDF/owl files formats.
Expected hacking days: 4 days
Related works and references
GitHub or any other public repositories of your FOSS products (if any)
Link to the hacking project https://github.com/d-salgado/PhenoMarker
- Toyofumi Fujiwara
- David Salgado