Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Join the chat at

Improve Orphanet disease description knowledge by phenotypic automated recognition using scrapping toolkits.

Representatives: David Lagorce , Marc Hanauer


Orphanet INSERM US14 - Elixir FR - Excelerate WP8 Rare diseases


Background information

Orphanet is a website dedicated to rare diseases, providing several kind of information such nomenclature, classifications, textual information, disorders/genes relations and also dedicated resources in the field (Experts centres, Diagnostic tests, clinical trials, orphandrugs, registries and biobanks, supports groups etc.) for more than 40 countries. The site has a huge audience, around 1 million unique visitors/month and 8 languages.


The database content is linked to the Orphanet nomenclature. We produce or aggregate textual information, expertised and manually curated. Clinical description of disease is done by using HPO (Human Phenotype Ontology) terms.


Orphanet produce also the Orphanet Rare Diseases Ontology. Each disease concept has a unique, stable, identifier (Orphacode) which could be used to identify diseases in health information system. The orphacode has been integrated in several countries.


Expected outcomes

Orphanet provides for around 3000 described diseases more than 60000 diseases/HPO annotations curated by experts. The aim of this proposal, by using several toolkits, will help us to try to:

  1. Speed-up the process of disease-HPO annotation by using text-mining recognition on Orphanet textual information / or pubmed publication. We will try to extract automatically HPO terms in publication.
  2. Improve the curation process by comparison between the automated recognition and the annotations already provided by Orphanet. (detecting missing terms etc.)

To this end, through a dedicated pipeline we propose to text-mine data from our database and/or from elsewhere (url, text files) in order to scrap HPO terms.

Approaches to reach goals

  1. Phenopacket-Scrapper:

Extracts information from life-science websites and texts, generating phenopackets with the extracted information and correct external ontology references.

  1. Phenomics-hippo

This is a search browser written in React JS to provide a user interface for Phenomics backend services (Phantom)

  1. MER

MER is a tool which given any lexicon and any input text returns the list of terms recognized in the text, including their exact location (annotations). MER is a Named-Entity Recognition tool which given any lexicon and any input text returns the list of terms recognized in the text, including their exact location (annotations) and link entities with a given ontology

  1. IHP

Framework for identifying Human Phenotype entities

Expected audience

programmers, developers - Python, ReactJS, Web APIs - XML, JSON, RDF/owl files formats.

Expected hacking days: 4 days

Related works and references

GitHub or any other public repositories of your FOSS products (if any)

Link to the hacking project


  • Toyofumi Fujiwara
  • David Salgado
You can’t perform that action at this time.