Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
BH_Scrapping_David.pptx
README.md

README.md

Join the chat at https://gitter.im/bh2018paris/14-Orphanet-PhenoScrapper

Improve Orphanet disease description knowledge by phenotypic automated recognition using scrapping toolkits.

Representatives: David Lagorce , Marc Hanauer

Community


Orphanet INSERM US14 - Elixir FR - Excelerate WP8 Rare diseases

Leads


Background information


Orphanet is a website dedicated to rare diseases, providing several kind of information such nomenclature, classifications, textual information, disorders/genes relations and also dedicated resources in the field (Experts centres, Diagnostic tests, clinical trials, orphandrugs, registries and biobanks, supports groups etc.) for more than 40 countries. The site has a huge audience, around 1 million unique visitors/month and 8 languages.

Orphanet

The database content is linked to the Orphanet nomenclature. We produce or aggregate textual information, expertised and manually curated. Clinical description of disease is done by using HPO (Human Phenotype Ontology) terms.

Orphanet_Map

Orphanet produce also the Orphanet Rare Diseases Ontology. Each disease concept has a unique, stable, identifier (Orphacode) which could be used to identify diseases in health information system. The orphacode has been integrated in several countries.

Codification

Expected outcomes


Orphanet provides for around 3000 described diseases more than 60000 diseases/HPO annotations curated by experts. The aim of this proposal, by using several toolkits, will help us to try to:

  1. Speed-up the process of disease-HPO annotation by using text-mining recognition on Orphanet textual information / or pubmed publication. We will try to extract automatically HPO terms in publication.
  2. Improve the curation process by comparison between the automated recognition and the annotations already provided by Orphanet. (detecting missing terms etc.)

To this end, through a dedicated pipeline we propose to text-mine data from our database and/or from elsewhere (url, text files) in order to scrap HPO terms.

Approaches to reach goals

  1. Phenopacket-Scrapper:

Extracts information from life-science websites and texts, generating phenopackets with the extracted information and correct external ontology references.

https://github.com/monarch-initiative/phenopacket-scraper-core

https://github.com/monarch-initiative/phenopacket-scraper-webapp

https://github.com/monarch-initiative/phenopacket-scraper-api

  1. Phenomics-hippo

This is a search browser written in React JS to provide a user interface for Phenomics backend services (Phantom)

https://github.com/KCCG/phenomics-hippo

  1. MER

MER is a tool which given any lexicon and any input text returns the list of terms recognized in the text, including their exact location (annotations). MER is a Named-Entity Recognition tool which given any lexicon and any input text returns the list of terms recognized in the text, including their exact location (annotations) and link entities with a given ontology

http://labs.rd.ciencias.ulisboa.pt/mer/

https://github.com/lasigeBioTM/MER

  1. IHP

Framework for identifying Human Phenotype entities

https://github.com/lasigeBioTM/IHP

Expected audience


programmers, developers - Python, ReactJS, Web APIs - XML, JSON, RDF/owl files formats.

Expected hacking days: 4 days

Related works and references


GitHub or any other public repositories of your FOSS products (if any)


https://github.com/Orphanet/Orphadata.org

Link to the hacking project https://github.com/d-salgado/PhenoMarker

Hackers


  • Toyofumi Fujiwara
  • David Salgado
You can’t perform that action at this time.