## Point prevalence, prevalence at birth, lifetime prevalence, annual incidence, number of cases an/or families

http://www.orphadata.org/data/xml/en_product2_prev.xml

http://www.orphadata.org/cgi-bin/docs/userguide2014.pdf

DisorderType: can be either Disease, Clinical syndrome, Malformation syndrome,
Biological anomaly, Morphological anomaly, Group of phenomes, Etiological subtype,
Clinical subtype, Histopathological subtype or Particular clinical situation in a disease
or syndrome

PrevalenceList count: total number of epidemiological data of a given entry.

PrevalenceType: can be either “Point prevalence”, “birth prevalence”, “lifelong
prevalence”, “incidence”, “cases/families”.

PrevalenceQualification: can be either “Value and Class”, “Only class”, “Case” or “Family”

PrevalenceClass: estimated prevalence of a given entry. There are eight possible values: 
\>1 / 1,000, 1-5 / 10,000, 6-9 / 10,000, 1-9 / 100,000, 1-9 / 1,000,000 or <1 / 1,000,000, Not yet documented, Unknown

ValMoy: Mean value of a given prevalence type. By default, the mean value is 0.0 when only a class is documented.

PrevalenceGeographic: Geographic area of a given prevalence type

Source: Source of information of a given prevalence type.

PrevalenceValidationStatus: can be either Validated or Not yet validated

In [1]:
import xml.etree.ElementTree as et
from collections import defaultdict

In [2]:
tree = et.parse('/home/gstupp/projects/biothings/mydisease/mydisease/data/en_product2_prev.xml')
root = tree.getroot()

In [3]:
d = defaultdict(lambda: defaultdict(list))
for disease in root.find("DisorderList"):    
    name = disease.find("Name").text
    orpha = "orphanet:" + disease.find("OrphaNumber").text
    disease_type = disease.find("DisorderType/Name").text
    prevalences = disease.findall("PrevalenceList/Prevalence")
    for prev in prevalences:
        source = prev.find("Source").text
        prevalence_type = prev.find("PrevalenceType/Name").text
        prevalence_qual = prev.find("PrevalenceQualification/Name").text
        prevalence_geo = prev.find("PrevalenceGeographic/Name").text
        prevalence_val_status = prev.find("PrevalenceValidationStatus/Name").text
        valmoy = prev.find("ValMoy").text
        prev_d = {'source': source, 'prevalence_type': prevalence_type, 'prevalence_qualification': prevalence_qual,
            'prevalence_geographic': prevalence_geo, 'prevalence_validation_status': prevalence_val_status,
            'mean_value': float(valmoy) if valmoy != '0.0' else None}
        
        if prev.find("PrevalenceClass/Name") is not None:
            prev_d['prevalence_class'] = prev.find("PrevalenceClass/Name").text
        d[orpha]['prevalence'].append(prev_d)
    #d[orpha] = dict(d[orpha])

In [4]:
d['orphanet:166024']

defaultdict(list,
            {'prevalence': [{'mean_value': 4.0,
               'prevalence_geographic': 'Worldwide',
               'prevalence_qualification': 'Case',
               'prevalence_type': 'Cases/families',
               'prevalence_validation_status': 'Validated',
               'source': '11389160[PMID]_9689990[PMID]_ [EXPERT]'},
              {'mean_value': None,
               'prevalence_class': '<1 / 1 000 000',
               'prevalence_geographic': 'Worldwide',
               'prevalence_qualification': 'Class only',
               'prevalence_type': 'Point prevalence',
               'prevalence_validation_status': 'Validated',
               'source': 'ORPHANET_11389160[PMID]_9689990[PMID]'}]})

## Type of inheritance, average age of onset and average age of death

http://www.orphadata.org/data/xml/en_product2_ages.xml


AverageAgeOfOnset: classes based on the estimated average age of entry onset.
There are ten different population age groups: Antenatal, Neonatal, Infancy,
Childhood, Adolescence, Adult, Elderly, All ages and No data available.

AverageAgeOfDeath: classes based on the estimated average age at death for a
given entry. There are twelve different population age groups: Embryofoetal, Stillbirth,
Infantile, Early Childhood, Late Childhood, Adolescent,Young adult, Adult, Elderly,
Any age, Normal life expectancy and No data available.

TypeOfInheritance: type(s) of inheritance associated with a given disease. There are
thirteen different types of inheritance: Autosomal dominant, Autosomal recessive, Xlinked
dominant, X-linked recessive, Chromosomal, Mitochondrial inheritance,
Multigenic/multifactorial, Oligogenic, Semi-dominant, Y-linked, No data available, Not
applicable, Not yet documented.

In [5]:
tree = et.parse('/home/gstupp/projects/biothings/mydisease/mydisease/data/en_product2_ages.xml')
root = tree.getroot()

In [6]:
for disease in root.find("DisorderList"):    
    orpha = "orphanet:" + disease.find("OrphaNumber").text
    aoo = [x.find("Name").text for x in disease.findall("AverageAgeOfOnsetList/AverageAgeOfOnset")]
    aod = [x.find("Name").text for x in disease.findall("AverageAgeOfDeathList/AverageAgeOfDeath")]
    toi = [x.find("Name").text for x in disease.findall("TypeOfInheritanceList/TypeOfInheritance")]
    ages_d = {'ave_age_of_onset': aoo, 'ave_age_of_death': aod, 'type_of_inheritance': toi}
    ages_d = {k:v for k,v in ages_d.items() if v}
    d[orpha].update(ages_d)

In [7]:
d['orphanet:166024']

defaultdict(list,
            {'ave_age_of_onset': ['Infancy', 'Neonatal'],
             'prevalence': [{'mean_value': 4.0,
               'prevalence_geographic': 'Worldwide',
               'prevalence_qualification': 'Case',
               'prevalence_type': 'Cases/families',
               'prevalence_validation_status': 'Validated',
               'source': '11389160[PMID]_9689990[PMID]_ [EXPERT]'},
              {'mean_value': None,
               'prevalence_class': '<1 / 1 000 000',
               'prevalence_geographic': 'Worldwide',
               'prevalence_qualification': 'Class only',
               'prevalence_type': 'Point prevalence',
               'prevalence_validation_status': 'Validated',
               'source': 'ORPHANET_11389160[PMID]_9689990[PMID]'}],
             'type_of_inheritance': ['Autosomal recessive']})

## Phenotypes associated with rare disorders

http://www.orphadata.org/cgi-bin/inc/product4.inc.php

http://www.orphadata.org/data/xml/en_product4_HPO.xml

Frequencies:
- Obligate: the phenotype is always present and the diagnosis could not be achieved in its absence;
- Very frequent: the phenotype is present in 80 to 99% of the patient population ;
- Frequent: the phenotype is present in 30 to 79% of the patient population ;
- Occasional: the phenotype is present in 5 to 29% of the patient population ;
- Very rare: the phenotype is present in 1 to 4% of the patient population ;
- Excluded: the phenotype is always absent AND is an exclusion criteria for diagnosing the disorder.

Diagnostic criterion: A diagnostic criterion is a phenotypic abnormality used consensually to
assess the diagnosis of a disorder. Multiple sets of diagnostic criteria are necessary to
achieve the diagnosis. Orphanet indicates only diagnostic criteria that are consensually
accepted by the experts of the medical domain AND published in medical literature.
Depending of the medical consensus, they could be further qualified as minor, major,
etc…This level of precision is yet not informed in the Orphanet dataset.

Pathognomonic sign: A pathognomonic phenotype is a feature sufficient by itself to establish
definitively and beyond any doubt the diagnosis of the disease concerned (i.e. heliotrope
erytheme for dermatomyosistis).
Files are available in 7 different languages (

<HPODisorderAssociation id="10225">
  <HPO id="166">
    <HPOId>HP:0001945</HPOId>
    <HPOTerm>Fever</HPOTerm>
  </HPO>
  <HPOFrequency id="28419">
    <OrphaNumber>453312</OrphaNumber>
    <Name lang="en">Frequent (79-30%)</Name>
  </HPOFrequency>
  <DiagnosticCriteria id="28447">
    <OrphaNumber>453316</OrphaNumber>
    <Name lang="en">Pathognomonic sign</Name>
  </DiagnosticCriteria>
</HPODisorderAssociation>

In [8]:
tree = et.parse('/home/gstupp/projects/biothings/mydisease/mydisease/data/en_product4_HPO.xml')
root = tree.getroot()

In [9]:
for disease in root.find("DisorderList"):    
    orpha = "orphanet:" + disease.find("OrphaNumber").text
    associations = disease.findall("HPODisorderAssociationList/HPODisorderAssociation")
    for ass in associations:
        hpo_id = ass.find("HPO/HPOId").text
        hpo_name = ass.find("HPO/HPOTerm").text
        frequency = ass.find("HPOFrequency/Name").text
        pheno_d = {'phenotype_id': hpo_id.lower(), 'phenotype_name': hpo_name, 
                   'frequency': frequency}
        if ass.find("DiagnosticCriteria/Name") is not None:
            pheno_d['diagnostic_criteria'] = ass.find("DiagnosticCriteria/Name").text
        d[orpha]['phenotypes'].append(pheno_d)

In [10]:
d['orphanet:166024']

defaultdict(list,
            {'ave_age_of_onset': ['Infancy', 'Neonatal'],
             'phenotypes': [{'frequency': 'Very frequent (99-80%)',
               'phenotype_id': 'hp:0000256',
               'phenotype_name': 'Macrocephaly'},
              {'frequency': 'Very frequent (99-80%)',
               'phenotype_id': 'hp:0000272',
               'phenotype_name': 'Malar flattening'},
              {'frequency': 'Very frequent (99-80%)',
               'phenotype_id': 'hp:0000316',
               'phenotype_name': 'Hypertelorism'},
              {'frequency': 'Very frequent (99-80%)',
               'phenotype_id': 'hp:0000369',
               'phenotype_name': 'Low-set ears'},
              {'frequency': 'Very frequent (99-80%)',
               'phenotype_id': 'hp:0000470',
               'phenotype_name': 'Short neck'},
              {'frequency': 'Very frequent (99-80%)',
               'phenotype_id': 'hp:0000767',
               'phenotype_name': 'Pectus excavatum'},
         

## Rare diseases with their associated genes

http://www.orphadata.org/cgi-bin/inc/product6.inc.php

http://www.orphadata.org/data/xml/en_product6.xml

DisorderList count: total number of disorders, group of disorders and subtypes in the XML file.

Orphanum: unique identifying number assigned by Orphanet to a given entry (disorder, group of disorders, subtype or gene).

Name: preferred name of a given entry (disorder, group of disorders, subtype or gene).

GeneList count: number of genes associated with a given entry.

Symbol: official HGNC-approved gene symbol.

Synonym list: list of synonyms for a given gene, including past symbols

GeneType: can be either gene with protein product, locus or non-coding RNA

GeneLocus: gene chromosomal location

DisorderGeneAssociationType: gene-disease relationships. They can be either Role in the phenotype of, Disease-causing germline mutation(s) (loss of function) in, Disease-causing germline mutation(s) (gain of function) in, Disease-causing somatic mutation(s) in, Modifying somatic mutation in, Part of a fusion gene in, Major susceptibility factor in and Candidate gene tested in.

DisorderGeneAssociationStatus: can be either Validated or Not validated

External Reference List: list of references in HGNC, OMIM, GenAtlas and UniProtKB, Ensembl, Reactome and IU-PHAR associated with a given gene.

Source: HGNC, OMIM, GenAtlas or UniProtKB.

Reference: listed reference for a given source associated with a gene

In [11]:
tree = et.parse('/home/gstupp/projects/biothings/mydisease/mydisease/data/en_product6.xml')
root = tree.getroot()

In [12]:
gene_d = {}
dga_d = defaultdict(list)
for disease in root.find("DisorderList"):    
    orpha = "orphanet:" + disease.find("OrphaNumber").text
    genes = disease.findall("GeneList/Gene")
    for gene in genes:
        synonyms = [x.text for x in gene.findall("SynonymList/Synonym")]
        gene_type = gene.find("GeneType/Name").text
        loci = [x.find("GeneLocus").text for x in gene.findall("LocusList/Locus")]
        gene_d[gene.attrib['id']] = {'synonyms': synonyms, 'gene_type': gene_type,
                                    'loci': loci}
    dg_associations = disease.findall("DisorderGeneAssociationList/DisorderGeneAssociation")
    for dga in dg_associations:
        gene = dga.find("Gene")
        gene_name = gene.find("Name").text
        gene_symbol = gene.find("Symbol").text
        dga_type = dga.find("DisorderGeneAssociationType/Name").text
        dga_status = dga.find("DisorderGeneAssociationStatus/Name").text
        this_dga = {'gene_name': gene_name, 'gene_symbol': gene_symbol, 'dga_type': dga_type,
                    'dga_status': dga_status}
        this_dga['gene_type'] = gene_d[gene.attrib['id']]['gene_type']
        this_dga['loci'] = gene_d[gene.attrib['id']]['loci']
        dga_d[orpha].append(this_dga)
        d[orpha]['disease_gene_associations'].append(this_dga)

In [13]:
d = {k:dict(v) for k,v in d.items()}
for k,v in d.items():
    v['_id'] = k
dlist = list(d.values())

In [14]:
d['orphanet:166024']

{'_id': 'orphanet:166024',
 'ave_age_of_onset': ['Infancy', 'Neonatal'],
 'disease_gene_associations': [{'dga_status': 'Assessed',
   'dga_type': 'Disease-causing germline mutation(s) in',
   'gene_name': 'kinesin family member 7',
   'gene_symbol': 'KIF7',
   'gene_type': 'gene with protein product',
   'loci': ['15q26.1']}],
 'phenotypes': [{'frequency': 'Very frequent (99-80%)',
   'phenotype_id': 'hp:0000256',
   'phenotype_name': 'Macrocephaly'},
  {'frequency': 'Very frequent (99-80%)',
   'phenotype_id': 'hp:0000272',
   'phenotype_name': 'Malar flattening'},
  {'frequency': 'Very frequent (99-80%)',
   'phenotype_id': 'hp:0000316',
   'phenotype_name': 'Hypertelorism'},
  {'frequency': 'Very frequent (99-80%)',
   'phenotype_id': 'hp:0000369',
   'phenotype_name': 'Low-set ears'},
  {'frequency': 'Very frequent (99-80%)',
   'phenotype_id': 'hp:0000470',
   'phenotype_name': 'Short neck'},
  {'frequency': 'Very frequent (99-80%)',
   'phenotype_id': 'hp:0000767',
   'phenotype_

### mongo

In [20]:
from pymongo import MongoClient
client = MongoClient()
db = client.mydisease.orphanet
db.find_one('orphanet:166024')

{'_id': 'orphanet:166024',
 'alternative_term': ['Multiple epiphyseal dysplasia-macrocephaly-distinctive facies syndrome'],
 'definition': 'Multiple epiphyseal dysplasia, Al-Gazali type is a skeletal dysplasia characterized by multiple epiphyseal dysplasia (see this term), macrocephaly and facial dysmorphism.',
 'definition_citation': 'orphanet',
 'definitions': 'Multiple epiphyseal dysplasia, Al-Gazali type is a skeletal dysplasia characterized by multiple epiphyseal dysplasia (see this term), macrocephaly and facial dysmorphism.',
 'mapping': {'E': ['omim:607131'], 'NTBT': ['icd10cm:Q77.3']},
 'parents': ['orphanet:377788'],
 'part_of': ['orphanet:251'],
 'preferred_label': 'Multiple epiphyseal dysplasia, Al-Gazali type',
 'synonyms': ['Multiple epiphyseal dysplasia-macrocephaly-distinctive facies syndrome'],
 'tree_view': ['orphanet:251'],
 'xref': {'omim': ['607131']}}

In [21]:
for dd in dlist:
    db.update_one({'_id':dd['_id']}, {'$set': dd}, upsert=True)

In [22]:
db.find_one('orphanet:166024')

{'_id': 'orphanet:166024',
 'alternative_term': ['Multiple epiphyseal dysplasia-macrocephaly-distinctive facies syndrome'],
 'ave_age_of_onset': ['Infancy', 'Neonatal'],
 'definition': 'Multiple epiphyseal dysplasia, Al-Gazali type is a skeletal dysplasia characterized by multiple epiphyseal dysplasia (see this term), macrocephaly and facial dysmorphism.',
 'definition_citation': 'orphanet',
 'definitions': 'Multiple epiphyseal dysplasia, Al-Gazali type is a skeletal dysplasia characterized by multiple epiphyseal dysplasia (see this term), macrocephaly and facial dysmorphism.',
 'disease_gene_associations': [{'dga_status': 'Assessed',
   'dga_type': 'Disease-causing germline mutation(s) in',
   'gene_name': 'kinesin family member 7',
   'gene_symbol': 'KIF7',
   'gene_type': 'gene with protein product',
   'loci': ['15q26.1']}],
 'mapping': {'E': ['omim:607131'], 'NTBT': ['icd10cm:Q77.3']},
 'parents': ['orphanet:377788'],
 'part_of': ['orphanet:251'],
 'phenotypes': [{'frequency': 'Ve

In [23]:
db.find_one("orphanet:98306")

{'_id': 'orphanet:98306',
 'alternative_term': ['FPLD'],
 'definition': 'Familial partial lipodystrophy (FPLD) is a group of rare genetic lipodystrophic syndromes characterized, in most cases, by fat loss from the limbs and buttocks, from childhood or early adulthood, and often associated with acanthosis nigricans, insulin resistance, diabetes, hypertriglyceridemia and liver steatosis.',
 'definition_citation': 'orphanet',
 'definitions': 'Familial partial lipodystrophy (FPLD) is a group of rare genetic lipodystrophic syndromes characterized, in most cases, by fat loss from the limbs and buttocks, from childhood or early adulthood, and often associated with acanthosis nigricans, insulin resistance, diabetes, hypertriglyceridemia and liver steatosis.',
 'mapping': {'E': ['mesh:D052496', 'umls_cui:C0271694'],
  'NTBT': ['icd10cm:E88.1']},
 'parents': ['orphanet:98305', 'orphanet:377794'],
 'preferred_label': 'Familial partial lipodystrophy',
 'prevalence': [{'mean_value': 2.0,
   'preval