<H1>Sulfite Oxidase Deficiency SUOX</H1>
<P>Data from <a href="https://pubmed.ncbi.nlm.nih.gov/36303223/" target="__blank">Li JT, et al. (2022) Mutation analysis of SUOX in isolated sulfite oxidase deficiency with ectopia lentis as the presenting feature: insights into genotype-phenotype correlation. Orphanet J Rare Dis.17(1):392. PMID:36303223</a>.</P>
<P>We transferred information from Additional Files 5, 6, and 7 to two Excel files to parse the data.</P>

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from IPython.display import display, HTML
from pyphetools.creation import *
from pyphetools.visualization import *
from pyphetools.validation import *
import pyphetools
print(f"Using pyphetools version {pyphetools.__version__}")

Using pyphetools version 0.9.8


In [2]:
PMID = "PMID:36303223"
title = "Mutation analysis of SUOX in isolated sulfite oxidase deficiency with ectopia lentis as the presenting feature: insights into genotype-phenotype correlation"
cite = Citation(pmid=PMID, title=title)
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
hpo_ontology = parser.get_ontology()
metadata = MetaData(created_by="ORCID:0000-0003-2598-6622", citation=cite)
metadata.default_versions_with_hpo(version=hpo_version)
print(f"HPO version {hpo_version}")

HPO version 2023-10-09


<H2>SUOX variants</H2>
<P>The file Li-SUOX-Variants.xlsx has one variant per line, assigned to the patient ID.</P>
<p>Note that one of the reported variants is erroneous according to Variant Validator:</p>
<pre>NM_001032386.2:c.1355C>A: Variant reference (C) does not agree with reference sequence (G)</pre>
<p>This is how the variant was reported in the original publication. We changed the C to a G and obtain the
same amino acid change as reported in the original publication: <tt>NP_001027558.1:p.(G452D)</tt>.</p>

In [3]:
variant_df = pd.read_excel('input/Li-SUOX-Variants.xlsx', na_values=['n.a.'])
variant_df.head()

Unnamed: 0,Proband ID,ID,Nucleotide,Amino acid,Exon,Domain,Status
0,1,M1,c.433delC,p.Q145Sfs*16,EX6,Cyt-b5 domain,Homo
1,2,M2,c.650G>A,p.R217Q,EX6,Moco domain,Homo
2,3,M3,c.794C>A,p.A265D,EX6,Moco domain,Com het
3,3,M4,c.1280C>A,p.S427*,EX6,Homodimerization domain,
4,4,M5,c.733_736delCTTT,p.L245Pfs*27,EX6,Moco domain,Homo


In [4]:
from collections import defaultdict
hg38 = "hg38"
SUOX_transcript = 'NM_001032386.2'
vvalidator = VariantValidator(genome_build=hg38, transcript=SUOX_transcript)
patient_id_to_variant_list_d = defaultdict(list)
all_variants = set()
for _, row in variant_df.iterrows():
    proband = row['Proband ID']
    individual_id = f"individual_{proband}"
    var = row['Nucleotide']
    if var == "c.1355C>A":
        var = "c.1355G>A" # repair error (see above)
    patient_id_to_variant_list_d[individual_id].append(var)
    all_variants.add(var)
variant_d = {}
for v in all_variants:
    var = vvalidator.encode_hgvs(v)
    variant_d[v] = var
print(f"Extracted information for {len(variant_d)} variants with Variant Validator")

https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001032386.2%3Ac.352C>T/NM_001032386.2?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001032386.2%3Ac.803G>A/NM_001032386.2?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001032386.2%3Ac.1348T>C/NM_001032386.2?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001032386.2%3Ac.284_285insC/NM_001032386.2?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001032386.2%3Ac.1521_1524delTTGT/NM_001032386.2?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001032386.2%3Ac.649C>G/NM_001032386.2?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001032386.2%3Ac.400_403delGAGC/N

<H1>Clinical data</H1>

In [5]:
df_clinical = pd.read_excel("input/Li-Suox-Clinical.xlsx")
df_clinical.head()

Unnamed: 0,Proband ID,PMID,Ethnicity,Gender,Parental consanguity,Age at onset (months),Variants,Amino acid,status,Typical type/Mild type,...,Homocys (umol/L) NR: 5-15,Cys (umol/L) NR: 20-70,UA (umol/L) NR: 210-430,Sulfite (mg/L) NR: 0,Thiosulfate NR: 0,Urine SSC (umol/mmolCr) NR: 0.1-10,Urine Taurine (mmol/molCr) NR: 12-150,Urine XA NR: 0-0.46mmol/L or <40umol/mmolCr or <0.29XA/Cr,Urine HypoXA NR: 0-0.18mmol/L or <8umol/mmolCr or <0.5HypoXA/Cr,Urine UA NR: 0.44-4.50mmol/L or 50-980umol/mmolCr
0,1,9050047,EUR,M,No,0,c.(433delC); (433delC),p.(Q145Sfs*16); (Q145Sfs*16),Homo,T,...,n.a.,2,n.a.,20-25,n.a.,320umol/L,95,n.a.,n.a.,n.a.
1,2,9600976,EUR,F,Yes,5,c.(650G>A); (650G>A),p.(R217Q); (R217Q),Homo,T,...,n.a.,n.a.,n.a.,0.108-0.211,0.297-1.632mmol/L,240umol/L,n.a.,0.04mmol/L,0.05mmol/L,0.14mmol/L
2,3,10519592,NAM,M,No,0,c.(794C>A); (1280C>A),p.(A265D); (S427*),Com het,T,...,n.a.,n.a.,normal,80-100,n.a.,690umol/L,n.a.,normal,normal,normal
3,4,12112661,n.a.,n.a.,Yes,n.a.,c.(733_736delCTTT); (733_736delCTTT),p.(L245Pfs*27); (L245Pfs*27),Homo,T,...,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.
4,5,12112661,n.a.,n.a.,Yes,n.a.,c.(284_285insC); (1126C>T),p.(E97*); (R376C),Com het,T,...,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.


In [6]:
df_clinical.columns

Index(['Proband ID', 'PMID', 'Ethnicity', 'Gender', 'Parental consanguity',
       'Age at onset (months)', 'Variants', 'Amino acid', 'status',
       'Typical type/Mild type', 'Death', 'Age of death (months)',
       'Prodromal infection', 'Developmental delay', 'Regression', 'Seizure',
       'Extrapyramidal symptoms', 'Hypertonia', 'Hypotonia', 'Microcephaly',
       'Ectopia lentis', 'Age of diagnosis of ophthalmic manifestations',
       'Proband ID.1', 'Resource (PMID)', 'SSC (umol/L) NR: 0',
       'Taurine (umol/L) NR: 15-145', 'Homocys (umol/L) NR: 5-15',
       'Cys (umol/L) NR: 20-70', 'UA (umol/L) NR: 210-430',
       'Sulfite (mg/L) NR: 0', 'Thiosulfate NR: 0 ',
       'Urine SSC (umol/mmolCr) NR: 0.1-10',
       'Urine Taurine (mmol/molCr) NR: 12-150',
       'Urine XA NR: 0-0.46mmol/L or <40umol/mmolCr or <0.29XA/Cr',
       'Urine HypoXA NR: 0-0.18mmol/L or <8umol/mmolCr or <0.5HypoXA/Cr',
       'Urine UA NR: 0.44-4.50mmol/L or 50-980umol/mmolCr'],
      dtype='object'

In [7]:
column_mapper_d = {}

items = {
    'Developmental delay': ['Neurodevelopmental delay', 'HP:0012758'],
    'Regression': ['Cognitive regression', 'HP:0034332'],
    'Seizure': ['Seizure', 'HP:0001250'],
    'Extrapyramidal symptoms': ['Abnormality of extrapyramidal motor function', 'HP:0002071'],
    'Hypertonia':['Hypertonia','HP:0001276'],
    'Hypotonia': ['Hypotonia','HP:0001252'],
    'Microcephaly':['Microcephaly', 'HP:0000252'],
    'Ectopia lentis':['Ectopia lentis', 'HP:0001083'],
}

item_column_mapper_d = hpo_cr.initialize_simple_column_maps(column_name_to_hpo_label_map=items, 
                                                            observed='+',
                                                            excluded='-')
print(f"We created {len(item_column_mapper_d)} simple column mappers")
# Transfor to column_mapper_d
for k, v in item_column_mapper_d.items():
    column_mapper_d[k] = v

We created 8 simple column mappers


<H2>Threshold mappers</H2>
<p>The data contain information about biochemical abnormalities framed as tests with reference ranges and values. We can capture this using threshold mappers</p>

<h3>SSC (umol/L) NR: 0</h3>
<p>SSC refers to S-sulfocysteine; The normal range is absent (not more than zero). The corresponding
HPO term is Elevated circulating S-sulfocysteine concentration HP:0034745.</p>

In [8]:
df_clinical['SSC (umol/L) NR: 0'].unique()

array(['n.a.', 28, 14], dtype=object)

In [9]:
sscMapper = ThresholdedColumnMapper(hpo_id="HP:0034745", 
                                    hpo_label="Elevated circulating S-sulfocysteine concentration",
                                    threshold=0,
                                   call_if_above=True)
sscMapper.preview_column(df_clinical['SSC (umol/L) NR: 0'])
column_mapper_d['SSC (umol/L) NR: 0'] = sscMapper

In [10]:
df_clinical['Taurine (umol/L) NR: 15-145'].unique()

array([197, 'n.a.', 46], dtype=object)

In [11]:
# Hypertaurinemia HP:0500181
taurineMapper = ThresholdedColumnMapper(hpo_id="HP:0500181",
                                        hpo_label="Hypertaurinemia",
                                        threshold=145,
                                        call_if_above=True)
taurineMapper.preview_column(df_clinical['Taurine (umol/L) NR: 15-145'])
column_mapper_d['Taurine (umol/L) NR: 15-145'] = taurineMapper

In [12]:
# 'Homocys (umol/L) NR: 5-15' -- Hyperhomocystinemia HP:0002160
homocyteineMapper = ThresholdedColumnMapper(hpo_id="HP:0002160",
                                            hpo_label="Hyperhomocystinemia",
                                            threshold=15,
                                            call_if_above=True)
homocyteineMapper.preview_column(df_clinical['Homocys (umol/L) NR: 5-15'])
column_mapper_d['Homocys (umol/L) NR: 5-15'] = homocyteineMapper

In [13]:
# 'Cys (umol/L) NR: 20-70' -- 
# Note this manifests as low circulate Cystine (not Cysteine)
# Hypocystinemia HP:0500152

cystineMapper = ThresholdedColumnMapper(hpo_id="HP:0500152",
                                        hpo_label="Hypocystinemia",
                                        threshold=20,
                                        call_if_above=False)
cystineMapper.preview_column(df_clinical['Cys (umol/L) NR: 20-70'])
column_mapper_d['Cys (umol/L) NR: 20-70'] = cystineMapper

In [14]:
# 'UA (umol/L) NR: 210-430'  -- Hypouricemia HP:0003537
uricAcidMapper = ThresholdedColumnMapper(hpo_id="HP:0003537",
                                         hpo_label="Hypouricemia",
                                         threshold=210,
                                         call_if_above=False)
uricAcidMapper.preview_column(df_clinical['UA (umol/L) NR: 210-430'])
column_mapper_d['UA (umol/L) NR: 210-430'] = uricAcidMapper

In [15]:
# 'Sulfite (mg/L) NR: 0' 
# df_clinical['Sulfite (mg/L) NR: 0']
# requires new HPO term

In [16]:
# 'Thiosulfate NR: 0 ' -- requires new HPO term

In [17]:
# 'Urine SSC (umol/mmolCr) NR: 0.1-10' -- Sulfocysteinuria HP:0032350
urineSscMapper = ThresholdedColumnMapper(hpo_id="HP:0032350",
                                         hpo_label="Sulfocysteinuria",
                                        threshold=10,
                                        call_if_above=True)
urineSscMapper.preview_column(df_clinical['Urine SSC (umol/mmolCr) NR: 0.1-10'])
column_mapper_d['Urine SSC (umol/mmolCr) NR: 0.1-10'] = urineSscMapper

In [18]:
# 'Urine Taurine (mmol/molCr) NR: 12-150'  -- Increased urinary taurine HP:0003166
urineTaurineMapper = ThresholdedColumnMapper(hpo_id="HP:0003166",
                                            hpo_label="Increased urinary taurine",
                                            threshold=150,
                                            call_if_above=True)
urineTaurineMapper.preview_column(df_clinical['Urine Taurine (mmol/molCr) NR: 12-150'])
column_mapper_d['Urine Taurine (mmol/molCr) NR: 12-150'] = urineTaurineMapper

In [19]:
# 'Urine XA NR: 0-0.46mmol/L or <40umol/mmolCr or <0.29XA/Cr'
# Here we need to use an OptionColumnMapper because three different measurement ranges are used
df_clinical['Urine XA NR: 0-0.46mmol/L or <40umol/mmolCr or <0.29XA/Cr'].unique()
urine_xa_d = { '11.7umol/mmolCr':"Xanthinuria",
       '1.7umol/mmolCr':"Xanthinuria" }
urine_not_xa_d = {'0.04mmol/L': "Xanthinuria", 
                  "1.6umol/mmolCr": "Xanthinuria", 
                  "0.0214XA/Cr": "Xanthinuria",
                 "normal": "Xanthinuria"}
urineXAmapper = OptionColumnMapper(concept_recognizer=hpo_cr, 
                                   option_d=urine_xa_d, 
                                   excluded_d=urine_not_xa_d)
urineXAmapper.preview_column(df_clinical['Urine XA NR: 0-0.46mmol/L or <40umol/mmolCr or <0.29XA/Cr'])
column_mapper_d['Urine XA NR: 0-0.46mmol/L or <40umol/mmolCr or <0.29XA/Cr'] = urineXAmapper

In [20]:
# 'Urine HypoXA NR: 0-0.18mmol/L or <8umol/mmolCr or <0.5HypoXA/Cr',
# Increased urinary hypoxanthine level HP:0011814
urine_hxa_d = {
    '8umol/mmolCr': "Increased urinary hypoxanthine level",
}
urine_hxa_excluded_d = {
    'normal': "Increased urinary hypoxanthine level",
    '0.05mmol/L': "Increased urinary hypoxanthine level",
    '0.0264HypoXA/Cr': "Increased urinary hypoxanthine level",
}
urineHXAmapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=urine_hxa_d, excluded_d=urine_hxa_excluded_d)
urineHXAmapper.preview_column(df_clinical['Urine HypoXA NR: 0-0.18mmol/L or <8umol/mmolCr or <0.5HypoXA/Cr'])
column_mapper_d['Urine HypoXA NR: 0-0.18mmol/L or <8umol/mmolCr or <0.5HypoXA/Cr'] = urineHXAmapper

In [21]:
# 'Urine UA NR: 0.44-4.50mmol/L or 50-980umol/mmolCr'
# Hyperuricosuria HP:0003149
# Decreased urinary urate HP:0011935
# Abnormality of urinary uric acid level HP:0012610
df_clinical['Urine UA NR: 0.44-4.50mmol/L or 50-980umol/mmolCr'].unique()

array(['n.a.', '0.14mmol/L', 'normal', '385umol/mmolCr', '21umol/mmolCr',
       '430umol/mmolCr'], dtype=object)

In [22]:
urine_ua_d = {'0.14mmol/L': "Decreased urinary urate",
             '21umol/mmolCr': "Decreased urinary urate",}
urine_ua_excluded_d = {'normal' : "Abnormality of urinary uric acid level",
                      '385umol/mmolCr': "Abnormality of urinary uric acid level",
                      '430umol/mmolCr': "Abnormality of urinary uric acid level",}
urineUaMapper = OptionColumnMapper(concept_recognizer=hpo_cr,
                                  option_d=urine_ua_d,
                                  excluded_d=urine_ua_excluded_d)
urineUaMapper.preview_column(df_clinical['Urine UA NR: 0.44-4.50mmol/L or 50-980umol/mmolCr'])
column_mapper_d['Urine UA NR: 0.44-4.50mmol/L or 50-980umol/mmolCr'] = urineUaMapper

<H2>Putting it all together</H2>
<p>First, let's create a new individual id column that shows the PMID</p>

In [23]:
def get_individual_id(arr):
    iid = arr.iloc[0]
    pmid = arr.iloc[1]
    if pmid == "our patient":
        return "individual_35_PMID_36303223" # from current manuscript
    else:
        return f"individual_{iid}_PMID_{pmid}"
df_clinical['individual_id'] = df_clinical[['Proband ID', 'Resource (PMID)']].apply(lambda x: get_individual_id(x), axis=1)

In [24]:
ageMapper = AgeColumnMapper.by_month(column_name="Age at onset (months)")
ageMapper.preview_column(df_clinical['Age at onset (months)'])

Unnamed: 0,original column contents,age
0,0,P0D
1,5,P5M
2,n.a.,NOT_PROVIDED
3,0.8,P24D
4,0.1,P3D
5,12,P1Y
6,0.5,P15D
7,1.3,P1M9D
8,0.7,P21D
9,16,P1Y4M


In [25]:
sexMapper = SexColumnMapper(male_symbol='M', female_symbol='F', column_name='Gender', unknown_symbol='n.a.')
#sexMapper.preview_column(df_clinical['Gender'])

In [26]:
individual_column_name = 'individual_id'

encoder = CohortEncoder(df=df_clinical, 
                        hpo_cr=hpo_cr, 
                        column_mapper_d=column_mapper_d, 
                        individual_column_name=individual_column_name,
                        agemapper=ageMapper, 
                        sexmapper=sexMapper,
                        metadata=metadata)

sod = Disease(disease_id='OMIM:272300', disease_label='Sulfite oxidase deficiency')
encoder.set_disease(sod)

In [27]:
individuals = encoder.get_individuals()

In [28]:
for indi in individuals:
    # indi.id is something like individual_1_PMID_9050047
    # however, above we have individual_1
    fields = indi.id.split("_PMID")
    indi_id = fields[0]
    var_list = patient_id_to_variant_list_d.get(indi_id)
    if len(var_list) == 1:
        homozygous_var = variant_d.get(var_list[0])
        homozygous_var.set_homozygous()
        indi.add_variant(homozygous_var)
    elif len(var_list) == 2:
        het_var_1 = variant_d.get(var_list[0])
        het_var_2 = variant_d.get(var_list[1])
        het_var_1.set_heterozygous()
        het_var_2.set_heterozygous()
        indi.add_variant(het_var_1)
        indi.add_variant(het_var_2)
    else:
        raise ValueError(f"Bad number of variants (should never happen)")

In [29]:
cvalidator = CohortValidator(cohort=individuals, ontology=hpo_ontology, min_hpo=1, allelic_requirement=AllelicRequirement.BI_ALLELIC)
qc = QcVisualizer(cohort_validator=cvalidator)
display(HTML(qc.to_summary_html()))

Level,Error category,Count
INFORMATION,NOT_MEASURED,275


<H2>Visualization</H2>

In [30]:
table = PhenopacketTable(individual_list=individuals, metadata=metadata)
display(HTML(table.to_html()))

Individual,Disease,Genotype,Phenotypic features
individual_1_PMID_9050047 (MALE; P0D),Sulfite oxidase deficiency (OMIM:272300),NM_001032386.2:c.433del (homozygous),Neurodevelopmental delay (HP:0012758); Seizure (HP:0001250); Abnormality of extrapyramidal motor function (HP:0002071); Hypertonia (HP:0001276); Hypotonia (HP:0001252); Microcephaly (HP:0000252); Ectopia lentis (HP:0001083); Hypertaurinemia (HP:0500181); Hypocystinemia (HP:0500152); excluded: Cognitive regression (HP:0034332); excluded: Increased urinary taurine (HP:0003166)
individual_2_PMID_9600976 (FEMALE; P5M),Sulfite oxidase deficiency (OMIM:272300),NM_001032386.2:c.650G>A (homozygous),Neurodevelopmental delay (HP:0012758); Seizure (HP:0001250); Abnormality of extrapyramidal motor function (HP:0002071); Hypertonia (HP:0001276); Hypotonia (HP:0001252); Ectopia lentis (HP:0001083); Decreased urinary urate (HP:0011935); excluded: Cognitive regression (HP:0034332); excluded: Xanthinuria (HP:0010934); excluded: Increased urinary hypoxanthine level (HP:0011814)
individual_3_PMID_10519592 (MALE; P0D),Sulfite oxidase deficiency (OMIM:272300),NM_001032386.2:c.794C>A (heterozygous) NM_001032386.2:c.1280C>A (heterozygous),Seizure (HP:0001250); Abnormality of extrapyramidal motor function (HP:0002071); Hypertonia (HP:0001276); Ectopia lentis (HP:0001083); excluded: Neurodevelopmental delay (HP:0012758); excluded: Cognitive regression (HP:0034332); excluded: Hypotonia (HP:0001252); excluded: Microcephaly (HP:0000252); excluded: Xanthinuria (HP:0010934); excluded: Increased urinary hypoxanthine level (HP:0011814); excluded: Abnormality of urinary uric acid level (HP:0012610)
individual_4_PMID_12112661 (UNKNOWN; ),Sulfite oxidase deficiency (OMIM:272300),NM_001032386.2:c.734_737del (homozygous),Seizure (HP:0001250)
individual_5_PMID_12112661 (UNKNOWN; ),Sulfite oxidase deficiency (OMIM:272300),NM_001032386.2:c.287dup (heterozygous) NM_001032386.2:c.1126C>T (heterozygous),Seizure (HP:0001250)
individual_6_PMID_12112661 (UNKNOWN; ),Sulfite oxidase deficiency (OMIM:272300),NM_001032386.2:c.772A>C (homozygous),Seizure (HP:0001250)
individual_7_PMID_12112661 (UNKNOWN; ),Sulfite oxidase deficiency (OMIM:272300),NM_001032386.2:c.803G>A (homozygous),Seizure (HP:0001250)
individual_8_PMID_12112661 (UNKNOWN; ),Sulfite oxidase deficiency (OMIM:272300),NM_001032386.2:c.1200C>G (homozygous),Seizure (HP:0001250)
individual_9_PMID_12112661 (UNKNOWN; ),Sulfite oxidase deficiency (OMIM:272300),NM_001032386.2:c.1261C>T (homozygous),Seizure (HP:0001250)
individual_10_PMID_12112661 (UNKNOWN; ),Sulfite oxidase deficiency (OMIM:272300),NM_001032386.2:c.1084G>A (homozygous),Seizure (HP:0001250)


In [31]:
Individual.output_individuals_as_phenopackets(individual_list=individuals, 
                                              metadata=metadata)

We output 35 GA4GH phenopackets to the directory phenopackets


<H2>Validation</H2>

<p>Also validated with phenopacket-tools</p>
<pre>pxf validate --hpo hp.json *.json</pre>
<p>No errors found</p>