# SCO2
Variants in SCO2 are associated with [Mitochondrial complex IV deficiency, nuclear type 2](https://omim.org/entry/604377) as well as
[Myopia 6](https://omim.org/entry/608908).  This notebook collects clinical data from several publications.


Note that we have revised incorrect HGVS nomenclature in the original publications as follows
- 12–base pair (bp) deletion at c.1519-1530 (PMID:23407777): NM_005138.3:c.402_413del
- c.1541G A (PMID:23407777):  c.418G>A
- c.398G> (PMID:14994243): c.398G>A
- c.17INS19bp (PMID:20159436): NM_005138.3(SCO2):c.16_17insAGCATGCAGCAGTGACTCA (p.Arg6fs) (See clinvar variation id 222816).

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
from IPython.display import display, HTML
import pyphetools
from pyphetools.creation import *
from pyphetools.visualization import *
from pyphetools.validation import CohortValidator
print(f"Using pyphetools version {pyphetools.__version__}")

Using pyphetools version 0.9.15


In [2]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
parser = HpoParser(hpo_json_file="../hp.json")
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
hpo_ontology = parser.get_ontology()
metadata = MetaData(created_by="ORCID:0000-0002-0736-9199")
metadata.default_versions_with_hpo(version=hpo_version)
print(f"HPO version {hpo_version}")

HPO version 2023-10-09


In [3]:
df = pd.read_excel('input/SCO2_curation.xlsx')
df = df.astype(str)

In [4]:
df['patient_id'] = df['id']
df.set_index('patient_id', inplace=True)

In [5]:
df['Variant annotation'].unique()

array(['c.361G>C(p.Gly121Arg)', 'c.763C>T(p.Arg255Trp)',
       'c.418G>A(p.Glu140Lys)/c.505C>A(p.Pro169Thr)',
       'c.404A>G(p.Asp135Gly)/c.512G>A(p.Arg171Gln)',
       'c.179G>A p.(Arg60Gln)/c.577G>A p.(Gly193Ser)',
       'c.334C>T(p.Arg112Trp)', 'c.358C>T(p.Arg120Trp)',
       'c.479T>C(p.Val160Ala)/c.697C>A(p.Pro233Thr)',
       'c.418G>A(p.Glu140Lys)/c.402_413del',
       'c.157C>T(p.Gln53*)/c.418G>A(p.Glu140Lys)',
       'c.418G>A(p.Glu140Lys)/c.674C>T(p.Ser225Phe)',
       'c.667G>A(p.Asp223Asn)/chr22:g.49275958_49362964del(NCBI Build 36.1)',
       'c.418G>A(p.Glu140Lys)/c.16_17insAGCATGCAGCAGTGACTCA',
       'c.418G>A(p.Glu140Lys)/c.398G>A(p.Cys133Tyr)',
       'c.418G>A(p.Glu140Lys)/c.157C>T(p.Gln53Ter)',
       'c.418G>A(p.Glu140Lys)',
       'c.418G>A(p.Glu140Lys)/c.479T>G(p.Val160Gly)',
       'c.418G>A(p.Glu140Lys)/c.107G>A(p.Trp36*)',
       'c.577G>A(p.Gly193Ser)', 'c.157C>T (p.Gln53*)',
       'c.341G>A(p.Arg114His)', 'c.776C>T(p.Ala259Val)',
       'c.418G>A(p.Glu1

In [6]:
def extract_cdna(variant):
    """
    split strings like c.772G>T(p.Gly258*) on the open-parenthesis symbol and return the first part
    """
    v = variant.split("(")[0]
    v = v.replace(" ", "").replace("p.","")
    return v
    
def extract_variant_1(variants):
    """
    Split on the slash ("/") and return the first part (or entire string for homozygous)
    """
    if isinstance(variants, float):
        return "nan"
    v1 = variants.split("/")[0]
    return extract_cdna(v1)

def extract_variant_2(variants):
    """
    Split on the slash ("/") and return the second part (or entire string for homozygous)
    """
    if isinstance(variants, float):
        return "nan"
    fields = variants.split("/")
    if len(fields) == 2:
        return extract_cdna(fields[1])
    else:
        # there was only one variant
        return extract_cdna(variants)

df["var1"] = df['Variant annotation'].apply(lambda x: extract_variant_1(x))
df["var2"] = df['Variant annotation'].apply(lambda x: extract_variant_2(x))

In [7]:
from time import sleep
var1_list = df["var1"].unique()
var2_list = df["var2"].unique()
var_set = set()
var_set.update(var1_list)
var_set.update(var2_list)
variant_d = {}
hg38 = "hg38"
SCO2_transcript = "NM_005138.3"

vvalidator = VariantValidator(genome_build=hg38, transcript=SCO2_transcript)
for v in var_set:
    #print(f"{v}")
    try:
        if v.startswith("chr22:g.49275958_49362964del"):
            var = StructuralVariant.chromosomal_deletion(cell_contents="chr22:g.49275958_49362964del(NCBI Build 36.1)",
                                                        gene_id="HGNC:10604", gene_symbol="SCO2")
        else:
            var = vvalidator.encode_hgvs(v)
        variant_d[v] = var
    except Exception as excpt:
        print(v + ': -- ' +str(excpt))
print(f"extracted {len(variant_d)} variants with VariantValidator")

https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_005138.3%3Ac.157C>T/NM_005138.3?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_005138.3%3Ac.334C>T/NM_005138.3?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_005138.3%3Ac.402_413del/NM_005138.3?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_005138.3%3Ac.268C>T/NM_005138.3?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_005138.3%3Ac.358C>T/NM_005138.3?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_005138.3%3Ac.361G>C/NM_005138.3?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_005138.3%3Ac.16_17insAGCATGCAGCAGTGACTCA/NM_005138.3?content-type=application%2F

In [8]:
import re
def convert_year(y):
    if y == '19 months':
        return "P1Y7M"
    elif y == '28 months':
        return "P2Y4M"
    elif y == "birth":
        return "P1D"
    match = re.search(r'(\d+) year', y)
    
    if match:
        years = match.group(1)
        return f"P{years}Y"
    match = re.search(r'(\d+) month', y) 
    if match:
        months = match.group(1)
        return f"P{months}M"
    match = re.search(r'(\d+) weeks', y) 
    if match:
        w = match.group(1)
        return f"P{w}W"
    else:
        return "n/a"

df['age'] = df['Age of diagnosis'].apply(lambda x: convert_year(x))

In [9]:
ageMapper = AgeColumnMapper.iso8601(column_name='age')
# ageMapper.preview_column(df['age'])

In [10]:
sexMapper = SexColumnMapper(male_symbol="M", female_symbol="F", column_name='Gender')
# sexMapper.preview_column(df["Gender"])

In [11]:
df.columns

Index(['id', 'omim_id', 'omim_title', 'omim_name', 'hgnc_id', 'gene_symbol',
       'Zigosity', 'Location', 'Variant annotation', 'Consequence', 'Refseq',
       'Protein ID', 'Unnamed: 12', 'ACGM classification', 'Protein structure',
       'Age of diagnosis', 'Gender', 'Age at death', 'Phenotype',
       'Prenatal ultrasound phenotype', 'MRI phenotype', 'Cardiac phenotype',
       'Family history', 'Source', 'PMID', 'title', 'var1', 'var2', 'age'],
      dtype='object')

In [12]:
mapper_d = {}
phenotypeColumnMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d={})
phenotypeColumnMapper.preview_column(df['Phenotype'])
mapper_d['Phenotype'] = phenotypeColumnMapper
# phenotypeColumnMapper.preview_column(df['Phenotype'])

In [13]:
prenatalUSmapper =  OptionColumnMapper(concept_recognizer=hpo_cr, option_d={})
prenatalUSmapper.preview_column(df['Prenatal ultrasound phenotype'])
mapper_d['Prenatal ultrasound phenotype'] = prenatalUSmapper

In [14]:
mriMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d={})
mriMapper.preview_column(df['MRI phenotype'])
mapper_d['MRI phenotype'] = mriMapper

In [15]:
# No entries for cardiac
#cardiacMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d={})
#cardiacMapper.preview_column(df['Cardiac phenotype'])

In [16]:
aod_d = {
    "5 months": "P5M",
    "9 months": "P9M",
    "10 months": "P10M",
    "45 days": "P1M15D",
     "7 months": "P7M",
     "6 months": "P6M",
    "20 months": "P1Y8M",
    "13 weeks": "P4M1W", 
    "4 weeks": "P1M",
    "53 days": "P1M3W2D",
     "8 months": "P8M",
     "12 months": "P1Y",
     "15 months": "P1Y3M",
    "7 weeks": "P1M3W",
    "8 weeks": "P2M",
     "4 months": "P4M",
     "25 days": "P3W4D",
    "1 month": "P1M",
    "14 months": "P1Y2M",
    "6.5 months": "P6M2W",
     "4.5 months": "P4M2W",
    "5 months": "P5M",
    "11 months": "P11M",
     "16 months": "P1Y4M",
    "22 months": "P1Y10M",
    "30 months": "P2Y6M",
     "28 months": "P2Y4M",
}
aodMapper = AgeOfDeathColumnMapper(column_name='Age at death', string_to_iso_d=aod_d)

In [17]:
MC4DN2 = Disease(disease_id="OMIM:604377", disease_label="Mitochondrial complex IV deficiency, nuclear type 2")
disease_d = {}
disease_d["604377"] = MC4DN2
myopia6 = Disease(disease_id="OMIM:608908", disease_label="Myopia 6")
disease_d["608908"] = myopia6
diseaseMapper = DiseaseIdColumnMapper(column_name="omim_id", disease_id_map=disease_d)

In [18]:
encoder = MixedCohortEncoder(df=df,
                            hpo_cr=hpo_cr,
                             column_mapper_d=mapper_d,
                             individual_column_name="patient_id",
                             disease_id_mapper=diseaseMapper,
                             pmid_column="PMID",
                             title_column="title",
                             sexmapper=sexMapper,
                             agemapper=ageMapper,
                             age_of_death_mapper=aodMapper,
                             metadata=metadata
                        )

In [19]:
individuals = encoder.get_individuals()

In [20]:
# retrieve the variant strings and add Variant objects to each individual
# the individual id (i.id) is also the index of the pandas dataframe
for i in individuals:
    row = df.loc[i.id] 
    v1 = row['var1']
    v2 = row['var2']
    
    #print(f"{i.id}: v1={v1} and v2={v2}")
    if v1 == v2:
        var1 = variant_d.get(v1)
        var1.set_homozygous()
        i.add_variant(var1)
    else:
        var1 = variant_d.get(v1)
        var2 = variant_d.get(v2)
        var1.set_heterozygous()
        var2.set_heterozygous()
        i.add_variant(var1)
        i.add_variant(var2)

In [21]:
cvalidator = CohortValidator(cohort=individuals, ontology=hpo_ontology, min_hpo=1, allelic_requirement=AllelicRequirement.BI_ALLELIC)
qc = QcVisualizer(cohort_validator=cvalidator)
display(HTML(qc.to_summary_html()))

Level,Error category,Count
ERROR,INSUFFICIENT_HPOS,1
WARNING,REDUNDANT,5

ID,Level,Category,Message,HPO Term
PMID_23838601_23838601_P1,ERROR,INSUFFICIENT_HPOS,Minimum HPO terms required 1 but only 0 found,


In [22]:
individuals = cvalidator.get_error_free_individual_list()
table = PhenopacketTable(individual_list=individuals, metadata=metadata)
display(HTML(table.to_html()))

Individual,Disease,Genotype,Phenotypic features
35112411_P1 (MALE; P14Y),"Mitochondrial complex IV deficiency, nuclear type 2 (OMIM:604377)",NM_005138.3:c.361G>C (homozygous),Frequent falls (HP:0002359); Limited ankle dorsiflexion (HP:0033526); Distal lower limb amyotrophy (HP:0008944); Hand muscle atrophy (HP:0009130); Tremor (HP:0001337); Diffuse white matter abnormalities (HP:0007204)
35112411_P2 (MALE; P6Y),"Mitochondrial complex IV deficiency, nuclear type 2 (OMIM:604377)",NM_005138.3:c.361G>C (homozygous),Motor delay (HP:0001270); Tremor (HP:0001337)
31844624_P1 (MALE; P1Y7M),"Mitochondrial complex IV deficiency, nuclear type 2 (OMIM:604377)",NM_005138.3:c.763C>T (homozygous),Frequent falls (HP:0002359); Gait ataxia (HP:0002066); Tremor (HP:0001337); Distal amyotrophy (HP:0003693); Dysmetria (HP:0001310); Peripheral axonal neuropathy (HP:0003477); Increased serum lactate (HP:0002151); Increased CSF lactate (HP:0002490); Cerebellar atrophy (HP:0001272)
31844624_P2 (FEMALE; P2Y4M),"Mitochondrial complex IV deficiency, nuclear type 2 (OMIM:604377)",NM_005138.3:c.763C>T (homozygous),Gait ataxia (HP:0002066); Tremor (HP:0001337); Distal amyotrophy (HP:0003693); Strabismus (HP:0000486); Cerebellar atrophy (HP:0001272)
29351582_P1 (FEMALE; P6M),"Mitochondrial complex IV deficiency, nuclear type 2 (OMIM:604377)",NM_005138.3:c.418G>A (heterozygous) NM_005138.3:c.505C>A (heterozygous),Ptosis (HP:0000508); Weakness of facial musculature (HP:0030319); Facial-lingual fasciculations (HP:0007089); Dysarthria (HP:0001260); Delayed gross motor development (HP:0002194); Frequent falls (HP:0002359); Skeletal muscle atrophy (HP:0003202); Motor axonal neuropathy (HP:0007002); Areflexia (HP:0001284); Pes cavus (HP:0001761)
29351582_P2 (MALE; P12Y),"Mitochondrial complex IV deficiency, nuclear type 2 (OMIM:604377)",NM_005138.3:c.404A>G (heterozygous) NM_005138.3:c.512G>A (heterozygous),Lower limb muscle weakness (HP:0007340); Areflexia (HP:0001284); Impaired vibratory sensation (HP:0002495); Mixed demyelinating and axonal polyneuropathy (HP:0007327); Distal lower limb amyotrophy (HP:0008944); Pes planus (HP:0001763)
34746378_P1 (MALE; P48Y),"Mitochondrial complex IV deficiency, nuclear type 2 (OMIM:604377)",NM_005138.3:c.179G>A (heterozygous) NM_005138.3:c.577G>A (heterozygous),Visual impairment (HP:0000505); Hemeralopia (HP:0012047); Memory impairment (HP:0002354); Lower limb pain (HP:0012514); Ataxia (HP:0001251); Areflexia (HP:0001284); Positive Romberg sign (HP:0002403); Hearing impairment (HP:0000365); Abnormal circulating creatine kinase concentration (HP:0040081); Abnormal cerebral white matter morphology (HP:0002500); Cerebral cortical atrophy (HP:0002120)
25525168_P1 (FEMALE; P32Y),Myopia 6 (OMIM:608908),NM_005138.3:c.334C>T (homozygous),Abnormal fundus morphology (HP:0001098); High myopia (HP:0011003)
25525168_P2 (FEMALE; P6Y),Myopia 6 (OMIM:608908),NM_005138.3:c.358C>T (homozygous),Abnormal fundus morphology (HP:0001098); High myopia (HP:0011003)
23364397_P1 (MALE; P4M),"Mitochondrial complex IV deficiency, nuclear type 2 (OMIM:604377)",NM_005138.3:c.479T>C (heterozygous) NM_005138.3:c.697C>A (heterozygous),Malignant hyperthermia (HP:0002047)


In [23]:
MixedCohortEncoder.output_individuals_as_phenopackets(individual_list=individuals)

We output 36 GA4GH phenopackets to the directory phenopackets
