## Gene-related condition/disorders

* Pull out all traits associated with this submission/submitter (?)
    * Needs to be by [submitter ID](https://www.ncbi.nlm.nih.gov/clinvar/submitters/239772/)
    * Shouldn't use this as a long-term solution, just to ensure we get the right "ground truth" set
* Does regex like `[0-9a-zA-Z]+-related .*` work?
    * Precision/recall over this set
* How many such traits have EFO/MONDO/HP terms?
* How many such traits have Medgen terms?
* How many such traits have records from other submitters?
* How many associated variants have other records?
    * Note we already know 99% of targets are covered by other records
* Iterate on the regex as needed

In [4]:
import sys
sys.path.append('..')

In [5]:
from filter_clinvar_xml import filter_xml, pprint, iterate_cvs_from_xml

from cmat.clinvar_xml_io import *
from cmat.clinvar_xml_io.xml_parsing import *

import gzip
import os
import re
import pandas as pd

In [7]:
data_dir = os.getenv('WORK_DIR')
full_clinvar_xml = os.path.join(data_dir, 'full-clinvar.xml.gz')
prevention_xml = os.path.join(data_dir, 'prevention-records.xml.gz')

In [None]:
prevention_id = '239772'

header = b'''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ReleaseSet Dated="." xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Type="full" xsi:noNamespaceSchemaLocation="http://ftp.ncbi.nlm.nih.gov/pub/clinvar/xsd_public/clinvar_public_2.0.xsd">
'''
count = 0
with gzip.open(prevention_xml, 'wb') as output_file:
    output_file.write(header)
    for raw_cvs_xml in iterate_cvs_from_xml(full_clinvar_xml):
        
        # 1. Trait must have a valid name
        rcv = find_mandatory_unique_element(raw_cvs_xml, 'ReferenceClinVarAssertion')
        record = ClinVarRecord(rcv, 2.0)
        if len(record.traits_with_valid_names) == 0:
            continue
        
        # 2. Record must have a PreventionGenetics submission
        subs = find_elements(raw_cvs_xml, 'ClinVarAssertion/ClinVarAccession')
        for s in subs:
            org_id = s.attrib['OrgID']
            if str(org_id) == prevention_id:
                output_file.write(ElementTree.tostring(raw_cvs_xml))
                count += 1
                
    output_file.write(b'</ReleaseSet>')
print(f'Records written: {count}')

# Records written: 112101

In [9]:
prevention_dataset = ClinVarDataset(prevention_xml)

In [10]:
prevention_trait_names = set()
for r in prevention_dataset:
    for t in r.traits_with_valid_names:
        prevention_trait_names.add(t.preferred_or_other_valid_name)

In [11]:
len(prevention_trait_names)

5745

In [12]:
prevention_trait_names = list(prevention_trait_names)

In [27]:
GENE_RELATED_DISORDER = r'^\S+-related disorder$'

In [28]:
not_related_disorder = [name for name in prevention_trait_names if not re.match(GENE_RELATED_DISORDER, name)]

In [29]:
# Indicates we can't filter on submitter either
# Also note *-related condition is indeed removed entirely - nothing to stop anyone from using another phrase for something similarly generic
not_related_disorder

['Early-onset progressive diffuse brain atrophy-microcephaly-muscle weakness-optic atrophy syndrome',
 'Niemann-Pick disease, type A',
 'Congenital heart defects, dysmorphic facial features, and intellectual developmental disorder',
 'Sandhoff disease',
 'Spinocerebellar ataxia type 12',
 'Von Hippel-Lindau syndrome',
 'Polycystic kidney disease, adult type',
 'Tyrosinase-negative oculocutaneous albinism',
 'Mitral valve prolapse, myxomatous 2',
 'Gaucher disease type I',
 'Gaucher disease type III',
 'Distinctive facial features',
 'Multiple congenital anomalies',
 'Gaucher disease type II',
 'Van Maldergem syndrome 1',
 'Anterior segment dysgenesis 7',
 'Developmental delay',
 'ASAH1-related disorders']

Don't think this is the right set to be looking at, could also do submission name (SUB14299258) or date (2024-03-08) to target the specific problematic submission but this also doesn't really address the root cause.

This also raises the question as to whether we should filter gene-related condition as well as disorder, even though it's disappeared from this specific case it's still arguably not specific enough to be annotated properly. Maybe should leave it until it becomes a concrete problem though.

In [19]:
# How many *-related disorder terms have EFO/MONDO/HP, or Medgen terms? => postponed till OLS/EFO back up
# Can check medgen terms within ClinVar though

In [32]:
related_disorder_xml = os.path.join(data_dir, 'disorder-records.xml.gz')

In [None]:
# Take another set, this time any with preferred trait name ending with "related disorder"

def has_related_disorder_trait(x: ClinVarRecord):
    for t in x.traits_with_valid_names:
        if re.match(GENE_RELATED_DISORDER, t.preferred_or_other_valid_name):
            return True
    return False


filter_xml(
    input_xml=full_clinvar_xml,
    output_xml=related_disorder_xml,
    filter_fct=has_related_disorder_trait,
    max_num=None,
)

In [None]:
# Of "related disorder" records, how many have MedGen or any other term within ClinVar?
# How many have a submitter besides Prevention Genetics? (not sure we really care about this actually)
