From the [original paper](https://academic.oup.com/nar/article/42/D1/D980/1051029?login=true) on data augmentation - in particular functional/molecular consequences, genes and location info:
> Some of the data ClinVar reports related to variation are added by NCBI. These data are reported only as part of the aggregate record (accession starting with RCV), and can include alternate HGVS expressions, allele frequencies from the 1000 Genomes project (7) or GO-ESP (8), identifiers from dbSNP or dbVar, molecular consequences (e.g. nonsense/missense/frameshift) and location data (splice site, untranslated regions, cytogenetic band, gene symbols and names). Values for molecular consequence, type of variation and location relative to a gene are standardized by reference to identifiers from the Sequence Ontology (9).

In [full submission spreadsheet](https://www.ncbi.nlm.nih.gov/clinvar/docs/submit/#spreadsheet): don't see anything about molecular / functional consequences, which is consistent with the above stating that NCBI adds these.
On the other hand there is space for optional gene symbol:
> This must be the preferred symbol in NCBI's Gene database, which corresponds to HGNC's preferred symbol when one is available. Use only gene symbols that represent functional genes; do not include symbols for pseudogenes or regulatory regions. Separate multiple symbols with a semicolon.
Gene symbol should only be provided to indicate the gene-disease relationship supporting the variant interpretation. Gene symbol is not expected for CNVs or cytogenetic variants except to make a statement that a specific gene within the variant has a relationship to the interpreted condition.

## Functional consequences

* What functional consequences are already in ClinVar?
* Are these directly from the submitters or annotated by ClinVar?
* How often are they present?
* How does the quality and coverage of the annotations compare with our annotations?

In [3]:
import sys

sys.path.append('..')

In [4]:
from collections import Counter
import csv
import multiprocessing
import requests

from filter_clinvar_xml import filter_xml, pprint, iterate_cvs_from_xml
from clinvar_xml_io.clinvar_xml_io import *
from eva_cttv_pipeline.trait_mapping.oxo import OntologyUri

In [5]:
%matplotlib inline
import matplotlib.pyplot as plt

In [6]:
# January 2023
clinvar_xml = '/home/april/projects/opentargets/clinvar.xml.gz'
clinvar = ClinVarDataset(clinvar_xml)

In [23]:
# Count: has target gene, has SO term, has neither
counts = Counter()
for record in clinvar:
    annotated = False
    if record.measure:
        so_elts = find_elements(record.measure.measure_xml, './AttributeSet/XRef[@DB="Sequence Ontology"]')
        if so_elts:
            # so_term = so_elts[0].attrib['ID']
            counts['has_consequences'] += 1
            annotated = True
        if record.measure.hgnc_ids:
            counts['has_hgnc'] += 1
            annotated = True
    if not annotated:
        counts['none'] += 1

In [40]:
def get_values(counter, keys):
    return [counter.get(k, 0) for k in keys]


def print_counter(counter, keys=None):
    if not keys:
        keys = counter.keys()
    l = len(max(keys, key=lambda x: len(x)))
    for k, v in zip(keys, get_values(counter, keys)):
        print(f'{k: <{l}} {v}')
        

def plot_counter(counter, title, keys=None):
    if not keys:
        keys = counter.keys()
    plt.figure(figsize=(15,10))
    plt.title(title)
    bars = plt.bar(keys, get_values(counter, keys))
    plt.bar_label(bars, padding=3)

In [41]:
print_counter(counts)

has_consequences 2198642
has_hgnc         2285631
none             16665


In [20]:
# from metrics
num_rcvs = 2302323  # verified same as count direct from XML
missing_conseq = 20598

In [25]:
# ClinVar's coverage of consequences
counts['has_consequences'] / num_rcvs

0.9549667878920551

In [21]:
# Our coverage of consequences
(num_rcvs - missing_conseq) / num_rcvs

0.9910533839083395

#### Notes on our processing

* we use HGNC id when we query biomart (repeat expansion pipeline) but not for structural or snp consequences
* we definitely need to do processing here even if ClinVar coverage is 100% because our downstream usage requires Ensembl gene ids
* see also [this (brief) discussion](https://github.com/EBIvariation/eva-opentargets/issues/189#issuecomment-786486333)

In [13]:
term_counts = Counter()
for record in clinvar:
    annotated = False
    if record.measure:
        so_elts = find_elements(record.measure.measure_xml, './AttributeSet/XRef[@DB="Sequence Ontology"]')
        if so_elts:
            so_terms = [elt.attrib['ID'] for elt in so_elts]
            term_counts.update(so_terms)            

In [15]:
# What's the relationship between these & microsatellites/CNVs/etc (variant types)?
repeat_terms = ['SO:0002162', 'SO:0002165']

In [14]:
term_counts

Counter({'SO:0001583': 6607031,
         'SO:0001623': 231681,
         'SO:0001589': 1583489,
         'SO:0001619': 420869,
         'SO:0001587': 821551,
         'SO:0001627': 1944452,
         'SO:0001574': 75734,
         'SO:0001575': 100717,
         'SO:0001822': 60273,
         'SO:0001819': 3028604,
         'SO:0001821': 22691,
         'SO:0001624': 333397,
         'SO:0002317': 9,
         'SO:0002054': 832,
         'SO:1000117': 86,
         'SO:1000064': 34,
         'SO:0001582': 10610,
         'SO:0001820': 74447,
         'SO:0002073': 4002,
         'SO:0001578': 4697,
         'SO:0001536': 403,
         'SO:1000071': 940,
         'SO:0001986': 457,
         'SO:0001555': 21,
         'SO:0001542': 2,
         'SO:0002153': 690,
         'SO:0002152': 168,
         'SO:0002220': 503,
         'SO:1000054': 243,
         'SO:0001559': 4,
         'SO:0001987': 93,
         'SO:1000072': 56,
         'SO:0001786': 6,
         'SO:0002218': 2038,
         'SO:1000

In [16]:
len(term_counts)

44

In [18]:
vep_terms = {'SO:0001893',
'SO:0001574',
'SO:0001575',
'SO:0001587',
'SO:0001589',
'SO:0001578',
'SO:0002012',
'SO:0001889',
'SO:0001821',
'SO:0001822',
'SO:0001583',
'SO:0001818',
'SO:0001630',
'SO:0001787',
'SO:0002170',
'SO:0002169',
'SO:0001626',
'SO:0002019',
'SO:0001567',
'SO:0001819',
'SO:0001580',
'SO:0001620',
'SO:0001623',
'SO:0001624',
'SO:0001792',
'SO:0001627',
'SO:0001621',
'SO:0001619',
'SO:0001631',
'SO:0001632',
'SO:0001895',
'SO:0001892',
'SO:0001782',
'SO:0001894',
'SO:0001891',
'SO:0001907',
'SO:0001566',
'SO:0001906',
'SO:0001628'}

In [19]:
cv_terms = set(term_counts.keys())

In [20]:
len(vep_terms)

39

In [21]:
len(cv_terms)

44

In [24]:
cv_terms - vep_terms

{'SO:0000683',
 'SO:0001536',
 'SO:0001541',
 'SO:0001542',
 'SO:0001555',
 'SO:0001559',
 'SO:0001561',
 'SO:0001565',
 'SO:0001582',
 'SO:0001786',
 'SO:0001820',
 'SO:0001986',
 'SO:0001987',
 'SO:0002052',
 'SO:0002053',
 'SO:0002054',
 'SO:0002073',
 'SO:0002152',
 'SO:0002153',
 'SO:0002218',
 'SO:0002219',
 'SO:0002220',
 'SO:0002316',
 'SO:0002317',
 'SO:1000054',
 'SO:1000064',
 'SO:1000070',
 'SO:1000071',
 'SO:1000072',
 'SO:1000117',
 'SO:1000184'}

In [25]:
vep_terms - cv_terms

{'SO:0001566',
 'SO:0001567',
 'SO:0001580',
 'SO:0001620',
 'SO:0001621',
 'SO:0001626',
 'SO:0001628',
 'SO:0001630',
 'SO:0001631',
 'SO:0001632',
 'SO:0001782',
 'SO:0001787',
 'SO:0001792',
 'SO:0001818',
 'SO:0001889',
 'SO:0001891',
 'SO:0001892',
 'SO:0001893',
 'SO:0001894',
 'SO:0001895',
 'SO:0001906',
 'SO:0001907',
 'SO:0002012',
 'SO:0002019',
 'SO:0002169',
 'SO:0002170'}

In [28]:
cv_terms.intersection(vep_terms)

{'SO:0001574',
 'SO:0001575',
 'SO:0001578',
 'SO:0001583',
 'SO:0001587',
 'SO:0001589',
 'SO:0001619',
 'SO:0001623',
 'SO:0001624',
 'SO:0001627',
 'SO:0001819',
 'SO:0001821',
 'SO:0001822'}

Ontology

- if obsolete, ok
- if not, ours need to be "better" - EFO, more specific/accurate, etc. (do this later)

Sequence Ontology / consequences
- can check if obsolete
- if not, compare our annotations
- e.g. sample 1000, take molecular consequence from clinvar & functional consequences from ours & visually inspect differences in precision, etc.


In [46]:
for record in clinvar:
    if record.measure:
        for attr_elt in find_elements(record.measure.measure_xml, './AttributeSet'):
            for conseq_elt in find_elements(attr_elt, './Attribute[@Type="MolecularConsequence"]'):
                conseq_text = conseq_elt.text
                so_elt = find_mandatory_unique_element(attr_elt, './XRef[@DB="Sequence Ontology"]')
                so_term = so_elt.attrib['ID']
                print(conseq_text)
                print(so_term)
    break

missense variant
SO:0001583
