## Functional consequences

* What functional consequences are already in ClinVar?
* Are these directly from the submitters or annotated by ClinVar?
* How often are they present?
* How does the quality and coverage of the annotations compare with our annotations?

### Background info

From the [original paper](https://academic.oup.com/nar/article/42/D1/D980/1051029?login=true) on data augmentation - in particular functional/molecular consequences, genes and location info:
> Some of the data ClinVar reports related to variation are added by NCBI. These data are reported only as part of the aggregate record (accession starting with RCV), and can include alternate HGVS expressions, allele frequencies from the 1000 Genomes project (7) or GO-ESP (8), identifiers from dbSNP or dbVar, molecular consequences (e.g. nonsense/missense/frameshift) and location data (splice site, untranslated regions, cytogenetic band, gene symbols and names). Values for molecular consequence, type of variation and location relative to a gene are standardized by reference to identifiers from the Sequence Ontology (9).

In [full submission spreadsheet](https://www.ncbi.nlm.nih.gov/clinvar/docs/submit/#spreadsheet): don't see anything about molecular / functional consequences, which is consistent with the above stating that NCBI adds these.
On the other hand there is space for optional gene symbol:
> This must be the preferred symbol in NCBI's Gene database, which corresponds to HGNC's preferred symbol when one is available. Use only gene symbols that represent functional genes; do not include symbols for pseudogenes or regulatory regions. Separate multiple symbols with a semicolon.
Gene symbol should only be provided to indicate the gene-disease relationship supporting the variant interpretation. Gene symbol is not expected for CNVs or cytogenetic variants except to make a statement that a specific gene within the variant has a relationship to the interpreted condition.

### Coverage in ClinVar

In [2]:
import sys

sys.path.append('..')

In [3]:
from collections import Counter
import csv
import multiprocessing
from random import random
import requests

from filter_clinvar_xml import filter_xml, pprint, iterate_cvs_from_xml
from eva_cttv_pipeline.clinvar_xml_io.clinvar_xml_io import *
from eva_cttv_pipeline.trait_mapping.oxo import OntologyUri

In [4]:
%matplotlib inline
import matplotlib.pyplot as plt

In [5]:
# January 2023
clinvar_xml = '/home/april/projects/opentargets/clinvar.xml.gz'
clinvar = ClinVarDataset(clinvar_xml)

In [64]:
def get_values(counter, keys):
    return [counter.get(k, 0) for k in keys]


def print_counter(counter, keys=None):
    if not keys:
        keys = counter.keys()
    l = len(max(keys, key=lambda x: len(x)))
    for k, v in zip(keys, get_values(counter, keys)):
        print(f'{k: <{l}} {v}')
        

def plot_counter(counter, title, keys=None):
    if not keys:
        keys = counter.keys()
    plt.figure(figsize=(15,10))
    plt.title(title)
    bars = plt.bar(keys, get_values(counter, keys))
    plt.bar_label(bars, padding=3)

In [62]:
# Note that
#  - not all molecular consequence terms are SO
#  - (possibly) not all SO terms are molecular consequences

annotation_counts = Counter()
for record in clinvar:
    has_molec = False
    has_func = False
    has_gene = False
    if record.measure:
        for attr_elt in find_elements(record.measure.measure_xml, './AttributeSet'):
            # These are usually Sequence Ontology
            if find_elements(attr_elt, './Attribute[@Type="MolecularConsequence"]'):
                has_molec = True
            # These are usually Variation Ontology (but sometimes SO)
            if find_elements(attr_elt, './Attribute[@Type="FunctionalConsequence"]'):
                has_func = True
        if record.measure.hgnc_ids:
            has_gene = True
    if has_molec:
        annotation_counts['molecular_consequences'] += 1
    if has_func:
        annotation_counts['functional_consequences'] += 1
    if has_gene:
        annotation_counts['has_gene'] += 1
    if not (has_molec or has_func or has_gene):
        annotation_counts['none'] += 1

In [65]:
print_counter(annotation_counts)

molecular_consequences  2198584
has_gene                2285631
functional_consequences 16091
none                    16652


In [68]:
# from metrics
num_rcvs = 2302323  # verified same as count direct from XML
missing_conseq = 20598

In [69]:
# ClinVar's coverage of consequences
annotation_counts['molecular_consequences'] / num_rcvs

0.9549415959446177

In [21]:
# Our coverage of consequences
(num_rcvs - missing_conseq) / num_rcvs

0.9910533839083395

#### Notes on our processing

* we use HGNC id when we query biomart (repeat expansion pipeline) but not for structural or snp consequences
* we definitely need to do processing here even if ClinVar coverage is 100% because our downstream usage requires Ensembl gene ids
* see also [this (brief) discussion](https://github.com/EBIvariation/eva-opentargets/issues/189#issuecomment-786486333)


For ontologies:
* if CV's obsolete, ok
* if not, ours need to be "better" - EFO, more specific/accurate, etc. (do this later)

For consequences:
* can also check if obsolete
* if not, compare our annotations
    * e.g. sample 1000, take molecular consequence from clinvar & functional consequences from ours & visually inspect differences in precision, etc.

### SO Terms analysis

In [5]:
term_counts = Counter()
for record in clinvar:
    annotated = False
    if record.measure:
        so_elts = find_elements(record.measure.measure_xml, './AttributeSet/XRef[@DB="Sequence Ontology"]')
        if so_elts:
            so_terms = [elt.attrib['ID'] for elt in so_elts]
            term_counts.update(so_terms)            

In [7]:
term_counts

Counter({'SO:0001583': 6607031,
         'SO:0001623': 231681,
         'SO:0001589': 1583489,
         'SO:0001619': 420869,
         'SO:0001587': 821551,
         'SO:0001627': 1944452,
         'SO:0001574': 75734,
         'SO:0001575': 100717,
         'SO:0001822': 60273,
         'SO:0001819': 3028604,
         'SO:0001821': 22691,
         'SO:0001624': 333397,
         'SO:0002317': 9,
         'SO:0002054': 832,
         'SO:1000117': 86,
         'SO:1000064': 34,
         'SO:0001582': 10610,
         'SO:0001820': 74447,
         'SO:0002073': 4002,
         'SO:0001578': 4697,
         'SO:0001536': 403,
         'SO:1000071': 940,
         'SO:0001986': 457,
         'SO:0001555': 21,
         'SO:0001542': 2,
         'SO:0002153': 690,
         'SO:0002152': 168,
         'SO:0002220': 503,
         'SO:1000054': 243,
         'SO:0001559': 4,
         'SO:0001987': 93,
         'SO:1000072': 56,
         'SO:0001786': 6,
         'SO:0002218': 2038,
         'SO:1000

In [8]:
# Our processing includes these as well as repeat expansion terms
vep_terms = {'SO:0001893', 'SO:0001574', 'SO:0001575', 'SO:0001587', 'SO:0001589', 'SO:0001578', 'SO:0002012', 'SO:0001889', 'SO:0001821', 'SO:0001822', 'SO:0001583', 'SO:0001818', 'SO:0001630', 'SO:0001787', 'SO:0002170', 'SO:0002169', 'SO:0001626', 'SO:0002019', 'SO:0001567', 'SO:0001819', 'SO:0001580', 'SO:0001620', 'SO:0001623', 'SO:0001624', 'SO:0001792', 'SO:0001627', 'SO:0001621', 'SO:0001619', 'SO:0001631', 'SO:0001632', 'SO:0001895', 'SO:0001892', 'SO:0001782', 'SO:0001894', 'SO:0001891', 'SO:0001907', 'SO:0001566', 'SO:0001906', 'SO:0001628'}

In [9]:
cv_terms = set(term_counts.keys())

In [10]:
len(vep_terms)

39

In [11]:
len(cv_terms)

44

In [70]:
cv_terms - vep_terms

{'SO:0000683',
 'SO:0001536',
 'SO:0001541',
 'SO:0001542',
 'SO:0001555',
 'SO:0001559',
 'SO:0001561',
 'SO:0001565',
 'SO:0001582',
 'SO:0001786',
 'SO:0001820',
 'SO:0001986',
 'SO:0001987',
 'SO:0002052',
 'SO:0002053',
 'SO:0002054',
 'SO:0002073',
 'SO:0002152',
 'SO:0002153',
 'SO:0002218',
 'SO:0002219',
 'SO:0002220',
 'SO:0002316',
 'SO:0002317',
 'SO:1000054',
 'SO:1000064',
 'SO:1000070',
 'SO:1000071',
 'SO:1000072',
 'SO:1000117',
 'SO:1000184'}

In [15]:
CURRENT = 'current'
OBSOLETE = 'obsolete'


def is_current(term):
    url = f'https://www.ebi.ac.uk/ols/api/terms?id={term}'
    response = requests.get(url)
    response.raise_for_status()
    try:
        data = response.json()['_embedded']['terms'][0]
        return (not data['is_obsolete'])
    except Exception as e:
        print('Not found', term)
        return False    

In [16]:
# check obsolesence for clinvar (none of ours should be obsolete!)
clinvar_obsolete = Counter()
for term in term_counts:
    if is_current(term):
        clinvar_obsolete[CURRENT] += term_counts[term]
    else:
        clinvar_obsolete[OBSOLETE] += term_counts[term]

In [17]:
clinvar_obsolete

Counter({'current': 15334503, 'obsolete': 1367})

Notes:

- very few are obsolete, which is as we'd assume
- assuming none from VEP are obsolete

### Annotations comparison

In [25]:
from eva_cttv_pipeline.evidence_string_generation.clinvar_to_evidence_strings import get_consequence_types
from eva_cttv_pipeline.evidence_string_generation.consequence_type import process_consequence_type_file

In [19]:
sample_xml = '/home/april/projects/opentargets/sample.xml.gz'

In [24]:
# Sample ~1000 random records from current clinvar
filter_xml(
    input_xml=clinvar_xml,
    output_xml=sample_xml,
    filter_fct=lambda _: random() < (1000/2000000),
    max_num=2000
)

INFO:filter_clinvar_xml:Records written: 1144


In [46]:
sample_clinvar = ClinVarDataset(sample_xml)

In [27]:
variant_to_gene_mappings = process_consequence_type_file('/home/april/projects/opentargets/consequences_combined.tsv')

INFO:eva_cttv_pipeline.evidence_string_generation:Loading mapping rs -> ENSG/SOterms
INFO:eva_cttv_pipeline.evidence_string_generation:1577073 rs->ENSG/SOterms mappings loaded


In [52]:
def get_clinvar_consequences(measure):
    # Only SO terms here
    results = {}
    for attr_elt in find_elements(record.measure.measure_xml, './AttributeSet'):
        for conseq_elt in find_elements(attr_elt, './Attribute[@Type="MolecularConsequence"]'):
            conseq_text = conseq_elt.text
            so_elt = find_mandatory_unique_element(attr_elt, './XRef[@DB="Sequence Ontology"]')
            so_term = so_elt.attrib['ID']
            results[so_term] = conseq_text
    return results

In [57]:
for record in sample_clinvar:
    if record.measure:
        clinvar_consequences = get_clinvar_consequences(record.measure)
        our_consequences = get_consequence_types(record.measure, variant_to_gene_mappings)
        our_consequences = {
            c.so_term.accession.replace('_', ':'): c.so_term.so_name.replace('_', ' ')
            for c in our_consequences
        }

        if clinvar_consequences.keys() != our_consequences.keys():
            print(record.accession)
            print('ClinVar:', clinvar_consequences)
            print('Ours:   ', our_consequences)
            print('=============')

RCV000050183
ClinVar: {'SO:0001583': 'missense variant', 'SO:0001619': 'non-coding transcript variant'}
Ours:    {'SO:0001583': 'missense variant'}
RCV000161803
ClinVar: {}
Ours:    {'SO:0001627': 'intron variant'}
RCV000250371
ClinVar: {'SO:0001627': 'intron variant'}
Ours:    {'SO:0002169': 'splice polypyrimidine tract variant'}
RCV000303006
ClinVar: {'SO:0001624': '3 prime UTR variant', 'SO:0001619': 'non-coding transcript variant'}
Ours:    {'SO:0001624': '3 prime UTR variant'}
RCV000309958
ClinVar: {'SO:0001623': '5 prime UTR variant', 'SO:0001619': 'non-coding transcript variant'}
Ours:    {'SO:0001623': '5 prime UTR variant'}
RCV000310470
ClinVar: {'SO:0001623': '5 prime UTR variant', 'SO:0001627': 'intron variant'}
Ours:    {'SO:0001623': '5 prime UTR variant'}
RCV000330816
ClinVar: {'SO:0001624': '3 prime UTR variant', 'SO:0001619': 'non-coding transcript variant'}
Ours:    {'SO:0001627': 'intron variant', 'SO:0001624': '3 prime UTR variant'}
RCV000407652
ClinVar: {'SO:0001627

RCV000552047
ClinVar: {'SO:0001627': 'intron variant'}
Ours:    {'SO:0002169': 'splice polypyrimidine tract variant'}
RCV000624821
ClinVar: {'SO:0001583': 'missense variant', 'SO:0001619': 'non-coding transcript variant'}
Ours:    {'SO:0001583': 'missense variant'}
RCV000806155
ClinVar: {'SO:0001627': 'intron variant'}
Ours:    {'SO:0002170': 'splice donor region variant'}
RCV001168107
ClinVar: {'SO:0001624': '3 prime UTR variant', 'SO:0001619': 'non-coding transcript variant'}
Ours:    {'SO:0001624': '3 prime UTR variant'}
RCV001248688
ClinVar: {'SO:0001583': 'missense variant', 'SO:0001619': 'non-coding transcript variant'}
Ours:    {'SO:0001583': 'missense variant'}
RCV001305520
ClinVar: {'SO:0001627': 'intron variant', 'SO:0001583': 'missense variant'}
Ours:    {'SO:0001583': 'missense variant'}
RCV001493744
ClinVar: {'SO:0001619': 'non-coding transcript variant', 'SO:0001819': 'synonymous variant'}
Ours:    {'SO:0001819': 'synonymous variant'}
RCV001712884
ClinVar: {'SO:0001627': 

RCV000047838
ClinVar: {'SO:0001589': 'frameshift variant', 'SO:0001627': 'intron variant', 'SO:0001619': 'non-coding transcript variant'}
Ours:    {'SO:0001589': 'frameshift variant'}
RCV000196528
ClinVar: {}
Ours:    {'SO:0001631': 'upstream gene variant'}
RCV000345033
ClinVar: {}
Ours:    {'SO:0001623': '5 prime UTR variant'}
RCV000355199
ClinVar: {}
Ours:    {'SO:0001631': 'upstream gene variant'}
RCV000355884
ClinVar: {'SO:0001627': 'intron variant', 'SO:0001583': 'missense variant'}
Ours:    {'SO:0001583': 'missense variant'}
RCV000359974
ClinVar: {}
Ours:    {'SO:0001623': '5 prime UTR variant'}
RCV000369616
ClinVar: {'SO:0001582': 'initiatior codon variant', 'SO:0001583': 'missense variant'}
Ours:    {'SO:0002012': 'start lost'}
RCV000389845
ClinVar: {'SO:0001624': '3 prime UTR variant', 'SO:0001627': 'intron variant'}
Ours:    {'SO:0001624': '3 prime UTR variant'}
RCV000412601
ClinVar: {'SO:0001589': 'frameshift variant', 'SO:0001627': 'intron variant', 'SO:0001619': 'non-codin

#### More notes:

* in ClinVar, consequences are associated with HGVS expressions as well, e.g. [RCV000050183](https://www.ncbi.nlm.nih.gov/clinvar/RCV000050183/)
* our consequences are already filtered to include only most severe
    * ClinVar may have some filtering as well but we don't know, and if anything it generally seems to include more (something like 1 consequence per HGVS expression (with a particular sequence identifier))
* found at least one consequence from [Variation Ontology](http://www.variationontology.org/) rather than Sequence Ontology: [RCV000196528](https://www.ncbi.nlm.nih.gov/clinvar/RCV000196528/)
* ClinVar uses nonsense instead of stop gained but apparently these are [synonymous](http://sequenceontology.org/browser/current_release/term/SO:0001587)