In [1]:
import sys

sys.path.append('..')

In [2]:
from filter_clinvar_xml import filter_xml, pprint, iterate_cvs_from_xml

In [7]:
import gzip
import os
import pandas as pd
from cmat.clinvar_xml_io import *
from cmat.clinvar_xml_io.xml_parsing import *

In [6]:
work_dir = '/home/april/projects/opentargets/sept-investigation'
july_clinvar_xml = os.path.join(work_dir, 'ClinVarRCVRelease_2024-07.xml.gz')
missing_rcvs_tsv = os.path.join(work_dir, 'missing_rcvs.tsv.gz')
missing_rcvs_output = os.path.join(work_dir, 'missing_rcvs_still_in_clinvar.xml.gz')

In [10]:
missing_rcvs = set()

with gzip.open(missing_rcvs_tsv, 'rt') as fh:
    # skip header
    next(fh)
    for line in fh:
        if line:
            missing_rcvs.add(line.split('\t')[0])

In [11]:
len(missing_rcvs)

359701

In [13]:
filter_xml(
    input_xml=july_clinvar_xml,
    output_xml=missing_rcvs_output,
    filter_fct = lambda r: r.accession in missing_rcvs,
    max_num=len(missing_rcvs)
)

INFO:filter_clinvar_xml:Records written: 106


In [17]:
from cmat.clinvar_xml_io.clinical_classification import MultipleClinicalClassificationsError

In [21]:
# Check that these all have clinical significance "no classifications from unflagged records" or are dropped for multiple classifications
for record in ClinVarDataset(missing_rcvs_output):
    try:
        if record.clinical_significance_list == record.valid_clinical_significances:
            pprint(record.rcv)
    except MultipleClinicalClassificationsError:
        continue

<ReferenceClinVarAssertion ID="3959282" DateLastUpdated="2024-06-23" DateCreated="2021-11-20">
    <ClinVarAccession Acc="RCV001779975" Version="2" Type="RCV" DateUpdated="2024-06-23" DateCreated="2021-11-20" />
    <RecordStatus>current</RecordStatus>
    <Classifications>
      <GermlineClassification>
        <ReviewStatus>no assertion criteria provided</ReviewStatus>
        <Description SubmissionCount="1">Pathogenic</Description>
      </GermlineClassification>
    </Classifications>
    <Assertion Type="variation to disease" />
    <ObservedIn>
      <Sample>
        <Origin>germline</Origin>
        <Species TaxonomyId="9606">human</Species>
        <AffectedStatus>yes</AffectedStatus>
      </Sample>
      <Method>
        <MethodType>research</MethodType>
      </Method>
      <ObservedData ID="158753121">
        <Attribute integerValue="1" Type="VariantChromosomes" />
      </ObservedData>
    </ObservedIn>
    <MeasureSet Type="Variant" ID="1321891" Acc="VCV001321891" Vers

This one is dropped because it covers too many genes:
```
evidence_string_generation_2742194.err:WARNING:root:Skipping variant NC_000001.11:g.173888460_174138926del with 4 target genes
```
Note it's also missing in past 2 submissions, so not sure why it's in this list.

Summary of the list of from Irene:
* Out of **359,701** missing RCVs
  * All but **106** were removed entirely from the ClinVar XML
    * Of these 106, all but **1** have been modified so they were dropped with good reason (either a submission was flagged for insufficient evidence, or multiple classifications (somatic/oncogeneic) were added which are not yet supported)
      * The 1 exception is dropped for being a large structural variant (too many target genes) and AFAICT was missing in March and June as well, so not sure how it ended up in the list

From this, we can see that ClinVar deleted (at least) 359701 - 106 = **359,595** RCVs but had a net gain of **214,802** RCVs (see [metrics](https://docs.google.com/spreadsheets/d/1g_4tHNWP4VIikH7Jb0ui5aNr0PiFgvscZYOe69g191k/edit?usp=sharing)). So at least **574,397** RCVs were added.
These counts are consistent with the assumption that:
* the deletions include the **225,537** RCVs that previously produced evidence strings ("done" in the metrics spreadsheet)
* the additions include the **440,301** new RCVs that are dropped for having invalid trait names ("fatal")

So this can account for both the drop in number of evidence strings as well as the increase in invalid trait names. In particular, there's no evidence that any RCV switched from having a valid to an invalid trait name.

This still leaves the question of why so many records are being added with non-specific trait names, but I think it confirms the pipeline is working as expected.