# MaveDB Mapping Method Analysis
This notebook assesses the performance of the mapping method, comparing the match between reference sequences for pre-mapped and post-mapped VRS objects. This notebook also computes the number of unique VRS alleles generated across the examine score sets.

## Load Relevant Libraries
Run the cell below to load the libraries used in the analysis

In [1]:
import json

import pandas as pd
from Bio.Seq import Seq
from cool_seq_tool.schemas import Strand

## Load List of Examined Score Sets

In [2]:
mave_metadata_dat = pd.read_csv("analysis_files/mave_dat.csv", index_col=[0])
score_sets = mave_metadata_dat["urn"].to_list()[:-5]

In [3]:
import pickle
from pathlib import Path

from dcd_mapping.schemas import AlignmentResult

with Path.open("analysis_files/mave_blat_output.pickle", "rb") as fn:
    mave_blat_temp = pickle.load(fn)
align_results = {}
for scoreset in score_sets:
    if scoreset != "urn:mavedb:00000105-a-1":
        align_results[scoreset] = AlignmentResult(**mave_blat_temp[scoreset])

## Determine the MaveDB IDs of Variants with Reference Mismatches
Run the cell below to generate a dictionary listing the MaveDB IDs of variants with reference mismatches. The dictionary is keyed by score set URN.

In [4]:
diff_vars_dict = {}
var_count = 0
counter = 0
var_count_dict_new = {}

for key in score_sets:
    if key != "urn:mavedb:00000072-a-1" and key != "urn:mavedb:00000105-a-1": # No mapping for these score sets
        f = open(f"analysis_files/mappings/{key[11::]}.json")
        dat = json.load(f)
        seq_type = dat["computed_reference_sequence"]["sequence_type"]
        dat = dat["mapped_scores"]

        diff_vars = []
        strand = align_results[key].strand

        for j in range(len(dat)):
            if "members" not in dat[j]["pre_mapped"]:
                var_count += 1
                seq_pre = dat[j]["pre_mapped"]["vrs_ref_allele_seq"]
                seq_post = dat[j]["post_mapped"]["vrs_ref_allele_seq"]
                seq_pre_rv = str(Seq(seq_pre).reverse_complement())

                if seq_type == "protein":
                    if seq_pre != seq_post:
                        diff_vars.append(j)
                else:
                    if strand == Strand.POSITIVE:
                        if seq_pre != seq_post:
                            diff_vars.append(j)
                    else:
                        if seq_post != seq_pre_rv:
                            diff_vars.append(j)

            else:
                for k in range(len(dat[j]["pre_mapped"]["members"])):
                    var_count += 1
                    seq_pre = dat[j]["pre_mapped"]["members"][k]["vrs_ref_allele_seq"]
                    seq_post = dat[j]["post_mapped"]["members"][k]["vrs_ref_allele_seq"]
                    seq_pre_rv = str(Seq(seq_pre).reverse_complement())

                    if seq_type == "protein":
                        if seq_pre != seq_post:
                            diff_vars.append(j)
                    else:
                        if strand == Strand.POSITIVE:
                            if seq_pre != seq_post:
                                diff_vars.append(j)
                        else:
                            if seq_post != seq_pre_rv:
                                diff_vars.append(j)

        diff_vars_dict[key] = diff_vars
        var_count_dict_new[key] = var_count
f"The number of examined variant pairs is: {var_count}"

'The number of examined variant pairs is: 2499044'

### The cell below can be uncommented to examine the output of the above cell

In [5]:
#diff_vars_dict

### Examine Example Mismatch
Run the cell below to view an example of reference mismatch discordance due to a MAVE variant that spans an alignment block

In [6]:
f = open(f'analysis_files/mappings/00000005-a-6.json')
dat = json.load(f)
dat["mapped_scores"][5601]

{'pre_mapped': {'id': 'ga4gh:VA.mmZEpS0H-V-dGn4n8V1uPWy18MY18PrR',
  'type': 'VariationDescriptor',
  'variation': {'id': 'ga4gh:VA.mmZEpS0H-V-dGn4n8V1uPWy18MY18PrR',
   'type': 'Allele',
   'location': {'id': None,
    'type': 'SequenceLocation',
    'sequence_id': 'ga4gh:SQ.sVMC1jmTXRvuzBCDJ8aoBmZ_Uu35YFj7',
    'interval': {'type': 'SequenceInterval',
     'start': {'type': 'Number', 'value': 316},
     'end': {'type': 'Number', 'value': 318}}},
   'state': {'type': 'LiteralSequenceExpression', 'sequence': 'CT'}},
  'vrs_ref_allele_seq': 'TG'},
 'post_mapped': {'id': 'ga4gh:VA.th7X__sHHeffw9AGpRyCUfg4jN-hM2mE',
  'type': 'VariationDescriptor',
  'variation': {'id': 'ga4gh:VA.th7X__sHHeffw9AGpRyCUfg4jN-hM2mE',
   'type': 'Allele',
   'location': {'id': None,
    'type': 'SequenceLocation',
    'sequence_id': 'ga4gh:SQ.5ZUqxCmDDgN4xTRbaSjN8LwgZironmB8',
    'interval': {'type': 'SequenceInterval',
     'start': {'type': 'Number', 'value': 43068507},
     'end': {'type': 'Number', 'val

## Compute the Proportion of Reference Mismatches

The cell below computes the proportion of reference mismatches among all MAVE variants that have been mapped.

In [7]:
mm_count = 0
for key in diff_vars_dict:
    mm_count = mm_count + len(diff_vars_dict[key])
f"There are {mm_count} instances of reference mismatch. This corresponds to a percentage of {round(100*mm_count/var_count,3)} ({mm_count}/{var_count})"

'There are 34832 instances of reference mismatch. This corresponds to a percentage of 1.394 (34832/2499044)'

## Compute the Number of Unique Pre-Mapped and Post-Mapped MAVE Variants

The cell below computes the number of unique pre-mapped and post-mapped MAVE variants that have been processed using VRS.

In [8]:
# Determine total number of paired VRS alleles in data set
allele_count = 0
var_count = 0
allele_ac_list_pre = []
allele_ac_list_post = []
for key in score_sets:
    if key != "urn:mavedb:00000072-a-1" and key != "urn:mavedb:00000105-a-1":
        f = open(f"analysis_files/mappings/{key[11::]}.json")
        dat = json.load(f)
        dat = dat["mapped_scores"]

        for j in range(len(dat)):
            var_count += 1
            if "members" not in dat[j]["post_mapped"]:
                allele_count += 1
                allele_ac_list_pre.append(dat[j]["pre_mapped"]["id"])
                allele_ac_list_post.append(dat[j]["post_mapped"]["id"])
            else:
                for k in range(len(dat[j]["post_mapped"]["members"])):
                    allele_count += 1
                    allele_ac_list_pre.append(dat[j]["pre_mapped"]["members"][k]["id"])
                    allele_ac_list_post.append(dat[j]["post_mapped"]["members"][k]["id"])


f"The number of unique pre-mapped MAVE variants is {len(set(allele_ac_list_pre))}. The number of unique post-mapped MAVE variants is {len(set(allele_ac_list_post))}."

'The number of unique pre-mapped MAVE variants is 363294. The number of unique post-mapped MAVE variants is 349972.'

## Generate VRS Allele ID Dictionary
Run the cell below to generate a VRS allele ID dictionary for the MAVE variants. The dictionary is keyed by the post-mapped VRS allele IDs, and each key's contents contain a list of corresponding pre-mapped VRS allele IDs.

In [9]:
allele_list_dict = {}
for key in score_sets:
    if key != "urn:mavedb:00000072-a-1" and key != "urn:mavedb:00000105-a-1":
        f = open(f"analysis_files/mappings/{key[11::]}.json")
        dat = json.load(f)
        dat = dat["mapped_scores"]

        for j in range(len(dat)):
            if "members" not in dat[j]["post_mapped"]:
                va = dat[j]["post_mapped"]["id"]
                if va not in allele_list_dict:
                    allele_list_dict[va] = [dat[j]["pre_mapped"]["id"]]
                else:
                    if dat[j]["pre_mapped"]["id"] in allele_list_dict[va]:
                        continue
                    else:
                        tmp = allele_list_dict[va]
                        tmp.append(dat[j]["pre_mapped"]["id"])
                        allele_list_dict[va] = tmp
            else:
                for k in range(len(dat[j]["post_mapped"]["members"])):
                    va = dat[j]["post_mapped"]["members"][k]["id"]
                    if va not in allele_list_dict:
                        allele_list_dict[va] = [dat[j]["pre_mapped"]["members"][k]["id"]]
                    else:
                        if dat[j]["pre_mapped"]["members"][k]["id"] in allele_list_dict[va]:
                            continue
                        tmp = allele_list_dict[va]
                        tmp.append(dat[j]["pre_mapped"]["members"][k]["id"])
                        allele_list_dict[va] = tmp

### Summary Statistics for VRS Allele ID Dictionary

In [10]:
count = 0
for key in allele_list_dict:
    if len(allele_list_dict[key]) > 1:
        count += 1
f"There are {len(allele_list_dict)} pre-mapped MAVE variants in the dictionary. {count} post-mapped MAVE variants have 2 or more corresponding pre-mapped MAVE variants."

'There are 349972 pre-mapped MAVE variants in the dictionary. 9553 post-mapped MAVE variants have 2 or more corresponding pre-mapped MAVE variants.'