# MaveDB Mapping Method Analysis
This notebook assesses the performance of the mapping method, comparing the match between reference sequences for pre-mapped and post-mapped VRS objects. This notebook also computes the number of unique VRS alleles generated across the examine score sets.

## Load Relevant Libraries
Run the cell below to load the libraries used in the analysis

In [1]:
import pandas as pd
import json
from Bio.Seq import Seq

## Load List of Examined Score Sets

In [2]:
mave_metadata_dat = pd.read_csv('analysis_files/mave_dat.csv', index_col=[0])
score_sets = mave_metadata_dat["urn"].to_list()[:-5]

## Determine the MaveDB IDs of Variants with Reference Mismatches
Run the cell below to generate a dictionary listing the MaveDB IDs of variants with reference mismatches. The dictionary is keyed by score set URN.

In [3]:
diff_vars_dict = {}
var_count = 0

for key in score_sets:
    if key != 'urn:mavedb:00000072-a-1' and key != 'urn:mavedb:00000105-a-1': # No mapping for these sequences
        f = open(f'mappings/{key[11::]}.json')
        dat = json.load(f)
        dat = dat['mapped_scores']

        diff_vars = []

        for j in range(len(dat)):
            if 'members' not in dat[j]['pre_mapped'].keys():
                var_count += 1
                seq_pre = dat[j]['pre_mapped']['vrs_ref_allele_seq']
                seq_post = dat[j]['post_mapped']['vrs_ref_allele_seq']
                seq_pre_rv = str(Seq(seq_pre).reverse_complement())

                if seq_pre != seq_post and seq_post != seq_pre_rv:
                    diff_vars.append(j)

            else:
                for k in range(len(dat[j]['pre_mapped']['members'])):
                    var_count += 1
                    seq_pre = dat[j]['pre_mapped']['members'][k]['vrs_ref_allele_seq']
                    seq_post = dat[j]['post_mapped']['members'][k]['vrs_ref_allele_seq']
                    seq_pre_rv = str(Seq(seq_pre).reverse_complement())

                    if seq_pre != seq_post and seq_post != seq_pre_rv:
                        diff_vars.append(j)

            diff_vars_dict[key] = diff_vars
var_count

2499036

### The cell below can be uncommented to examine the output of the above cell

In [4]:
#diff_vars_dict

### Examine Example Mismatch
Run the cell below to view an example of reference mismatch discordance. In the example, the pre_mapped reference amino acid is Arginine and the post_mapped reference amino acid is Proline.

In [5]:
f = open(f'mappings/00000068-b-1.json')
dat = json.load(f)
dat['mapped_scores'][6680]

{'pre_mapped': {'id': 'ga4gh:VA.1U4Pma7BqWjhhhFwUsnQ2Yty8aX7Bcj4',
  'type': 'VariationDescriptor',
  'variation': {'id': 'ga4gh:VA.1U4Pma7BqWjhhhFwUsnQ2Yty8aX7Bcj4',
   'type': 'Allele',
   'location': {'id': None,
    'type': 'SequenceLocation',
    'sequence_id': 'ga4gh:SQ.JtEWOMSBOOCAxy6RBZNVl9NAKRb4t2iw',
    'interval': {'type': 'SequenceInterval',
     'start': {'type': 'Number', 'value': 71},
     'end': {'type': 'Number', 'value': 72}}},
   'state': {'type': 'LiteralSequenceExpression', 'sequence': '*'}},
  'vrs_ref_allele_seq': 'R'},
 'post_mapped': {'id': 'ga4gh:VA.d5iaRnWv2-QrP5Ay9ZDNbi3LGenWppF1',
  'type': 'VariationDescriptor',
  'extensions': [{'type': 'Extension',
    'name': 'transcript_accession',
    'value': {'transcript': 'NM_000546.6', 'status': 'MANE Select'}}],
  'variation': {'id': 'ga4gh:VA.d5iaRnWv2-QrP5Ay9ZDNbi3LGenWppF1',
   'type': 'Allele',
   'location': {'id': None,
    'type': 'SequenceLocation',
    'sequence_id': 'ga4gh:SQ.KAxM06sYzBF6zFftFaYq9E_18w

## Compute the Proportion of Reference Mismatches

In [6]:
mm_count = 0
for key in diff_vars_dict:
    mm_count = mm_count + len(diff_vars_dict[key])
f"There are {mm_count} instances of reference mismatch. This corresponds to a percentage of {round(100*mm_count/var_count,3)} ({mm_count}/{var_count})"

'There are 24878 instances of reference mismatch. This corresponds to a percentage of 0.996 (24878/2499036)'

## Compute the Number of Unique Pre-Mapped and Post-Mapped VRS Alleles

In [7]:
# Determine total number of paired VRS alleles in data set
allele_count = 0
var_count = 0
allele_ac_list_pre = []
allele_ac_list_post = []
for key in score_sets:
    if key != 'urn:mavedb:00000072-a-1' and key != 'urn:mavedb:00000105-a-1':
        f = open(f'mappings/{key[11::]}.json')
        dat = json.load(f)
        dat = dat['mapped_scores']

        for j in range(len(dat)):
            var_count += 1
            if 'members' not in dat[j]['post_mapped'].keys():
                allele_count += 1
                allele_ac_list_pre.append(dat[j]['pre_mapped']['id'])
                allele_ac_list_post.append(dat[j]['post_mapped']['id'])
            else:
                for k in range(len(dat[j]['post_mapped']['members'])):
                    allele_count += 1
                    allele_ac_list_pre.append(dat[j]['pre_mapped']['members'][k]['id'])
                    allele_ac_list_post.append(dat[j]['post_mapped']['members'][k]['id'])


f"The number of unique pre-mapped VRS alleles is {len(set(allele_ac_list_pre))}. The number of unique post-mapped VRS alleles is {len(set(allele_ac_list_post))}."

'The number of unique pre-mapped VRS alleles is 363294. The number of unique post-mapped VRS alleles is 349972.'

## Generate VRS Allele ID Dictionary
Run the cell below to generate a VRS allele ID dictionary. The dictionary is keyed by the post-mapped VRS allele IDs, and each key's contents contain a list of corresponding pre-mapped VRS allele IDs.

In [8]:
allele_list_dict = {}
for key in score_sets:
    if key != 'urn:mavedb:00000072-a-1' and key != 'urn:mavedb:00000105-a-1':
        f = open(f'mappings/{key[11::]}.json')
        dat = json.load(f)
        dat = dat['mapped_scores']

        for j in range(len(dat)):
            if 'members' not in dat[j]['post_mapped'].keys():
                va = dat[j]['post_mapped']['id']
                if va not in allele_list_dict:
                    allele_list_dict[va] = [dat[j]['pre_mapped']['id']]
                else:
                    if dat[j]['pre_mapped']['id'] in allele_list_dict[va]:
                        continue
                    else:
                        tmp = allele_list_dict[va]
                        tmp.append(dat[j]['pre_mapped']['id'])
                        allele_list_dict[va] = tmp
            else:
                for k in range(len(dat[j]['post_mapped']['members'])):
                    va = dat[j]['post_mapped']['members'][k]['id']
                    if va not in allele_list_dict:
                        allele_list_dict[va] = [dat[j]['pre_mapped']['members'][k]['id']]
                    else:
                        if dat[j]['pre_mapped']['members'][k]['id'] in allele_list_dict[va]:
                            continue
                        tmp = allele_list_dict[va]
                        tmp.append(dat[j]['pre_mapped']['members'][k]['id'])
                        allele_list_dict[va] = tmp

### Summary Statistics for VRS Allele ID Dictionary

In [9]:
count = 0
for key in allele_list_dict:
    if len(allele_list_dict[key]) > 1:
        count += 1
f"There are {len(allele_list_dict)} keys in the dictionary. {count} post-mapped VRS alleles map to 2 or more pre-mapped VRS alleles."

'There are 349972 keys in the dictionary. 9553 post-mapped VRS alleles map to 2 or more pre-mapped VRS alleles.'