# MaveDB Mapping Analysis

This notebook demonstrates how data from score sets in [MaveDB](https://mavedb.org/) can be mapped to human reference sequences and represented using the [Variation Representation Specification (VRS)](https://vrs.ga4gh.org/) of the [Global Alliance for Genomics and Health (GA4GH)](https://www.ga4gh.org/), as described in "Mapping MAVE data for use in human genomics applications" (Arbesfeld et al). 

Each step of the mapping workflow is demonstrated under the relevant header and accompanied by example data pulled from MaveDB scoreset `urn:mavedb:00000041-a-1`. After each step, data is saved to a local [pickle](https://docs.python.org/3/library/pickle.html) checkpoint file for easy use in later steps.

## Setup

First, initialize environment parameters to enable access to required resources:

* Universal Transcript Archive (UTA): see [README](https://github.com/biocommons/uta?tab=readme-ov-file#installing-uta-locally) for setup instructions. Users with access to Docker on their local devices can use the available Docker image; otherwise, start a relatively recent (version 14+) PostgreSQL instance and add data from the `20210129b` database dump.
* SeqRepo: see [README](https://github.com/biocommons/biocommons.seqrepo?tab=readme-ov-file#requirements) for setup instructions. Experiments here were run using the `2024-02-20` snapshot.
  * Note that `dcd_map` requires writing to SeqRepo's sequence databases. This means the user must have write permissions on the data directory. See [here](https://github.com/biocommons/biocommons.seqrepo/blob/main/docs/store.rst) for more information if using locally-available data.
* Gene Normalizer: see [documentation](https://gene-normalizer.readthedocs.io/0.3.0-dev1/install.html) for installation instructions
  * This notebook was run using Gene Normalizer PostgreSQL data checkpointed from 2024-05-29. To sync local data against this snapshot, follow [instructions for PostgreSQL setup](https://gene-normalizer.readthedocs.io/0.3.0-dev1/managing_data/postgresql.html#local-setup) and then use the `gene_norm_update_remote` command:

```shell
$ gene_norm_update_remote --data_url="https://vicc-normalizers.s3.us-east-2.amazonaws.com/gene_normalization/postgresql/gene_norm_20240529154335.sql.tar.gz"
```

* blat: Must be available on the local PATH and executable by the user. Otherwise, its location can be set manually with the `BLAT_BIN_PATH` environment variable. See the [UCSC Genome Browser FAQ](https://genome.ucsc.edu/FAQ/FAQblat.html#blat3) for download instructions. For our experiments, we placed the binary in the same directory as these notebooks.



In [1]:
import warnings
from os import environ
from pathlib import Path

from tqdm import tqdm
import pandas as pd

warnings.filterwarnings("ignore")

# set external resources. configure based on location of available data.
environ["GENE_NORM_DB_URL"] = "postgresql://postgres@localhost:5432/gene_normalizer"
environ["UTA_DB_URL"] = "postgresql://uta_admin:uta@localhost:5432/uta/uta_20210129b"
environ["BLAT_BIN_PATH"] = str(Path("blat").absolute())
environ["SEQREPO_ROOT_DIR"] = "/usr/local/share/seqrepo/2024-02-20" 
environ["DCD_MAPPING_RESOURCES_DIR"] = str(Path("./mavedb_files").absolute())

### Create Output Directory

Output from this notebook will be stored in a directory named `analysis_files`:

In [2]:
analysis_files_dir = Path("analysis_files")
analysis_files_dir.mkdir(exist_ok=True)

### Get experiment data from MaveDB

Get metadata for the examined MaveDB score sets (209 in total). Each captures the following:

* `urn`: The score set identifier
* `target_gene_name`: The listed target for the score set (e.g. Src catalytic domain, CXCR4)
* `target_sequence`: The target sequence for the score set
* `target_sequence_type`: Is the target sequence a DNA or protein sequence
* `target_uniprot_ref`: The Uniprot ID associated with the score set, if available
* `target_gene_category`: The target type associated with the score set (e.g. Regulatory)

In [3]:
scoresets_input = Path("experiment_scoresets.txt")
with scoresets_input.open() as f:
    scoresets = [scoreset.strip() for scoreset in f.readlines()]
example_scoreset = "urn:mavedb:00000041-a-1"

In [4]:
from dcd_mapping.mavedb_data import get_scoreset_metadata

metadata = {}
for scoreset in tqdm(scoresets):
    metadata[scoreset] = get_scoreset_metadata(scoreset)

100%|███████████████████████████████████████████████████████████████████████████████████████████████| 209/209 [00:00<00:00, 4087.62it/s]


In [5]:
metadata[example_scoreset].model_dump()

{'urn': 'urn:mavedb:00000041-a-1',
 'target_gene_name': 'Src catalytic domain',
 'target_gene_category': <TargetType.PROTEIN_CODING: 'Protein coding'>,
 'target_sequence': 'CTGCGGCTGGAGGTCAAGCTGGGCCAGGGCTGCTTTGGCGAGGTGTGGATGGGGACCTGGAACGGTACCACCAGGGTGGCCATCAAAACCCTGAAGCCTGGCACGATGTCTCCAGAGGCCTTCCTGCAGGAGGCCCAGGTCATGAAGAAGCTGAGGCATGAGAAGCTGGTGCAGTTGTATGCTGTGGTTTCAGAGGAGCCCATTTACATCGTCACGGAGTACATGAGCAAGGGGAGTTTGCTGGACTTTCTCAAGGGGGAGACAGGCAAGTACCTGCGGCTGCCTCAGCTGGTGGACATGGCTGCTCAGATCGCCTCAGGCATGGCGTACGTGGAGCGGATGAACTACGTCCACCGGGACCTTCGTGCAGCCAACATCCTGGTGGGAGAGAACCTGGTGTGCAAAGTGGCCGACTTTGGGCTGGCTCGGCTCATTGAAGACAATGAGTACACGGCGCGGCAAGGTGCCAAATTCCCCATCAAGTGGACGGCTCCAGAAGCTGCCCTCTATGGCCGCTTCACCATCAAGTCGGACGTGTGGTCCTTCGGGATCCTGCTGACTGAGCTCACCACAAAGGGACGGGTGCCCTACCCTGGGATGGTGAACCGCGAGGTGCTGGACCAGGTGGAGCGGGGCTACCGGATGCCCTGCCCGCCGGAGTGTCCCGAGTCCCTGCACGACCTCATGTGCCAGTGCTGGCGGAAGGAGCCTGAGGAGCGGCCCACCTTCGAGTACCTGCAGGCCTTCCTG',
 'target_sequence_type': <TargetSequenceType.DNA: 'dna'>,
 'target_uniprot

Additionally, get corresponding experiment scores from MaveDB. Mirroring information provided by `/scores` API endpoint, provides the following data for each score in a score set:

* `hgvs_pro`: variant description with respect to the amino acid target sequence
* `hgvs_nt`: variant description with respect to the nucleotide target sequence
* `score`: raw reported score
* `accession`: accession identifier for the specific experiment, e.g. `urn:mavedb:00000041-a-1#548`

In [6]:
from dcd_mapping.mavedb_data import get_scoreset_records

scores = {}
for urn in tqdm(scoresets):
    try:
        scores[urn] = get_scoreset_records(urn)
    except:
        print(urn)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 209/209 [00:07<00:00, 27.45it/s]


Each score "row" consists of nucleotide and/or protein [MAVE-HGVS expressions](https://www.mavedb.org/docs/mavehgvs/spec.html), a score, and an identifier:

In [7]:
scores[example_scoreset][0].model_dump()

{'hgvs_pro': 'p.Tyr170Gly',
 'hgvs_nt': 'NA',
 'score': '0.753146338',
 'accession': 'urn:mavedb:00000041-a-1#36'}

## Part 1: MaveDB Metadata to BLAT Alignment Data

During this step, the target sequence for each score set is run through BLAT, allowing for genomic coordinates to be linked with the target sequence.

### Generate BLAT Output for each Score Set

Generate BLAT alignment output for each examined score set:

In [8]:
from dcd_mapping.align import AlignmentError, align
from dcd_mapping.mavedb_data import get_scoreset_metadata

align_results = {}
failed_alignment_scoresets = []

for scoreset, meta in tqdm(metadata.items()):
    try:
        align_results[scoreset] = align(meta, silent=True)
    except AlignmentError:
        failed_alignment_scoresets.append(scoreset)

100%|███████████████████████████████████████████████████████████████████████████████████████████████| 209/209 [2:33:28<00:00, 44.06s/it]


During our experiments, we found that one scoreset, `urn:mavedb:00000105-a-1`, fails to return a BLAT hit against the reference genome:

In [9]:
print(failed_alignment_scoresets)

['urn:mavedb:00000105-a-1']


The result of the alignment phase is a structured description of the best BLAT result for the input sequence.

In [10]:
align_results[example_scoreset].model_dump()

{'chrom': 'chr20',
 'strand': <Strand.POSITIVE: 1>,
 'coverage': 100.0,
 'ident_pct': 99.86666666666666,
 'query_range': {'start': 0, 'end': 750},
 'query_subranges': [{'start': 0, 'end': 52},
  {'start': 52, 'end': 232},
  {'start': 232, 'end': 309},
  {'start': 309, 'end': 463},
  {'start': 463, 'end': 595},
  {'start': 595, 'end': 750}],
 'hit_range': {'start': 37397802, 'end': 37403325},
 'hit_subranges': [{'start': 37397802, 'end': 37397854},
  {'start': 37400114, 'end': 37400294},
  {'start': 37401601, 'end': 37401678},
  {'start': 37402434, 'end': 37402588},
  {'start': 37402748, 'end': 37402880},
  {'start': 37403170, 'end': 37403325}]}

### Save BLAT Output

Save a checkpoint for the BLAT results:

In [11]:
import pickle

mave_blat_to_save = {}
for scoreset, result in align_results.items():
    mave_blat_to_save[scoreset] = result.model_dump(exclude_none=True)
with (analysis_files_dir / "mave_blat_output.pickle").open("wb") as fn:
    pickle.dump(mave_blat_to_save, fn, protocol=pickle.HIGHEST_PROTOCOL)
del mave_blat_to_save

## Part 2: Transcript and Offset Selection for MaveDB Score Sets

In this phase, a human transcript is chosen for each protein-coding score set, and an offset is computed when the target sequence does not occur at the start of the human reference sequence. For regulatory/other non-coding score sets, a transcript is not chosen and the chromosomal sequence is selected as the reference sequence. 

### Load BLAT output

Load checkpointed BLAT output for the examined MaveDB score sets:

In [12]:
import pickle

from dcd_mapping.schemas import AlignmentResult

with (analysis_files_dir / "mave_blat_output.pickle").open("rb") as fn:
    mave_blat_temp = pickle.load(fn)
align_results = {}
for scoreset in scoresets:
    align_result = mave_blat_temp.get(scoreset)
    if align_result:
        align_results[scoreset] = AlignmentResult(**align_result)

### Generate Transcript Mappings File

Generate a transcript mapping for each relevant score set containing the following data:

* `nm`: A RefSeq transcript accession
* `np`: A RefSeq protein sequence accession
* `start`: An integer containing the offset for the target sequence with the respect to the selected human reference sequence
* `transcript_mode`: The set of [MANE annotations](https://www.ncbi.nlm.nih.gov/refseq/MANE/) in which the selected transcript is included. See the [CoolSeqTool docs](https://coolseqtool.readthedocs.io/0.4.0-dev3/transcript_selection.html#representative-transcript-priority) for additional information
* `sequence`: The translated protein reference sequence
* `is_full_match`: sequence is a complete match for target sequence

In [13]:
import asyncio

import nest_asyncio

from dcd_mapping.transcripts import TxSelectError, select_transcript

nest_asyncio.apply()
failed_tx_select_scoresets = [] 
tx_selection = {}
for ss in tqdm(scoresets):
    if ss in align_results:
        try:
            tx_selection[ss] = asyncio.run(
                select_transcript(
                    metadata[ss],
                    scores[ss],
                    align_results[ss],
                )
            )
        except TxSelectError:
            failed_tx_select_scoresets.append(ss)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 209/209 [00:19<00:00, 10.82it/s]


This phase should be completed without encountering any new errors:

In [14]:
failed_tx_select_scoresets

[]

Transcript selection and offset data is stored for each scoreset:

In [15]:
tx_selection[example_scoreset].model_dump()

{'nm': 'NM_198291.3',
 'np': 'NP_938033.1',
 'start': 269,
 'is_full_match': True,
 'transcript_mode': <TranscriptPriority.MANE_SELECT: 'mane_select'>,
 'sequence': 'LRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFL'}

### Save Transcript Mappings Output

Save a checkpoint for the `transcript_mappings` data:

In [16]:
import pickle

transcript_mappings_to_save = {}
for ss in tx_selection:
    if tx_selection[ss]:
        transcript_mappings_to_save[ss] = tx_selection[ss].model_dump(exclude_none=True)
with (analysis_files_dir / "transcript_mappings.pickle").open("wb") as fn:
    pickle.dump(transcript_mappings_to_save, fn, protocol=pickle.HIGHEST_PROTOCOL)
del transcript_mappings_to_save

## Part 3: Mapping MAVE Variants using the GA4GH Variation Representation Specification (VRS)

During this phase, MAVE variants are supplied to VRS, generating a pre-mapped and post-mapped computable representation for each variant. The functional effect score for each variant pair and the associated MaveDB ID are also stored in separate dictionaries.

### Load Alignment and Transcript Selection Data

Load checkpointed alignment and transcript selection data:

In [17]:
import pickle

from dcd_mapping.schemas import AlignmentResult, TxSelectResult

with (analysis_files_dir / "mave_blat_output.pickle").open("rb") as fn:
    mave_blat_temp = pickle.load(fn)
align_results = {}
for ss in mave_blat_temp:
    align_results[ss] = AlignmentResult(**mave_blat_temp[ss])
del mave_blat_temp

with (analysis_files_dir / "transcript_mappings.pickle").open("rb") as fn:
    transcript_mappings_temp = pickle.load(fn)
tx_selection = {}
for ss in transcript_mappings_temp:
    tx_selection[ss] = TxSelectResult(**transcript_mappings_temp[ss])
del transcript_mappings_temp

100%|███████████████████████████████████████████████████████████████████████████████████████████████| 208/208 [1:13:37<00:00, 21.24s/it]


### Convert MaveDB Variants to VRS Alleles

Convert MaveDB variants to VRS objects.

In [18]:
from dcd_mapping.lookup import get_seqrepo
from dcd_mapping.vrs_map import vrs_map
from dcd_mapping.mavedb_data import get_scoreset_metadata, get_scoreset_records

mave_vrs_mappings = {}

for ss in tqdm(align_results):
    mave_vrs_mappings[ss] = vrs_map(
        metadata[ss],
        align_results[ss],
        scores[ss],
        tx_selection.get(ss)
    )

100%|███████████████████████████████████████████████████████████████████████████████████████████████| 208/208 [2:45:36<00:00, 47.77s/it]


For each score in a MaveDB scoreset, VRS objects are generated from both the original (pre-mapped) MAVE variation descriptions as well as the variations that have been mapped to reference transcripts:

In [19]:
mave_vrs_mappings[example_scoreset][0].model_dump(exclude_none=True)

{'accession_id': 'urn:mavedb:00000041-a-1#36',
 'annotation_layer': <AnnotationLayer.PROTEIN: 'p'>,
 'score': '0.753146338',
 'pre_mapped': {'id': 'ga4gh:VA.FLe4-pSUs7vjdVtVD4TmUNL4JhrBbqTd',
  'type': 'Allele',
  'digest': 'FLe4-pSUs7vjdVtVD4TmUNL4JhrBbqTd',
  'location': {'type': 'SequenceLocation',
   'digest': 'DCfpyPamywb6xZ_YuqheLIUUna9idFdK',
   'sequenceReference': {'type': 'SequenceReference',
    'refgetAccession': 'SQ.PyX9IDu95_tYLg1Jz9JpW5xpQkwn6bpB'},
   'start': 169,
   'end': 170},
  'state': {'type': 'LiteralSequenceExpression', 'sequence': 'G'}},
 'post_mapped': {'id': 'ga4gh:VA.rKyjzmt0czvrVFeRsvCxH-aE4GSoMzUS',
  'type': 'Allele',
  'digest': 'rKyjzmt0czvrVFeRsvCxH-aE4GSoMzUS',
  'location': {'type': 'SequenceLocation',
   'digest': 'F_PJZIrk2lQaj2CLaS-TbsWdeJjwAsCu',
   'sequenceReference': {'type': 'SequenceReference',
    'refgetAccession': 'SQ.uJDQo_HaTNFL2-0-6K5dVzVcweigexye'},
   'start': 438,
   'end': 439},
  'state': {'type': 'LiteralSequenceExpression', 'se

### Save VRS Mappings Dictionary

Save a checkpoint of the VRS mappings dictionary to `analysis_files`:

In [20]:
import pickle

tmp_mave_vrs_mappings = {}
for ss, mappings in mave_vrs_mappings.items():
    if mappings:
        tmp_mave_vrs_mappings[ss] = [m.model_dump(exclude_none=True, exclude_unset=True) for m in mappings]
    else:
        tmp_mave_vrs_mappings[ss] = mappings
with (analysis_files_dir / "mave_vrs_mappings.pickle").open("wb") as fn:
    pickle.dump(tmp_mave_vrs_mappings, fn, protocol=pickle.HIGHEST_PROTOCOL)
del tmp_mave_vrs_mappings

100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 207/207 [45:16<00:00, 13.13s/it]


### Generate Annotations

Finally, annotate MaveDB scoreset metadata with the pre- and post-mapped VRS objects, as well as two additional data points:

1. `vrs_ref_allele_seq`: The sequence between the start and end positions indicated in the variant
2. `hgvs`: An HGVS string describing the variant (only included for post-mapped variants)

In [21]:
from dcd_mapping.annotate import annotate

annotated_vrs_mappings = {}
for urn, mapping in tqdm(mave_vrs_mappings.items()):
    if mapping:
        annotated_vrs_mappings[urn] =  annotate(mapping, tx_selection.get(urn), metadata[urn])

100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 208/208 [58:09<00:00, 16.78s/it]


The final product provided to our reported integration projects includes VRS 1.3-compliant alleles:

In [22]:
annotated_vrs_mappings[example_scoreset][0].model_dump(exclude_none=True)

{'pre_mapped': {'id': 'ga4gh:VA.p1kr99gs8Zg2mPjLO2d_pEwPnUCgX_Hb',
  'type': 'VariationDescriptor',
  'variation': {'type': 'Allele',
   'id': 'ga4gh:VA.p1kr99gs8Zg2mPjLO2d_pEwPnUCgX_Hb',
   'location': {'id': 'z2nVPPq4_GUMglZu8A8QumXgwLiMbnb1',
    'type': 'SequenceLocation',
    'sequence_id': 'ga4gh:SQ.PyX9IDu95_tYLg1Jz9JpW5xpQkwn6bpB',
    'interval': {'type': 'SequenceInterval',
     'start': {'type': 'Number', 'value': 169},
     'end': {'type': 'Number', 'value': 170}}},
   'state': {'type': 'LiteralSequenceExpression', 'sequence': 'G'}},
  'expressions': [],
  'vrs_ref_allele_seq': 'Y',
  'extensions': []},
 'post_mapped': {'id': 'ga4gh:VA.kyyRBeK2TehmHkcr54TVvkTsfqIfpGCc',
  'type': 'VariationDescriptor',
  'variation': {'type': 'Allele',
   'id': 'ga4gh:VA.kyyRBeK2TehmHkcr54TVvkTsfqIfpGCc',
   'location': {'id': '-iuHgmV7-c61mmcp693fH4d_xRC_ZKYU',
    'type': 'SequenceLocation',
    'sequence_id': 'ga4gh:SQ.uJDQo_HaTNFL2-0-6K5dVzVcweigexye',
    'interval': {'type': 'Sequence

### Save VRS mappings output in JSON files

Run the cells below to save the VRS mappings output in a JSON file in `analysis_files/mappings`

In [23]:
from dcd_mapping.annotate import save_mapped_output_json

mappings_dir = analysis_files_dir / "mappings"
mappings_dir.mkdir(exist_ok=True)

for urn, mappings in tqdm(annotated_vrs_mappings.items()):
    output_file = mappings_dir / f"{urn}_mappings.json"
    save_mapped_output_json(
        urn,
        mappings,
        align_results[urn],
        tx_selection.get(urn),
        include_vrs_2=False,
        output_path=output_file
    )

100%|███████████████████████████████████████████████████████████████████████████████████████████| 207/207 [34:49<00:00, 10.09s/it]
