# Project aiming to recalculate the MAPS score using gnomAD v3 data

### MAPS (mutability-adjusted proportion of singletons) - the score measures the selection against classes of variants in a population (assumes that the more damaging classes of variants are kept at lower frequencies by natural selection).

In the most basic way, higher values of MAPS indicate increased deleteriousness/increased potential to posess low-frequency variants which, according to the assumption of natrual selection of such low frequency variants, increase the morbitidy by inducing heavily impacting diseases/symptoms/processes onto the host overall well-being.    ("Higher values indicate an enrichment of lower frequency variants, which suggests increased deleteriousness", gnomAD paper).

Such deleterious variants (which as said are low frequency) increase an individual’s susceptibility or predisposition to a certain disease or disorder. When such a variant (or mutation) is inherited, development of symptoms is more likely, but not certain. Also called disease-causing mutation, pathogenic variant, predisposing mutation, and susceptibility gene mutation.

And so in summary, the MAPS score may tell that in a given functionall region this particullar population has an increased deleteriousness (described above). This score is not comparable between cohorts.

### Basic step-by-step guide to calculate MAPS:
    1) Divide variants into functional classes (this is VEP's `consequence`);
    2) Acquire the number of singletons in each of the class and the total number of mutations and calculate the proportions;
    3) To correct for variant's mutational class (transitions, transversions, CpGs), one shall also acquire the singleton proportion but only for synonymous functional class of variants;
    4) Train a linear model on synonymous variation weighted by number of observations in each mutational context (this is the step that allows to correct MAPS score for transitions as they are more common than transversions; mutation rates and estimates downloaded from supplementary_data_10 gnomAD paper);
    5) Use the trained model to regress the expected proportion singleton for each functional variant class;
    6) Acquiring the expected proportion of singletons for each functional class - MAPS? 
   **(6) This is called MAPS in supplementary, but in the code a subtraction is done?? to do**
    
    

## 1. Import packages

In [1]:
import hail as hl
from bokeh.io import output_notebook,show

## 2. Import data

In [2]:
ht = hl.read_table('gs://gcp-public-data--gnomad/release/3.1.2/ht/genomes/gnomad.genomes.v3.1.2.sites.ht')

Initializing Hail with default parameters...


2022-09-18 13:46:04 WARN  Utils:69 - Your hostname, MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.0.181 instead (on interface en0)
2022-09-18 13:46:04 WARN  Utils:69 - Set SPARK_LOCAL_IP if you need to bind to another address




2022-09-18 13:46:05 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


2022-09-18 13:46:07 WARN  Utils:69 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


Running on Apache Spark version 3.1.3
SparkUI available at http://192.168.0.181:4041
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.99-57537fea08d4
LOGGING: writing to /Users/adrian/BroadIS/01_maps/hail-20220918-1346-0.2.99-57537fea08d4.log


In [16]:
ht_mu = hl.import_table('data/supplementary_dataset_10_mutation_rates.tsv.gz',
                delimiter='\t', impute=True, force_bgz=True)

2022-09-18 23:23:40 Hail: INFO: Reading table to impute column types
2022-09-18 23:23:41 Hail: INFO: Finished type imputation
  Loading field 'context' as type str (imputed)
  Loading field 'ref' as type str (imputed)
  Loading field 'alt' as type str (imputed)
  Loading field 'methylation_level' as type int32 (imputed)
  Loading field 'mu_snp' as type float64 (imputed)


## 3. Retrieve the variant annotations

### Also subset the data to draft locally

In [3]:
ht = ht.head(100000)

ht.count()

100000

In [28]:
ht.describe()

----------------------------------------
Global fields:
    'freq_meta': array<dict<str, str>> 
    'freq_index_dict': dict<str, int32> 
    'faf_index_dict': dict<str, int32> 
    'faf_meta': array<dict<str, str>> 
    'vep_version': str 
    'vep_csq_header': str 
    'dbsnp_version': str 
    'filtering_model': struct {
        model_name: str, 
        score_name: str, 
        snv_cutoff: struct {
            bin: float64, 
            min_score: float64
        }, 
        indel_cutoff: struct {
            bin: float64, 
            min_score: float64
        }, 
        model_id: str, 
        snv_training_variables: array<str>, 
        indel_training_variables: array<str>
    } 
    'age_distribution': struct {
        bin_edges: array<float64>, 
        bin_freq: array<int32>, 
        n_smaller: int32, 
        n_larger: int32
    } 
    'freq_sample_count': array<int32> 
----------------------------------------
Row fields:
    'locus': locus<GRCh38> 
    'alleles': array<s

In [5]:
ht.aggregate(hl.agg.counter(ht.vep.most_severe_consequence))



frozendict({'3_prime_UTR_variant': 692, '5_prime_UTR_variant': 5, 'downstream_gene_variant': 4308, 'frameshift_variant': 14, 'inframe_insertion': 1, 'intergenic_variant': 5365, 'intron_variant': 65324, 'mature_miRNA_variant': 51, 'missense_variant': 256, 'non_coding_transcript_exon_variant': 16198, 'regulatory_region_variant': 99, 'splice_acceptor_variant': 56, 'splice_donor_variant': 83, 'splice_region_variant': 548, 'start_lost': 3, 'stop_gained': 7, 'stop_lost': 1, 'synonymous_variant': 103, 'upstream_gene_variant': 6886})