# Expected variants
This script determines the expected number of variants for a given region. 
It calculates this for:
- Transcripts
- NMD regions

## Import modules

In [1]:
import numpy as np
import pandas as pd
from collections import defaultdict

## Load data

### Synonymous variants observed in UKB
Rare synonymous variants are the basis for the model. We drop synonymous variants arising from CpG transitions.

In [2]:
# Rare synonymous variants per variant context
syn = pd.read_csv("../outputs/observed_variants_stats_synonymous.tsv", sep="\t")

# Get proportion of variants observed
syn["prop_obs"] = syn["obs"] / syn["pos"]

# Exclude CpG transitions
syn_cpg = syn[syn["variant_type"] != "CpG"].copy()

### Variants observed in UKB, per transcript and NMD region
Expected numbers of variants will be predicted for these regions.

In [3]:
# Variants observed per transcript
enst = pd.read_csv("../outputs/observed_variants_stats_transcript_no_cpg.tsv", sep="\t")

In [4]:
# Variants observed per NMD region
nmd = pd.read_csv("../outputs/observed_variants_stats_nmd_no_cpg.tsv", sep="\t")

In [5]:
# Concatenate the transcript-level and region-level data
enst = enst.assign(region="transcript")
nmd = nmd.rename(columns={"nmd": "region"})

df = pd.concat([nmd, enst]).sort_values(["region", "enst", "csq"])

## Linear model for expected proportion of variants

From expecation_model_choices.ipynb, it seems the best model for predicting the expected number of variants (excluding CpG transitions) is a simple linear model of obs vs mu, weighted by the number of possible variants per context. 

In [6]:
# Linear model
fit = np.polyfit(syn_cpg["mu"], syn_cpg["prop_obs"], 1, w=syn_cpg["pos"])
lm_p = np.poly1d(fit)

## Calculate expected variants per transcript and context

In [7]:
# Find the expected number of variants
df = df.assign(
    prop_obs=lambda x: x["n_obs"] / x["n_pos"],
    prop_exp=lambda x: lm_p(x["mu"]),
    n_exp=lambda x: np.round(x["n_pos"] * x["prop_exp"], 2),
    oe=lambda x: x["n_obs"] / x["n_exp"],
)

# By default, regions with no variants would be dropped.
# We keep them instead.
df = df.set_index(["region", "enst", "csq"]).unstack(fill_value=0).stack().reset_index()

# Keep relevant columns
df = df[
    [
        "region",
        "enst",
        "csq",
        "mu",
        "n_pos",
        "n_obs",
        "n_exp",
        "oe",
        "prop_obs",
        "prop_exp",
    ]
]

## Write to output

In [8]:
df.to_csv(
    "../outputs/expected_variants_all_regions_no_cpg.tsv", sep="\t", index=False
)