# Regional constraint vs gnomAD constraint
Exploratory data analysis comparing regional constraint annotations with gnomAD constraint data.

In [1]:
import pandas as pd
import numpy as np
from src import constants as C

Get the constraint summary data.

In [2]:
df = pd.read_csv(C.REGIONAL_NONSENSE_CONSTRAINT, sep="\t")
df.sample(3)

Unnamed: 0,enst,region,csq,n_pos,n_obs,n_exp,oe,prop_obs,prop_exp,mu,chi2,z,p,fdr_p,pli,loeuf,gnomad_flags,syn_z,constraint
12888,ENST00000393409,distal_nmd,stop_gained,19.0,2.0,3.182,0.628536,0.105263,0.167474,5.488922e-08,,,,,0.60692,0.486,[],-0.699559,
40551,ENST00000262238,nmd_target,stop_gained,93.0,1.0,16.213,0.061679,0.010753,0.174333,2.819781e-07,17.28867,-4.157965,3.2e-05,0.000215,1.0,0.2,[],0.343072,constrained
44582,ENST00000319098,nmd_target,stop_gained,0.0,0.0,0.0,,,,,,,,,,,"[""no_exp_lof""]",,


## Genes which are not constrained in gnomAD, but are constrained at the transcript level in our analysis.

In [3]:
# Masks
m1 = df["region"] == "transcript"
m2 = df["constraint"] == "constrained"
m3 = df["pli"] < 0.9
m4 = df["pli"].isna()
m5 = df["loeuf"] > 0.6
m6 = df["loeuf"].isna()

new = df[m1 & m2 & (m3 | m4) & (m5 | m6)].copy()
new.shape

(364, 19)

There are 364 "newly-constrained" transcripts from our analysis.

Why might they be constrained for nonsense variants, but not constrained in gnomAD?

- **Splicing variants** pLI and LOEUF are calculated using pLoF SNVs. These include both nonsense and canonical splice site SNVs. Our metric is calculated only for nonsense variants. Splicing variants are perhaps less likely to cause true LoF than nonsense variants. E.g. ENST00000005260.
- **Noncoding transcripts** E.g. ENST00000683988. We have constraint annotations for non-coding transcripts. For example, this is the canonical transcript for a non-coding gene (NMD) which overlaps KMT2D. This appears to be the most common reason for the "no_exp_lof" flag.
- **LOEUF may be more conservative** The error margin around the LOEUF score makes it a more conservative measure. E.g. ENST00000693108 has LOEUF 0.61, but O/E (in gnomAD) 0.38, and O/E (here) 0.31.
- **Missing in gnomAD** e.g. ENST00000683779. No constraint statistics are available for this transcript in gnomAD, and the reasons for this have not been flagged. It appears to have quite poor coverage.
- **Flagged in gnomAD**  Constraint scores have not been calculated in gnomAD because the transcript has been flagged. e.g. ENST00000248071. This is an interesting example. The transcript is marked with the "outlier_syn" flag in gnomAD. But in our analysis the transcript-level synonymous Z score is 1.98. This difference may be due to different coverage cutoffs. The gnomAD constraint work is done on sites with median coverage >30. Ours is on sites with median coverage >30. Furthermore, we only use a synonymous Z score cutoff, and then only for lower-than-expected synonymous values (Z < -1) 
- **Other** There may be other important factors I have not considered. Those mentioned above are apparent from a skim of the data.

Let's limit this list to protein-coding transcripts.

In [4]:
gene_list = pd.read_csv(C.CANONICAL_CDS_GENE_IDS, sep="\t")

new_coding = new[new.enst.isin(gene_list.transcript_id)].copy()
new_coding.shape

(364, 19)

All of the transcripts are found in the canonical gene list. This is problematic and suggests an error in transcript curation. 

I am currently filtering for transcripts in which the "gene_type" is "protein_coding". However, a "protein_coding" gene_type does not equate to a protein coding canonical transcript. Many genes are potentially protein coding, but their canonical transcripts are subject to NMD and lack a protein-coding tag.

I should switch to filtering for transcripts in which the "transcript_type" is protein coding. This will involve rerunning most of my pipeline.

A similar approach is to exclude transcripts with the "no_exp_lof" flag in gnomAD.

In [5]:
m7 = new.gnomad_flags.fillna("").str.contains("no_exp_lof")

new_coding = new[~m7].copy()
new_coding.shape

(343, 19)

We drop 21 transcripts with the "no_exp_lof" flag.

We should also drop those transcripts which are missing constraint annotations in gnomAD. Possibly this is due to limited coverage.

In [6]:
new_coding = new_coding.dropna()
new_coding.shape

(338, 19)

A further five transcripts are dropped.

In [7]:
new_coding.gnomad_flags.value_counts()

[]                               325
["outlier_syn"]                    7
["outlier_mis","outlier_syn"]      5
["outlier_mis"]                    1
Name: gnomad_flags, dtype: int64

Thirteen of the transcripts are outliers for synonymous / missense variants in gnomAD. 

In total, we find 325 "newly-constrained" transcripts.

These are likely to be accounted for by:
1) The inclusion of splicing variants in gnomAD constraint.
2) The more conservative nature of LOEUF.

## Genes which are not constrained in gnomAD, but which have regional nonsense constraint.
How many genes are not constrained in gnomAD, but have regional nonsense constraint?

In [8]:
m8 = df["region"] != "transcript"

new_regional = df[m8 & m2 & m3 & m5].copy() # Drops NaN values in pli & loeuf columns
new_regional.enst.nunique()

605

In [9]:
new_regional.region.value_counts()

nmd_target        307
distal_nmd        244
long_exon          58
start_proximal     24
Name: region, dtype: int64

In [10]:
new_regional.gnomad_flags.value_counts()

[]                               612
["outlier_mis","outlier_syn"]     11
["outlier_syn"]                   10
Name: gnomad_flags, dtype: int64

21 are flagged as outliers for missense / synonymous variants in gnomAD.

## Number of genes with regional nonsense constraint

How many transcripts have a regional nonsense constraint annotation?

In [11]:
m11 = df.constraint == "constrained"
m12 = df.region.isin(["distal", "nmd_target", "long_exon", "start_proximal"])

df[m11 & m12].enst.nunique()

1967

How many transcripts are *specifically* constrained in NMD escape regions? (I.e. not constrained in NMD target regions.)

In [12]:
constrained_nmd_target = df[(df.region == "nmd_target") & m11].enst
# constrained_transcripts = df[(df.region == "transcript") & m11].enst

m13 = df.enst.isin(constrained_nmd_target)
# m14 = df.enst.isin(constrained_transcripts)

df[m11 & m12 & ~m13].enst.nunique()

320