# Regional constraint vs gnomAD constraint
Exploratory data analysis comparing regional constraint annotations with gnomAD constraint data.

In [1]:
import pandas as pd
from src import constants as C

Get the constraint summary data.

In [2]:
df = pd.read_csv(C.REGIONAL_NONSENSE_CONSTRAINT, sep="\t")
df.sample(3)

Unnamed: 0,enst,region,csq,n_pos,n_obs,n_exp,oe,prop_obs,prop_exp,mu,chi2,z,p,fdr_p,pli,loeuf,gnomad_flags,syn_z,constraint
60194,ENST00000255499,start_proximal,stop_gained,13.0,0.0,3.381,0.0,0.0,0.260077,6.40622e-08,,,,,,,,-1.939672,
61004,ENST00000263549,start_proximal,stop_gained,11.0,5.0,2.393,2.089427,0.454545,0.217545,4.35825e-08,,,,,5.2449e-17,1.023,[],2.323923,
45984,ENST00000326427,nmd_target,stop_gained,57.0,8.0,12.138,0.659087,0.140351,0.212947,3.128969e-07,1.79238,-1.338798,0.180636,0.28249,4.2564e-05,1.167,[],0.587984,unconstrained


## Genes which are not constrained in gnomAD, but are constrained at the transcript level in our analysis.

In [3]:
# Masks
m1 = df["region"] == "transcript"
m2 = df["constraint"] == "constrained"
m3 = df["pli"] < 0.9
m4 = df["pli"].isna()
m5 = df["loeuf"] > 0.6
m6 = df["loeuf"].isna()

new = df[m1 & m2 & (m3 | m4) & (m5 | m6)].copy()
new.shape

(377, 19)

There are 377 "newly-constrained" transcripts from our analysis.

Why might they be constrained for nonsense variants, but not constrained in gnomAD?

- **Splicing variants** pLI and LOEUF are calculated using pLoF SNVs. These include both nonsense and canonical splice site SNVs. Our metric is calculated only for nonsense variants. Splicing variants are perhaps less likely to cause true LoF than nonsense variants. E.g. ENST00000005260.
- **Noncoding transcripts** E.g. ENST00000683988. We have constraint annotations for non-coding transcripts. For example, this is the canonical transcript for a non-coding gene (NMD) which overlaps KMT2D. This appears to be the most common reason for the "no_exp_lof" flag.
- **LOEUF may be more conservative** The error margin around the LOEUF score makes it a more conservative measure. E.g. ENST00000693108 has LOEUF 0.61, but O/E (in gnomAD) 0.38, and O/E (here) 0.31.
- **Missing in gnomAD** e.g. ENST00000683779. No constraint statistics are available for this transcript in gnomAD, and the reasons for this have not been flagged. It appears to have quite poor coverage.
- **Flagged in gnomAD**  Constraint scores have not been calculated in gnomAD because the transcript has been flagged. e.g. ENST00000248071. This is an interesting example. The transcript is marked with the "outlier_syn" flag in gnomAD. But in our analysis the transcript-level synonymous Z score is 1.98. This difference may be due to different coverage cutoffs. The gnomAD constraint work is done on sites with median coverage >30. Ours is on sites with median coverage >30. Furthermore, we only use a synonymous Z score cutoff, and then only for lower-than-expected synonymous values (Z < -1) 
- **Other** There may be other important factors I have not considered. Those mentioned above are apparent from a skim of the data.

Let's limit this list to protein-coding transcripts.

In [4]:
gene_list = pd.read_csv(C.CANONICAL_CDS_GENE_IDS, sep="\t")

new_coding = new[new.enst.isin(gene_list.transcript_id)].copy()
new_coding.shape

(377, 19)

All of the transcripts are found in the canonical gene list. This is problematic and suggests an error in transcript curation. 

I am currently filtering for transcripts in which the "gene_type" is "protein_coding". However, a "protein_coding" gene_type does not equate to a protein coding canonical transcript. Many genes are potentially protein coding, but their canonical transcripts are subject to NMD and lack a protein-coding tag.

I should switch to filtering for transcripts in which the "transcript_type" is protein coding. This will involve rerunning most of my pipeline.

A similar approach is to exclude transcripts with the "no_exp_lof" flag in gnomAD.

In [5]:
m7 = new.gnomad_flags.fillna("").str.contains("no_exp_lof")

new_coding = new[~m7].copy()
new_coding.shape

(356, 19)

We drop 21 transcripts with the "no_exp_lof" flag.

We should also drop those transcripts which are missing constraint annotations in gnomAD. Possibly this is due to limited coverage.

In [6]:
new_coding = new_coding.dropna()
new_coding.shape

(350, 19)

A further six transcripts are dropped.

In [7]:
new_coding.gnomad_flags.value_counts()

[]                               337
["outlier_syn"]                    7
["outlier_mis","outlier_syn"]      5
["outlier_mis"]                    1
Name: gnomad_flags, dtype: int64

Thirteen of the transcripts are outliers for synonymous / missense variants in gnomAD. 

In total, we find 337 "newly-constrained" transcripts.

These are likely to be accounted for by:
1) The inclusion of splicing variants in gnomAD constraint.
2) The more conservative nature of LOEUF.

## Genes which are not constrained in gnomAD, but which have regional nonsense constraint.

In [8]:
m8 = df["region"] != "transcript"

new_regional = df[m8 & m2 & m3 & m5].copy() # Drops NaN values in pli & loeuf columns
new_regional.enst.nunique()

622

There are 622 transcripts which are not constrained in gnomAD, but which have regional nonsense constraint.

In [9]:
new_regional.region.value_counts()

nmd_target        316
distal_nmd        251
long_exon          59
start_proximal     24
Name: region, dtype: int64

316 are constrained in NMD target regions. 334 are constrained in NMD escape regions.

In [10]:
new_regional.gnomad_flags.value_counts()

[]                               629
["outlier_mis","outlier_syn"]     11
["outlier_syn"]                   10
Name: gnomad_flags, dtype: int64

21 are flagged as outliers for missense / synonymous variants in gnomAD.