# Regional constraint vs gnomAD constraint
Exploratory data analysis comparing regional constraint annotations with gnomAD constraint data.

In [1]:
# Imports
import pandas as pd
from src import constants as C

In [2]:
# Load the constraint data
df = pd.read_csv(C.REGIONAL_NONSENSE_CONSTRAINT, sep="\t")

## Comparing transcript-level constraint here and in gnomAD.

In [3]:
# Masks
m1 = df["region"] == "transcript"
m2 = df["constraint"] == "constrained"
m3 = df["pli"] < 0.9
m4 = df["pli"].isna()
m5 = df["loeuf"] > 0.6
m6 = df["loeuf"].isna()

 How many transcripts are constrained?

In [4]:
_ = df[m1 & m2]
print(f"Constrained transcripts: {len(df[m1 & m2])}")

Constrained transcripts: 1852


How many transcripts are not constrained in gnomAD, but show transcript-level constraint in our analysis?

In [5]:
new = df[m1 & m2 & (m3 | m4) & (m5 | m6)].copy()
print(f"Newly constrained transcripts: {len(new)}")

Newly constrained transcripts: 347


Why might these transcripts be constrained in our analysis, but not in gnomAD?

- **Splicing variants** pLI and LOEUF are calculated using pLoF SNVs. These include both nonsense and canonical splice site SNVs. Our metric is calculated only for nonsense variants. Splicing variants are perhaps less likely to cause true LoF than nonsense variants. E.g. ENST00000005260.
- **LOEUF may be more conservative** The error margin around the LOEUF score makes it a more conservative measure. E.g. ENST00000693108 has LOEUF 0.61, but O/E (in gnomAD) is 0.38, and O/E (here) is 0.31.
- **Missing in gnomAD** e.g. ENST00000683779. No constraint statistics are available for this transcript in gnomAD, and the reasons for this have not been flagged. It appears to have quite poor coverage.
- **Flagged in gnomAD**  Constraint scores have not been calculated in gnomAD because the transcript has been flagged. e.g. ENST00000248071. This is an interesting example. The transcript is marked with the "outlier_syn" flag in gnomAD. But in our analysis the transcript-level synonymous Z score is 1.98. This difference may be due to different coverage cutoffs. The gnomAD constraint work is done on sites with median coverage >30. Ours is on sites with median coverage >20. Furthermore, we only use a synonymous Z score cutoff, and then only for lower-than-expected synonymous values (Z < -1) 
- **Other** There may be other important factors I have not considered. Those mentioned above are apparent from a skim of the data.

### The "no_exp_lof" flag
This is a common flag in the gnomAD constraint data, flagging transcripts in which no pLoF variants are expected.

Do we see any constrained transcripts with this flag?

In [6]:
m7 = new.gnomad_flags.fillna("").str.contains("no_exp_lof")

new[m7]

Unnamed: 0,enst,region,csq,n_pos,n_obs,n_exp,oe,prop_obs,prop_exp,mu,chi2,z,p,fdr_p,pli,loeuf,gnomad_flags,syn_z,constraint
86891,ENST00000377741,transcript,stop_gained,66.0,5.0,15.887,0.314723,0.075758,0.240712,3.879214e-07,9.825804,-3.134614,0.001720805,0.004624145,,,"[""no_exp_lof""]",-0.533593,constrained
93210,ENST00000650528,transcript,stop_gained,356.0,21.0,89.02,0.235902,0.058989,0.250056,1.969831e-06,69.303782,-8.324889,8.440758e-17,2.351972e-15,,,"[""no_exp_lof""]",1.935121,constrained


**ENST00000377741** has a single coding exon, plus a 5' UTR exon. All pLoFs in gnomAD are flagged as low-confidence due to "END_TRUNC" - end truncation. In our analysis it is constrained for nonsense variants.

**ENST00000650528** is a single exon gene. All pLoFs in gnomAD are flagged as "END_TRUNC", and also by LOFTEE as a single exon gene. Interestingly, it is a known morbid gene (MAGEL2), in which heterozygous pLoF variants cause an AD developmental disorder called Schaaf-Yang syndrome (MIM 615547).

#### Noncoding transcripts

Earlier in our analysis, I had mistakenly included some non-coding transcripts in our transcript curation. 

I had previously filtered for transcripts in which the "gene_type" is "protein_coding". However, a "protein_coding" gene_type does not equate to a protein coding canonical transcript. Many genes are potentially protein coding, but their canonical transcripts are subject to NMD and lack a protein-coding tag.

E.g. ENST00000683988 is a non-coding transcripts which appeared to be constrained. It is no longer included in our transcript set. It is the canonical transcript for a non-coding gene (NMD) which overlaps KMT2D, which explains the apparent paucity of LoF variants. This scenario appears to be the most common reason for the "no_exp_lof" flag.

Accordingly, I have now re-run my analysis pipeline using transcripts in which the "transcript_type" is protein coding.

### gnomAD flags in newly constrained transcripts

In [7]:
print(f"gnomAD flag value counts:\n")
print(f"{new.gnomad_flags.value_counts(dropna=False)}")

gnomAD flag value counts:

[]                               327
["outlier_syn"]                    7
NaN                                5
["outlier_mis","outlier_syn"]      5
["no_exp_lof"]                     2
["outlier_mis"]                    1
Name: gnomad_flags, dtype: int64


In [8]:
print(
    f"Newly constrained transcripts with no gnomAD flags: "
    f"{(new.gnomad_flags == '[]').sum()}"
)

Newly constrained transcripts with no gnomAD flags: 327


## Regional nonsense constraint versus gnomAD constraint.
How many genes are not constrained in gnomAD, but have regional nonsense constraint?

In [9]:
# Masks
m1 = df["region"] == "transcript"
m2 = df["constraint"] == "constrained"
m3 = df["pli"] < 0.9
m4 = df["pli"].isna()
m5 = df["loeuf"] > 0.6
m6 = df["loeuf"].isna()

new_regional = df[~m1 & m2 & (m3 | m4) & (m5 | m6)].copy()

print(f"Unique transcripts with new regional constraint: {new_regional.enst.nunique()}")

Unique transcripts with new regional constraint: 625


In [10]:
print(f"New constrained regions value counts:\n\n{new_regional.region.value_counts()}")

New constrained regions value counts:

nmd_target        310
distal_nmd        253
long_exon          60
start_proximal     32
Name: region, dtype: int64


In [11]:
print(
    f"Value counts of gnomAD flags in transcripts with new regional constraint:\n"
    f"{new_regional.drop_duplicates(subset='enst').gnomad_flags.value_counts(dropna=False)}"
)

Value counts of gnomAD flags in transcripts with new regional constraint:
[]                               587
NaN                               17
["outlier_syn"]                    9
["outlier_mis","outlier_syn"]      9
["no_exp_lof"]                     3
Name: gnomad_flags, dtype: int64


## Specific regional nonsense constraint

How many transcripts have a regional nonsense constraint annotation?

How many transcripts are *specifically* constrained in NMD escape regions? (I.e. not constrained in NMD target regions.)

In [12]:
m1 = df.constraint == "constrained"
m2 = df.region.isin(["distal", "nmd_target", "long_exon", "start_proximal"])
m3 = df.region == "nmd_target"
m4 = df.region.isin(["distal", "long_exon", "start_proximal"])

constrained_nmd_target = df[m1 & m3].enst
constrained_nmd_escape = df[m1 & m4].enst

m5 = df.enst.isin(constrained_nmd_target)
m6 = df.enst.isin(constrained_nmd_escape)

print(f"Transcripts with a constrained region: {df[m1 & m2].enst.nunique()}")
print(
    f"Transcripts constrained in NMD target regions: "
    f"{constrained_nmd_target.nunique()}"
)
print(
    f"Transcripts constrained in NMD escape regions: "
    f"{constrained_nmd_escape.nunique()}"
)
print(
    f"Transcripts constrained in NMD target regions, but not NMD escape regions: "
    f"{df[m1 & m3 & ~m6].enst.nunique()}"
)
print(
    f"Transcripts constrained in NMD-escape regions, but not NMD target regions: "
    f"{df[m1 & m4 & ~m5].enst.nunique()}"
)

Transcripts with a constrained region: 1947
Transcripts constrained in NMD target regions: 1629
Transcripts constrained in NMD escape regions: 548
Transcripts constrained in NMD target regions, but not NMD escape regions: 1399
Transcripts constrained in NMD-escape regions, but not NMD target regions: 318


How many transcripts are constrained in >1 region?

In [25]:
occurrence_counts = df[m1 & m2]["enst"].value_counts().value_counts()
print(f"Transcripts by number of constrained regions:\n{occurrence_counts}")

Transcripts by number of constrained regions:
1    1715
2     228
3       4
Name: enst, dtype: int64


How many transcripts with regional nonsense constraint are constrained at the transcript level?

In [30]:
_region = df[m1 & m2].enst.drop_duplicates()
_transcript = df[m1 & (df.region == "transcript")].enst.drop_duplicates()
_rt = _region[_region.isin(_transcript)]

print(f"Transcripts with regional constraint: {len(_region)}")
print(f"Transcripts with regional and transcript-level constraint: {len(_rt)}")

Transcripts with regional constraint: 1947
Transcripts with regional and transcript-level constraint: 1206
