# NMD annotations
This script labels genomic positions with an NMD annotation. The annotations include:
1) Start-proximal (<150nt from translation start site)
2) Long exons (>400nt upstream of the splice donor site)
3) Last exon
4) 50nt rule (within the most 3' 50nt of the penultimate exon)

There are several approaches I could take.
First, I could annotate every position in the exome with these criteria.
Alternatively, I could construct a bed file of different regions.

Generally, I would favour the per-site annotations for maximum flexibility.
So I will need to annotate:
1) The distance from the translation start site
2) The distance upstream from the splice donor site
3) Whether it is the last exon
4) Whether it is the last 50nt of the penultimate exon

## Import modules

In [1]:
conda install -c bioconda gtfparse -y

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - gtfparse


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    gtfparse-1.2.1             |     pyh864c0ab_0          13 KB  bioconda
    ------------------------------------------------------------
                                           Total:          13 KB

The following NEW packages will be INSTALLED:

  gtfparse           bioconda/noarch::gtfparse-1.2.1-pyh864c0ab_0 



Downloading and Extracting Packages
                                                                                
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.


In [2]:
%%bash
dx download -o ../data/ data/gencode.v39.annotation.gtf

In [3]:
# Import the relevant modules
import numpy as np
import pandas as pd
import gtfparse

In [7]:
def get_gencode_gtf(path):
    """Read a GENCODE .gtf into memory with gtfparse"""
    gtf = gtfparse.read_gtf(path)
    return gtf

In [111]:
def get_canonical_cds(gtf):
    """Identify all CDS features in each Ensembl_canonical from GENCODE"""

    # Subset to Ensembl_canonical features in protein coding genes
    canonical_cds = gtf[
        ((gtf.feature == "exon") | (gtf.feature == "CDS"))
        & (gtf.tag.str.contains("Ensembl_canonical"))
        & (gtf.gene_type == "protein_coding")
    ].copy()
    
    # Find the number of exons per transcript
    canonical_cds["exon_number"] = canonical_cds["exon_number"].astype(int)
    canonical_cds["exon_count"] = canonical_cds.groupby("transcript_id")["exon_number"].transform("max")
    
    # Find exon end positions
    exons = canonical_cds[canonical_cds.feature=="exon"]
    exons = exons[["exon_id","start","end"]]
    exons.columns = ["exon_id", "exon_start", "exon_end"]
    exons = exons.drop_duplicates()
    
    # Subset to CDS only
    canonical_cds = canonical_cds[canonical_cds.feature == "CDS"]
    
    # Merge with exon end positions
    canonical_cds = canonical_cds.merge(exons, how="left")
    canonical_cds = canonical_cds.drop_duplicates(["transcript_id","exon_id","start","end"])
    
    # Count the number of CDS exons in each MANE transcript
    canonical_cds["cds_number"] = (
        canonical_cds.groupby("transcript_id")["exon_number"].rank().astype(int)
    )
    return canonical_cds

In [None]:
if __name__ == "__main__":
    # Read GTF data
    gencode_path = "../data/gencode.v39.annotation.gtf"
    gtf = get_gencode_gtf(gencode_path)

    # Define regions of interest
    cds = get_canonical_cds(gtf)
    cds = cds[["seqname","start","end", "exon_end","strand","transcript_id","exon_id","exon_count","exon_number","cds_number"]]
    cds = cds.set_index(["seqname","transcript_id","exon_id"])
    cds["pos"] = cds.apply(lambda x: list(range(x["start"], x["end"] + 1)), axis=1)
    cds = cds.explode("pos")
    cds["cds_len"] = cds.groupby(level="transcript_id")["pos"].transform("count")
    cds["exon_len"] = cds.groupby(level="exon_id")["pos"].transform("count")

## NMD annotations for + transcripts 

In [66]:
fwd = cds[cds["strand"]=="+"].copy()

### Start proximal (distance from start codon)

In [67]:
fwd["start_distance"] = fwd.groupby(level="transcript_id")["pos"].rank(ascending=True).astype(int)

### Long exons (distance upstream from splice junction)

In [68]:
fwd["splice_donor_distance"] = (fwd["exon_end"] - fwd["pos"]) + 1
# Where the CDS is in the last exon (i.e. contiguous with the 3' UTR) we should drop the "splice_donor_distance" annotation
fwd.loc[, "splice_donor_distance"] = np.nan

### Last exon

In [73]:
fwd["last_exon"] = np.where(fwd["exon_count"] == fwd["exon_number"], 1, 0) 

### 50nt rule

In [74]:
fwd["fifty_nt_rule"] = np.where((fwd["exon_number"] == fwd["exon_count"] - 1)
                            & ((fwd["end"] - fwd["pos"]) + 1 <= 50),
                            1,
                            0
                           ) 

## NMD annotations for - transcripts 

In [83]:
rev = cds[cds["strand"]=="-"].copy()

### Start proximal (distance from start codon)

In [84]:
rev["start_distance"] = rev.groupby(level="transcript_id")["pos"].rank(ascending=False).astype(int)

### Long exons (distance upstream from splice junction)

In [86]:
rev["splice_donor_distance"] = (rev["pos"] - rev["start"]) + 1
rev.loc[rev["exon_number"] == rev["exon_count"], "splice_donor_distance"] = np.nan

### Last exon

In [73]:
rev["last_exon"] = np.where(rev["exon_count"] == rev["exon_number"], 1, 0) 

### 50nt rule

In [74]:
rev["fifty_nt_rule"] = np.where((rev["exon_number"] == rev["exon_count"] - 1)
                            & ((rev["end"] - rev["pos"]) + 1 <= 50),
                            1,
                            0
                           ) 