# How many cryptic PAS are supported by polyA tail containing reads?

Strategy for analysis:
- Predicted PAS from GTF file
- Subset quant GTF to cryptics
- pr.k_nearest on each interval (pick relaxed, allow overlaps + either direction)

Q1: What %age of events are supported by reads at given distance threshold?
- e.g. 0, 10, 25, 50, 100
- Each le_id can contain multiple 3'ends per event - first collapse to min distance as representative for each le_id
- Then apply range of cut-offs - number of events in each window


Q2: Are distinct isoforms supported for each event?
- e.g. STMN2s and ARHGAP32s of the cryptic world
- perform naive clustering/collapsing of pas by le_id e.g. 25/50nt 
  - How many PAS have multiple unique predicted isoforms?
  - How many have junction read support?




TO CONSIDER:

3-5 overhang could be missing some lower confidence/shorter supporting reads. Repeat counting of lower confidence PAS by running count_pas with BED file containing cryptic PAS from PAPA (all PAS for each le_id) AND/OR (e.g. separate runs) PAS clusters defined using the 3-5/6+ criteria (i.e. for the same clusters, is there additional support from short overhangs?)

In [5]:
import pyranges as pr
import pandas as pd
import numpy as np
from typing import Union, List, Optional, Tuple, Dict
import os

In [15]:
def nearest_threshold_wrapper(gr: pr.PyRanges,
                              gr2: pr.PyRanges,
                              thresholds: Union[int, List[int]],
                              id_col: str = "le_id",
                              nearest_only: bool = True,
                              return_grs: bool = False,
                              nearest_kwargs: dict = {"k": 10,
                                                      "strandedness": "same",
                                                      "overlap": True,
                                                      "how": None},
                              ) -> Tuple[Dict[str, set], Optional[Dict[str, pr.PyRanges]]]:
    '''Get events with nearest overlap passing a maximum distance threshold(s)

    Parameters
    ----------
    gr : pr.PyRanges
        _description_
    gr2 : pr.PyRanges
        _description_
    thresholds : Union[int, List[int]]
        _description_
    id_col : str, optional
        _description_, by default "le_id"
    nearest_only : bool, optional
        _description_, by default True
    return_grs : bool, optional
        _description_, by default False
    nearest_kwargs : _type_, optional
        _description_, by default {"k": 10, "strandedness": "same", "overlap": True, "how": None}

    Returns
    -------
    Tuple[Dict[str, set], Optional[Dict[str, pr.PyRanges]]]
        return {threshold: set(id_col values)} and optionally {threshold: pr.PyRanges} (if return_grs=True)
    '''
    

    if nearest_only:
        if "k" in nearest_kwargs.keys():
            kwargs = {key: v for key, v in nearest_kwargs.items()if key != "k"}
        gr_nr = gr.nearest(gr2, **kwargs)

    else:
        assert "k" in nearest_kwargs.keys()
        gr_nr = gr.k_nearest(gr2, **nearest_kwargs)


    # convert to absolute values (don't care if up/downstream rn)
    gr_nr = gr_nr.assign("DistanceAbs", lambda df: df.Distance.abs())
    
    # if nearest_only:
    #     # # for each le_id, select the smallest distance
    #     gr_nr = gr_nr.apply(lambda df: df.sort_values(by=[id_col, "DistanceAbs"]).drop_duplicates(subset=[id_col])).sort()

    if isinstance(thresholds, int):
        # subset to those pasing distance threshold
        pass_grs = {str(thresholds): gr_nr.subset(lambda df: df.DistanceAbs.le(thresholds))}

    else:
        pass_grs = {str(threshold): gr_nr.subset(lambda df: df.DistanceAbs.le(threshold)) for threshold in thresholds}

    # return set of ids at each threshold
    pass_ids = {threshold: set(gr.as_df()[id_col]) for threshold, gr in pass_grs.items()}

    if return_grs:
        return pass_ids, pass_grs
    else:
        return pass_ids


    


In [6]:
# BED files of polyA-tail containing reads - 3-5 nt overhang + 100 % As | 6+nt & >80 % As
# Pooling across all KD samples from all experiments
pas_all = pr.read_bed("../data/bulk_polya_reads/tdp_ko_collection/pas_clusters/condition__TDP43KD/two_class_simple/polya_clusters.bed")
# pooling x KD samples from i3Neuron datasets
pas_i3 = pr.read_bed("../data/bulk_polya_reads/tdp_ko_collection/pas_clusters/cell_type__i3_cortical___condition__TDP43KD/two_class_simple/polya_clusters.bed")
# pooling x KD samples from SH-SY5Y datasets
pas_shsy5y = pr.read_bed("../data/bulk_polya_reads/tdp_ko_collection/pas_clusters/cell_type__shsy5y___condition__TDP43KD/two_class_simple/polya_clusters.bed")
# poolin x KD samples from SK-N-BE2 datasets
pas_sknbe2 = pr.read_bed("../data/bulk_polya_reads/tdp_ko_collection/pas_clusters/cell_type__sknbe2___condition__TDP43KD/two_class_simple/polya_clusters.bed")

# GTF file of predicted last exons
quant_le = pr.read_gtf("../data/novel_ref_combined.last_exons.gtf")

# Table with cryptic le_ids (prior to mv filtering)
cryptics_df = pd.read_csv("../processed/2023-12-10_cryptics_summary_all_events.tsv", sep="\t")
# after manual validation (just to get some idea)
cryptics_mv_df = pd.read_csv("../processed/2023-12-10_cryptics_summary_all_events_bleedthrough_manual_validation.tsv", sep="\t")


In [7]:
# cryptic IDs
cryp_le_ids = set(cryptics_df.le_id)
cryp_le_ids_mv = set(cryptics_mv_df.le_id)
quant_le_cryp = quant_le.subset(lambda df: df.le_id.isin(cryp_le_ids))
quant_le_cryp_mv = quant_le.subset(lambda df: df.le_id.isin(cryp_le_ids_mv))
quant_le_cryp_pas = quant_le_cryp.three_end()
quant_le_cryp_mv_pas = quant_le_cryp_mv.three_end()
quant_le_cryp_pas

Unnamed: 0,Chromosome,Source,Feature,Start,End,Score,Strand,Frame,gene_id,gene_name,...,region_rank,Start_ref,End_ref,transcript_id_ref,3p_extension_length,event_type,ref_gene_id,ref_gene_name,le_number,le_id
0,chr1,.,exon,15940476,15940477,.,+,.,PAPA.chx_tdp_DOX_ctrl_2.224,,...,,,,,,last_exon_extension,ENSG00000065526.12,SPEN,4.0,ENSG00000065526.12_4
1,chr1,.,exon,147623355,147623356,.,+,.,PAPA.doxconc_DOX_0075_2.1123,,...,,,,,,internal_exon_extension,ENSG00000116128.12,BCL9,2.0,ENSG00000116128.12_2
2,chr1,.,exon,147622692,147622693,.,+,.,PAPA.chx_tdp_DOX_ctrl_2.1186,,...,,,,,,internal_exon_extension,ENSG00000116128.12,BCL9,2.0,ENSG00000116128.12_2
3,chr1,.,exon,147623700,147623701,.,+,.,PAPA.doxconc_DOX_0075_1.1139,,...,,,,,,internal_exon_extension,ENSG00000116128.12,BCL9,2.0,ENSG00000116128.12_2
4,chr1,.,exon,21457149,21457150,.,+,.,PAPA.TDP-1.345,,...,,,,,,last_exon_extension,ENSG00000142794.19,NBPF3,3.0,ENSG00000142794.19_3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
970,chrX,.,exon,40653676,40653677,.,-,.,PAPA.TDP43_ctrl_3.26410,,...,,,,,,last_exon_extension,ENSG00000180182.11,MED14,3.0,ENSG00000180182.11_3
971,chrX,.,exon,131823775,131823776,.,-,.,PAPA.TDP43-F_S6.20505,,...,,,,,,internal_exon_extension,ENSG00000213468.7,FIRRE,1.0,ENSG00000213468.7_1
972,chrX,.,exon,131823777,131823778,.,-,.,PAPA.TDP43_ctrl_2.26896,,...,,,,,,internal_exon_extension,ENSG00000213468.7,FIRRE,1.0,ENSG00000213468.7_1
973,chrX,.,exon,108267641,108267642,.,-,.,ENSG00000197565.17,COL4A6,...,last,108221374108221374108221374,108310747108310747108310747,"ENST00000372216.8,ENST00000538570.5,ENST000006...","NULL,NULL,NULL",internal_exon_spliced,ENSG00000197565.17,COL4A6,1.0,ENSG00000197565.17_1


In [8]:
# remove junky columns
quant_le_cryp_pas = quant_le_cryp_pas[["le_id", "ref_gene_name", "ref_gene_id", "event_id"]]
quant_le_cryp_pas

Unnamed: 0,Chromosome,Start,End,Strand,le_id,ref_gene_name,ref_gene_id
0,chr1,15940476,15940477,+,ENSG00000065526.12_4,SPEN,ENSG00000065526.12
1,chr1,147623355,147623356,+,ENSG00000116128.12_2,BCL9,ENSG00000116128.12
2,chr1,147622692,147622693,+,ENSG00000116128.12_2,BCL9,ENSG00000116128.12
3,chr1,147623700,147623701,+,ENSG00000116128.12_2,BCL9,ENSG00000116128.12
4,chr1,21457149,21457150,+,ENSG00000142794.19_3,NBPF3,ENSG00000142794.19
...,...,...,...,...,...,...,...
970,chrX,40653676,40653677,-,ENSG00000180182.11_3,MED14,ENSG00000180182.11
971,chrX,131823775,131823776,-,ENSG00000213468.7_1,FIRRE,ENSG00000213468.7
972,chrX,131823777,131823778,-,ENSG00000213468.7_1,FIRRE,ENSG00000213468.7
973,chrX,108267641,108267642,-,ENSG00000197565.17_1,COL4A6,ENSG00000197565.17


In [9]:
# overlap with jnc reads from all PAS
cryp_pas_kn = quant_le_cryp_pas.k_nearest(pas_all, k=10, strandedness="same", overlap=True, how=None)
# convert to absolute values (don't care if up/downstream rn)
cryp_pas_kn = cryp_pas_kn.assign("DistanceAbs", lambda df: df.Distance.abs())

# for each le_id, select the smallest distance
cryp_pas_kn_nr = cryp_pas_kn.apply(lambda df: df.sort_values(by=["le_id", "DistanceAbs"]).drop_duplicates(subset=["le_id"])).sort()

# how 
n_cryp_supported = {str(d): cryp_pas_kn_nr.subset(lambda df: df.DistanceAbs.le(d)).le_id.nunique() for d in [0,10,25,50,100,200,500]}
print(n_cryp_supported)
{d: n / len(cryp_le_ids) for d,n in n_cryp_supported.items()}
# cryp_pas_kn_nr.as_df()


{'0': 54, '10': 112, '25': 122, '50': 131, '100': 138, '200': 151, '500': 176}


{'0': 0.20930232558139536,
 '10': 0.43410852713178294,
 '25': 0.4728682170542636,
 '50': 0.5077519379844961,
 '100': 0.5348837209302325,
 '200': 0.5852713178294574,
 '500': 0.6821705426356589}

In [10]:
# overlap with jnc reads from all PAS
cryp_pas_mv_kn = quant_le_cryp_pas.k_nearest(pas_all, k=10, strandedness="same", overlap=True, how=None)
# convert to absolute values (don't care if up/downstream rn)
cryp_pas_mv_kn = cryp_pas_mv_kn.assign("DistanceAbs", lambda df: df.Distance.abs())

# for each le_id, select the smallest distance
cryp_pas_mv_kn_nr = cryp_pas_mv_kn.apply(lambda df: df.sort_values(by=["le_id", "DistanceAbs"]).drop_duplicates(subset=["le_id"])).sort()

# how 
n_cryp_supported_mv = {str(d): cryp_pas_mv_kn_nr.subset(lambda df: df.DistanceAbs.le(d)).le_id.nunique() for d in [0,10,25,50,100,200,500]}
print(n_cryp_supported_mv)
{d: n / len(cryp_le_ids_mv) for d,n in n_cryp_supported_mv.items()}


{'0': 54, '10': 112, '25': 122, '50': 131, '100': 138, '200': 151, '500': 176}


{'0': 0.23788546255506607,
 '10': 0.4933920704845815,
 '25': 0.5374449339207048,
 '50': 0.5770925110132159,
 '100': 0.6079295154185022,
 '200': 0.6651982378854625,
 '500': 0.775330396475771}

For manually validated (i.e. bleedthrough IR artefacts removed):
~ 49 % +/- 10nt, rising ~ 61 % +/- 100nt

vs all initial cryptics:
~ 43 % +/- 10nt, rising ~ 53 % +/- 100nt

Haven't checked read support...