# Extract splice junctions for AS-ALE (/ last exon spliced) events.

Due to a pretty silly move on my part, my pipeline does not output splice junctions for novel spliced last exons (nor last exons)
Given want to use SJs to quantify these events in NYGC data, need to extract using only last exon coordinates.

In pipeline, approach to define last exons:
- 3'end doesn't overlap any annotated exon
- terminal splice junction must match an annotated splice junction at the 5'end. 

Can exploit these criteria to grab SJs 'in reverse'

Strategy for novel last exons:
- Extract introns (SJs) from annotated transcripts
- find introns in which last exons are completely contained
- (w/ some form of pr.join), construct a SJ with start coordinate of annotated intron and end coordinate = start of exon coordinate
- As a sanity check, compare with AL's provided list of cryptic SJs inferred with MAJIQ - for common genes, do we get the same SJ coordinates

Strategy for annotated last exons:
- Any spliced last exon that is not captured with novel LE strategy
- pr.join with slack=1 (allow bookended intervals to overlap) annotated SJs with last exon coordinates
- Subset for exact matches at SJ 3'end/last exon 5'end
- Retain the SJs for downstream analysis

In [20]:
import pyranges as pr
import pandas as pd
import numpy as np

In [3]:
majiq_sj = pr.read_bed("../data/Everything_minus_ferguson_liu_recent_new_cryptics.junctions.bed")
ref_gtf = pr.read_gtf("../data/reference_filtered.gtf")
le = pr.read_bed("../data/2023-07-04_papa_cryptic_spliced.last_exons.bed") # generated in motifs/ subdir, all cryptic last exon coords used for iCLIP analysis

CPU times: user 1e+03 ns, sys: 1 µs, total: 2 µs
Wall time: 3.58 µs


In [6]:
# ref gtf has lots of attributes don't need, subset to minimal cols for analysis
ref_gtf = ref_gtf[["Feature", "gene_id", "transcript_id", "gene_name"]]

In [13]:
# subset le for cryptic LEs only
le_cryp = le.subset(lambda df: df.Name.str.contains("cryptic", regex=False))
ids_le_cryp = set(le_cryp.Name)
le_cryp

Unnamed: 0,Chromosome,Start,End,Name,Score,Strand
0,chr1,76871267,76871821,ENSG00000117069.15_2|ST6GALNAC5|spliced|cryptic,.,+
1,chr1,61824444,61825501,ENSG00000132849.22_1|PATJ|spliced|cryptic,.,+
2,chr1,54634687,54639192,ENSG00000162390.18_4|ACOT11|spliced|cryptic,.,+
3,chr1,245464258,245471621,ENSG00000162849.16_2|KIF26B|spliced|cryptic,.,+
4,chr1,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+
...,...,...,...,...,...,...
102,chr22,37464524,37465718,ENSG00000100060.18_2|MFNG|spliced|cryptic,.,-
103,chrX,102721363,102724864,ENSG00000198908.12_1|BHLHB9|spliced|cryptic,.,+
104,chrX,98679426,98679978,ENSG00000281566.3_1|ENSG00000281566|spliced|cr...,.,+
105,chrX,17835910,17837395,ENSG00000131831.18_1|RAI2|spliced|cryptic,.,-


In [7]:
# extract introns (SJs) for each transcript
ref_introns = ref_gtf.features.introns(by="transcript")
# store a df noting which transcript IDs share the same intron coordinates (useful if collapse later and want to track/retain)
intron2tx = ref_introns.as_df()[["Chromosome", "Start", "End", "Strand", "transcript_id"]]



In [18]:
# first extract LEs completely contained within annotated introns
le_cryp_cont = le_cryp.overlap(ref_introns, strandedness="same", how="containment")

# now overlap-join introns with LEs
ref_introns_le = ref_introns.join(le_cryp_cont, strandedness="same")
ref_introns_le

Unnamed: 0,Chromosome,Feature,Start,End,Strand,gene_id,transcript_id,gene_name,Start_b,End_b,Name,Score,Strand_b
0,chr1,intron,1616614,1623430,+,ENSG00000197530.13,ENST00000355826.10,MIB2,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+
1,chr1,intron,1616614,1623430,+,ENSG00000197530.13,ENST00000355826.10,MIB2,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+
2,chr1,intron,1616614,1623430,+,ENSG00000197530.13,ENST00000355826.10,MIB2,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+
3,chr1,intron,1616614,1623430,+,ENSG00000197530.13,ENST00000355826.10,MIB2,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+
4,chr1,intron,1616614,1623430,+,ENSG00000197530.13,ENST00000355826.10,MIB2,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1492,chrX,intron,108221374,108310747,-,ENSG00000197565.17,ENST00000538570.5,COL4A6,108267641,108269152,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.,-
1493,chrX,intron,108221374,108310747,-,ENSG00000197565.17,ENST00000538570.5,COL4A6,108267641,108269152,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.,-
1494,chrX,intron,108221374,108310747,-,ENSG00000197565.17,ENST00000621266.4,COL4A6,108267641,108269152,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.,-
1495,chrX,intron,108221374,108310747,-,ENSG00000197565.17,ENST00000621266.4,COL4A6,108267641,108269152,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.,-


In [23]:
# If + strand
# Start coord = intron start (i.e. 'Start' column), End coord = exon start (i.e. 'Start_b' column)
# If - strand
# Start coord = Exon start (i.e. 'End_b' column), End coord = intron start (i.e. 'End' column)

def _df_intron_exon_to_sj(df: pd.DataFrame) -> pd.DataFrame:
    '''Convert joined intron + overlapping exon df of PyRanges object to splice junction coordinates (the first bin between the two intervals)

    If + strand
    Start coord = intron start (i.e. 'Start' column), End coord = exon start (i.e. 'Start_b' column)
    If - strand
    Start coord = Exon start (i.e. 'End_b' column), End coord = intron start (i.e. 'End' column)
    
    Parameters
    ----------
    df : pd.DataFrame
        _description_

    Returns
    -------
    pd.DataFrame
        _description_

    Raises
    ------
    Exception
        Strand column of df doesn't contain all '+' or '-'
    '''

    df[["Start_intron", "End_intron"]] = df[["Start", "End"]]

    if (df.Strand == "+").all():
        df["End"] = df["Start_b"]

    elif (df.Strand == "-").all():
        df["Start"] = df["End_b"]

    else:
        raise Exception("Strand column must only contain all '+' or '-'")
    
    return df


le_sj = ref_introns_le.apply(_df_intron_exon_to_sj)
le_sj

Unnamed: 0,Chromosome,Feature,Start,End,Strand,gene_id,transcript_id,gene_name,Start_b,End_b,Name,Score,Strand_b,Start_intron,End_intron
0,chr1,intron,1616614,1616614,+,ENSG00000197530.13,ENST00000355826.10,MIB2,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+,1616614,1616614
1,chr1,intron,1616614,1616614,+,ENSG00000197530.13,ENST00000355826.10,MIB2,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+,1616614,1616614
2,chr1,intron,1616614,1616614,+,ENSG00000197530.13,ENST00000355826.10,MIB2,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+,1616614,1616614
3,chr1,intron,1616614,1616614,+,ENSG00000197530.13,ENST00000355826.10,MIB2,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+,1616614,1616614
4,chr1,intron,1616614,1616614,+,ENSG00000197530.13,ENST00000355826.10,MIB2,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+,1616614,1616614
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1492,chrX,intron,108269152,108310747,-,ENSG00000197565.17,ENST00000538570.5,COL4A6,108267641,108269152,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.,-,108269152,108310747
1493,chrX,intron,108269152,108310747,-,ENSG00000197565.17,ENST00000538570.5,COL4A6,108267641,108269152,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.,-,108269152,108310747
1494,chrX,intron,108269152,108310747,-,ENSG00000197565.17,ENST00000621266.4,COL4A6,108267641,108269152,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.,-,108269152,108310747
1495,chrX,intron,108269152,108310747,-,ENSG00000197565.17,ENST00000621266.4,COL4A6,108267641,108269152,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.,-,108269152,108310747


In [25]:
le_sj.subset(lambda df: df.End - df.Start == 0).drop_duplicate_positions() # perhaps these two are annotated in some way...

Unnamed: 0,Chromosome,Feature,Start,End,Strand,gene_id,transcript_id,gene_name,Start_b,End_b,Name,Score,Strand_b,Start_intron,End_intron
0,chr1,intron,1616614,1616614,+,ENSG00000197530.13,ENST00000355826.10,MIB2,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+,1616614,1616614
1,chr1,intron,20776596,20776596,-,ENSG00000127483.19,ENST00000312239.10,HP1BP3,20775449,20776596,ENSG00000127483.19_1|HP1BP3|spliced|cryptic,.,-,20776596,20776596


In [26]:
# remove 0 length SJs for now as likely some annotation issue - check back at end after mined annotated SJs
le_sj = le_sj.subset(lambda df: df.Start != df.End)
le_sj

Unnamed: 0,Chromosome,Feature,Start,End,Strand,gene_id,transcript_id,gene_name,Start_b,End_b,Name,Score,Strand_b,Start_intron,End_intron
0,chr1,intron,245419745,245464258,+,ENSG00000162849.16,ENST00000407071.7,KIF26B,245464258,245471621,ENSG00000162849.16_2|KIF26B|spliced|cryptic,.,+,245419745,245464258
1,chr1,intron,61823079,61824444,+,ENSG00000132849.22,ENST00000459752.5,PATJ,61824444,61825501,ENSG00000132849.22_1|PATJ|spliced|cryptic,.,+,61823079,61824444
2,chr1,intron,61823079,61824444,+,ENSG00000132849.22,ENST00000459752.5,PATJ,61824444,61825501,ENSG00000132849.22_1|PATJ|spliced|cryptic,.,+,61823079,61824444
3,chr1,intron,61823079,61824444,+,ENSG00000132849.22,ENST00000459752.5,PATJ,61824444,61825501,ENSG00000132849.22_1|PATJ|spliced|cryptic,.,+,61823079,61824444
4,chr1,intron,76868742,76871267,+,ENSG00000117069.15,ENST00000477717.6,ST6GALNAC5,76871267,76871821,ENSG00000117069.15_2|ST6GALNAC5|spliced|cryptic,.,+,76868742,76871267
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1358,chrX,intron,108269152,108310747,-,ENSG00000197565.17,ENST00000538570.5,COL4A6,108267641,108269152,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.,-,108269152,108310747
1359,chrX,intron,108269152,108310747,-,ENSG00000197565.17,ENST00000538570.5,COL4A6,108267641,108269152,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.,-,108269152,108310747
1360,chrX,intron,108269152,108310747,-,ENSG00000197565.17,ENST00000621266.4,COL4A6,108267641,108269152,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.,-,108269152,108310747
1361,chrX,intron,108269152,108310747,-,ENSG00000197565.17,ENST00000621266.4,COL4A6,108267641,108269152,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.,-,108269152,108310747


In [36]:
# finally generate a cleaned BED-ready file of SJs
le_sj_clean_nov = le_sj[["Name"]]
le_sj_clean_nov.Score = "."
le_sj_clean_nov

Unnamed: 0,Chromosome,Start,End,Strand,Name,Score
0,chr1,245419745,245464258,+,ENSG00000162849.16_2|KIF26B|spliced|cryptic,.
1,chr1,61823079,61824444,+,ENSG00000132849.22_1|PATJ|spliced|cryptic,.
2,chr1,61823079,61824444,+,ENSG00000132849.22_1|PATJ|spliced|cryptic,.
3,chr1,61823079,61824444,+,ENSG00000132849.22_1|PATJ|spliced|cryptic,.
4,chr1,76868742,76871267,+,ENSG00000117069.15_2|ST6GALNAC5|spliced|cryptic,.
...,...,...,...,...,...,...
1358,chrX,108269152,108310747,-,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.
1359,chrX,108269152,108310747,-,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.
1360,chrX,108269152,108310747,-,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.
1361,chrX,108269152,108310747,-,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.


In [33]:
# now to get SJs for annotated ALEs
# overlap-join SJs and LEs, allowing bookended/directly adjacent to overlap
# subset for direct match at intron 3'end, last exon 5'EncodingWarning
le_sj_ref = (ref_introns.join(le_cryp, strandedness="same", slack=1)
 .subset(lambda df: ((df.Strand == "+") & (df["End"] == df["Start_b"])) |
         ((df.Strand == "-") & (df["Start"] == df["End_b"]))
         )
         .drop_duplicate_positions()
 )

le_sj_ref

Unnamed: 0,Chromosome,Feature,Start,End,Strand,gene_id,transcript_id,gene_name,Start_b,End_b,Name,Score,Strand_b
0,chr1,intron,54630886,54634687,+,ENSG00000162390.18,ENST00000371316.3,ACOT11,54634687,54639192,ENSG00000162390.18_4|ACOT11|spliced|cryptic,.,+
1,chr1,intron,112542090,112550600,-,ENSG00000007341.19,ENST00000343210.11,ST7L,112540178,112542090,ENSG00000007341.19_4|ST7L|spliced|cryptic,.,-
2,chr1,intron,112542090,112555867,-,ENSG00000007341.19,ENST00000360743.8,ST7L,112540178,112542090,ENSG00000007341.19_4|ST7L|spliced|cryptic,.,-
3,chr1,intron,33592489,33600864,-,ENSG00000121904.19,ENST00000373380.5,CSMD2,33591806,33592489,ENSG00000121904.19_3|CSMD2|spliced|cryptic,.,-
4,chr1,intron,163305313,163306232,-,ENSG00000232995.7,ENST00000439699.1,ENSG00000232995,163304210,163305313,ENSG00000232995.7_2|ENSG00000232995|spliced|cr...,.,-
5,chr2,intron,205766818,205776245,+,ENSG00000118257.17,ENST00000272849.7,NRP2,205776245,205779714,ENSG00000118257.17_5|NRP2|spliced|cryptic,.,+
6,chr2,intron,205766803,205776245,+,ENSG00000118257.17,ENST00000357118.8,NRP2,205776245,205779714,ENSG00000118257.17_5|NRP2|spliced|cryptic,.,+
7,chr2,intron,3457693,3457794,+,ENSG00000171853.16,ENST00000441099.5,TRAPPC12,3457794,3458181,ENSG00000171853.16_4|TRAPPC12|spliced|cryptic,.,+
8,chr3,intron,3129188,3129918,+,ENSG00000072756.17,ENST00000339437.11,TRNT1,3129918,3130679,ENSG00000072756.17_1|TRNT1|spliced|cryptic,.,+
9,chr3,intron,114339426,114350273,-,ENSG00000181722.17,ENST00000357258.8,ZBTB20,114314500,114339426,ENSG00000181722.17_5|ZBTB20|spliced|cryptic,.,-


In [34]:
# finally generate a cleaned BED-ready file of ref SJs
le_sj_clean_ref = le_sj_ref[["Name"]]
le_sj_clean_ref.Score = "."
le_sj_clean_ref

Unnamed: 0,Chromosome,Start,End,Strand,Name,Score
0,chr1,54630886,54634687,+,ENSG00000162390.18_4|ACOT11|spliced|cryptic,.
1,chr1,112542090,112550600,-,ENSG00000007341.19_4|ST7L|spliced|cryptic,.
2,chr1,112542090,112555867,-,ENSG00000007341.19_4|ST7L|spliced|cryptic,.
3,chr1,33592489,33600864,-,ENSG00000121904.19_3|CSMD2|spliced|cryptic,.
4,chr1,163305313,163306232,-,ENSG00000232995.7_2|ENSG00000232995|spliced|cr...,.
5,chr2,205766818,205776245,+,ENSG00000118257.17_5|NRP2|spliced|cryptic,.
6,chr2,205766803,205776245,+,ENSG00000118257.17_5|NRP2|spliced|cryptic,.
7,chr2,3457693,3457794,+,ENSG00000171853.16_4|TRAPPC12|spliced|cryptic,.
8,chr3,3129188,3129918,+,ENSG00000072756.17_1|TRNT1|spliced|cryptic,.
9,chr3,114339426,114350273,-,ENSG00000181722.17_5|ZBTB20|spliced|cryptic,.


In [38]:
# now just need to check that all les are represented
le_sj_clean = pr.concat([le_sj_clean_nov.drop_duplicate_positions(), le_sj_clean_ref])
le_sj_clean

Unnamed: 0,Chromosome,Start,End,Strand,Name,Score
0,chr1,245419745,245464258,+,ENSG00000162849.16_2|KIF26B|spliced|cryptic,.
1,chr1,61823079,61824444,+,ENSG00000132849.22_1|PATJ|spliced|cryptic,.
2,chr1,76868742,76871267,+,ENSG00000117069.15_2|ST6GALNAC5|spliced|cryptic,.
3,chr1,54630886,54634687,+,ENSG00000162390.18_4|ACOT11|spliced|cryptic,.
4,chr1,243613034,243613670,-,ENSG00000117020.19_2|AKT3|spliced|cryptic,.
...,...,...,...,...,...,...
153,chrX,98644574,98679426,+,ENSG00000281566.3_1|ENSG00000281566|spliced|cr...,.
154,chrX,17837395,17837536,-,ENSG00000131831.18_1|RAI2|spliced|cryptic,.
155,chrX,108269152,108310747,-,ENSG00000197565.17_1|COL4A6|spliced|cryptic,.
156,chrX,17837395,17861097,-,ENSG00000131831.18_1|RAI2|spliced|cryptic,.


In [55]:
# check overlap with AL's cryptics BED
# get gene names from le_sj_id_clean
# Name field = ENSG00000162849.16_2|KIF26B|spliced|cryptic	
le_cryp_gn = set(le_sj_clean.Name.str.split("|", expand=True)[1])

# Name field = PTPRU|novel_donor|2	
majiq_sj_gn = set(majiq_sj.Name.str.split("|", expand=True)[0])

# get matching genes
le_majiq_m_gn = le_cryp_gn.intersection(majiq_sj_gn)

print(f"Fraction of cryptics where gene has SJ found by MAJIQ - {len(le_majiq_m_gn) / len(le_cryp_gn)}")
#


Fraction of cryptics where gene has SJ found by MAJIQ - 0.7526881720430108


In [66]:
# extract gene_name as column
le_sj_clean = le_sj_clean.assign("gene_name", lambda df: df.Name.str.split("|", expand=True)[1])
majiq_sj = majiq_sj.assign("gene_name", lambda df: df.Name.str.split("|", expand=True)[0])

# overlap-join le inferred SJs with MAJIQ SJs 
sj_le_majiq = (le_sj_clean.subset(lambda df: df.gene_name.isin(le_majiq_m_gn))
 .join(majiq_sj.subset(lambda df: df.gene_name.isin(le_majiq_m_gn)),
       how="left",
       strandedness="same"
        )
 )

# calculate difference between 3' coordinates
(sj_le_majiq.assign("end3_diff",
                    lambda df: abs(df.End_b - df.End) if (df.Strand == "+").all() else abs(df.Start_b - df.Start))
                    .apply(lambda df: df.loc[df.groupby("Name")["end3_diff"].idxmin(), :])
                    .end3_diff.describe(percentiles=[0.1 * i for i in range(0,11,1)])
)


count    7.400000e+01
mean     1.789569e+07
std      4.010586e+07
min      0.000000e+00
0%       0.000000e+00
10%      0.000000e+00
20%      0.000000e+00
30%      1.000000e+00
40%      1.000000e+00
50%      1.000000e+00
60%      1.689400e+03
70%      1.410880e+04
80%      2.557911e+07
90%      7.430823e+07
100%     2.057762e+08
max      2.057762e+08
Name: end3_diff, dtype: float64

In [68]:
(sj_le_majiq.assign("end5_diff",
                    lambda df: abs(df.Start_b - df.Start) if (df.Strand == "+").all() else abs(df.End_b - df.End))
                    .apply(lambda df: df.loc[df.groupby("Name")["end5_diff"].idxmin(), :]) # select smallest per gene (so prioritises match)
                    .end5_diff.describe(percentiles=[0.1 * i for i in range(0,11,1)])
)

count    7.400000e+01
mean     1.789371e+07
std      4.010648e+07
min      0.000000e+00
0%       0.000000e+00
10%      0.000000e+00
20%      0.000000e+00
30%      0.000000e+00
40%      1.000000e+00
50%      1.000000e+00
60%      1.000000e+00
70%      2.867100e+03
80%      2.557979e+07
90%      7.430702e+07
100%     2.057668e+08
max      2.057668e+08
Name: end5_diff, dtype: float64

20-30 % are exactly matching at 3'end (start of last exon), but further 20-30 % differ by 1 nucleotide... One off error?

5'end matching is slightly better...

Let's IGV browse a few 

In [70]:
# Add 5' and 3' difference to df
sj_le_majiq = (sj_le_majiq.assign("end3_diff",
                                                      lambda df: abs(df.End_b - df.End) if (df.Strand == "+").all() else abs(df.Start_b - df.Start))
                    .assign("end5_diff",
                                                lambda df: abs(df.Start_b - df.Start) if (df.Strand == "+").all() else abs(df.End_b - df.End))
                    )

In [87]:
sj_le_majiq.subset(lambda df: df.end3_diff + df.end5_diff <= 1).subset(lambda df: df.gene_name.isin(["STMN2", "ARHGAP32", "SYNJ2", "GLE1", "ONECUT1"])).drop_duplicate_positions()

Unnamed: 0,Chromosome,Start,End,Strand,Name,Score,gene_name,Start_b,End_b,Name_b,Score_b,Strand_b,gene_name_b,end3_diff,end5_diff
0,chr6,158017290,158019983,+,ENSG00000078269.15_1|SYNJ2|spliced|cryptic,.,SYNJ2,158017290,158019984,SYNJ2|novel_acceptor|5,0,+,SYNJ2,1,0
1,chr8,79611214,79616821,+,ENSG00000104435.14_1|STMN2|spliced|cryptic,.,STMN2,79611214,79616822,STMN2|novel_acceptor|23,0,+,STMN2,1,0
2,chr8,79611791,79616821,+,ENSG00000104435.14_1|STMN2|spliced|cryptic,.,STMN2,79611791,79616822,STMN2|novel_acceptor|1,0,+,STMN2,1,0
3,chr11,128992046,128998318,-,ENSG00000134909.19_1|ARHGAP32|spliced|cryptic,.,ARHGAP32,128992046,128998319,ARHGAP32|novel_acceptor|21,0,-,ARHGAP32,0,1
4,chr15,52777690,52788779,-,ENSG00000169856.9_1|ONECUT1|spliced|cryptic,.,ONECUT1,52777690,52788780,ONECUT1|novel_acceptor|5,0,-,ONECUT1,0,1


In [84]:
# matches seem to have a difference of 1 at either 5'/3' end
# I think this is due to End coordinates being systematically shifted
# If so, then when subset for matches the one-off difference should be perfectly split by strand
# (+ strand - 3'end = End coordinate, - strand - 5'end = End coordinate)
(sj_le_majiq.subset(lambda df: df.end3_diff + df.end5_diff <= 1)
 .drop_duplicate_positions()
 .assign("end_diff", lambda df: abs(df.End_b - df.End))
 .as_df()[["Strand", "Name", "end_diff"]]
 .groupby("Strand")
 .end_diff.describe(percentiles=[0.1 * i for i in range(0,11,1)])
 )

Unnamed: 0_level_0,count,mean,std,min,0%,10%,20%,30%,40%,50%,60%,70%,80%,90%,100%,max
Strand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
+,21.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
-,20.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [86]:
# output current SJ BED just so I can vis in IGV
le_sj_clean.drop("gene_name").to_bed("../processed/2023-09-01_putative_as_cle_sjs.bed")

In [81]:
# # potentially a strand & End coordinate problem? AL's BED is always +1?
# (sj_le_majiq.assign("end_diff", lambda df: abs(df.End_b - df.End))
#  .apply(lambda df: df.loc[df.groupby("Name")["end_diff"].idxmin(), :]) # select smallest per gene (so prioritises match)
#  .as_df()
#  [["Name", "end_diff","Strand"]]
#  .groupby("Strand")
#  .end_diff.describe(percentiles=[0.1 * i for i in range(0,11,1)])
# )

Unnamed: 0_level_0,count,mean,std,min,0%,10%,20%,30%,40%,50%,60%,70%,80%,90%,100%,max
Strand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
+,35.0,15551240.0,41823120.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1478.0,8860.4,647852.6,73130860.8,205776246.0,205776246.0
-,39.0,19998860.0,38929560.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,15867.0,45014407.6,77149927.4,151409427.0,151409427.0


In [42]:
# which le_ids do not have a SJ?
ids_le_cryp.difference(set(le_sj_clean.Name))

{'ENSG00000002746.15_3|HECW1|spliced|cryptic',
 'ENSG00000100060.18_2|MFNG|spliced|cryptic',
 'ENSG00000111665.12_4|CDCA3|spliced|cryptic',
 'ENSG00000127483.19_1|HP1BP3|spliced|cryptic',
 'ENSG00000151692.15_6|RNF144A|spliced|cryptic',
 'ENSG00000151692.15_7|RNF144A|spliced|cryptic',
 'ENSG00000183570.17_3|PCBP3|spliced|cryptic',
 'ENSG00000184347.15_7|SLIT3|spliced|cryptic',
 'ENSG00000197530.13_1|MIB2|spliced|cryptic',
 'ENSG00000216895.9_4|ENSG00000216895|spliced|cryptic'}

In [44]:
le_cryp.subset(lambda df: df.Name.isin(ids_le_cryp.difference(set(le_sj_clean.Name)))).join(ref_introns, strandedness="same", slack=1)

Unnamed: 0,Chromosome,Start,End,Name,Score,Strand,Feature,Start_b,End_b,Strand_b,gene_id,transcript_id,gene_name
0,chr1,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+,intron,1616614,1623430,+,ENSG00000197530.13,ENST00000355826.10,MIB2
1,chr1,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+,intron,1616614,1623430,+,ENSG00000197530.13,ENST00000378712.5,MIB2
2,chr1,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+,intron,1616614,1623430,+,ENSG00000197530.13,ENST00000479659.5,MIB2
3,chr1,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+,intron,1616614,1623430,+,ENSG00000197530.13,ENST00000514363.1,MIB2
4,chr1,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+,intron,1616614,1623388,+,ENSG00000197530.13,ENST00000489635.5,MIB2
5,chr1,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+,intron,1616614,1623388,+,ENSG00000197530.13,ENST00000504599.6,MIB2
6,chr1,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+,intron,1616614,1623388,+,ENSG00000197530.13,ENST00000505820.7,MIB2
7,chr1,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+,intron,1616614,1623388,+,ENSG00000197530.13,ENST00000507229.5,MIB2
8,chr1,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+,intron,1616614,1623388,+,ENSG00000197530.13,ENST00000511502.5,MIB2
9,chr1,1616614,1619210,ENSG00000197530.13_1|MIB2|spliced|cryptic,.,+,intron,1616614,1623388,+,ENSG00000197530.13,ENST00000518681.6,MIB2


In [None]:
# check overlap