# Identify STRs that are at splice sites
Like the ENCODE paper, define a splice site as being within 100bp of an exon. All of the following uses the GRCh37 alignments (GENCODE v19 annotations), also like the paper. 
Links to [GENCODE data](https://www.gencodegenes.org/human/release_37lift37.html)

In [1]:
import pyranges as pr
import numpy as np
import pandas as pd

Load the STR data

In [2]:
col_names = ["Chromosome", "Start", "End", "class", 
            "length", "Strand", "num_units", 
            "actual_repeat", "gene", "gene_start",
            "gene_stop", "gene_strand", "annotation",
            "promoter", "dist_to_tss"]
dtype_dict = {"Chromosome": np.str, "Start":np.int, "End":np.int, "length": np.int, 
             "Strand": np.str} #, "gene_start": np.int, "gene_stop": np.int, "dist_to_tss": np.int}
str_df = pd.read_csv("data/msdb_data.tsv", sep = '\t', dtype = dtype_dict, names = col_names, index_col = False, na_values = "")
str_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Chromosome,Start,End,class,length,Strand,num_units,actual_repeat,gene,gene_start,gene_stop,gene_strand,annotation,promoter,dist_to_tss
0,chr1,10000,10108,AACCCT,108,+,18,TAACCC,uc001aaa.3,11874,14409,+,Intergenic,Non-Promoter,-1766
1,chr1,10108,10149,AACCCT,41,+,6,AACCCT,uc001aaa.3,11874,14409,+,Intergenic,Non-Promoter,-1725
2,chr1,10147,10179,AACCCT,32,+,5,CCCTAA,uc001aaa.3,11874,14409,+,Intergenic,Non-Promoter,-1695
3,chr1,10172,10184,AACCT,12,+,2,CCTAA,uc001aaa.3,11874,14409,+,Intergenic,Non-Promoter,-1690
4,chr1,10177,10233,AACCCT,56,+,9,CCTAAC,uc001aaa.3,11874,14409,+,Intergenic,Non-Promoter,-1641


In [13]:
str_pr = pr.PyRanges(str_df)
str_pr.head()

Unnamed: 0,Chromosome,Start,End,class,length,Strand,num_units,actual_repeat,gene,gene_start,gene_stop,gene_strand,annotation,promoter,dist_to_tss
0,chr1,10000,10108,AACCCT,108,+,18,TAACCC,uc001aaa.3,11874,14409,+,Intergenic,Non-Promoter,-1766
1,chr1,10108,10149,AACCCT,41,+,6,AACCCT,uc001aaa.3,11874,14409,+,Intergenic,Non-Promoter,-1725
2,chr1,10147,10179,AACCCT,32,+,5,CCCTAA,uc001aaa.3,11874,14409,+,Intergenic,Non-Promoter,-1695
3,chr1,10172,10184,AACCT,12,+,2,CCTAA,uc001aaa.3,11874,14409,+,Intergenic,Non-Promoter,-1690
4,chr1,10177,10233,AACCCT,56,+,9,CCTAAC,uc001aaa.3,11874,14409,+,Intergenic,Non-Promoter,-1641
5,chr1,10231,10249,AACCCT,18,+,3,CCCTAA,uc001aaa.3,11874,14409,+,Intergenic,Non-Promoter,-1625
6,chr1,10255,10290,AACCCT,35,+,5,AACCCT,uc001aaa.3,11874,14409,+,Intergenic,Non-Promoter,-1584
7,chr1,10285,10326,AACCCC,41,+,6,AACCCC,uc001aaa.3,11874,14409,+,Intergenic,Non-Promoter,-1548


Load the UCSC exons

In [20]:
col_names = ["Chromosome", "Start", "End", "ucsc_id", "Strand"]
exon_df = pd.read_csv("data/ucsc_exons.bed", sep = '\t', names = col_names, index_col = False, skiprows = [0], usecols = [0,1,2,3,5])
exon_df.head()

Unnamed: 0,Chromosome,Start,End,ucsc_id,Strand
0,chr1,11873,12227,uc001aaa.3_exon_0_0_chr1_11874_f,+
1,chr1,12612,12721,uc001aaa.3_exon_1_0_chr1_12613_f,+
2,chr1,13220,14409,uc001aaa.3_exon_2_0_chr1_13221_f,+
3,chr1,11873,12227,uc010nxr.1_exon_0_0_chr1_11874_f,+
4,chr1,12645,12697,uc010nxr.1_exon_1_0_chr1_12646_f,+


In [21]:
exon_pr = pr.PyRanges(exon_df)
exon_pr.head()

Unnamed: 0,Chromosome,Start,End,ucsc_id,Strand
0,chr1,11873,12227,uc001aaa.3_exon_0_0_chr1_11874_f,+
1,chr1,12612,12721,uc001aaa.3_exon_1_0_chr1_12613_f,+
2,chr1,13220,14409,uc001aaa.3_exon_2_0_chr1_13221_f,+
3,chr1,11873,12227,uc010nxr.1_exon_0_0_chr1_11874_f,+
4,chr1,12645,12697,uc010nxr.1_exon_1_0_chr1_12646_f,+
5,chr1,13220,14409,uc010nxr.1_exon_2_0_chr1_13221_f,+
6,chr1,11873,12227,uc010nxq.1_exon_0_0_chr1_11874_f,+
7,chr1,12594,12721,uc010nxq.1_exon_1_0_chr1_12595_f,+


Next, get all the intervals of short tandem repeats that are located in introns. The new PyRanges Start, End will now reflect only the region of the repeat that overlapped the intron, not the length of the whole repeat (this information is retained in the `length` column)

In [6]:
str_intron_intersect = str_pr.intersect(intron_pr, strandedness= False)

In [11]:
# drop duplicate rows in the intersection PyRanges
str_intron_intersect_nodup = str_intron_intersect.drop_duplicate_positions()
str_intron_intersect_nodup.head()

Unnamed: 0,Chromosome,Start,End,class,length,Strand,num_units,actual_repeat,gene,gene_start,gene_stop,gene_strand,annotation,promoter,dist_to_tss
0,chr1,15240,15255,AGGGCC,15,+,2,GGGCCA,uc009viv.2,14407,29370,-,Exon,Non-Promoter,14115
1,chr1,15383,15395,AGGCGC,12,+,2,GCAGGC,uc009viv.2,14407,29370,-,Exon,Non-Promoter,13975
2,chr1,17476,17488,AGCCG,12,+,2,CCGAG,uc009vjc.1,16858,17751,-,Exon,Promoter,263
3,chr1,17801,17814,ATCCC,13,+,2,CCATC,uc009vjd.2,15796,18061,-,Exon,Promoter,247
4,chr1,18453,18466,AGGCC,13,+,2,GCCAG,uc009vit.3,14362,19759,-,Intron,Non-Promoter,1293
5,chr1,19412,19424,AGGGGG,12,+,2,GGGAGG,uc009vit.3,14362,19759,-,Exon,Promoter,335
6,chr1,19516,19528,AAAGCC,12,+,2,AAGCCA,uc001aai.1,16858,19759,-,Exon,Promoter,231
7,chr1,22811,22823,AAAGG,12,+,2,AGGAA,uc001aac.4,14362,29370,-,Intron,Non-Promoter,6547


Now, get the STRs that are within 300kb of an intron. First, find the nearest intron to each STR:

In [14]:
nearest_intron_pr = str_pr.nearest(intron_pr)
nearest_intron_pr.head()

Unnamed: 0,Chromosome,Start,End,class,length,Strand,num_units,actual_repeat,gene,gene_start,...,gene_strand,annotation,promoter,dist_to_tss,Start_b,End_b,ucsc_id,unk,Strand_b,Distance
0,chr1,15240,15255,AGGGCC,15,+,2,GGGCCA,uc009viv.2,14407,...,-,Exon,Non-Promoter,14115,14829,15795,uc009viq.3_intron_0_0_chr1_14830_r,0,-,0
1,chr1,15383,15395,AGGCGC,12,+,2,GCAGGC,uc009viv.2,14407,...,-,Exon,Non-Promoter,13975,14829,15795,uc009viq.3_intron_0_0_chr1_14830_r,0,-,0
2,chr1,17476,17488,AGCCG,12,+,2,CCGAG,uc009vjc.1,16858,...,-,Exon,Promoter,263,17055,17605,uc009viq.3_intron_3_0_chr1_17056_r,0,-,0
3,chr1,17801,17814,ATCCC,13,+,2,CCATC,uc009vjd.2,15796,...,-,Exon,Promoter,247,17368,17914,uc009vix.2_intron_3_0_chr1_17369_r,0,-,0
4,chr1,18453,18466,AGGCC,13,+,2,GCCAG,uc009vit.3,14362,...,-,Intron,Non-Promoter,1293,18061,24737,uc009vix.2_intron_4_0_chr1_18062_r,0,-,0
5,chr1,19412,19424,AGGGGG,12,+,2,GGGAGG,uc009vit.3,14362,...,-,Exon,Promoter,335,18061,24737,uc009vix.2_intron_4_0_chr1_18062_r,0,-,0
6,chr1,19516,19528,AAAGCC,12,+,2,AAGCCA,uc001aai.1,16858,...,-,Exon,Promoter,231,18061,24737,uc009vix.2_intron_4_0_chr1_18062_r,0,-,0
7,chr1,22811,22823,AAAGG,12,+,2,AGGAA,uc001aac.4,14362,...,-,Intron,Non-Promoter,6547,18061,24737,uc009vix.2_intron_4_0_chr1_18062_r,0,-,0


Then, get everything 

Next, want to get the eCLIP peaks that are in these STRs. To start, let's try 