# Transformer Model

My idea here is to use the TF expression, RE availability, and the distance from RE to TG TSS as the input to a transformer model which predicts TG expression. The output of the transformer will be the predicted expression of each TG, which will be compared to the expression in the dataset.

In [1]:
!hostnamectl

   Static hostname: psh01com1hcom34
         Icon name: computer-server
           Chassis: server
        Machine ID: 4232d1115a5548e982021ba5a27af5c3
           Boot ID: 7722405a22e04dd3b701e19ac5a96705
  Operating System: ]8;;https://www.redhat.com/Red Hat Enterprise Linux 8.10 (Ootpa)]8;;
       CPE OS Name: cpe:/o:redhat:enterprise_linux:8::baseos
            Kernel: Linux 4.18.0-553.22.1.el8_10.x86_64
      Architecture: x86-64


In [14]:

import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import pandas as pd
import pybedtools
from grn_inference import utils

torch.manual_seed(1)
np.random.seed(42)

project_dir = "/gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER"
mm10_genome_dir = os.path.join(project_dir, "data/reference_genome/mm10")
mm10_gene_tss_file = os.path.join(project_dir, "data/genome_annotation/mm10/mm10_TSS.bed")
ground_truth_dir = os.path.join(project_dir, "ground_truth_files")
sample_input_dir = os.path.join(project_dir, "input/mESC/filtered_L2_E7.5_rep1")
output_dir = os.path.join(project_dir, "output/transformer_testing_output")

### Splitting the mm10 genome into ranges

Peak locations are going to be different for every sample. If I want to allow my method to work across multiple samples, I will need to split peaks into genomic ranges to allow the model to learn. If the peak overlaps with two genomic ranges, they will be counted as being located in the range which overlaps with the majority of the peak. If a peak is evenly split between two ranges, it will be randomly assigned.

#### Read in the mm10 gene TSS bed file

In [4]:
mm10_fasta_file = os.path.join(mm10_genome_dir, "chr1.fa")
mm10_chrom_sizes_file = os.path.join(mm10_genome_dir, "chrom.sizes")

In [48]:
print("Reading and formatting TSS bed file")
mm10_gene_tss_bed = pybedtools.BedTool(mm10_gene_tss_file)
gene_tss_df = (
    mm10_gene_tss_bed
    .filter(lambda x: x.chrom == "chr1")
    .saveas(os.path.join(mm10_genome_dir, "mm10_ch1_gene_tss.bed"))
    .to_dataframe()
    .sort_values(by="start", ascending=True)
    )
gene_tss_df.head()



Reading and formatting TSS bed file


Unnamed: 0,chrom,start,end,name,score,strand
0,chr1,3671498,3671498,Xkr4,.,-
1,chr1,4360303,4360303,Rp1,.,-
2,chr1,4360314,4360314,Rp1,.,-
3,chr1,4409241,4409241,Rp1,.,-
4,chr1,4497354,4497354,Sox17,.,-


#### Read in the scATAC-seq dataset

We will also need the ATAC-seq dataset that we will use for training. We will load in the scATAC-seq counts csv file.

In [43]:
mesc_atac_data = pd.read_csv(
    os.path.join(sample_input_dir, "mESC_filtered_L2_E7.5_rep1_ATAC.csv"), 
    header=0, 
    index_col=0
    )
mesc_atac_peak_loc = mesc_atac_data.index
mesc_atac_peak_loc = utils.format_peaks(mesc_atac_peak_loc)
mesc_atac_peak_loc = mesc_atac_peak_loc[mesc_atac_peak_loc["chromosome"] == "chr1"]
mesc_atac_peak_loc = mesc_atac_peak_loc.rename(columns={"chromosome":"chrom"})
mesc_atac_peak_loc.head()

Unnamed: 0,chrom,start,end,strand,peak_id
0,chr1,3035602,3036202,.,chr1:3035602-3036202
1,chr1,3062653,3063253,.,chr1:3062653-3063253
2,chr1,3072313,3072913,.,chr1:3072313-3072913
3,chr1,3191496,3192096,.,chr1:3191496-3192096
4,chr1,3340575,3341175,.,chr1:3340575-3341175


We will also restrict the scATAC-seq data to only use chromatin accessibility data for chromosome 1 for now.

In [47]:
mesc_atac_data_chr1 = mesc_atac_data[mesc_atac_data.index.isin(mesc_atac_peak_loc.peak_id)]
mesc_atac_data_chr1.iloc[:5, :3].head()

Unnamed: 0,E7.5_rep1#AAACATGCAATGAATG-1,E7.5_rep1#AAACATGCAATTATGC-1,E7.5_rep1#AAACATGCACCTATAG-1
chr1:3035602-3036202,0,0,0
chr1:3062653-3063253,0,0,0
chr1:3072313-3072913,0,0,0
chr1:3191496-3192096,0,0,0
chr1:3340575-3341175,0,0,0


#### Read in the scRNA-seq dataset

In addition to the ATAC-seq dataset, we will also need the corresponding gene expression from the scRNA-seq counts csv file.

In [40]:
mesc_rna_data = pd.read_csv(
    os.path.join(sample_input_dir, "mESC_filtered_L2_E7.5_rep1_RNA.csv"),
    header=0,
    index_col=0
)
mesc_rna_data.iloc[0:5, 0:3].head()

Unnamed: 0,E7.5_rep1#AAACATGCAATGAATG-1,E7.5_rep1#AAACATGCAATTATGC-1,E7.5_rep1#AAACATGCACCTATAG-1
Xkr4,0,0,2
Sox17,0,0,0
Mrpl15,1,0,0
Lypla1,0,1,0
Tcea1,1,1,1


#### Create genomic windows

Next, we will tile the mm10 genome using the mm10 `chrom.sizes` file and a window size of 1 kb. We will use this to create our genomic ranges for mapping peaks to potential TGs.

In [8]:
window_size = 1000
mm10_genome_windows = pybedtools.bedtool.BedTool().window_maker(g=mm10_chrom_sizes_file, w=window_size)
mm10_chr1_windows = (
    mm10_genome_windows
    .filter(lambda x: x.chrom == "chr1")
    .saveas(os.path.join(mm10_genome_dir, f"mm10_chr1_windows_{window_size // 1000}kb.bed"))
    .to_dataframe()
    )
mm10_chr1_windows

Unnamed: 0,chrom,start,end
0,chr1,0,1000
1,chr1,1000,2000
2,chr1,2000,3000
3,chr1,3000,4000
4,chr1,4000,5000
...,...,...,...
195467,chr1,195467000,195468000
195468,chr1,195468000,195469000
195469,chr1,195469000,195470000
195470,chr1,195470000,195471000


#### Calculate the distance score between peaks and potential target genes

Only keep peak-TG rows where the distance between the peak and the gene TSS is less than 1 Mb.

Now that we have the ATAC peak locations and the gene locations, we can calculate the distance between peaks within 1 Mb of the gene TSS.

In [None]:
peak_bed = pybedtools.BedTool.from_dataframe(
    mesc_atac_peak_loc[["chrom", "start", "end", "peak_id"]]
    ).saveas(os.path.join(output_dir, "peak_tmp.bed"))

tss_bed = pybedtools.BedTool.from_dataframe(
    gene_tss_df[["chrom", "start", "end", "name"]]
    ).saveas(os.path.join(output_dir, "tss_tmp.bed"))

genes_near_peaks = utils.find_genes_near_peaks(peak_bed, tss_bed)
genes_near_peaks = genes_near_peaks[genes_near_peaks["TSS_dist"] <= 1e6]

  peak_chr  peak_start  peak_end               peak_id gene_chr  gene_start  \
0     chr1     3035602   3036202  chr1:3035602-3036202     chr1     3671498   
1     chr1     3062653   3063253  chr1:3062653-3063253     chr1     3671498   
2     chr1     3072313   3072913  chr1:3072313-3072913     chr1     3671498   
3     chr1     3191496   3192096  chr1:3191496-3192096     chr1     3671498   
4     chr1     3340575   3341175  chr1:3340575-3341175     chr1     3671498   

   gene_end target_id  
0   3671498      Xkr4  
1   3671498      Xkr4  
2   3671498      Xkr4  
3   3671498      Xkr4  
4   3671498      Xkr4  


Unnamed: 0,peak_chr,peak_start,peak_end,peak_id,gene_chr,gene_start,gene_end,target_id,TSS_dist
400,chr1,4496754,4497354,chr1:4496754-4497354,chr1,4497354,4497354,Sox17,0
87149,chr1,74542289,74542889,chr1:74542289-74542889,chr1,74542888,74542888,Plcd4,1
333683,chr1,190169693,190170293,chr1:190169693-190170293,chr1,190170295,190170295,Prox1os,2
50111,chr1,55363149,55363749,chr1:55363149-55363749,chr1,55363753,55363753,Boll,4
110386,chr1,84839235,84839835,chr1:84839235-84839835,chr1,84839840,84839840,Fbxo36,5


Now that we have the distance between each peak and gene TSS, we will calculate an exponential dropoff score.

In [26]:
# Scale the TSS distance using an exponential drop-off function
# e^-dist/25000, same scaling function used in LINGER Cis-regulatory potential calculation
# https://github.com/Durenlab/LINGER
genes_near_peaks = genes_near_peaks.copy()
genes_near_peaks["TSS_dist_score"] = np.exp(-genes_near_peaks["TSS_dist"] / 250000)
genes_near_peaks.head()

Unnamed: 0,peak_chr,peak_start,peak_end,peak_id,gene_chr,gene_start,gene_end,target_id,TSS_dist,TSS_dist_score
400,chr1,4496754,4497354,chr1:4496754-4497354,chr1,4497354,4497354,Sox17,0,1.0
87149,chr1,74542289,74542889,chr1:74542289-74542889,chr1,74542888,74542888,Plcd4,1,0.999996
333683,chr1,190169693,190170293,chr1:190169693-190170293,chr1,190170295,190170295,Prox1os,2,0.999992
50111,chr1,55363149,55363749,chr1:55363149-55363749,chr1,55363753,55363753,Boll,4,0.999984
110386,chr1,84839235,84839835,chr1:84839235-84839835,chr1,84839840,84839840,Fbxo36,5,0.99998


#### Find TF-peak binding sites using Homer

Next, we need to format the peaks to use the Homer peak file format to find TFs matching to peaks.

In [35]:
homer_peaks = genes_near_peaks[["peak_id", "peak_chr", "peak_start", "peak_end"]]
homer_peaks = homer_peaks.rename(columns={
    "peak_id":"PeakID", 
    "peak_chr":"chr",
    "peak_start":"start",
    "peak_end":"end"
    })
homer_peaks["strand"] = ["."] * len(homer_peaks)
homer_peaks["start"] = round(homer_peaks["start"].astype(int),0)
homer_peaks["end"] = round(homer_peaks["end"].astype(int),0)
homer_peaks = homer_peaks.drop_duplicates(subset="PeakID")

os.makedirs(os.path.join(output_dir, "tmp"), exist_ok=True)
homer_peak_path = os.path.join(output_dir, "tmp/homer_peaks.txt")
homer_peaks.to_csv(homer_peak_path, sep="\t", header=False, index=False)
homer_peaks.head()

Unnamed: 0,PeakID,chr,start,end,strand
400,chr1:4496754-4497354,chr1,4496754,4497354,.
87149,chr1:74542289-74542889,chr1,74542289,74542889,.
333683,chr1:190169693-190170293,chr1,190169693,190170293,.
50111,chr1:55363149-55363749,chr1,55363149,55363749,.
110386,chr1:84839235-84839835,chr1,84839235,84839835,.


In [36]:
print(len(homer_peaks))
print(len(homer_peaks.drop_duplicates(subset="PeakID")))

12940
12940


Next, we need to run homer on these peaks. I created a `run_homer.sh` script to handle running `findMotifsGenome.pl`, `annotatePeaks.pl`, and the pipeline script `homer_tf_peak_motifs.py`. `homer_tf_peak_motifs.py` calculates Homer TF to peak scores by counting the number of motifs found in each peak for a given TF in the output file from `annotatePeaks.pl`.

In [38]:
!sbatch /gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER/dev/testing_scripts/run_homer.sh

Submitted batch job 3389915


Once Homer has finished running, we will load the results. We are interested in the number of binding sites for each TF to peak.

In [50]:
homer_results = pd.read_parquet(os.path.join(output_dir, "homer_tf_to_peak.parquet"), engine="pyarrow")
homer_results = homer_results.reset_index(drop=True)
homer_results["source_id"] = homer_results["source_id"].str.capitalize()
homer_results.head()

Unnamed: 0,peak_id,source_id,tf_motifs_in_peak,homer_binding_score
0,chr1:22805812-22806412,Amyb,1.0,0.000148
1,chr1:36251342-36251942,Amyb,1.0,0.000148
2,chr1:36021079-36021679,Amyb,1.0,0.000148
3,chr1:119082498-119083098,Amyb,2.0,0.000295
4,chr1:178684866-178685466,Amyb,1.0,0.000148


Let's review what we have so far:

- scRNA-seq gene expression of the TFs and TGs
- scATAC-seq chromatin accessibility of the peaks
- TSS start site of each gene
- A distance score between each peak and TG within 1 Mb of one another
- A TF-peak binding score from Homer
- Genomic bins of 1Kb

Next, we need to combine the 

In [74]:
print(f"\nmesc_rna_data\n{mesc_rna_data.iloc[0:5, 0:1].head()}")
print(f"\nmesc_atac_data_chr1\n{mesc_atac_data_chr1.iloc[0:5, :1].head()}")
print(f"\ngene_tss_df\n{gene_tss_df.head()}")
print(f"\ngenes_near_peaks\n{genes_near_peaks.head()}")
print(f"\nhomer_results\n{homer_results.head()}")
print(f"\nmm10_chr1_windows\n{mm10_chr1_windows.head()}")


mesc_rna_data
        E7.5_rep1#AAACATGCAATGAATG-1
Xkr4                               0
Sox17                              0
Mrpl15                             1
Lypla1                             0
Tcea1                              1

mesc_atac_data_chr1
                      E7.5_rep1#AAACATGCAATGAATG-1
chr1:3035602-3036202                             0
chr1:3062653-3063253                             0
chr1:3072313-3072913                             0
chr1:3191496-3192096                             0
chr1:3340575-3341175                             0

gene_tss_df
  chrom    start      end   name score strand
0  chr1  3671498  3671498   Xkr4     .      -
1  chr1  4360303  4360303    Rp1     .      -
2  chr1  4360314  4360314    Rp1     .      -
3  chr1  4409241  4409241    Rp1     .      -
4  chr1  4497354  4497354  Sox17     .      -

genes_near_peaks
       peak_chr  peak_start   peak_end                   peak_id gene_chr  \
400        chr1     4496754    4497354      chr1:449

We only want genes that are either in the potential TG list or in the unique TF list and cells that are present in both the RNA and ATAC datasets.

In [64]:
atac_cell_barcodes = mesc_atac_data_chr1.columns.to_list()
rna_cell_barcodes = mesc_rna_data.columns.to_list()
atac_in_rna_shared_barcodes = [i for i in atac_cell_barcodes if i in rna_cell_barcodes]

# Make sure that the cell names are in the same order and in both datasets
shared_barcodes = sorted(set(atac_in_rna_shared_barcodes))

mesc_atac_data_chr1_shared = mesc_atac_data_chr1[shared_barcodes]
mesc_rna_data_shared = mesc_rna_data[shared_barcodes]

In [None]:
rna_first_cell = mesc_rna_data_shared.iloc[:, 0]
atac_first_cell = mesc_atac_data_chr1_shared.loc[:, rna_first_cell.name] # Makes sure the barcodes match
print(rna_first_cell.name, atac_first_cell.name)

E7.5_rep1#AAACATGCAATGAATG-1 E7.5_rep1#AAACATGCAATGAATG-1


In [69]:
potential_tgs = genes_near_peaks["target_id"].unique()
print(f"Number of potential TGs: {len(potential_tgs)}")

unique_tfs = homer_results["source_id"].unique()
print(f"Number of unique TFs: {len(unique_tfs)}")

unique_peaks = homer_results["peak_id"].unique()
print(f"Number of unique peaks: {len(unique_peaks)}")

Number of potential TGs: 1425
Number of unique TFs: 298
Number of unique peaks: 12940


The scores that we are interested in are the:

- `TSS_dist_score` between each peak and each potential TG from `genes_near_peaks`
- `tf_motifs_in_peak` between each TF and each peak from `homer_results`
- RNA expression values for TFs and potential TGs from `rna_first_cell`
- ATAC accessibility values for peaks from `atac_first_cell`

We want 
- Matrix 1: (RE accessibility * RE distance score) mapped to the genomic windows
- Matrix 2: (RE accessibility * RE distance score) x potential TG
- Matrix 3: (TF expression * Homer binding score) x Matrix 2


In [82]:
print("Peak to Potential TG Distance Score")
print(genes_near_peaks[["peak_id", "target_id", "TSS_dist_score"]].head())

print("\nHomer TF to Peak Binding Motifs")
print(homer_results[["source_id", "peak_id", "tf_motifs_in_peak"]].head())

print("\nscRNA-seq Gene Expression")
print(rna_first_cell.head())

print("\nscATAC-seq Peak Accessibility")
print(atac_first_cell.head())

Peak to Potential TG Distance Score
                         peak_id target_id  TSS_dist_score
400         chr1:4496754-4497354     Sox17        1.000000
87149     chr1:74542289-74542889     Plcd4        0.999996
333683  chr1:190169693-190170293   Prox1os        0.999992
50111     chr1:55363149-55363749      Boll        0.999984
110386    chr1:84839235-84839835    Fbxo36        0.999980

Homer TF to Peak Binding Motifs
  source_id                   peak_id  tf_motifs_in_peak
0      Amyb    chr1:22805812-22806412                1.0
1      Amyb    chr1:36251342-36251942                1.0
2      Amyb    chr1:36021079-36021679                1.0
3      Amyb  chr1:119082498-119083098                2.0
4      Amyb  chr1:178684866-178685466                1.0

scRNA-seq Gene Expression
Xkr4      0
Sox17     0
Mrpl15    1
Lypla1    0
Tcea1     1
Name: E7.5_rep1#AAACATGCAATGAATG-1, dtype: int64

scATAC-seq Peak Accessibility
chr1:3035602-3036202    0
chr1:3062653-3063253    0
chr1:3072313-307

#### TF-peak Binding Potential

We calculate the TF-peak binding potential as the TF to peak binding score from homer * the gene expression of each TF

In [89]:
tf_peak_binding_potential = pd.merge(homer_results, rna_first_cell, left_on="source_id", right_index=True, how="inner")
tf_peak_binding_potential["tf_peak_binding_potential"] = tf_peak_binding_potential["homer_binding_score"] * tf_peak_binding_potential.iloc[:,-1]
tf_peak_binding_potential = tf_peak_binding_potential[["source_id", "peak_id", "tf_peak_binding_potential"]]
tf_peak_binding_potential.head()

Unnamed: 0,source_id,peak_id,tf_peak_binding_potential
7792,Atf4,chr1:33026581-33027181,0.002384
7793,Atf4,chr1:107878563-107879163,0.002384
7794,Atf4,chr1:37698121-37698721,0.002384
7795,Atf4,chr1:23263694-23264294,0.002384
7796,Atf4,chr1:72867374-72867974,0.002384


#### Peak-TG Regulatory Potential

We calculate the peak-TG regulatory potential as the peak accessibility * the peak to potential TG TSS distance score

In [88]:
peak_tg_regulatory_potential = pd.merge(genes_near_peaks, atac_first_cell, left_on="peak_id", right_index=True, how="inner")
peak_tg_regulatory_potential["peak_tg_regulatory_potential"] = peak_tg_regulatory_potential["TSS_dist_score"] * peak_tg_regulatory_potential.iloc[:, -1]
peak_tg_regulatory_potential = peak_tg_regulatory_potential[["peak_id", "target_id", "peak_tg_regulatory_potential"]]
peak_tg_regulatory_potential.head()

Unnamed: 0,peak_id,target_id,peak_tg_regulatory_potential
400,chr1:4496754-4497354,Sox17,0.0
87149,chr1:74542289-74542889,Plcd4,0.0
333683,chr1:190169693-190170293,Prox1os,0.0
50111,chr1:55363149-55363749,Boll,0.0
110386,chr1:84839235-84839835,Fbxo36,3.99992


#### TF-Peak-TG Regulatory Potential

Finally, we get the TF-Peak-TG regulatory potential by multiplying the TF-peak binding potential scores with the peak-TG regulatory potential scores for each shared peak.

In [91]:
tf_peak_tg_regulatory_potential = pd.merge(tf_peak_binding_potential, peak_tg_regulatory_potential, on="peak_id", how="outer")
tf_peak_tg_regulatory_potential["tf_peak_tg_score"] = tf_peak_tg_regulatory_potential["tf_peak_binding_potential"] * tf_peak_tg_regulatory_potential["peak_tg_regulatory_potential"]
tf_peak_tg_regulatory_potential = tf_peak_tg_regulatory_potential[["source_id", "peak_id", "target_id", "tf_peak_tg_score"]]
print(tf_peak_tg_regulatory_potential.head())

  source_id                 peak_id target_id  tf_peak_tg_score
0     Hnf4a  chr1:10007124-10007724   Ppp1r42               0.0
1     Hnf4a  chr1:10007124-10007724     Cops5               0.0
2     Hnf4a  chr1:10007124-10007724     Cspp1               0.0
3     Hnf4a  chr1:10007124-10007724     Cspp1               0.0
4     Hnf4a  chr1:10007124-10007724     Tcf24               0.0


#### Aggregate the TF-peak-TG scores into genomic coordinate windows

We will not aggregate the peaks based on the 1 Kb windows that we created earlier. The peak scores for a window will be summed to get a final score. By aggregating the peaks into windows with a static size, we ensure that the transformer architecture will work with any mm10 data.

In [95]:
# Parse peak_id into genomic coords (chrom, start, end)
coords = tf_peak_tg_regulatory_potential["peak_id"].str.extract(
    r"(?P<chrom>[^:]+):(?P<start>\d+)-(?P<end>\d+)"
).astype({"start":"int64","end":"int64"})

df = pd.concat([tf_peak_tg_regulatory_potential, coords], axis=1)

df = df[df["chrom"] == "chr1"].copy()

window_size = int((mm10_chr1_windows["end"] - mm10_chr1_windows["start"]).mode().iloc[0])

# Build a quick lookup of window_id strings from window indices
# window index k -> [start=k*w, end=(k+1)*w)
win_lut = {}
for _, row in mm10_chr1_windows.iterrows():
    k = row["start"] // window_size
    win_lut[k] = f'{row["chrom"]}:{row["start"]}-{row["end"]}'

# --- Assign each unique peak to the window with maximal overlap (random ties) ---
rng = np.random.default_rng(0)  # set a seed for reproducibility; change/remove if you want different random choices

peaks_unique = (
    df.loc[:, ["peak_id", "chrom", "start", "end"]]
      .drop_duplicates(subset=["peak_id"])
      .reset_index(drop=True)
)

def assign_best_window(start, end, w):
    # windows indices spanned by the peak (inclusive)
    i0 = start // w
    i1 = (end - 1) // w  # subtract 1 so exact boundary end==k*w goes to k-1 window
    if i1 < i0:
        i1 = i0
    # compute overlaps with all spanned windows
    overlaps = []
    for k in range(i0, i1 + 1):
        bin_start = k * w
        bin_end = bin_start + w
        ov = max(0, min(end, bin_end) - max(start, bin_start))
        overlaps.append((k, ov))
    # choose the k with max overlap; break ties randomly
    ov_vals = [ov for _, ov in overlaps]
    max_ov = max(ov_vals)
    candidates = [k for (k, ov) in overlaps if ov == max_ov]
    if len(candidates) == 1:
        return candidates[0]
    else:
        return rng.choice(candidates)

peak_to_window_idx = peaks_unique.apply(
    lambda r: assign_best_window(r["start"], r["end"], window_size), axis=1
)
peaks_unique["window_idx"] = peak_to_window_idx
peaks_unique["window_id"] = peaks_unique["window_idx"].map(win_lut)

# Map window assignment back to the full TF–peak–gene table and aggregate
df = df.merge(
    peaks_unique.loc[:, ["peak_id", "window_id"]],
    on="peak_id",
    how="left"
)

# Aggregate scores per TF × window × gene
binned_scores = (
    df.groupby(["source_id", "window_id", "target_id"], observed=True)["tf_peak_tg_score"]
      .sum()
      .reset_index()
).rename(columns={"tf_peak_tg_score":"tf_window_tg_score"})

print(binned_scores.head())


  source_id               window_id      target_id  tf_window_tg_score
0      Atf1  chr1:10009000-10010000  1700034P13Rik                 0.0
1      Atf1  chr1:10009000-10010000  2610203C22Rik                 0.0
2      Atf1  chr1:10009000-10010000         Adhfe1                 0.0
3      Atf1  chr1:10009000-10010000        Arfgef1                 0.0
4      Atf1  chr1:10009000-10010000          Cops5                 0.0


Next, we need to pivot the long dataframe into a 3D TF x window x TG NumPy array.

In [None]:
# Get unique IDs
tfs = binned_scores["source_id"].unique()
windows = binned_scores["window_id"].unique()
genes = binned_scores["target_id"].unique()

# Create index maps
tf_idx = {tf: i for i, tf in enumerate(tfs)}
window_idx = {p: i for i, p in enumerate(windows)}
gene_idx = {g: i for i, g in enumerate(genes)}

# Initialize 3D matrix
tensor_np = np.zeros((len(tfs), len(windows), len(genes)), dtype=float)

# Fill values
for _, row in binned_scores.iterrows():
    i = tf_idx[row["source_id"]]
    j = window_idx[row["window_id"]]
    k = gene_idx[row["target_id"]]
    tensor_np[i, j, k] = row["tf_window_tg_score"]

print(tensor_np.shape)  # (n_TFs, n_peaks, n_genes)

To recap, the scores in this array represnts the TF-peak regulatory potentials for each TG based on the Homer TF-peak binding scores, the peak-TG regulatory potential scores, the peak accessibility values, and the TF RNA expression values

### Genomic range distance to TG TSS

The distance between each genomic range and each potential TG TSS within 1MB of the range will be calculated using the exponential drop-off function:

$$\text{Scaling Factor} = e^{-\frac{\text{peak to TSS Distance}}{25000}}$$

This will reduce the regulatory effect that peaks in a genomic range can exert on a potential TG. The values in this matrix will be multiplied by the log1p normalized and min-max scaled RE accessibility.

### TF-RE Binding Potential

Homer will be used to calculate the ability for a TF to bind to each peak. Values for TF-RE edges where the TF is not predicted to bind will have a value of 0. We will map the TF-RE binding potential to the genomic ranges by taking the average TF binding potential for peaks within a genomic range. The TF-RE binding potential matrix will be multiplied by a log1p normalized and min-max scaled vector of TF expression.

### Combining TF-RE binding potential with RE regulatory potential values

The (genomic range accessibility * genomic range distance) x TG and (TF expression * TF-genomic range binding potential) x TG matrices will be matrix multiplied to get the final TF x genomic range x TG matrix. This matrix will be the input to the transformer model.
