I am interested in comparing the Homer TF-peak nearest gene promoter and peak location to our ChIP-seq ground truth, which assumes that each peak only regulates the closest gene. The output of `annotatePeaks.pl` is a file for each TF with information about its binding potential to the peaks in your dataset. The columns are:

- PeakID
- Chromosome
- Peak start position
- Peak end position
- Strand
- Peak Score
- FDR/Peak Focus Ratio/Region Size
- Annotation (i.e. Exon, Intron, ...)
- Detailed Annotation (Exon, Intron etc. + CpG Islands, repeats, etc.)
- Distance to nearest RefSeq TSS
- Nearest TSS: Native ID of annotation file
- Nearest TSS: Entrez Gene ID
- Nearest TSS: Unigene ID
- Nearest TSS: RefSeq ID
- Nearest TSS: Ensembl ID
- Nearest TSS: Gene Symbol
- Nearest TSS: Gene Aliases
- Nearest TSS: Gene description
- Additional columns depend on options selected when running the program.

I am interested in the following columns:

- PeakID
- Chromosome
- Peak start position
- Peak end position
- Annotation (i.e. Exon, Intron, ...)
- Detailed Annotation (Exon, Intron etc. + CpG Islands, repeats, etc.)
- Distance to nearest RefSeq TSS
- Nearest TSS: Gene Symbol
- CpG%
- GC%

The TF name can be found in the final column, which is structure as:

> CTCF(Zf)/CD4+-CTCF-ChIP-Seq(Barski_et_al.)/Homer Distance From Peak(sequence,strand,conservation)

There is some information in the literature that the likelihood of peaks regulating the nearest TSS is different based on the location of the peak, such as introns, promoter/TSS, intergenic, etc.

In [19]:
import pandas as pd
import os

data_dir = "/gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER/dev/notebooks/sample_data/"

homer_df = pd.read_csv(
    os.path.join(data_dir, "sample_homer_annotatePeak_output.tsv"), 
    sep="\t", 
    header=0, 
    index_col=0 # Setting the PeakID column as the index
    )
homer_df.index.names = ["PeakID"]
homer_df.head()

Unnamed: 0_level_0,Chr,Start,End,Strand,Peak Score,Focus Ratio/Region Size,Annotation,Detailed Annotation,Distance to TSS,Nearest PromoterID,...,Nearest Unigene,Nearest Refseq,Nearest Ensembl,Gene Name,Gene Alias,Gene Description,Gene Type,CpG%,GC%,"CTCF(Zf)/CD4+-CTCF-ChIP-Seq(Barski_et_al.)/Homer Distance From Peak(sequence,strand,conservation)"
PeakID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
peak75613,chr16,93932426,93933224,+,0,,"intron (NM_019500, intron 2 of 2)","intron (NM_019500, intron 2 of 2)",-3008,NM_001165926,...,,NM_019500,ENSMUSG00000047109,Cldn14,-,claudin 14,protein-coding,0.018797,0.518148,
peak17901,chr10,67284946,67285839,+,0,,promoter-TSS (NM_001036293),promoter-TSS (NM_001036293),-111,NM_001036293,...,,NM_001036293,ENSMUSG00000075000,Nrbf2,NRBF-2,nuclear receptor binding factor 2,protein-coding,0.081747,0.63311,
peak151815,chr6,98435152,98436223,+,0,,Intergenic,Intergenic,-92933,NM_001128092,...,,NM_001128092,ENSMUSG00000090667,Mdfic2,Gm765,MyoD family inhibitor domain containing 2,protein-coding,0.014006,0.455224,
peak52902,chr13,113270778,113271469,+,0,,Intergenic,Intergenic,-22047,NM_001378698,...,,NM_001378698,ENSMUSG00000021763,Cspg4b,-,chondroitin sulfate proteoglycan 4B,protein-coding,0.011577,0.455202,
peak34483,chr11,114764860,114765741,+,0,,promoter-TSS (NM_001102615),promoter-TSS (NM_001102615),-89,NM_001102615,...,,NM_001102615,ENSMUSG00000010021,Kif19a,Kif19,kinesin family member 19A,protein-coding,0.057889,0.626984,


In [21]:
homer_df["Peak Location"] = homer_df["Chr"].astype(str) + ":" + homer_df["Start"].astype(str) + "-" + homer_df["End"].astype(str)
homer_df.head()

Unnamed: 0_level_0,Chr,Start,End,Strand,Peak Score,Focus Ratio/Region Size,Annotation,Detailed Annotation,Distance to TSS,Nearest PromoterID,...,Nearest Refseq,Nearest Ensembl,Gene Name,Gene Alias,Gene Description,Gene Type,CpG%,GC%,"CTCF(Zf)/CD4+-CTCF-ChIP-Seq(Barski_et_al.)/Homer Distance From Peak(sequence,strand,conservation)",Peak Location
PeakID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
peak75613,chr16,93932426,93933224,+,0,,"intron (NM_019500, intron 2 of 2)","intron (NM_019500, intron 2 of 2)",-3008,NM_001165926,...,NM_019500,ENSMUSG00000047109,Cldn14,-,claudin 14,protein-coding,0.018797,0.518148,,chr16:93932426-93933224
peak17901,chr10,67284946,67285839,+,0,,promoter-TSS (NM_001036293),promoter-TSS (NM_001036293),-111,NM_001036293,...,NM_001036293,ENSMUSG00000075000,Nrbf2,NRBF-2,nuclear receptor binding factor 2,protein-coding,0.081747,0.63311,,chr10:67284946-67285839
peak151815,chr6,98435152,98436223,+,0,,Intergenic,Intergenic,-92933,NM_001128092,...,NM_001128092,ENSMUSG00000090667,Mdfic2,Gm765,MyoD family inhibitor domain containing 2,protein-coding,0.014006,0.455224,,chr6:98435152-98436223
peak52902,chr13,113270778,113271469,+,0,,Intergenic,Intergenic,-22047,NM_001378698,...,NM_001378698,ENSMUSG00000021763,Cspg4b,-,chondroitin sulfate proteoglycan 4B,protein-coding,0.011577,0.455202,,chr13:113270778-113271469
peak34483,chr11,114764860,114765741,+,0,,promoter-TSS (NM_001102615),promoter-TSS (NM_001102615),-89,NM_001102615,...,NM_001102615,ENSMUSG00000010021,Kif19a,Kif19,kinesin family member 19A,protein-coding,0.057889,0.626984,,chr11:114764860-114765741


In [None]:
cols_of_interest = [
    "Peak Location",
    "Annotation",
    "Distance to TSS",
    "Gene Name",
    "Gene Type",
    "CpG%",
    "GC%"
]

TF_column = homer_df.columns[-2]

# Extract the TF name from the motif column name
TF_name = TF_column.split('/')[0].split('(')[0].split(':')[0]
print(f"TF name = {TF_name}")

homer_df_subset = homer_df[cols_of_interest]
homer_df_subset.head()


TF name = CTCF


Unnamed: 0_level_0,Peak Location,Annotation,Distance to TSS,Gene Name,Gene Type,CpG%,GC%
PeakID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
peak75613,chr16:93932426-93933224,"intron (NM_019500, intron 2 of 2)",-3008,Cldn14,protein-coding,0.018797,0.518148
peak17901,chr10:67284946-67285839,promoter-TSS (NM_001036293),-111,Nrbf2,protein-coding,0.081747,0.63311
peak151815,chr6:98435152-98436223,Intergenic,-92933,Mdfic2,protein-coding,0.014006,0.455224
peak52902,chr13:113270778-113271469,Intergenic,-22047,Cspg4b,protein-coding,0.011577,0.455202
peak34483,chr11:114764860-114765741,promoter-TSS (NM_001102615),-89,Kif19a,protein-coding,0.057889,0.626984


In [36]:
import numpy as np
tf_to_nearest_gene_map = {
    "source_id":[TF_name for i in range(len(homer_df_subset))],
    "target_id":homer_df_subset.loc[:, "Gene Name"].to_list()
    }
index_col = np.arange(len(tf_to_nearest_gene_map["source_id"]))

tf_to_nearest_gene = pd.DataFrame(tf_to_nearest_gene_map, index=index_col)
tf_to_nearest_gene.head()

Unnamed: 0,source_id,target_id
0,CTCF,Cldn14
1,CTCF,Nrbf2
2,CTCF,Mdfic2
3,CTCF,Cspg4b
4,CTCF,Kif19a
