## Map the ChIP-Atlas TF-peak file to nearest TSS and see if it overlaps with the BEELINE ground truth

The BEELINE paper says that it used TF-TG annotations from ChIP-Atlas for their mESC ChIP-seq dataset. However, they do not specify what files they downloaded or how they mapped peak to target genes. We are validating our TF-peak scoring methods using ChIP-seq TF-peak binding data from ChIP-Atlas, and want to see if the dataset we downloaded matches the BEELINE ground truth.

### ChIP-Atlas data

The ChIP-Atlas dataset we are using can be downloaded from:

`wget https://chip-atlas.dbcls.jp/data/mm10/assembled/Oth.Emb.05.AllAg.AllCell.bed`

This file corresponds to the following settings in the Peak Browser:
- Species -> M. musculus (mm10)
- Track type class -> ChIP: TFs and others
- Cell type class -> Embryo
- Threshold for Significance -> 50
- Track type -> All
- Cell type -> all

### BEELINE Networks

The BEELINE paper's supplementary table 4 contains information about the networks used for their ground truth. The mESC dataset contains the following entry:

| Source | #TFs | #Genes (incl. TFs) | #Edges | Density | Gene expression dataset |
|:------:|:----:|:------------------:|:------:|:-------:|:-----------------------:|
|mESC, ESCAPE+ ChIP-Atlas | 247 | 25,703 | 6,348,394 | 0.154 | mESC, Hayashi et al.^{2} |

2. Hayashi, T. et al. Single-cell full-length total RNA sequencing uncovers dynamics of recursive splicing and enhancer RNAs. Nat. Commun. 9, 619 (2018).

I am also going to re-download the ground truth network from the BEELINE paper to ensure that our RN111 ground truth file and their file are the same.

The `BEELINE-Networks.zip` file containing the `mESC-ChIP-seq-network.csv` file can be downloaded from `https://zenodo.org/records/3701939/files/BEELINE-Networks.zip?download=1`

### Downloading the BEELINE ChIP-seq network for mESC

In [None]:
GROUND_TRUTH_DIR="/gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER/ground_truth_files"

!wget "https://zenodo.org/records/3701939/files/BEELINE-Networks.zip?download=1" -O "$GROUND_TRUTH_DIR/beeline_networks.zip"
!unzip "$GROUND_TRUTH_DIR/beeline_networks.zip" -d "$GROUND_TRUTH_DIR/beeline_networks"
!mv "$GROUND_TRUTH_DIR/beeline_networks/Networks/mouse/mESC-ChIP-seq-network.csv" "$GROUND_TRUTH_DIR/mESC_beeline_ChIP-seq.csv"
!rm -rf "$GROUND_TRUTH_DIR/beeline_networks"
!rm "$GROUND_TRUTH_DIR/beeline_networks.zip"

In [None]:
import os
import pandas as pd
import csv
import pybedtools

ground_truth_dir = "/gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER/ground_truth_files"

mesc_rn111_path = "/gpfs/Labs/Uzun/DATA/PROJECTS/2024.SC_MO_TRN_DB.MIRA/REPOSITORY/CURRENT/REFERENCE_NETWORKS/RN111_ChIPSeq_BEELINE_Mouse_ESC.tsv"
chip_atlas_bed_file = os.path.join(ground_truth_dir, "Oth.Emb.05.AllAg.AllCell.bed")
chip_atlas_path = os.path.join(ground_truth_dir, "chipatlas_mESC.csv")
beeline_path = os.path.join(ground_truth_dir, "mESC_beeline_ChIP-seq.csv")

## Comparing BEELINE to mESC RN111

In [None]:
mesc_rn111 = pd.read_csv(mesc_rn111_path, sep='\t', quoting=csv.QUOTE_NONE, on_bad_lines='skip', header=0)
mesc_rn111 = mesc_rn111.rename(columns={"Source": "source_id", "Target": "target_id"})
mesc_rn111 = mesc_rn111[["source_id", "target_id"]]
mesc_rn111.head()

In [None]:
beeline = pd.read_csv(beeline_path)
beeline = beeline.rename(columns={"Gene1": "source_id", "Gene2": "target_id"})
beeline.head()

In [None]:
merged_df = pd.merge(beeline, mesc_rn111, on=["source_id", "target_id"], how="outer", indicator=True)
beeline_only = len(merged_df[merged_df["_merge"] == "left_only"])
rn111_only = len(merged_df[merged_df["_merge"] == "right_only"])
both = len(merged_df[merged_df["_merge"] == "both"])
print(f"Edges only in BEELINE {beeline_only}")
print(f"Edges only in RN111 {rn111_only}")
print(f"Edges present in both {both}")

7 of the edges were not found in both datasets. However, looking more closely this is because the BEELINE dataset had multiple gene names separated by a comma and they were mis-handled when the gene names were standardized

In [None]:
merged_df[merged_df["_merge"] == "left_only"]

In [None]:
merged_df[merged_df["_merge"] == "right_only"]

All of edges are the same between BEELINE and RN111.

## Comparing ChIP-Atlas to BEELINE

First, we need to map the peaks in the ChIP-Atlas bed file to the nearest gene TSS for each peak.

The mm10 tss_reference_file can be downloaded from here:

[RefGenie mm10 TSS annotation file](http://awspds.refgenie.databio.org/refgenomes.databio.org/0f10d83b1050c08dd53189986f60970b92a315aa7a16a6f1/ensembl_gtf__default/0f10d83b1050c08dd53189986f60970b92a315aa7a16a6f1_ensembl_gene_body.bed)

In [None]:

chip_bed = pybedtools.BedTool(chip_atlas_bed_file)
tss_bed = pybedtools.BedTool("/gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER/data/genome_annotation/mm10/mm10_TSS.bed")
chip_closest_tss = chip_bed.closest(tss_bed, d=True)
chip_closest_tss_df = chip_closest_tss.to_dataframe(
    names=["peak_chr", "peak_start", "peak_end", "peak_gene",
           "tss_chr", "tss_start", "tss_end", "tss_gene", "strand", "strand2", "distance"]
)


We need to extract the TF name from the index and the TG name from the `tss_gene` column

In [None]:
chip_closest_tss_df = chip_closest_tss_df.reset_index()
chip_closest_tss_df["source_id"] = (
    chip_closest_tss_df["level_3"]
    .str.extract(r'Name=([^%]+)')
)

In [None]:
chip_closest_tss_df["source_id"] = chip_closest_tss_df["source_id"].str.upper()
chip_closest_tss_df["target_id"] = chip_closest_tss_df["tss_gene"].str.upper()

chip_closest_tss_df = chip_closest_tss_df[["source_id", "target_id"]]
chip_closest_tss_df

In [None]:
merged_df = pd.merge(beeline, chip_closest_tss_df, on=["source_id", "target_id"], how="outer", indicator=True)
beeline_only = len(merged_df[merged_df["_merge"] == "left_only"])
chip_atlas_only = len(merged_df[merged_df["_merge"] == "right_only"])
both = len(merged_df[merged_df["_merge"] == "both"])
print(f"Edges only in BEELINE {beeline_only}")
print(f"Edges only in ChIP-Atlas {chip_atlas_only}")
print(f"Edges present in both {both}")