## Plotting Sliding Window Score AUROC and AUPRC

When evaluating the feature score distributions of True / False values after adding MIRA RP scores to our combined DataFrame, we noticed that the `sliding_window_score` feature has a good separation in score distribution between True and False values. The True and False values are from the **RN115 LOGOF ESCAPE** mESC knockout dataset.

In [None]:
!hostnamectl

In [None]:
from IPython.display import Image, display
Image(
    "/gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER/figures/mm10/DS011/xgboost_feature_score_hist_by_label.png", 
    width=800, 
    height=800
    )

Now, we are interested in assessing the AUROC and AUPRC for just the sliding window scores.

I figured out that the large number of 1 values was due to `clip_and_normalize_log1p_pandas()`. This clips scores below the bottom 5th percentile and above the top 95th percentile and sets them equal to the threshold, then the min-max normalization was moving the distribution to between 0-1. This caused a lot of scores to build up at 0 (the bottom threshold) and 1 (the top threshold).

First, we will look at the AUROC and AUPRC as it stands now.

In [None]:
import os
import pandas as pd
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score
output_dir = "/gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER/output/DS011_mESC/DS011_mESC_sample1/"

In [None]:
inferred_df = pd.read_parquet(os.path.join(output_dir, "labeled_inferred_grn.parquet"), engine="pyarrow")
sliding_window_score_df = inferred_df[["source_id", "peak_id", "target_id", "sliding_window_score", "label"]]
sliding_window_score_df

We first need to balance the number of True and False rows to not skew the accuracy curves as much.

In [None]:
def balance_dataset(df):
    true_rows = df[df["label"] == 1]
    false_rows = df[df["label"] == 0]

    print("Before Balancing:")
    print(f"  - Number of True values: {len(true_rows)}")
    print(f"  - Number of False values: {len(false_rows)}")

    min_rows = min(len(true_rows), len(false_rows))
    print(f"\nSubsampling down to {min_rows} rows")

    true_rows_sampled = true_rows.sample(min_rows)
    false_rows_sampled = false_rows.sample(min_rows)

    balanced_df = pd.concat([true_rows_sampled, false_rows_sampled])

    balanced_true_rows = balanced_df[balanced_df["label"] == 1]
    balanced_false_rows = balanced_df[balanced_df["label"] == 0]

    print("\nAfter Balancing:")
    print(f"  - Number of True values: {len(balanced_true_rows)}")
    print(f"  - Number of False values: {len(balanced_false_rows)}")

    return balanced_df

In [None]:
balanced_df = balance_dataset(sliding_window_score_df)

Let's look at the True / False sliding window score histogram

In [None]:
def plot_true_false_feature_histogram(
    df,
    feature_col,
    limit_x = True
):

    fig = plt.figure(figsize=(5, 4))
    
    true_values = df[df["label"] == 1]
    false_values = df[df["label"] == 0]

    plt.hist(
        false_values[feature_col].dropna(),
        bins=50, alpha=0.7,
        color="#747474", edgecolor="#2D2D2D",
        label="False",
    )
    plt.hist(
        true_values[feature_col].dropna(),
        bins=50, alpha=0.7,
        color='#1682b1', edgecolor="#032b5f",
        label="True",
    )

    # set titles/labels on the same ax
    plt.title(feature_col, fontsize=14)
    plt.xlabel(feature_col, fontsize=14)
    plt.ylabel("Frequency", fontsize=14)
    if limit_x:
        plt.xlim(0, 1)

    fig.legend(
        loc="lower center",
        ncol=2,
        fontsize=14,
        bbox_to_anchor=(0.5, -0.02)
    )
    fig.tight_layout(rect=[0, 0.05, 1, 1])
    plt.show()

In [None]:
plot_true_false_feature_histogram(balanced_df, "sliding_window_score")

In [None]:
def plot_auroc_auprc(inferred_df):
    # Subset the relevant columns
    df = inferred_df[["sliding_window_score", "label"]].dropna()

    # Get true labels and predicted scores
    y_true = df["label"]
    y_scores = df["sliding_window_score"]

    # --- ROC Curve ---
    fpr, tpr, _ = roc_curve(y_true, y_scores)
    roc_auc = auc(fpr, tpr)

    # --- PR Curve ---
    precision, recall, _ = precision_recall_curve(y_true, y_scores)
    avg_precision = average_precision_score(y_true, y_scores)

    # --- Plot ---
    plt.figure(figsize=(10, 4))

    # ROC
    plt.subplot(1, 2, 1)
    plt.plot(fpr, tpr, label=f"AUROC = {roc_auc:.2f}")
    plt.plot([0, 1], [0, 1], linestyle="--", color="gray")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("ROC Curve")
    plt.legend()

    # PR
    plt.subplot(1, 2, 2)
    plt.plot(recall, precision, label=f"AUPRC = {avg_precision:.2f}")
    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.title("Precision-Recall Curve")
    plt.legend()

    plt.tight_layout()
    plt.show()
    
plot_auroc_auprc(balanced_df)

As expected, there are a large number of incorrect 0 and 1 values from scores above the 95th percentile clipping threshold and below the 5th percentile clipping threshold. Let's see what happens if we filter out sliding window scores with values of 0 or 1.

In [None]:
filtered_balanced_df = balanced_df[
    (balanced_df["sliding_window_score"] > 0) &
    (balanced_df["sliding_window_score"] < 1)
]

In [None]:
plot_true_false_feature_histogram(filtered_balanced_df, "sliding_window_score")

In [None]:
plot_auroc_auprc(filtered_balanced_df)

---

### Investigating the bimodal distribution of True scores

In the sliding window score distribution, the False scores have a single broad peak around 0.6, while the True scores are bimodal with a lower peak around 0.375 - 0.425 and a higher peak around 0.8 - 0.9. We want to determine what is causing the True values to have multiple peaks.

First, let's look at the number of TFs in the True and False scores

In [None]:
total_true_tfs = sliding_window_score_df[sliding_window_score_df["label"] == 1]["source_id"].unique()
total_false_tfs = sliding_window_score_df[sliding_window_score_df["label"] == 0]["source_id"].unique()

print(f"Total TFs in the True values: {len(total_true_tfs)}")
print(f"Total TFs in the False values: {len(total_false_tfs)}")

There are many more TFs in the False values. How many edges does each TF have for the True and False groups?

To start answering this question, we can count the number of sliding window scores per TF in each group

In [None]:

true_sorted_agg_df = (
    sliding_window_score_df[sliding_window_score_df["label"] == 1][["source_id", "sliding_window_score"]]
    .groupby("source_id")
    .count()
    .sort_values(by="sliding_window_score", ascending=False)
    .reset_index()
    .rename(columns={"sliding_window_score":"num_scores"})
    )

false_sorted_agg_df = (
    sliding_window_score_df[sliding_window_score_df["label"] == 0][["source_id", "sliding_window_score"]]
    .groupby("source_id")
    .count()
    .sort_values(by="sliding_window_score", ascending=False)
    .reset_index()
    .rename(columns={"sliding_window_score":"num_scores"})
    )


Now, we can plot the number of scores per TF

#### Number of False sliding window scores by TF

In [None]:
fig = plt.figure(figsize=(15,5))
plt.bar(x=false_sorted_agg_df["source_id"], height=false_sorted_agg_df["num_scores"], color="blue")
plt.title("Number of False sliding window scores by TF")
plt.ylabel("Number of False \nsliding window scores", fontsize=12)
plt.xticks(rotation=55, fontsize=9)
plt.tight_layout()
plt.show()

#### Number of True sliding window scores by TF

In [None]:
fig = plt.figure(figsize=(5,3))
plt.bar(x=true_sorted_agg_df["source_id"], height=true_sorted_agg_df["num_scores"], color="blue")
plt.title("Number of True sliding window scores by TF")
plt.ylabel("Number of True \nsliding window scores")
plt.xticks(rotation=55, fontsize=10)
plt.tight_layout()
plt.show()

It looks like there is a relatively small proportion of True values out of the total number of False values. Let's look at the number of scores for TFs with both True and False values.

In [None]:
grouped_values = pd.merge(true_sorted_agg_df, false_sorted_agg_df, on="source_id", how="inner")
grouped_values = grouped_values.rename(columns={
    "num_scores_x": "Num True Scores",
    "num_scores_y": "Num False Scores"
})
grouped_values

In [None]:
grouped_values.plot.bar(x="source_id", stacked=False, figsize=(7,4))
plt.title("Number of True and False sliding window scores for TFs with True values")
plt.ylabel("Number of \nsliding window scores", fontsize=12)
plt.xticks(rotation=55, fontsize=10)
plt.legend()
plt.tight_layout()
plt.show()

There are many more scores for the False values. Let's look to see if there are any differences in the True vs False score distributions for each TF.

In [None]:
import math

def plot_tf_score_distributions(df, tf_name_list, score_col, title):
    ncols = 4
    nrows = math.ceil(len(tf_name_list) / ncols)
    fig, ax = plt.subplots(nrows=nrows, ncols=ncols)
    fig.set_figwidth(ncols * 3)
    fig.set_figheight(nrows * 3)

    for i, tf_name in enumerate(tf_name_list):
        plot_row = i // ncols
        plot_col = i % ncols
        
        tf_scores = df[df["source_id"] == tf_name][score_col]
        
        ax[plot_row, plot_col].hist(tf_scores, bins=25)
        ax[plot_row, plot_col].set_title(tf_name, fontsize=10)
        ax[plot_row, plot_col].tick_params(axis='x', labelsize=9)
        ax[plot_row, plot_col].tick_params(axis='y', labelsize=9)
        ax[plot_row, plot_col].set_xbound((0, 1))

    # Hide the extra plots
    n_tfs = len(tf_name_list)
    n_figs = ncols * nrows

    for i in range(n_tfs, n_figs):
        row = i // ncols
        col = i % ncols
        ax[row, col].axis("off")
        
    plt.suptitle(title)
    plt.tight_layout(rect=[0.05, 0.05, 1, 1])
    
    fig.text(0.5, 0.04, 'Sliding Window Score', ha='center', fontsize=12)
    fig.text(0.04, 0.5, 'Frequency', va='center', rotation='vertical', fontsize=12)
    
    plt.show()

#### Distribution of True sliding window scores per TF

In [None]:
true_df = sliding_window_score_df[sliding_window_score_df["label"] == 1]
tfs_with_true_scores = true_df["source_id"].unique()
plot_tf_score_distributions(
    df=true_df, 
    tf_name_list=tfs_with_true_scores, 
    score_col="sliding_window_score",
    title="True sliding window score distributions by TF"
    )

#### Distribution of False sliding window scores per TF

In [None]:
false_df = sliding_window_score_df[sliding_window_score_df["label"] == 0]
plot_tf_score_distributions(
    df=false_df, 
    tf_name_list=tfs_with_true_scores, 
    score_col="sliding_window_score",
    title="False sliding window score distributions by TF"
    )

The True and False scores appear to have similar distributions for each of the TFs.

For the True scores, only the TFs **SOX2**, **SOX9**, and **TCF3** have more than 10 scores. It appears as though these scores are the drivers of the three peaks that we see on the plot. Let's see how the distributions of these scores match up when plotted together. 

>Note: The number of True and False scores per TF are balanced for the following plot, otherwise the False edges overwhelm the True edges

In [None]:
import random
import math

def plot_true_false_tf_score_distributions(true_df, false_df, tf_name_list, score_col, title):
    ncols = min(len(tf_name_list), 4)
    nrows = math.ceil(len(tf_name_list) / ncols)
    fig, ax = plt.subplots(nrows=nrows, ncols=ncols)
    fig.set_figwidth(ncols * 3)
    fig.set_figheight(nrows * 3)
    
    ax = ax.flatten()

    for i, tf_name in enumerate(tf_name_list):
        plot_row = i // ncols
        plot_col = i % ncols
        
        true_tf_scores = true_df[true_df["source_id"] == tf_name][score_col]
        false_tf_scores = false_df[false_df["source_id"] == tf_name][score_col]
        
        min_scores = min(len(true_tf_scores), len(false_tf_scores))
        true_tf_scores = true_tf_scores.sample(min_scores)
        false_tf_scores = false_tf_scores.sample(min_scores)
        
        ax[i].hist(true_tf_scores, bins=25, alpha=0.7, label="True Scores")
        ax[i].hist(false_tf_scores, bins=25, alpha=0.7, label="False Scores")
        
        ax[i].set_title(tf_name, fontsize=10)
        ax[i].tick_params(axis='x', labelsize=9)
        ax[i].tick_params(axis='y', labelsize=9)
        ax[i].set_xbound((0, 1))

    for j in range(len(tf_name_list), len(ax)):
        ax[j].axis("off")
    
    handles, labels = ax[0].get_legend_handles_labels()
    fig.legend(
        handles,
        labels,
        loc='upper center',
        bbox_to_anchor=(0.5, 1.0),
        ncol=2,
        fontsize=10,
        frameon=False
    )
        
    plt.suptitle(title, y=1.05)
    plt.tight_layout(rect=[0.05, 0.05, 1, 0.98])
        
    fig.text(0.5, 0.04, 'Sliding Window Score', ha='center', fontsize=11)
    fig.text(0.04, 0.5, 'Frequency', va='center', rotation='vertical', fontsize=11)
    
    plt.show()

In [None]:
tf_names = ["SOX2", "SOX9", "TCF3"]

plot_true_false_tf_score_distributions(
    true_df=true_df,
    false_df=false_df,
    tf_name_list=tf_names,
    score_col="sliding_window_score",
    title="Sliding window score distributions for TFs with True values"
)

Let's plot the balanced distributions again, but this time we can color SOX2, SOX9, and TCF3 differently so we can see if they truly are the majority of scores.

In [None]:
def plot_tfs_of_interest(
    df,
    feature_col,
    tfs_of_interest,
    limit_x = True
):

    fig = plt.figure(figsize=(7, 5))
    
    true_values = df[df["label"] == 1]
    false_values = df[df["label"] == 0]

    plt.hist(
        false_values[feature_col].dropna(),
        bins=50, alpha=0.3,
        color="#757575",
        label="False Scores"
    )
    
    y_cmap = plt.get_cmap("Dark2")
    
    for x, tf in enumerate(tfs_of_interest):
        
        
        true_tfs = true_values[true_values["source_id"] == tf]
        
        percent_total_true_edges = len(true_tfs[feature_col]) / len(true_values[feature_col])
        
        nbins=max(10, math.ceil(50*percent_total_true_edges))
        
        plt.hist(
            true_tfs[feature_col].dropna(),
            bins=nbins, alpha=0.8,
            color=y_cmap.colors[x],
            label=tf
        )

    # set titles/labels on the same ax
    plt.title("Sliding window score distribution colored by TFs of interest", fontsize=12)
    plt.xlabel("Sliding Window Score", fontsize=12)
    plt.ylabel("Frequency", fontsize=12)
    if limit_x:
        plt.xlim(0, 1)

    fig.legend(
        loc="lower center",
        ncol=1,
        fontsize=10,
        bbox_to_anchor=(1.10, 0.60)
    )
    fig.tight_layout(rect=[0, 0, 1, 1])
    plt.show()

In [None]:
plot_tfs_of_interest(balanced_df, "sliding_window_score", tf_names, limit_x=False)

In [None]:
plot_true_false_feature_histogram(balanced_df, "sliding_window_score", limit_x=False)

Looking at the plot of only the values of the main True TFs, we can see that the three main distributions of True values stems from those TFs having distinct distributions and representing the majority of True edges.

Also, if we don't balance the True and False edges, we can see that the True values represent a small majority of the total number of scores. This shows that balancing the scores makes it appear that the True and False scores are more separated than they truly are.

In [None]:
plot_true_false_feature_histogram(sliding_window_score_df, "sliding_window_score", limit_x=False)

## Comparing the knockout ground truth to the ChIP-seq ground truth
Now I want to answer the following questions:

1. How many TFs and TGs are in the sliding window scores before merging with the knockout ground truth?
2. How many TFs and TGs are in the knockout ground truth?
3. What does the sliding window analysis look like if I use ChIP-seq data instead of knockout data?
    - How many TFs and TGs are in the ChIP-seq ground truth?
    - How many True and False TFs match with the ChIP-seq ground truth?
    - How many TGs are there for each TF in the ChIP-seq ground truth?
  
### 1. Number of TFs and TGs are in the sliding window scores before merging with the knockout ground truth

In [None]:
raw_sliding_window_scores = pd.read_parquet(os.path.join(output_dir, "no_norm_sliding_window_tf_to_peak_score.parquet"), engine="pyarrow")
raw_sliding_window_scores

In [None]:
raw_sliding_window_scores.hist("sliding_window_score", bins=50, grid=False)

In [None]:
print(f"Number of TFs in the raw sliding window scores: {len(raw_sliding_window_scores['source_id'].drop_duplicates())}")

To find the number of putative TGs, we need to map the peaks to the closest gene TSS. We have the distances from each peak to nearby genes saved in `peaks_near_genes.parquet`, so we will use the data from that file to build our list of peak to TG targets.

We start by reading in the file and finding the closest peak to gene target (highest TSS_dist_score value)

In [None]:
peaks_near_genes_df = pd.read_parquet(os.path.join(output_dir, "peaks_near_genes.parquet"), engine="pyarrow")
closest_gene_to_peak_df = peaks_near_genes_df.sort_values("TSS_dist_score", ascending=False).groupby("peak_id").first()
closest_gene_to_peak_df = closest_gene_to_peak_df[["target_id"]].reset_index()
closest_gene_to_peak_df

Now we can merge the sliding window scores with the closest peak to TG DataFrame to add the closest gene as the target for each row in the sliding window scores

In [None]:
sliding_window_with_targets = pd.merge(raw_sliding_window_scores, closest_gene_to_peak_df, on=["peak_id"], how="left")

sliding_window_tf_tg = sliding_window_with_targets[["source_id", "target_id", "sliding_window_score"]].drop_duplicates()


In [None]:
num_tfs = sliding_window_tf_tg["source_id"].nunique()
num_tgs = sliding_window_tf_tg["target_id"].nunique()

tf_tg_edges = sliding_window_tf_tg[["source_id", "target_id"]].drop_duplicates()

print(f"Number of TFs: {num_tfs:,}")
print(f"Number of TGs: {num_tgs:,}")
print(f"Number of TF-TG Edges: {len(tf_tg_edges):,}")

### Finding the number of TFs and TGs in the ground truth

#### RN115 LOGOF ESCAPE:

In [None]:
rn115_ko_ground_truth = pd.read_csv("/gpfs/Labs/Uzun/DATA/PROJECTS/2024.SC_MO_TRN_DB.MIRA/REPOSITORY/CURRENT/REFERENCE_NETWORKS/RN115_LOGOF_ESCAPE_Mouse_ESC.tsv", sep="\t")
rn115_ko_ground_truth = rn115_ko_ground_truth[["Source", "Target"]].rename(columns={"Source":"source_id", "Target":"target_id"})
print(f"Number of TFs: {rn115_ko_ground_truth['source_id'].nunique():,}")
print(f"Number of TGs: {rn115_ko_ground_truth['target_id'].nunique():,}")
print(f"Number of Edges: {len(rn115_ko_ground_truth):,}")

In [None]:
rn115_ko_ground_truth.groupby("source_id").count()

In [None]:
sliding_window_tf_tg.groupby("source_id")["sliding_window_score"].count()

In [None]:
sliding_window_tf_tg.groupby("target_id")["sliding_window_score"].count()

In [None]:
merged_edges = pd.merge(sliding_window_tf_tg, rn115_ko_ground_truth, on=["source_id", "target_id"], how="inner")


In [None]:
merged_edges

In [None]:
merged_targets[merged_targets["target_id"] == "Sfpi1"]

In [None]:
merged_targets[merged_targets["target_id"] == "Esx1"]

In [None]:
merged_targets["target_id"].drop_duplicates()

In [None]:
same_edges = merged_targets[merged_targets["source_id_x"] == merged_targets["source_id_y"]]

In [None]:
same_edges

In [None]:
same_edges.hist("sliding_window_score", bins=150, grid=False)

In [None]:
same_edges

#### RN111 BEELINE ChIP-seq:

In [None]:
from grn_inference.utils import read_ground_truth
rn111_chipseq_ground_truth = read_ground_truth("/gpfs/Labs/Uzun/DATA/PROJECTS/2024.SC_MO_TRN_DB.MIRA/REPOSITORY/CURRENT/REFERENCE_NETWORKS/RN111_ChIPSeq_BEELINE_Mouse_ESC.tsv")
rn111_chipseq_ground_truth['source_id'] = rn111_chipseq_ground_truth['source_id'].str.capitalize()
rn111_chipseq_ground_truth['target_id'] = rn111_chipseq_ground_truth['target_id'].str.capitalize()
print(f"Number of TFs: {rn111_chipseq_ground_truth['source_id'].nunique():,}")
print(f"Number of TGs: {rn111_chipseq_ground_truth['target_id'].nunique():,}")
print(f"Number of Edges: {len(rn111_chipseq_ground_truth):,}")

Let's see how many of the TFs and TGs are shared between the sliding window scores and the ground truths

In [None]:
# Sliding Window vs RN115 LOGOF
sliding_window_vs_rn115_tfs = tf_tg_edges[tf_tg_edges["source_id"].isin(rn115_ko_ground_truth["source_id"])]["source_id"].drop_duplicates()
sliding_window_vs_rn115_tgs = tf_tg_edges[tf_tg_edges["target_id"].isin(rn115_ko_ground_truth["target_id"])]["target_id"].drop_duplicates()
sliding_window_vs_rn115_edges = pd.merge(tf_tg_edges, rn115_ko_ground_truth, on=["source_id", "target_id"], how="inner")

# Sliding Window vs RN111 ChIP-seq
sliding_window_vs_rn111_tfs = tf_tg_edges[tf_tg_edges["source_id"].isin(rn111_chipseq_ground_truth["source_id"])]["source_id"].drop_duplicates()
sliding_window_vs_rn111_tgs = tf_tg_edges[tf_tg_edges["target_id"].isin(rn111_chipseq_ground_truth["target_id"])]["target_id"].drop_duplicates()
sliding_window_vs_rn111_edges = pd.merge(tf_tg_edges, rn111_chipseq_ground_truth, on=["source_id", "target_id"], how="inner")

print(f"Sliding Window vs RN115 LOGOF Knockout Ground Truth")
print(f"  - Shared TFs: {len(sliding_window_vs_rn115_tfs):,} / {num_tfs:,}")
print(f"  - Shared TGs: {len(sliding_window_vs_rn115_tgs):,} / {num_tgs:,}")
print(f"  - Shared Edges: {len(sliding_window_vs_rn115_edges):,} / {len(tf_tg_edges):,}")

print(f"\nSliding Window vs RN111 BEELINE ChIP-seq Ground Truth")
print(f"  - Shared TFs: {len(sliding_window_vs_rn111_tfs):,} / {num_tfs:,}")
print(f"  - Shared TGs: {len(sliding_window_vs_rn111_tgs):,} / {num_tgs:,}")
print(f"  - Shared Edges: {len(sliding_window_vs_rn111_edges):,} / {len(tf_tg_edges):,}")

In [None]:
tf_tg_edges[tf_tg_edges.isin(rn115_ko_ground_truth)].drop_duplicates()

In [None]:
rn111_chipseq_ground_truth["source_id"].drop_duplicates()

In [None]:
tf_tg_edges[tf_tg_edges["source_id"].isin(rn115_ko_ground_truth["source_id"])]["source_id"].drop_duplicates()

## Clustering TF-TG Edges

To see how the sliding window score varies between TF-TG edges, we can create a clustermap of the average sliding window score for each TF-TG edge. The TGs will be along the x-axis and the TFs along the y-axis, with the mean sliding window score as values. 

If there are clear score differences between vertical clusters, this indicates that differences in the sliding window scores are driven by TGs. 

If there are clear score differences between horizontal clusters, this indicates that differences in the sliding window scores are driven by TFs.  

In [None]:
# Mean sliding window score per TF-TG pair
sliding_window_tf_tg_pivot = (
    sliding_window_tf_tg
    .groupby(["source_id", "target_id"])
    .mean("sliding_window_score")
    .rename(columns={"sliding_window_score": "mean_score"})
    .sort_values("mean_score", ascending=False)
    ).reset_index().pivot(index="source_id", columns="target_id", values="mean_score")
ax = sns.clustermap(sliding_window_tf_tg_pivot, linewidth=0)
plt.show()

We can now map the TF and TG clusters back to the main dataframe

In [None]:
from scipy.cluster.hierarchy import fcluster

# Extract linkage matrices
row_linkage = ax.dendrogram_row.linkage
col_linkage = ax.dendrogram_col.linkage

# Use existing linkage
tf_clusters = fcluster(row_linkage, t=5, criterion='maxclust')  # cluster TFs
tg_clusters = fcluster(col_linkage, t=5, criterion='maxclust')  # cluster TGs

# Create a DataFrame mapping labels to cluster IDs
tf_cluster_df = pd.DataFrame({
    'source_id': sliding_window_tf_tg_pivot.index,
    'tf_cluster': tf_clusters
})

tg_cluster_df = pd.DataFrame({
    'target_id': sliding_window_tf_tg_pivot.columns,
    'tg_cluster': tg_clusters
})

# Merge the TF and TG cluster dataframes with the sliding window TF-TG dataframe
sliding_window_tf_tg_clustered = pd.merge(sliding_window_tf_tg, tf_cluster_df, on="source_id", how="left").merge(tg_cluster_df, on="target_id", how='left')
sliding_window_tf_tg_clustered

Now that we have each TF and TG labeled by their cluster, we can see which TF-TG clusters have the highest average sliding window scores.

In [None]:
sliding_window_by_cluster = sliding_window_tf_tg_clustered.groupby(["tf_cluster", "tg_cluster"]).mean("sliding_window_score").reset_index()
sliding_window_by_cluster_pivot = sliding_window_by_cluster.pivot(index="tf_cluster", columns="tg_cluster", values="sliding_window_score")
sns.heatmap(sliding_window_by_cluster_pivot, linewidth=0)
plt.show()

This shows which groups have the highest and lowest sliding window scores. In this case, rows where the TF is in TF cluster 3 and the TG is in TG cluster 3 have the highest sliding window scores. We can cluster the TFs and TGs in this row to see which TFs and TGs have the highest scores.

In [None]:
cluster_tf3_tg3 = sliding_window_tf_tg_clustered[
    (sliding_window_tf_tg_clustered["tf_cluster"] == 3) &
    (sliding_window_tf_tg_clustered["tg_cluster"] == 3)
    ]
cluster_tf3_tg3_edges = (
    cluster_tf3_tg3[["source_id", "target_id", "sliding_window_score"]]
    .groupby(["source_id", "target_id"])
    .mean("sliding_window_score")
    .rename(columns={"sliding_window_score":"mean_score"})
    .reset_index()
    )
cluster_tf3_tg3_edges_pivot = cluster_tf3_tg3_edges.pivot(index="source_id", columns="target_id", values="mean_score")
sns.heatmap(cluster_tf3_tg3_edges_pivot, linewidths=0)
plt.show()

In [None]:
ahctf1_sliding_window_scores = sliding_window_with_targets[sliding_window_with_targets["source_id"] == "Ahctf1"]
ahctf1_sliding_window_scores

In [None]:
# Mean sliding window score per TF-TG pair
ahctf1_mean_tf_tg_scores = (
    ahctf1_sliding_window_scores
    .groupby(["source_id", "target_id"])
    .mean("sliding_window_score")
    .rename(columns={"sliding_window_score": "mean_score"})
    )

# Number of sliding window scores per TF-TG pair
ahctf1_count_tf_tg_scores = (
    ahctf1_sliding_window_scores
    .groupby(["source_id", "target_id"])
    .count()
    .drop(columns="peak_id")
    .rename(columns={"sliding_window_score": "num_scores"})
    )

ahctf1_agg = pd.merge(ahctf1_mean_tf_tg_scores, ahctf1_count_tf_tg_scores, left_index=True, right_index=True, how='inner').reset_index()
ahctf1_agg

In [None]:
ahctf1_agg_pivot = ahctf1_agg.pivot(index="source_id", columns="target_id", values="mean_score")
ahctf1_agg_pivot

In [None]:
plt.figure(figsize=(8,8))
plt.imshow(ahctf1_agg_pivot, cmap="hot")


In [None]:
ahctf1_num_tf_tg_scores

In [None]:
ahctf1_sliding_window_scores.hist("sliding_window_score", bins=50, grid=False)

In [None]:
ahctf1_scores_per_tg = (
    ahctf1_sliding_window_scores
    .groupby(["source_id", "target_id"])
    .count()
    .rename(columns={"sliding_window_score":"Number of peaks Per TF-TG pair"})
    )

fig = plt.figure(figsize=(8,5))
plt.hist(ahctf1_scores_per_tg["Number of peaks Per TF-TG pair"], color="blue", bins=50)
plt.title("Sliding Window Scores - Number of peaks for each target of Ahctf1")
plt.ylabel("Number of unique TGs", fontsize=12)
plt.tight_layout()
plt.show()
ahctf1_scores_per_tg.hist("Number of peaks Per TF-TG pair", bins=50, grid=False)

In [None]:
num_tgs_per_tf_agg = sliding_window_with_targets.groupby(["source_id", "target_id"]).count()
# num_tgs_per_tf_agg = num_tgs_per_tf_agg.drop("peak_id").rename(columns={"sliding_window_score":"num_scores"})
num_tgs_per_tf_agg

In [None]:
fig = plt.figure(figsize=(15,5))
plt.bar(x=num_tgs_per_tf_agg.index, height=num_tgs_per_tf_agg["target_id"], color="blue")
plt.title("Sliding Window Scores - Number of TGs per TF")
plt.ylabel("Number of unique TGs", fontsize=12)
plt.xticks(rotation=55, fontsize=9)
plt.tight_layout()
plt.show()

---

## Testing the non-normalized sliding window scores using RN111 ChIP-seq as the ground truth

We used a knockout dataset as the ground truth for our predictions. Let's test if we see the True / False score distributions separate if we use the RN111 ChIP-seq dataset as the ground truth

### Label sliding window score edges using RN111

In [None]:
rn111_chipseq_ground_truth

In [None]:
sliding_window_tf_tg

In [None]:

ground_truth_pairs = set(zip(
    rn111_chipseq_ground_truth["source_id"].str.upper(),
    rn111_chipseq_ground_truth["target_id"].str.upper()
))


sliding_window_tf_tg["source_id"] = sliding_window_tf_tg["source_id"].str.upper()
sliding_window_tf_tg["target_id"] = sliding_window_tf_tg["target_id"].str.upper()

def label_partition(df):
    df = df.copy()  # <-- avoids SettingWithCopyWarning
    tf_tg_tuples = list(zip(df["source_id"], df["target_id"]))
    df.loc[:, "label"] = [1 if pair in ground_truth_pairs else 0 for pair in tf_tg_tuples]
    return df

labeled_df = label_partition(sliding_window_tf_tg)

labeled_df["source_id"] = labeled_df["source_id"].str.capitalize()
labeled_df["target_id"] = labeled_df["target_id"].str.capitalize()
labeled_df

In [None]:
plot_true_false_feature_histogram(labeled_df, "sliding_window_score", limit_x=False)

### Evaluating the True / False sliding window scores by TF against RN111

Let's look at the number of TFs in the True vs False score distributions in RN111 compared to the RN115 knockout ground truth

In [None]:
total_true_tfs = labeled_df[labeled_df["label"] == 1]["source_id"].unique()
total_false_tfs = labeled_df[labeled_df["label"] == 0]["source_id"].unique()

print(f"Total TFs in the True values: {len(total_true_tfs)}")
print(f"Total TFs in the False values: {len(total_false_tfs)}")

These are much better numbers - rather than 7 TFs in the True values, we have 75.

Let's again look at the number of True and False sliding window scores by TF

In [None]:
true_sorted_agg_df = (
    labeled_df[labeled_df["label"] == 1][["source_id", "sliding_window_score"]]
    .groupby("source_id")
    .count()
    .sort_values(by="sliding_window_score", ascending=False)
    .reset_index()
    .rename(columns={"sliding_window_score":"num_scores"})
    )

false_sorted_agg_df = (
    labeled_df[labeled_df["label"] == 0][["source_id", "sliding_window_score"]]
    .groupby("source_id")
    .count()
    .sort_values(by="sliding_window_score", ascending=False)
    .reset_index()
    .rename(columns={"sliding_window_score":"num_scores"})
    )

#### Number of False sliding window scores by TF

In [None]:
fig = plt.figure(figsize=(15,5))
plt.bar(x=false_sorted_agg_df["source_id"], height=false_sorted_agg_df["num_scores"], color="blue")
plt.title("Number of False sliding window scores by TF")
plt.ylabel("Number of False \nsliding window scores", fontsize=12)
plt.xticks(rotation=55, fontsize=2)
plt.tight_layout()
plt.show()

#### Number of True Sliding Window Scores by TF

In [None]:
fig = plt.figure(figsize=(12,3))
plt.bar(x=true_sorted_agg_df["source_id"], height=true_sorted_agg_df["num_scores"], color="blue")
plt.title("Number of True sliding window scores by TF")
plt.ylabel("Number of True \nsliding window scores")
plt.xticks(rotation=55, fontsize=6)
plt.tight_layout()
plt.show()

#### Number of True vs False scores per TF
Let's again compare the number of True vs False scores for TFs that have both True and False scores

In [None]:
grouped_values = pd.merge(true_sorted_agg_df, false_sorted_agg_df, on="source_id", how="inner")
grouped_values = grouped_values.rename(columns={
    "num_scores_x": "Num True Scores",
    "num_scores_y": "Num False Scores"
})
grouped_values

In [None]:
grouped_values.plot.bar(x="source_id", stacked=False, figsize=(12,6))
plt.title("Number of True and False sliding window scores for TFs with True values")
plt.ylabel("Number of \nsliding window scores", fontsize=12)
plt.xticks(rotation=55, fontsize=8)
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
labeled_df_balanced = balance_dataset(labeled_df)

In [None]:
plot_true_false_feature_histogram(labeled_df_balanced, "sliding_window_score", limit_x=False)

In [None]:
balanced_true_tfs = labeled_df_balanced[labeled_df_balanced["label"] == 1]["source_id"].unique()
balanced_false_tfs = labeled_df_balanced[labeled_df_balanced["label"] == 0]["source_id"].unique()

print(f"Total TFs in the True values: {len(balanced_true_tfs)}")
print(f"Total TFs in the False values: {len(balanced_false_tfs)}")

In [None]:
balanced_true_sorted_agg_df = (
    labeled_df_balanced[labeled_df_balanced["label"] == 1][["source_id", "sliding_window_score"]]
    .groupby("source_id")
    .count()
    .sort_values(by="sliding_window_score", ascending=False)
    .reset_index()
    .rename(columns={"sliding_window_score":"num_scores"})
    )

balanced_false_sorted_agg_df = (
    labeled_df_balanced[labeled_df_balanced["label"] == 0][["source_id", "sliding_window_score"]]
    .groupby("source_id")
    .count()
    .sort_values(by="sliding_window_score", ascending=False)
    .reset_index()
    .rename(columns={"sliding_window_score":"num_scores"})
)

In [None]:
fig = plt.figure(figsize=(15,5))
plt.bar(x=balanced_false_sorted_agg_df["source_id"], height=balanced_false_sorted_agg_df["num_scores"], color="blue")
plt.title("Number of False sliding window scores by TF")
plt.ylabel("Number of False \nsliding window scores", fontsize=12)
plt.xticks(rotation=55, fontsize=2)
plt.tight_layout()
plt.show()

In [None]:
fig = plt.figure(figsize=(15,5))
plt.bar(x=balanced_true_sorted_agg_df["source_id"], height=balanced_true_sorted_agg_df["num_scores"], color="blue")
plt.title("Number of True sliding window scores by TF")
plt.ylabel("Number of True \nsliding window scores", fontsize=12)
plt.xticks(rotation=55, fontsize=10)
plt.tight_layout()
plt.show()

In [None]:
balanced_grouped_values = pd.merge(balanced_true_sorted_agg_df, balanced_false_sorted_agg_df, on="source_id", how="inner")
balanced_grouped_values = balanced_grouped_values.rename(columns={
    "num_scores_x": "Num True Scores",
    "num_scores_y": "Num False Scores"
})
balanced_grouped_values

In [None]:
grouped_values.plot.bar(x="source_id", stacked=False, figsize=(12,6), color=["#4195df","#747474"], width=0.8)
plt.title("Number of True and False sliding window scores for TFs with True values")
plt.ylabel("Number of \nsliding window scores", fontsize=12)
plt.xticks(rotation=55, fontsize=8)
plt.legend()
plt.tight_layout()
plt.show()