## Plotting Sliding Window Score AUROC and AUPRC

When evaluating the feature score distributions of True / False values after adding MIRA RP scores to our combined DataFrame, we noticed that the `sliding_window_score` feature has a good separation in score distribution between True and False values. The True and False values are from the **RN115 LOGOF ESCAPE** mESC knockout dataset.

In [None]:
!hostnamectl

In [None]:
from IPython.display import Image, display
Image(
    "/gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER/figures/mm10/DS011/xgboost_feature_score_hist_by_label.png", 
    width=800, 
    height=800
    )

Now, we are interested in assessing the AUROC and AUPRC for just the sliding window scores.

I figured out that the large number of 1 values was due to `clip_and_normalize_log1p_pandas()`. This clips scores below the bottom 5th percentile and above the top 95th percentile and sets them equal to the threshold, then the min-max normalization was moving the distribution to between 0-1. This caused a lot of scores to build up at 0 (the bottom threshold) and 1 (the top threshold).

First, we will look at the AUROC and AUPRC as it stands now.

In [None]:
import pandas as pd
import pandas as pd
import matplotlib.pyplot as plt
import random
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score

In [None]:
inferred_df = pd.read_parquet("/gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER/output/DS011_mESC/DS011_mESC_sample1/labeled_inferred_grn.parquet", engine="pyarrow")
sliding_window_score_df = inferred_df[["source_id", "peak_id", "target_id", "sliding_window_score", "label"]]
sliding_window_score_df

We first need to balance the number of True and False rows to not skew the accuracy curves as much.

In [None]:
def balance_dataset(df):
    true_rows = df[df["label"] == 1]
    false_rows = df[df["label"] == 0]

    print("Before Balancing:")
    print(f"  - Number of True values: {len(true_rows)}")
    print(f"  - Number of False values: {len(false_rows)}")

    min_rows = min(len(true_rows), len(false_rows))
    print(f"\nSubsampling down to {min_rows} rows")

    true_rows_sampled = true_rows.sample(min_rows)
    false_rows_sampled = false_rows.sample(min_rows)

    balanced_df = pd.concat([true_rows_sampled, false_rows_sampled])

    balanced_true_rows = balanced_df[balanced_df["label"] == 1]
    balanced_false_rows = balanced_df[balanced_df["label"] == 0]

    print("\nAfter Balancing:")
    print(f"  - Number of True values: {len(balanced_true_rows)}")
    print(f"  - Number of False values: {len(balanced_false_rows)}")

    return balanced_df

In [None]:
balanced_df = balance_dataset(sliding_window_score_df)

Let's look at the True / False sliding window score histogram

In [None]:
def plot_true_false_feature_histogram(
    df,
    feature_col,
    limit_x = True
):

    fig = plt.figure(figsize=(5, 4))
    
    true_values = df[df["label"] == 1]
    false_values = df[df["label"] == 0]

    plt.hist(
        false_values[feature_col].dropna(),
        bins=50, alpha=0.7,
        color='#1682b1', edgecolor="#032b5f",
        label="False",
    )
    plt.hist(
        true_values[feature_col].dropna(),
        bins=50, alpha=0.7,
        color="#cb5f17", edgecolor="#b13301",
        label="True",
    )

    # set titles/labels on the same ax
    plt.title(feature_col, fontsize=14)
    plt.xlabel(feature_col, fontsize=14)
    plt.ylabel("Frequency", fontsize=14)
    if limit_x:
        plt.xlim(0, 1)

    fig.legend(
        loc="lower center",
        ncol=2,
        fontsize=14,
        bbox_to_anchor=(0.5, -0.02)
    )
    fig.tight_layout(rect=[0, 0.05, 1, 1])
    plt.show()

In [None]:
plot_true_false_feature_histogram(balanced_df, "sliding_window_score")

In [None]:
def plot_auroc_auprc(inferred_df):
    # Subset the relevant columns
    df = inferred_df[["sliding_window_score", "label"]].dropna()

    # Get true labels and predicted scores
    y_true = df["label"]
    y_scores = df["sliding_window_score"]

    # --- ROC Curve ---
    fpr, tpr, _ = roc_curve(y_true, y_scores)
    roc_auc = auc(fpr, tpr)

    # --- PR Curve ---
    precision, recall, _ = precision_recall_curve(y_true, y_scores)
    avg_precision = average_precision_score(y_true, y_scores)

    # --- Plot ---
    plt.figure(figsize=(10, 4))

    # ROC
    plt.subplot(1, 2, 1)
    plt.plot(fpr, tpr, label=f"AUROC = {roc_auc:.2f}")
    plt.plot([0, 1], [0, 1], linestyle="--", color="gray")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("ROC Curve")
    plt.legend()

    # PR
    plt.subplot(1, 2, 2)
    plt.plot(recall, precision, label=f"AUPRC = {avg_precision:.2f}")
    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.title("Precision-Recall Curve")
    plt.legend()

    plt.tight_layout()
    plt.show()
    
plot_auroc_auprc(balanced_df)

As expected, there are a large number of incorrect 0 and 1 values from scores above the 95th percentile clipping threshold and below the 5th percentile clipping threshold. Let's see what happens if we filter out sliding window scores with values of 0 or 1.

In [None]:
filtered_balanced_df = balanced_df[
    (balanced_df["sliding_window_score"] > 0) &
    (balanced_df["sliding_window_score"] < 1)
]

In [None]:
plot_true_false_feature_histogram(filtered_balanced_df, "sliding_window_score")

In [None]:
plot_auroc_auprc(filtered_balanced_df)

---

### Investigating the bimodal distribution of True scores

In the sliding window score distribution, the False scores have a single broad peak around 0.6, while the True scores are bimodal with a lower peak around 0.375 - 0.425 and a higher peak around 0.8 - 0.9. We want to determine what is causing the True values to have multiple peaks.

First, let's look at the number of TFs in the True and False scores

In [None]:
total_true_tfs = sliding_window_score_df[sliding_window_score_df["label"] == 1]["source_id"].unique()
total_false_tfs = sliding_window_score_df[sliding_window_score_df["label"] == 0]["source_id"].unique()

print(f"Total TFs in the True values: {len(total_true_tfs)}")
print(f"Total TFs in the False values: {len(total_false_tfs)}")

There are many more TFs in the False values. How many edges does each TF have for the True and False groups?

To start answering this question, we can count the number of sliding window scores per TF in each group

In [None]:

true_sorted_agg_df = (
    sliding_window_score_df[sliding_window_score_df["label"] == 1][["source_id", "sliding_window_score"]]
    .groupby("source_id")
    .count()
    .sort_values(by="sliding_window_score", ascending=False)
    .reset_index()
    .rename(columns={"sliding_window_score":"num_scores"})
    )

false_sorted_agg_df = (
    sliding_window_score_df[sliding_window_score_df["label"] == 0][["source_id", "sliding_window_score"]]
    .groupby("source_id")
    .count()
    .sort_values(by="sliding_window_score", ascending=False)
    .reset_index()
    .rename(columns={"sliding_window_score":"num_scores"})
    )


Now, we can plot the number of scores per TF

#### Number of False sliding window scores by TF

In [None]:
fig = plt.figure(figsize=(15,5))
plt.bar(x=false_sorted_agg_df["source_id"], height=false_sorted_agg_df["num_scores"], color="blue")
plt.title("Number of False sliding window scores by TF")
plt.ylabel("Number of False \nsliding window scores", fontsize=12)
plt.xticks(rotation=55, fontsize=9)
plt.tight_layout()
plt.show()

#### Number of True sliding window scores by TF

In [None]:
fig = plt.figure(figsize=(5,3))
plt.bar(x=true_sorted_agg_df["source_id"], height=true_sorted_agg_df["num_scores"], color="blue")
plt.title("Number of True sliding window scores by TF")
plt.ylabel("Number of True \nsliding window scores")
plt.xticks(rotation=55, fontsize=10)
plt.tight_layout()
plt.show()

It looks like there is a relatively small proportion of True values out of the total number of False values. Let's look at the number of scores for TFs with both True and False values.

In [None]:
grouped_values = pd.merge(true_sorted_agg_df, false_sorted_agg_df, on="source_id", how="inner")
grouped_values = grouped_values.rename(columns={
    "num_scores_x": "Num True Scores",
    "num_scores_y": "Num False Scores"
})
grouped_values

In [None]:
grouped_values.plot.bar(x="source_id", stacked=False, figsize=(7,4))
plt.title("Number of True and False sliding window scores for TFs with True values")
plt.ylabel("Number of \nsliding window scores", fontsize=12)
plt.xticks(rotation=55, fontsize=10)
plt.legend()
plt.tight_layout()
plt.show()

There are many more scores for the False values. Let's look to see if there are any differences in the True vs False score distributions for each TF.

In [None]:
import math

def plot_tf_score_distributions(df, tf_name_list, score_col, title):
    ncols = 4
    nrows = math.ceil(len(tf_name_list) / ncols)
    fig, ax = plt.subplots(nrows=nrows, ncols=ncols)
    fig.set_figwidth(ncols * 3)
    fig.set_figheight(nrows * 3)

    for i, tf_name in enumerate(tf_name_list):
        plot_row = i // ncols
        plot_col = i % ncols
        
        tf_scores = df[df["source_id"] == tf_name][score_col]
        
        ax[plot_row, plot_col].hist(tf_scores, bins=25)
        ax[plot_row, plot_col].set_title(tf_name, fontsize=10)
        ax[plot_row, plot_col].tick_params(axis='x', labelsize=9)
        ax[plot_row, plot_col].tick_params(axis='y', labelsize=9)
        ax[plot_row, plot_col].set_xbound((0, 1))

    # Hide the extra plots
    n_tfs = len(tf_name_list)
    n_figs = ncols * nrows

    for i in range(n_tfs, n_figs):
        row = i // ncols
        col = i % ncols
        ax[row, col].axis("off")
        
    plt.suptitle(title)
    plt.tight_layout(rect=[0.05, 0.05, 1, 1])
    
    fig.text(0.5, 0.04, 'Sliding Window Score', ha='center', fontsize=12)
    fig.text(0.04, 0.5, 'Frequency', va='center', rotation='vertical', fontsize=12)
    
    plt.show()

#### Distribution of True sliding window scores per TF

In [None]:
true_df = sliding_window_score_df[sliding_window_score_df["label"] == 1]
tfs_with_true_scores = true_df["source_id"].unique()
plot_tf_score_distributions(
    df=true_df, 
    tf_name_list=tfs_with_true_scores, 
    score_col="sliding_window_score",
    title="True sliding window score distributions by TF"
    )

#### Distribution of False sliding window scores per TF

In [None]:
false_df = sliding_window_score_df[sliding_window_score_df["label"] == 0]
plot_tf_score_distributions(
    df=false_df, 
    tf_name_list=tfs_with_true_scores, 
    score_col="sliding_window_score",
    title="False sliding window score distributions by TF"
    )

The True and False scores appear to have similar distributions for each of the TFs.

For the True scores, only the TFs **SOX2**, **SOX9**, and **TCF3** have more than 10 scores. It appears as though these scores are the drivers of the three peaks that we see on the plot. Let's see how the distributions of these scores match up when plotted together. 

>Note: The number of True and False scores per TF are balanced for the following plot, otherwise the False edges overwhelm the True edges

In [None]:
import random
import math

def plot_true_false_tf_score_distributions(true_df, false_df, tf_name_list, score_col, title):
    ncols = min(len(tf_name_list), 4)
    nrows = math.ceil(len(tf_name_list) / ncols)
    fig, ax = plt.subplots(nrows=nrows, ncols=ncols)
    fig.set_figwidth(ncols * 3)
    fig.set_figheight(nrows * 3)
    
    ax = ax.flatten()

    for i, tf_name in enumerate(tf_name_list):
        plot_row = i // ncols
        plot_col = i % ncols
        
        true_tf_scores = true_df[true_df["source_id"] == tf_name][score_col]
        false_tf_scores = false_df[false_df["source_id"] == tf_name][score_col]
        
        min_scores = min(len(true_tf_scores), len(false_tf_scores))
        true_tf_scores = true_tf_scores.sample(min_scores)
        false_tf_scores = false_tf_scores.sample(min_scores)
        
        ax[i].hist(true_tf_scores, bins=25, alpha=0.7, label="True Scores")
        ax[i].hist(false_tf_scores, bins=25, alpha=0.7, label="False Scores")
        
        ax[i].set_title(tf_name, fontsize=10)
        ax[i].tick_params(axis='x', labelsize=9)
        ax[i].tick_params(axis='y', labelsize=9)
        ax[i].set_xbound((0, 1))

    for j in range(len(tf_name_list), len(ax)):
        ax[j].axis("off")
    
    handles, labels = ax[0].get_legend_handles_labels()
    fig.legend(
        handles,
        labels,
        loc='upper center',
        bbox_to_anchor=(0.5, 1.0),
        ncol=2,
        fontsize=10,
        frameon=False
    )
        
    plt.suptitle(title, y=1.05)
    plt.tight_layout(rect=[0.05, 0.05, 1, 0.98])
        
    fig.text(0.5, 0.04, 'Sliding Window Score', ha='center', fontsize=11)
    fig.text(0.04, 0.5, 'Frequency', va='center', rotation='vertical', fontsize=11)
    
    plt.show()

In [None]:
tf_names = ["SOX2", "SOX9", "TCF3"]

plot_true_false_tf_score_distributions(
    true_df=true_df,
    false_df=false_df,
    tf_name_list=tf_names,
    score_col="sliding_window_score",
    title="Sliding window score distributions for TFs with True values"
)

Let's plot the balanced distributions again, but this time we can color SOX2, SOX9, and TCF3 differently so we can see if they truly are the majority of scores.

In [None]:
def plot_tfs_of_interest(
    df,
    feature_col,
    tfs_of_interest,
    limit_x = True
):

    fig = plt.figure(figsize=(7, 5))
    
    true_values = df[df["label"] == 1]
    false_values = df[df["label"] == 0]

    plt.hist(
        false_values[feature_col].dropna(),
        bins=50, alpha=0.3,
        color="#757575",
        label="False Scores"
    )
    
    y_cmap = plt.get_cmap("Dark2")
    
    for x, tf in enumerate(tfs_of_interest):
        
        
        true_tfs = true_values[true_values["source_id"] == tf]
        
        percent_total_true_edges = len(true_tfs[feature_col]) / len(true_values[feature_col])
        
        nbins=max(10, math.ceil(50*percent_total_true_edges))
        
        plt.hist(
            true_tfs[feature_col].dropna(),
            bins=nbins, alpha=0.8,
            color=y_cmap.colors[x],
            label=tf
        )

    # set titles/labels on the same ax
    plt.title("Sliding window score distribution colored by TFs of interest", fontsize=12)
    plt.xlabel("Sliding Window Score", fontsize=12)
    plt.ylabel("Frequency", fontsize=12)
    if limit_x:
        plt.xlim(0, 1)

    fig.legend(
        loc="lower center",
        ncol=1,
        fontsize=10,
        bbox_to_anchor=(1.10, 0.60)
    )
    fig.tight_layout(rect=[0, 0, 1, 1])
    plt.show()

In [None]:
plot_tfs_of_interest(balanced_df, "sliding_window_score", tf_names, limit_x=False)

In [None]:
plot_true_false_feature_histogram(balanced_df, "sliding_window_score", limit_x=False)

Looking at the plot of only the values of the main True TFs, we can see that the three main distributions of True values stems from those TFs having distinct distributions and representing the majority of True edges.

Also, if we don't balance the True and False edges, we can see that the True values represent a small majority of the total number of scores. This shows that balancing the scores makes it appear that the True and False scores are more separated than they truly are.

In [None]:
plot_true_false_feature_histogram(sliding_window_score_df, "sliding_window_score", limit_x=False)