## Plotting Sliding Window Score AUROC and AUPRC

When evaluating the feature score distributions of True / False values after adding MIRA RP scores to our combined DataFrame, we noticed that the `sliding_window_score` feature has a good separation in score distribution between True and False values. The True and False values are from the **RN115 LOGOF ESCAPE** mESC knockout dataset.

In [None]:
!hostnamectl

In [None]:
from IPython.display import Image, display
Image(
    "/gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER/figures/mm10/DS011/xgboost_feature_score_hist_by_label.png", 
    width=800, 
    height=800
    )

Now, we are interested in assessing the AUROC and AUPRC for just the sliding window scores.

I figured out that the large number of 1 values was due to `clip_and_normalize_log1p_pandas()`. This clips scores below the bottom 5th percentile and above the top 95th percentile and sets them equal to the threshold, then the min-max normalization was moving the distribution to between 0-1. This caused a lot of scores to build up at 0 (the bottom threshold) and 1 (the top threshold).

First, we will look at the AUROC and AUPRC as it stands now.

In [None]:
import pandas as pd
import pandas as pd
import matplotlib.pyplot as plt
import random
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score

In [None]:
inferred_df = pd.read_parquet("/gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER/output/DS011_mESC/DS011_mESC_sample1/labeled_inferred_grn.parquet", engine="pyarrow")
sliding_window_score_df = inferred_df[["source_id", "peak_id", "target_id", "sliding_window_score", "label"]]
sliding_window_score_df

We first need to balance the number of True and False rows to not skew the accuracy curves as much.

In [None]:
def balance_dataset(df):
    true_rows = df[df["label"] == 1]
    false_rows = df[df["label"] == 0]

    print("Before Balancing:")
    print(f"  - Number of True values: {len(true_rows)}")
    print(f"  - Number of False values: {len(false_rows)}")

    min_rows = min(len(true_rows), len(false_rows))
    print(f"\nSubsampling down to {min_rows} rows")

    true_rows_sampled = true_rows.sample(min_rows)
    false_rows_sampled = false_rows.sample(min_rows)

    balanced_df = pd.concat([true_rows_sampled, false_rows_sampled])

    balanced_true_rows = balanced_df[balanced_df["label"] == 1]
    balanced_false_rows = balanced_df[balanced_df["label"] == 0]

    print("\nAfter Balancing:")
    print(f"  - Number of True values: {len(balanced_true_rows)}")
    print(f"  - Number of False values: {len(balanced_false_rows)}")

    return balanced_df

In [None]:
balanced_df = balance_dataset(sliding_window_score_df)

Let's look at the True / False sliding window score histogram

In [None]:
def plot_true_false_feature_histogram(
    df,
    feature_col,
    limit_x = True
):

    fig = plt.figure(figsize=(5, 4))
    
    true_values = df[df["label"] == 1]
    false_values = df[df["label"] == 0]

    plt.hist(
        false_values[feature_col].dropna(),
        bins=50, alpha=0.7,
        color='#1682b1', edgecolor="#032b5f",
        label="False",
    )
    plt.hist(
        true_values[feature_col].dropna(),
        bins=50, alpha=0.7,
        color="#cb5f17", edgecolor="#b13301",
        label="True",
    )

    # set titles/labels on the same ax
    plt.title(feature_col, fontsize=14)
    plt.xlabel(feature_col, fontsize=14)
    plt.ylabel("Frequency", fontsize=14)
    if limit_x:
        plt.xlim(0, 1)

    fig.legend(
        loc="lower center",
        ncol=2,
        fontsize=14,
        bbox_to_anchor=(0.5, -0.02)
    )
    fig.tight_layout(rect=[0, 0.05, 1, 1])
    plt.show()

In [None]:
plot_true_false_feature_histogram(balanced_df, "sliding_window_score")

In [None]:
def plot_auroc_auprc(inferred_df):
    # Subset the relevant columns
    df = inferred_df[["sliding_window_score", "label"]].dropna()

    # Get true labels and predicted scores
    y_true = df["label"]
    y_scores = df["sliding_window_score"]

    # --- ROC Curve ---
    fpr, tpr, _ = roc_curve(y_true, y_scores)
    roc_auc = auc(fpr, tpr)

    # --- PR Curve ---
    precision, recall, _ = precision_recall_curve(y_true, y_scores)
    avg_precision = average_precision_score(y_true, y_scores)

    # --- Plot ---
    plt.figure(figsize=(10, 4))

    # ROC
    plt.subplot(1, 2, 1)
    plt.plot(fpr, tpr, label=f"AUROC = {roc_auc:.2f}")
    plt.plot([0, 1], [0, 1], linestyle="--", color="gray")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("ROC Curve")
    plt.legend()

    # PR
    plt.subplot(1, 2, 2)
    plt.plot(recall, precision, label=f"AUPRC = {avg_precision:.2f}")
    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.title("Precision-Recall Curve")
    plt.legend()

    plt.tight_layout()
    plt.show()
    
plot_auroc_auprc(balanced_df)

As expected, the large number of incorrect 0 and 1 values are impacting the predictions. Let's see what happens if we filter out sliding window scores with values of 0 or 1.

In [None]:
filtered_balanced_df = balanced_df[
    (balanced_df["sliding_window_score"] > 0) &
    (balanced_df["sliding_window_score"] < 1)
]

In [None]:
plot_true_false_feature_histogram(filtered_balanced_df, "sliding_window_score")

In [None]:
plot_auroc_auprc(filtered_balanced_df)

That improves the AUROC and AUPRC by quite a lot. Let's also look at the score distribution with 0 and 1 values removed.

Now, I need to look into what is causing scores to have a value of 1 or 0. I am re-running the sliding window score method for the dataset without minmax normalizing between 0-1 to get a better look at how the raw scores are distributed without normalization or clipping.

In [None]:
sliding_window_raw_df = pd.read_parquet("/gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER/output/DS011_mESC/DS011_mESC_sample1/no_norm_sliding_window_tf_to_peak_score.parquet", engine="pyarrow")
sliding_window_raw_df = sliding_window_raw_df.reset_index(drop=True)
sliding_window_raw_df["source_id"] = sliding_window_raw_df["source_id"].str.upper()

In [None]:
sliding_window_raw_df

Now that we have the raw sliding window scores, we need to label the edges with the ground truth label. However, the sliding window scores only have peak to TG edges. In order to get TF to TG edges, we can match up the sliding window peak-TG edges to the labeled inferred score dataframe.

In [None]:
sliding_window_raw_df["source_id"] = sliding_window_raw_df["source_id"].str.upper()

inferred_edges = inferred_df[["source_id", "peak_id", "label"]]
labeled_sliding_window_raw_df = pd.merge(inferred_edges, sliding_window_raw_df, on=["source_id", "peak_id"], how="left")

We need to re-balance this dataset as well.

In [None]:
balanced_df = balance_dataset(labeled_sliding_window_raw_df)

Now we can plot the True / False score histogram and AUROC / AUPRC plots for the balanced labeled raw sliding window scores.

In [None]:
plot_true_false_feature_histogram(balanced_df, "sliding_window_score", limit_x=False)

In [None]:
balanced_df_above_0 = balanced_df[balanced_df["sliding_window_score"] > 0]

In [None]:
plot_true_false_feature_histogram(balanced_df_above_0, "sliding_window_score", limit_x=False)

A lot of the scores were clipped, which led to only the middle of the distributions being kept. The actual distribution is a lot messier than it looks with the log1p normalization and percentile clipping.

In [None]:
plot_auroc_auprc(balanced_df_above_0)

As expected, this reduced the AUROC and AUPRC performance. The AUROC is similar to the one with the clipped values between FPRs of 0.2 and 0.4, as values above and below this were clipped.

## Investigating the True interactions below the bottom 5th percentile

We see an enrichment for True scores below a score of 10,000. There are several potential reasons why we might see this pattern, so let's start by investigating the following:

1) Are the scores dominated by a few TFs?
2) Are the scores dominated by a few TGs?
3) Are the scores mainly located on a certain chromosome?
4) Are the scores mainly short / long range interactions?

First, we need to separate out the True and False scores above and below the bottom 5th percentile and above the top 5th percentile.

**Loading in the sliding window score file:**

In [None]:
norm_sliding_window_scores_full = pd.read_parquet("output/DS011_mESC/DS011_mESC_sample1/sliding_window_tf_to_peak_score.parquet", engine="pyarrow")
norm_sliding_window_scores_full["source_id"] = norm_sliding_window_scores_full["source_id"].str.upper()
norm_sliding_window_scores_full

**Loading in the combined score DataFrame:**

In [None]:
combined_score_df = pd.read_parquet("output/DS011_mESC/DS011_mESC_sample1/labeled_inferred_grn.parquet", engine="pyarrow")
combined_score_balanced = balance_dataset(combined_score_df)
inferred_edges = balanced_df[["source_id", "peak_id", "sliding_window_score", "label"]]
inferred_edges

**Merging the score file and the combined score DataFrame:**

In [None]:
labeled_norm_sliding_window_scores_df = pd.merge(inferred_edges, norm_sliding_window_scores_full, on=["source_id", "peak_id"], how="inner")

In [None]:
labeled_norm_sliding_window_scores_df

The sliding_window_score distributions look different between the combined inferred GRN score dataframe and the actual sliding window scores, even when I just keep shared edges. Why is that?

**Combined score DataFrame sliding window scores:**

In [None]:
plot_true_false_feature_histogram(inferred_edges, "sliding_window_score", limit_x=True)

**Scores directly form the sliding window score file, edges matched with the combined DataFrame:**

In [None]:
plot_true_false_feature_histogram(labeled_norm_sliding_window_scores_df, "sliding_window_score_y", limit_x=True)

Let's look at the `inferred_score_df_full.parquet` file created from `explore_score_dataframe_sizes.ipynb`

In [None]:
inferred_score_full_df = pd.read_parquet("/gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER/output/DS011_mESC/DS011_mESC_sample1/inferred_grns/inferred_score_df_full.parquet", engine="pyarrow")
inferred_score_full_df = inferred_score_full_df[["source_id", "peak_id", "target_id", "sliding_window_score"]]
inferred_score_full_df

Let's compare the sliding window score distributions between the combined score dataframe and the raw sliding window scores for shared edges, and see if there are differences.

In [None]:
inferred_score_labeled_df = pd.merge(
    labeled_norm_sliding_window_scores_df, 
    inferred_score_full_df, 
    on=["source_id", "peak_id"], 
    how="inner"
    )
inferred_score_labeled_df

In [None]:
def is_different(row):
    return row["sliding_window_score_x"] != row["sliding_window_score_y"]

rows_diff_scores = inferred_score_labeled_df.apply(lambda x: is_different(x), axis=1)
matching_scores = len([i for i in rows_diff_scores if i == False])
diff_scores = len([i for i in rows_diff_scores if i == True])
print(f"{matching_scores} / {matching_scores + diff_scores} scores are the same between the combined df and the sliding window score df")

---

In [None]:
balanced_labeled_norm_sliding_window_scores_df = balance_dataset(labeled_norm_sliding_window_scores_df)

plot_true_false_feature_histogram(balanced_labeled_norm_sliding_window_scores_df, "sliding_window_score", limit_x=False)

In [None]:
bottom_threshold = balanced_df["sliding_window_score"].quantile(0.05)
top_threshold = balanced_df["sliding_window_score"].quantile(0.95)

scores_below_5th_percentile = balanced_df[balanced_df["sliding_window_score"] <= bottom_threshold]
scores_between_thresholds = balanced_df[
    (balanced_df["sliding_window_score"] > bottom_threshold) &
    (balanced_df["sliding_window_score"] < top_threshold)
    ]
scores_above_95th_percentile = balanced_df[balanced_df["sliding_window_score"] >= top_threshold]

In [None]:
fig = plt.figure(figsize=(5, 4))

# Extract the True values within each threshold
true_values_below_bottom_thresh = scores_below_5th_percentile[scores_below_5th_percentile["label"] == 1]
true_values_between_thresh = scores_between_thresholds[scores_between_thresholds["label"] == 1]
true_values_above_top_thresh = scores_above_95th_percentile[scores_above_95th_percentile["label"] == 1]

# Extact the False values within each threshold
false_values_below_bottom_thresh = scores_below_5th_percentile[scores_below_5th_percentile["label"] == 0]
false_values_between_thresh = scores_between_thresholds[scores_between_thresholds["label"] == 0]
false_values_above_top_thresh = scores_above_95th_percentile[scores_above_95th_percentile["label"] == 0]

# Plot the True / False feature scores BELOW the bottom 5th percentile threshold
plt.hist(
    false_values_below_bottom_thresh["sliding_window_score"].dropna(),
    bins=20, alpha=0.7,
    color="#759fb1", edgecolor="#1e3b60",
)
plt.hist(
    true_values_below_bottom_thresh["sliding_window_score"].dropna(),
    bins=20, alpha=0.7,
    color="#ca9f83", edgecolor="#ab5938",
)

plt.axvline(x=bottom_threshold, linestyle="--")

# Plot the True / False feature scores BETWEEN the 5th and 95th percentiles
plt.hist(
    false_values_between_thresh["sliding_window_score"].dropna(),
    bins=50, alpha=0.7,
    color='#1682b1', edgecolor="#032b5f",
    label="False",
)
plt.hist(
    true_values_between_thresh["sliding_window_score"].dropna(),
    bins=50, alpha=0.7,
    color="#cb5f17", edgecolor="#b13301",
    label="True",
)

plt.axvline(x=top_threshold, linestyle="--")

# Plot the True / False feature scores ABOVE the top 95th percentile threshold
plt.hist(
    false_values_above_top_thresh["sliding_window_score"].dropna(),
    bins=20, alpha=0.7,
    color="#759fb1", edgecolor="#1e3b60",
)
plt.hist(
    true_values_above_top_thresh["sliding_window_score"].dropna(),
    bins=20, alpha=0.7,
    color="#ca9f83", edgecolor="#ab5938",
)


# set titles/labels on the same ax
plt.title("Bottom 5th percentile and upper 95th Percentiles", fontsize=14)
plt.xlabel("Sliding Window Score", fontsize=14)
plt.ylabel("Frequency", fontsize=14)

fig.legend(
    loc="lower center",
    ncol=2,
    fontsize=14,
    bbox_to_anchor=(0.5, -0.02)
)
fig.tight_layout(rect=[0, 0.05, 1, 1])
plt.show()

As a sanity check, lets make sure that the clipping and normalization creates the same histogram.

**Clipping:**

In [None]:
clipped_score_df = balanced_df.copy()
clipped_score_df["sliding_window_score"] = clipped_score_df["sliding_window_score"].clip(lower=bottom_threshold, upper=top_threshold)
plot_true_false_feature_histogram(clipped_score_df, "sliding_window_score", limit_x=False)

**Min-Max Normalization:**

In [None]:
norm_clipped_score_df = clipped_score_df.copy()
norm_clipped_score_df["sliding_window_score"] = (norm_clipped_score_df["sliding_window_score"] - bottom_threshold) / (top_threshold - bottom_threshold)
plot_true_false_feature_histogram(norm_clipped_score_df, "sliding_window_score", limit_x=False)

**Log1p Normalization:**

In [None]:
import numpy as np
log1p_norm_clipped_score_df = norm_clipped_score_df.copy()
log1p_norm_clipped_score_df["sliding_window_score"] = np.log1p(log1p_norm_clipped_score_df["sliding_window_score"])
plot_true_false_feature_histogram(log1p_norm_clipped_score_df, "sliding_window_score", limit_x=False)

**Minmax normalization:**

In [None]:
minmax_log1p_norm_clipped_score_df = log1p_norm_clipped_score_df.copy()
min_val = minmax_log1p_norm_clipped_score_df["sliding_window_score"].min()
max_val = minmax_log1p_norm_clipped_score_df["sliding_window_score"].max()
minmax_log1p_norm_clipped_score_df.loc[:, "sliding_window_score"] = (minmax_log1p_norm_clipped_score_df["sliding_window_score"] - min_val) / (max_val - min_val)
plot_true_false_feature_histogram(minmax_log1p_norm_clipped_score_df, "sliding_window_score", limit_x=False)