## Plotting Sliding Window Score AUROC and AUPRC

When evaluating the feature score distributions of True / False values after adding MIRA RP scores to our combined DataFrame, we noticed that the `sliding_window_score` feature has a good separation in score distribution between True and False values. The True and False values are from the **RN115 LOGOF ESCAPE** mESC knockout dataset.

In [None]:
from IPython.display import Image, display
Image(
    "/gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER/figures/mm10/DS011/xgboost_feature_score_hist_by_label.png", 
    width=800, 
    height=800
    )

Now, we are interested in assessing the AUROC and AUPRC for just the sliding window scores. We also need to asses why a large number of False values have score of 1 and a large number of True values have a score of 0.

First, we will look at the AUROC and AUPRC as it stands now.

In [None]:
import pandas as pd
import pandas as pd
import matplotlib.pyplot as plt
import random
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score

In [None]:
inferred_df = pd.read_parquet("/gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER/output/DS011_mESC/DS011_mESC_sample1/labeled_inferred_grn.parquet", engine="pyarrow")
sliding_window_score_df = inferred_df[["source_id", "peak_id", "target_id", "sliding_window_score", "label"]]
sliding_window_score_df

We first need to balance the number of True and False rows to not skew the accuracy curves as much.

In [None]:
true_rows = inferred_df[inferred_df["label"] == 1]
false_rows = inferred_df[inferred_df["label"] == 0]

print("Before Balancing:")
print(f"  - Number of True values: {len(true_rows)}")
print(f"  - Number of False values: {len(false_rows)}")

min_rows = min(len(true_rows), len(false_rows))
print(f"\nSubsampling down to {min_rows} rows")

true_rows_sampled = true_rows.sample(min_rows)
false_rows_sampled = false_rows.sample(min_rows)

balanced_df = pd.concat([true_rows_sampled, false_rows_sampled])

balanced_true_rows = balanced_df[balanced_df["label"] == 1]
balanced_false_rows = balanced_df[balanced_df["label"] == 0]

print("\nAfter Balancing:")
print(f"  - Number of True values: {len(balanced_true_rows)}")
print(f"  - Number of False values: {len(balanced_false_rows)}")

Let's look at the True / False sliding window score histogram

In [None]:
def plot_true_false_feature_histogram(
    df,
    feature_col,
):

    fig = plt.figure(figsize=(5, 4))
    
    true_values = df[df["label"] == 1]
    false_values = df[df["label"] == 0]

    plt.hist(
        true_values[feature_col].dropna(),
        bins=50, alpha=0.7,
        color='#1682b1', edgecolor="#032b5f",
        label="False",
    )
    plt.hist(
        false_values[feature_col].dropna(),
        bins=50, alpha=0.7,
        color="#cb5f17", edgecolor="#b13301",
        label="True",
    )

    # set titles/labels on the same ax
    plt.title(feature_col, fontsize=14)
    plt.xlabel(feature_col, fontsize=14)
    plt.ylabel("Frequency", fontsize=14)
    plt.xlim(0, 1)

    fig.legend(
        loc="lower center",
        ncol=2,
        fontsize=14,
        bbox_to_anchor=(0.5, -0.02)
    )
    fig.tight_layout(rect=[0, 0.05, 1, 1])
    plt.show()

plot_true_false_feature_histogram(balanced_df, "sliding_window_score")

In [None]:
def plot_auroc_auprc(inferred_df):
    # Subset the relevant columns
    df = inferred_df[["sliding_window_score", "label"]].dropna()

    # Get true labels and predicted scores
    y_true = df["label"]
    y_scores = df["sliding_window_score"]

    # --- ROC Curve ---
    fpr, tpr, _ = roc_curve(y_true, y_scores)
    roc_auc = auc(fpr, tpr)

    # --- PR Curve ---
    precision, recall, _ = precision_recall_curve(y_true, y_scores)
    avg_precision = average_precision_score(y_true, y_scores)

    # --- Plot ---
    plt.figure(figsize=(10, 4))

    # ROC
    plt.subplot(1, 2, 1)
    plt.plot(fpr, tpr, label=f"AUROC = {roc_auc:.2f}")
    plt.plot([0, 1], [0, 1], linestyle="--", color="gray")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("ROC Curve")
    plt.legend()

    # PR
    plt.subplot(1, 2, 2)
    plt.plot(recall, precision, label=f"AUPRC = {avg_precision:.2f}")
    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.title("Precision-Recall Curve")
    plt.legend()

    plt.tight_layout()
    plt.show()
    
plot_auroc_auprc(balanced_df)

As expected, the large number of incorrect 0 and 1 values are impacting the predictions. Let's see what happens if we filter out sliding window scores with values of 0 or 1.

In [None]:
filtered_balanced_df = balanced_df[
    (balanced_df["sliding_window_score"] > 0) &
    (balanced_df["sliding_window_score"] < 1)
]

In [None]:
plot_true_false_feature_histogram(filtered_balanced_df, "sliding_window_score")

In [None]:
plot_auroc_auprc(filtered_balanced_df)

That improves the AUROC and AUPRC by quite a lot. Let's also look at the score distribution with 0 and 1 values removed.

In [None]:
plot_true_false_feature_histogram(filtered_balanced_df, "sliding_window_score")

Now, I need to look into what is causing scores to have a value of 1 or 0. I am re-running the sliding window score method for the dataset without minmax normalizing between 0-1 to get a better look at what the raw scores are.

In [None]:
sliding_window_raw_df = pd.read_parquet("/gpfs/Labs/Uzun/SCRIPTS/PROJECTS/2024.SINGLE_CELL_GRN_INFERENCE.MOELLER/output/DS011_mESC/DS011_mESC_sample1/sliding_window_tf_to_peak_score.parquet", engine="pyarrow")
sliding_window_raw_df = sliding_window_raw_df.reset_index(drop=True)

In [None]:
sliding_window_raw_df = sliding_window_raw_df.reset_index(drop=True)
sliding_window_raw_df