## Analysis of Prediction Results

We will read the confusion matrics in the .csv files and store them in a pandas dataframe. Then, we will define a function to obtain the metrics such as accuracy, precision, recall etc for all the 4 datasets.

In [2]:
import numpy as np 
import pandas as pd

In [6]:
#Dropseq
HCC_drop_pr_df = pd.read_csv("HCC1806_drop_table_pr.csv")
MCF_drop_pr_df = pd.read_csv("MCF7_drop_table_pr.csv")

#Smartseq

HCC_smart_pr_df = pd.read_csv("HCC1806_smart_table_pr.csv")
MCF_smart_pr_df = pd.read_csv("MCF7_smart_table_pr.csv")

HCC_drop_pr_df.head()

Unnamed: 0,0,1
0,1405,119
1,49,2098


In [7]:
MCF_drop_pr_df.head()

Unnamed: 0,0,1
0,3170,69
1,45,2122


In [8]:
HCC_smart_pr_df.head()

Unnamed: 0,0,1
0,25,0
1,1,19


In [9]:
MCF_smart_pr_df.head()

Unnamed: 0,0,1
0,32,0
1,0,31


### A few remarks on the Metrics

Precision: Precision measures the proportion of correctly predicted positive instances (true positives) out of all instances predicted as positive (true positives + false positives). It indicates how reliable the positive predictions are. 

Recall (Sensitivity or True Positive Rate): Recall measures the proportion of correctly predicted positive instances (true positives) out of all actual positive instances (true positives + false negatives). It indicates how effectively the model can identify positive instances.

F1-score: The F1-score is the harmonic mean of precision and recall. It provides a balanced measure between precision and recall. F1-score is useful when there is an imbalance between the positive and negative classes.

Support: Support represents the number of instances in each class. It indicates the number of actual occurrences of the class in the dataset.

Accuracy: Accuracy measures the proportion of correctly predicted instances (true positives + true negatives) out of all instances. It provides an overall performance measure of the model.

False Positive Rate (FPR): FPR calculates the proportion of incorrectly predicted negative instances (false positives) out of all actual negative instances (true negatives + false positives). It indicates the rate of falsely identifying negative instances as positive. This can be calculated using the formula 1 - Recall of Class 0.

False Negative Rate (FNR): FNR calculates the proportion of incorrectly predicted positive instances (false negatives) out of all actual positive instances (true positives + false negatives). It indicates the rate of falsely identifying positive instances as negative. This can be calculated using the formula 1 - Recall of Class 1.

In [57]:
def calculate_classification_metrics(confusion_matrix_df):
    # Convert pandas DataFrame to numpy array
    confusion_matrix = confusion_matrix_df.to_numpy()

    # Check if the confusion matrix has the correct shape
    if confusion_matrix.shape != (2, 2):
        raise ValueError("Confusion matrix must be a 2x2 matrix.")
    
    TN = confusion_matrix[0, 0]
    FP = confusion_matrix[0, 1]
    FN = confusion_matrix[1, 0]
    TP = confusion_matrix[1, 1]

    # Calculate metrics for negative class (class 0)
    precision_0 = TN / (TN + FN)
    recall_0 = TN / (TN + FP)
    f1_score_0 = 2 * (precision_0 * recall_0) / (precision_0 + recall_0)
    support_0 = TN + FP

    # Calculate metrics for positive class (class 1)
    precision_1 = TP / (TP + FP)
    recall_1 = TP / (TP + FN)
    f1_score_1 = 2 * (precision_1 * recall_1) / (precision_1 + recall_1)
    support_1 = TP + FN

    # Calculate overall accuracy
    accuracy = (TP + TN) / (TP + TN + FP + FN)

    # Calculate false positive rate (FPR)
    FPR = 1 - recall_0

    # Calculate false negative rate (FNR)
    FNR = 1 - recall_1

    # Create a dictionary to store the metrics
    metrics = {
        'precision': [precision_0, precision_1],
        'recall': [recall_0, recall_1],
        'f1-score': [f1_score_0, f1_score_1],
        'support': [support_0, support_1]
    }

    # Create a DataFrame for the metrics
    metrics_df = pd.DataFrame(metrics, index=['Negative (class 0)', 'Positive (class 1)'])

    return metrics_df, accuracy, FPR, FNR



### Dropseq

#### HCC

In [58]:
confusion_matrix_df = HCC_drop_pr_df
metrics_df, accuracy, FPR, FNR = calculate_classification_metrics(confusion_matrix_df)

# Print the classification report-style output
print(metrics_df.to_string(float_format='%.4f'))

# Print overall accuracy
accuracy_percentage = accuracy * 100
print("Accuracy: {:.2f}%".format(accuracy_percentage))

# Print false positive rate and false negative rate
FPR_percentage = FPR * 100
FNR_percentage = FNR * 100
print("False Positive Rate (FPR): {:.2f}%".format(FPR_percentage))
print("False Negative Rate (FNR): {:.2f}%".format(FNR_percentage))

                    precision  recall  f1-score  support
Negative (class 0)     0.9663  0.9219    0.9436     1524
Positive (class 1)     0.9463  0.9772    0.9615     2147
Accuracy: 95.42%
False Positive Rate (FPR): 7.81%
False Negative Rate (FNR): 2.28%


#### MCF

In [59]:
confusion_matrix_df = MCF_drop_pr_df
metrics_df, accuracy, FPR, FNR = calculate_classification_metrics(confusion_matrix_df)

# Print the classification report-style output
print(metrics_df.to_string(float_format='%.4f'))

# Print overall accuracy
accuracy_percentage = accuracy * 100
print("Accuracy: {:.2f}%".format(accuracy_percentage))

# Print false positive rate and false negative rate
FPR_percentage = FPR * 100
FNR_percentage = FNR * 100
print("False Positive Rate (FPR): {:.2f}%".format(FPR_percentage))
print("False Negative Rate (FNR): {:.2f}%".format(FNR_percentage))

                    precision  recall  f1-score  support
Negative (class 0)     0.9860  0.9787    0.9823     3239
Positive (class 1)     0.9685  0.9792    0.9738     2167
Accuracy: 97.89%
False Positive Rate (FPR): 2.13%
False Negative Rate (FNR): 2.08%


### SmartSeq

#### HCC

In [61]:
confusion_matrix_df = HCC_smart_pr_df
metrics_df, accuracy, FPR, FNR = calculate_classification_metrics(confusion_matrix_df)

# Print the classification report-style output
print(metrics_df.to_string(float_format='%.4f'))

# Print overall accuracy
accuracy_percentage = accuracy * 100
print("Accuracy: {:.2f}%".format(accuracy_percentage))

# Print false positive rate and false negative rate
FPR_percentage = FPR * 100
FNR_percentage = FNR * 100
print("False Positive Rate (FPR): {:.2f}%".format(FPR_percentage))
print("False Negative Rate (FNR): {:.2f}%".format(FNR_percentage))

                    precision  recall  f1-score  support
Negative (class 0)     0.9615  1.0000    0.9804       25
Positive (class 1)     1.0000  0.9500    0.9744       20
Accuracy: 97.78%
False Positive Rate (FPR): 0.00%
False Negative Rate (FNR): 5.00%


#### MCF

In [62]:
confusion_matrix_df = MCF_smart_pr_df
metrics_df, accuracy, FPR, FNR = calculate_classification_metrics(confusion_matrix_df)

# Print the classification report-style output
print(metrics_df.to_string(float_format='%.4f'))

# Print overall accuracy
accuracy_percentage = accuracy * 100
print("Accuracy: {:.2f}%".format(accuracy_percentage))

# Print false positive rate and false negative rate
FPR_percentage = FPR * 100
FNR_percentage = FNR * 100
print("False Positive Rate (FPR): {:.2f}%".format(FPR_percentage))
print("False Negative Rate (FNR): {:.2f}%".format(FNR_percentage))

                    precision  recall  f1-score  support
Negative (class 0)     1.0000  1.0000    1.0000       32
Positive (class 1)     1.0000  1.0000    1.0000       31
Accuracy: 100.00%
False Positive Rate (FPR): 0.00%
False Negative Rate (FNR): 0.00%
