### DATA PREDICTION MODEL: Cervical Cancer Binary Classification

##### Workflow:
1. Load preprocessed data
2. Split into training/testing sets
3. Scale features
4. Train multiple classification models
5. Evaluate models using metrics like Accuracy, Precision, Recall, F1 Score, AUC
6. Visualize confusion matrices and ROC curves
7. Compare model performance

In [None]:
#import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score

In [None]:
"""TO READ FILE"""
import os
os.chdir(r"C:\Users\Deonne\OneDrive - Nanyang Technological University\Desktop\Y2S1\Biohackathon")
#os.chdir(r"path")
print("Current working directory is:", os.getcwd())

#load cleaned data
clean_data = pd.read_csv('cleaned_cervical_cancer.csv')

In [None]:
#Define target variable
target = 'Dx:Cancer'
X = clean_data.drop(columns=[target])
Y = clean_data[target]

In [None]:
#Split dataset into training and testing sets (80-20)
X_training, X_testing, Y_training, Y_testing = train_test_split(X, Y, test_size=0.2, random_state=42)

#Standardise features (important for SVM and Logistic Regression)
scaler = StandardScaler()
X_training_scaled = scaler.fit_transform(X_training)
X_testing_scaled = scaler.transform(X_testing)

### Possible MODELS:
- Logistic Regression (linear classifier, interpretable)
- Random Forest (ensemble method, handles non-linearity well)
- Support Vector Machine (robust with high-dimensional data)
- XGBoost (gradient boosting, powerful for structured data)

In [None]:
#Define models
models = {
    "Logistics Regression": LogisticRegression(class_weight='balanced'),
    "Random Forest": RandomForestClassifier(class_weight='balanced',random_state=42),
    "SVM": SVC(probability=True, class_weight='balanced'),
    "XGBoost": XGBClassifier(eval_metric='logloss')
}

### TRAIN & EVALUATE MODELS

In [None]:
#train + evaluate models
result = []
fitted_models = {}

for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_training_scaled, Y_training)
    fitted_models[name] = model

    Y_prediction = model.predict(X_testing_scaled)
    if hasattr(model, "predict_proba"):
        Y_probability = model.predict_proba(X_testing_scaled)[:, 1]
    else:
        Y_probability = None

    print(f"Results for {name}:")
    print(f"Accuracy: {accuracy_score(Y_testing, Y_prediction):.4f}")
    print(f"Precision: {precision_score(Y_testing, Y_prediction):.4f}")
    print(f"Recall: {recall_score(Y_testing, Y_prediction):.4f}")
    print(f"F1 Score: {f1_score(Y_testing, Y_prediction):.4f}")

    print(classification_report(Y_testing, Y_prediction))
    print("-----")

### Confusion Matrix
It shows how many True Positive, True Negatives, False Positive, True Positive for each model.

In [None]:
#CONFUSION MATRIX
from sklearn.metrics import confusion_matrix
for name, model in fitted_models.items():
    Y_predicted = model.predict(X_testing_scaled)
    confusion_mtx = confusion_matrix(Y_testing, Y_predicted)
    
    #plot as heatmap for visualisation
    plt.figure(figsize=(6, 4))
    sns.heatmap(
        confusion_mtx, 
        annot=True, 
        fmt='d', 
        cmap="Blues", 
        xticklabels=['Cancer negative', 'Cancer positive'], 
        yticklabels=['Cancer negative', 'Cancer positive']
    )
    plt.title(f"Confusion Matrix: {name}")
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.show()

### ROC Curve & AUC

ROC Curve: shows trade-off between TP rate(Recall) and FP rate

AUC (Area Under Curve): indicates how well a model can distinguish between classes

In [None]:
#ROC Curve
from sklearn.metrics import roc_curve, auc

for name, model in fitted_models.items():
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_testing_scaled)[:, 1]
        fpr, tpr, _ = roc_curve(Y_testing, y_prob)
        roc_auc = auc(fpr, tpr)

        plt.figure(figsize=(8, 6))
        plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.2f}")
        plt.plot([0, 1], [0, 1], 'k--')
        plt.title(f"ROC Curve: {name}")
        plt.xlabel("False Positive Rate")
        plt.ylabel("True Positive Rate")
        plt.legend()
        plt.grid(True)
        plt.show()

### Model Performances
summerisation of all the evaluation metrics (Accuracy, Precision, Recall, F1 Score, AUC) in a DataFrame for easy comparison

In [None]:
from sklearn.metrics import roc_auc_score

metrics_summary = []

for name, model in fitted_models.items():
    y_pred = model.predict(X_testing_scaled)
    y_prob = model.predict_proba(X_testing_scaled)[:, 1] if hasattr(model, "predict_proba") else None

    metrics_summary.append({
        "Model": name,
        "Accuracy": accuracy_score(Y_testing, y_pred),
        "Precision": precision_score(Y_testing, y_pred),
        "Recall": recall_score(Y_testing, y_pred),
        "F1 Score": f1_score(Y_testing, y_pred),
        "AUC": roc_auc_score(Y_testing, y_prob) if y_prob is not None else None
    })

results_df = pd.DataFrame(metrics_summary)
results_df = results_df.sort_values(by="Recall", ascending=False)
print(results_df)


### Model Comparisons:

Key evaluation metrics:
- **Recall (TP rate)** -- Avoid false negatives
- **Precision** -- Avoid false positives
- **F1 Score** -- Balance between Precision & Recall
- **AUC** -- overall classification power at all thresholds


#### Observations:
- **Logistic Regression**: Shows perfect metrics
- **Random Forest**: High precision but lower recall (misses ~⅓ cancer cases)
- **SVM/XGBoost**: Trade-off between metrics, worth hyperparameter tuning

##### Conclusion: Logistics Regression appears most promising
