<a href="https://colab.research.google.com/github/coolkat2000/Classification-Model-CS5260/blob/main/thyroidpredictionassingment3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Thyroid Cancer Prediction

This script loads patient data from an Excel file, preprocesses it,
splits the data into training and validation sets (75/25 split),
and trains a Random Forest classifier to predict thyroid cancer.

### Why Random Forest Classifier?
The Random Forest classifier was chosen for the following reasons:
- **Robustness & Accuracy**: Random Forest is an ensemble method that reduces overfitting and improves predictive accuracy.
- **Handles Categorical & Numerical Data Well**: It can effectively work with mixed data types present in the dataset.
- **Feature Importance Evaluation**: It provides insights into which features contribute most to the prediction.
- **Works Well With Small to Medium-Sized Datasets**: Given the dataset size, Random Forest is computationally efficient.
- **Handles Missing Values & Noisy Data**: Since it averages multiple decision trees, it's less affected by missing or noisy values.

Author: Micky Simwenyi

## Loading all the required Libararies

In [None]:
import pandas as pd
import numpy as np
from google.colab import files
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## Upload file manually in Google Colab
The below will prompt you to load the dataset in Colab for processing and model buiding


In [None]:
print("Please upload the thyroid_cancer_risk_data.xlsx file.")
uploaded = files.upload()
file_path = list(uploaded.keys())[0]

Please upload the thyroid_cancer_risk_data.xlsx file.


Saving thyroid_cancer_risk_data.xlsx to thyroid_cancer_risk_data (1).xlsx


Loads the dataset from an uploaded Excel file.

In [None]:
def load_data(file_path):
    return pd.read_excel(file_path, sheet_name="thyroid_cancer_risk_data")

The below Preprocess the dataset by encoding categorical variables and normalizing numerical features.
    
*   Drops the `Patient_ID` column as it is non-informative.
*   Encodes categorical features using Label Encoding.
*   Encodes the target variable `Diagnosis`.
*   Scales numerical features using StandardScaler.

In [None]:
def preprocess_data(df):
    df = df.drop(columns=["Patient_ID"])  # Remove non-informative column

    # Encode categorical variables
    categorical_cols = ["Gender", "Country", "Ethnicity", "Family_History", "Radiation_Exposure",
                        "Iodine_Deficiency", "Smoking", "Obesity", "Diabetes", "Thyroid_Cancer_Risk"]

    label_encoders = {}
    for col in categorical_cols:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])
        label_encoders[col] = le

    # Encode target variable
    target_encoder = LabelEncoder()
    df["Diagnosis"] = target_encoder.fit_transform(df["Diagnosis"])

    # Normalize numerical features
    numerical_cols = ["Age", "TSH_Level", "T3_Level", "T4_Level", "Nodule_Size"]
    scaler = StandardScaler()
    df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

    return df, label_encoders

The below splits the datset into training (75%) and validation (25%) sets.

In [None]:
def split_data(df):
    X = df.drop(columns=["Diagnosis"])
    y = df["Diagnosis"]
    return train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

The Below trains a Random Forest classifier.
    
The Random Forest algorithm is an ensemble learning method that combines multiple decision trees.  This reduces overfitting and increases accuracy. The model is trained with 100 estimators (trees) to create a strong classifier by aggregating weak learners.

In [None]:
def train_model(X_train, y_train):
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    return model

The below evalutes the model using accuracy, confusion matrix, and classification report

#prints
*   Model Accuracy
*   Confusion Matrix
*   Classification Report





In [None]:
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)

    print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
    print("Confusion Matrix:")
    print(conf_matrix)
    print("Classification Report:")
    print(class_report)

The Main Function to run the entire pipeline in Google Colab


In [None]:
def main():
    """Main function to run the entire pipeline in Google Colab.

    Steps:
    1. Upload and load the dataset.
    2. Preprocess the data.
    3. Split into training and validation sets.
    4. Train the Random Forest classifier.
    5. Evaluate the trained model.
    """
    df = load_data(file_path)
    df, label_encoders = preprocess_data(df)
    X_train, X_test, y_train, y_test = split_data(df)
    model = train_model(X_train, y_train)
    evaluate_model(model, X_test, y_test)

if __name__ == "__main__":
    main()

Accuracy: 0.83
Confusion Matrix:
[[38515  2284]
 [ 6879  5495]]
Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.94      0.89     40799
           1       0.71      0.44      0.55     12374

    accuracy                           0.83     53173
   macro avg       0.78      0.69      0.72     53173
weighted avg       0.82      0.83      0.81     53173



### Analysis

**Thyroid Cancer Prediction Model Performance Report**

**1. Introduction**
This report evaluates the performance of a **Random Forest Classifier** trained to predict thyroid cancer based on patient data. The evaluation metrics include accuracy, precision, recall, F1-score, and confusion matrix analysis.

---

**2. Model Performance Overview**

- **Accuracy**: **83%**
  - The model correctly classified **83%** of the total instances in the dataset.

- **Confusion Matrix**:
  |                  | Predicted Benign (0) | Predicted Malignant (1) |
  |------------------|----------------------|--------------------------|
  | **Actual Benign (0)**  | 38,515                 | 2,284                    |
  | **Actual Malignant (1)** | 6,879                  | 5,495                    |
  
  - **True Negatives (TN)**: **38,515** (Correct benign predictions)
  - **False Positives (FP)**: **2,284** (Benign cases misclassified as malignant)
  - **False Negatives (FN)**: **6,879** (Malignant cases misclassified as benign)
  - **True Positives (TP)**: **5,495** (Correct malignant predictions)

---

**3. Precision, Recall, and F1-Score**

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| **Benign (0)** | **0.85**  | **0.94** | **0.89** | **40,799** |
| **Malignant (1)** | **0.71**  | **0.44** | **0.55** | **12,374** |

- **Precision**: The model’s ability to correctly predict a class.
  - **Benign (0) Precision = 85%** → 85% of predicted benign cases were truly benign.
  - **Malignant (1) Precision = 71%** → 71% of predicted malignant cases were truly malignant.

- **Recall**: The proportion of actual cases correctly identified.
  - **Benign (0) Recall = 94%** → The model detected **94%** of actual benign cases.
  - **Malignant (1) Recall = 44%** → The model detected **only 44%** of malignant cases.

- **F1-Score**: Harmonic mean of precision and recall.
  - **Benign (0) F1-Score = 0.89** (Strong balance between precision and recall)
  - **Malignant (1) F1-Score = 0.55** (Lower performance due to low recall)

---

**4. Overall Model Evaluation**

| Metric          | Precision | Recall | F1-Score |
|----------------|-----------|--------|----------|
| **Macro Avg**  | **0.78**  | **0.69** | **0.72** |
| **Weighted Avg** | **0.82**  | **0.83** | **0.81** |

- **Macro Average**: Takes the average of precision, recall, and F1-score across both classes.
- **Weighted Average**: Adjusts for class imbalance, giving more weight to common classes.

---

**5. Key Insights & Recommendations**

**Strengths:**
- High **accuracy (83%)** and strong **performance for benign cases (94% recall)**.
- Model achieves **good precision (82%)** overall.

**Weaknesses:**
- **Low recall (44%) for malignant cases**, meaning **many cancerous cases are being missed**.
- **False Negatives (6,879)** could lead to **missed diagnoses**, posing a **serious medical risk**.

 **Potential Improvements:**
1. **Balance the Dataset**:
   - Apply **SMOTE (Synthetic Minority Over-sampling Technique)** to improve recall for malignant cases.
   - Consider **undersampling benign cases** to reduce class imbalance.

2. **Use Alternative Metrics**:
   - **ROC-AUC score** to evaluate true model performance in imbalanced datasets.
   - **Precision-Recall curves** to assess sensitivity.

3. **Hyperparameter Tuning**:
   - Adjust **number of trees, max depth, and feature selection** for improved classification.
   - Test **Gradient Boosting** or **XGBoost**, which handle imbalanced datasets better.

4. **Use Different Model Architectures**:
   - Try **Logistic Regression** for interpretability.
   - Use **Neural Networks** for enhanced feature learning.

---

**6. Conclusion**
The Random Forest model demonstrated strong predictive capabilities for benign cases, achieving a high recall of 94%. However, its ability to detect malignant cases remains limited, with a recall of only 44%, meaning a substantial number of actual thyroid cancer cases are being misclassified as benign.

This high number of false negatives (6,879 cases) represents a significant clinical risk, as undiagnosed thyroid cancer could lead to delayed treatments and worsening patient outcomes. Therefore, future iterations of the model should focus on enhancing sensitivity towards malignant cases, ensuring that high-risk patients receive timely and accurate diagnoses.

To address these concerns, balancing the dataset using oversampling techniques like SMOTE, undersampling the majority class, or using cost-sensitive learning should be explored. Additionally, fine-tuning hyperparameters and evaluating alternative algorithms such as Gradient Boosting, XGBoost, or Neural Networks may improve recall without significantly compromising precision.

By implementing these enhancements, the model can become a more reliable tool for thyroid cancer screening, aiding in early detection and improving patient outcomes.

