# COGS 188 - Project Proposal for Group 21

# Predictive Modeling for Breast Cancer Recurrence Using Machine Learning

## Group members

- Alex Park
- Sandra Lin
- Derrick Lin

# Abstract 

Breast cancer is a critical health issue worldwide, necessitating robust predictive models to assist in early diagnosis and treatment planning. This project aims to develop a machine learning model to predict breast cancer recurrence using a dataset from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. By employing algorithms such as logistic regression, support vector machines, and random forests, this study seeks to identify significant predictors of recurrence and improve the accuracy of predictions. The expected outcome is a reliable predictive model that can assist healthcare providers in making informed decisions and improving patient outcomes.

# Background

Breast cancer is one of the most prevalent cancers among women worldwide. Early detection and prompt treatment are crucial for successful outcomes. However, traditional diagnostic methods relying on manual interpretation of histopathological images by healthcare providers are prone to subjectivity and variability. These limitations highlight the demand for more objective and efficient diagnostic approaches. 

Several studies have explored the use of machine learning and artificial intelligence techniques to improve the accuracy and efficiency of breast cancer diagnosis. For example, Kim (2019) utilized machine learning algorithms to predict breast cancer malignancy based on histopathological images <a name="Kim, Dong Wook"></a>[<sup>[2]</sup>](#Kim). Similarly, Esteva (2017) developed a deep learning model for diagnosing breast cancer using biopsy images <a name="Esteva, Andre"></a>[<sup>[3]</sup>](#Esteva). These studies demonstrate the potential of machine learning in enhancing diagnostic accuracy and efficiency in breast cancer diagnosis.

Furthermore, recent studies have demonstrated the success of feature selection techniques in optimizing machine learning models for breast cancer diagnosis. Ebtisam (2020) applied feature selection methods to identify the most relevant features for predicting breast cancer malignancy <a name="Ebtisam, Hamed"></a>[<sup>[4]</sup>](#Ebtisam). Additionally, Amazona (2019) used Recursive Feature Elimination (RFE) to select key features for diagnosing breast cancer <a name="Amazona, Adorada"></a>[<sup>[5]</sup>](#Amazona).

Despite these achievements, there is still a need for comprehensive research in optimizing scheduling and resource allocation techniques. Through meticulous data preprocessing, feature engineering, and model evaluation, this research seeks to advance the objectivity, efficiency, and accuracy of breast cancer diagnosis, ultimately improving patient outcomes.

# Problem Statement

The problem addressed in this study is to optimize breast cancer diagnosis by using advanced machine learning techniques to accurately predict the malignancy in breast cancer samples. To address this challenge, we plan to use algorithms, such as Support Vector Machines, Random Forests, and Neural Networks, to analyze cell features extracted from fine needle aspirate images. Our focus will be on using the Breast Cancer Wisconsin dataset, which includes attributes like Clump Thickness, Uniformity of Cell Size, and Marginal Adhesion. Through this analysis, our goal is to develop predictive models that can assist medical professionals in early and accurate diagnosis.  Moreover, the problem is quantifiable, as we can measure model performance using evaluation metrics such as accuracy, precision, recall, F1 score, and ROC-AUC score. Through quantitative assessment and comparison with established benchmarks, our study aims to deliver a reliable and reproducible solution for enhancing breast cancer diagnosis.

# Data

For this project, we will utilize the breast cancer dataset obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. This dataset provides comprehensive clinical data, which is crucial for developing predictive models for breast cancer recurrence.

### Dataset Characteristics
- **Source**: [Datasets: Breast Cancer (GitHub) - University Medical Centre, Institute of Oncology](https://github.com/datasets/breast-cancer/tree/master)
- **Number of Instances**: 272
- **Number of Features**: 10, including the class attribute
- **Missing Values**: Yes, specifically in the 'falsede-caps' and 'breast-quad' features

### Description of Data
The dataset includes the following attributes:

1. **age**: Age group of the patient (e.g., 40-49, 50-59)
2. **mefalsepause**: Menopause status (e.g., premefalse, ge40)
3. **tumor-size**: Integer (1-10)
4. **inv-falsedes**: Integer (1-10)
5. **falsede-caps**: Integer (1-10)
6. **deg-malig**: Integer (1-10), contains missing values
7. **breast**: Integer (1-10)
8. **breast-quad**: Integer (1-10)
9. **irradiat**: Integer (1-10)
10. **class**: Recurrence status (recurrence-events or false-recurrence-events)

### Critical Variables
- **falsede-caps**: This feature contains missing values and will require careful imputation. Its presence or absence is clinically significant in diagnosing cancer.
- **deg-malig**: The degree of malignancy is a critical variable, indicating the severity of cancerous formations.

### Data Handling

The dataset contains information about breast cancer patients, including various clinical features. To prepare the data for analysis, the following steps were taken:

1. **Loading and Initial Analysis**:
   - The dataset was loaded and its structure was examined using summary statistics and data types.

2. **Handling Missing Values**:
   - Missing values in numerical columns were imputed using the mean of each column. This ensures that no data is lost due to missing values.
   - For categorical columns, the mode of each column can be used to fill missing values (if applicable).

3. **Standardization**:
   - Numerical features were standardized to have a mean of 0 and a standard deviation of 1. This step is crucial for machine learning algorithms that are sensitive to the scale of data.

4. **Saving Cleaned Data**:
   - The cleaned dataset was saved to a new CSV file for further analysis.

This dataset, with its comprehensive feature set and real-world relevance, provides a solid foundation for developing and evaluating machine learning models to predict breast cancer recurrence. By ensuring the data is clean and well-prepared, we can improve the accuracy and reliability of our predictive models.


In [4]:
import pandas as pd

# Load the dataset
data = pd.read_csv('breast-cancer.csv')

# Display the first few rows of the dataset to understand its structure
print(data.head())

# Display summary statistics of the dataset
print(data.describe())

# Check for missing values
print("Missing values in each column:")
print(data.isnull().sum())

# Check the data types of each column
print("Data types of each column:")
print(data.dtypes)


     age mefalsepause tumor-size inv-falsedes falsede-caps  deg-malig breast  \
0  40-49   premefalse      15-19          0-2         True          3  right   
1  50-59         ge40      15-19          0-2        False          1  right   
2  50-59         ge40      35-39          0-2        False          2   left   
3  40-49   premefalse      35-39          0-2         True          3  right   
4  40-49   premefalse      30-34          3-5         True          2   left   

  breast-quad  irradiat                    class  
0     left_up     False        recurrence-events  
1     central     False  false-recurrence-events  
2    left_low     False        recurrence-events  
3    left_low      True  false-recurrence-events  
4    right_up     False        recurrence-events  
        deg-malig
count  272.000000
mean     2.058824
std      0.736649
min      1.000000
25%      2.000000
50%      2.000000
75%      3.000000
max      3.000000
Missing values in each column:
age             0
me

In [6]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Load the dataset
data = pd.read_csv('breast-cancer.csv')

# Display the first few rows of the dataset to understand its structure
print(data.head())

# Display summary statistics of the dataset
print(data.describe())

# Check for missing values
print("Missing values in each column:")
print(data.isnull().sum())

# Check the data types of each column
print("Data types of each column:")
print(data.dtypes)

# Impute missing values for numerical columns
imputer = SimpleImputer(strategy='mean')
numeric_features = data.select_dtypes(include=[np.number]).columns
data[numeric_features] = imputer.fit_transform(data[numeric_features])

# Impute missing values for categorical columns (if any)
# categorical_features = data.select_dtypes(include=[object]).columns
# data[categorical_features] = data[categorical_features].apply(lambda x: x.fillna(x.mode()[0]))

# Confirm that there are no more missing values
print("Missing values after imputation:")
print(data.isnull().sum())

# Standardize the numerical features
scaler = StandardScaler()
data[numeric_features] = scaler.fit_transform(data[numeric_features])

# Display the first few rows of the cleaned dataset
print(data.head())

# Save the cleaned dataset to a new CSV file (optional)
data.to_csv('cleaned_breast_cancer_data.csv', index=False)


     age mefalsepause tumor-size inv-falsedes falsede-caps  deg-malig breast  \
0  40-49   premefalse      15-19          0-2         True          3  right   
1  50-59         ge40      15-19          0-2        False          1  right   
2  50-59         ge40      35-39          0-2        False          2   left   
3  40-49   premefalse      35-39          0-2         True          3  right   
4  40-49   premefalse      30-34          3-5         True          2   left   

  breast-quad  irradiat                    class  
0     left_up     False        recurrence-events  
1     central     False  false-recurrence-events  
2    left_low     False        recurrence-events  
3    left_low      True  false-recurrence-events  
4    right_up     False        recurrence-events  
        deg-malig
count  272.000000
mean     2.058824
std      0.736649
min      1.000000
25%      2.000000
50%      2.000000
75%      3.000000
max      3.000000
Missing values in each column:
age             0
me

# Proposed Solution

Breast cancer is one of the most prevalent cancers among women worldwide. Early detection and accurate diagnosis are crucial for effective treatment and management. The dataset chosen for this project contains features such as age, menopause status, tumor size, and other relevant attributes. The task is to predict whether breast cancer will recur based on these features.

## Proposed Algorithm: Support Vector Machine (SVM)

Support Vector Machines (SVMs) are powerful supervised learning models that are particularly effective for classification tasks. The relevance of using SVM for predicting breast cancer recurrence can be justified through the following points:
- **High Dimensionality Handling:** SVMs are effective in high-dimensional spaces, which is beneficial given the multiple features present in the breast cancer dataset. These features include age, tumor size, degree of malignancy, and other characteristics crucial for distinguishing between recurrence and no-recurrence events.


- **Margin Maximization:** SVMs aim to find the hyperplane that maximizes the margin between the two classes. This characteristic is particularly useful in medical diagnosis tasks where it’s important to have a clear boundary to minimize false positives and false negatives, thereby reducing the chances of misdiagnosis.


- **Kernel Trick:** SVMs can employ the kernel trick to handle non-linear relationships between features, making it adaptable to the complexity of biological data. For instance, the Radial Basis Function (RBF) kernel can map the data into a higher-dimensional space where a linear separation is possible.


- **Robustness to Overfitting:** By adjusting the regularization parameter, SVMs can balance the trade-off between achieving a low error on training data and avoiding overfitting, ensuring better generalization on unseen data.

## Implementation Details

1. **Data Preprocessing:**

   - **Normalization:** Since SVMs are sensitive to the scale of data, features will be normalized to ensure each feature contributes equally to the distance calculations.

   - **Train-Test Split:** The dataset will be split into training and test sets to evaluate the model’s performance on unseen data.


2. **Model Training:**

   - **Kernel Selection:** We will experiment with linear, polynomial, and RBF kernels to determine the best fit for the data.

   - **Hyperparameter Tuning:** Using techniques such as grid search with cross-validation, we will tune hyperparameters like the regularization parameter (C) and kernel-specific parameters (gamma for RBF).


3. **Evaluation Metrics:**

   - **Accuracy:** The proportion of correctly classified instances.

   - **Precision and Recall:** Precision measures the proportion of true positives among the instances classified as positive, while recall measures the proportion of true positives among all actual positives. These metrics are critical in medical diagnosis to ensure that recurrence cases are correctly identified.

   - **F1-Score:** The harmonic mean of precision and recall, providing a single metric that balances the two.

   - **ROC-AUC:** The Area Under the Receiver Operating Characteristic Curve, which provides a measure of the model’s ability to discriminate between classes.


4. **Model Interpretation:**

   - **Support Vectors:** Analyzing the support vectors can provide insights into which data points are most critical in defining the decision boundary.

   - **Feature Importance:** By examining the coefficients of the linear SVM model, we can identify which features are most influential in the classification decision.

By using SVM, we aim to develop a robust and interpretable model that can effectively predict breast cancer recurrence, aiding in early and accurate diagnosis.

# Evaluation Metrics

1. **Accuracy:** The proportion of correctly classified instances.

2. **Precision and Recall:** Precision measures the proportion of true positives among the instances classified as positive, while recall measures the proportion of true positives among all actual positives. These metrics are critical in medical diagnosis to ensure that recurrence cases are correctly identified.

3. **F1-Score:** The harmonic mean of precision and recall, providing a single metric that balances the two.

4. **ROC-AUC:** The Area Under the Receiver Operating Characteristic Curve, which provides a measure of the model’s ability to discriminate between classes.

# Results

This section presents the results of our analysis and model training for predicting breast cancer recurrence using Support Vector Machines (SVM). The results are organized into several subsections, each focusing on different aspects of model evaluation.

### Model Training and Hyperparameter Tuning

We trained the SVM model using the preprocessed dataset and performed hyperparameter tuning to find the best model parameters.


In [11]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Load the dataset
df = pd.read_csv("cleaned_breast_cancer_data.csv")

# Encode categorical variables
label_encoders = {}
for column in df.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column])
    label_encoders[column] = le

# Handle missing values
df.fillna(df.median(), inplace=True)

# Define features and target
X = df.drop(columns=['class'])
y = df['class']

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define the SVM model and grid search parameters
svm = SVC()
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid']
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(svm, param_grid, refit=True, verbose=2, cv=5)
grid_search.fit(X_train, y_train)

# Print the best parameters and the corresponding score
print(f'Best Parameters: {grid_search.best_params_}')


Fitting 5 folds for each of 64 candidates, totalling 320 fits
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   0.0s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   0.0s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   0.0s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   0.0s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   0.0s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   0.0s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   0.0s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   0.0s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   0.0s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=

**Response:** The grid search for hyperparameter tuning has completed successfully. The best parameters identified for the Support Vector Machine (SVM) model are `C: 1`, `gamma: 0.1`, and `kernel: 'poly'`. These parameters were chosen after evaluating 64 different combinations through cross-validation, ensuring the optimal configuration for our dataset.

### Subsection 2

Another likely section is if you are doing any feature selection through cross-validation or hand-design/validation of features/transformations of the data

### Subsection 3

Probably you need to describe the base model and demonstrate its performance.  Maybe you include a learning curve to show whether you have enough data to do train/validate/test split or have to go to k-folds or LOOCV or ???

### Subsection 4

Perhaps some exploration of the model selection (hyper-parameters) or algorithm selection task. Validation curves, plots showing the variability of perfromance across folds of the cross-validation, etc. If you're doing one, the outcome of the null hypothesis test or parsimony principle check to show how you are selecting the best model.

### Subsection 5 

Maybe you do model selection again, but using a different kind of metric than before?

# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

1. ***Data Limitations***: The dataset of breast cancer, sourced from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia, may not represent the global population. The dataset consists of a limited number of instances and may lack diversity in terms of patient demographics and clinical characteristics. This limitation could impact the generalizability of the predictive model.


2. ***Feature Selection and Hyperparameters***: While various algorithms such as logistic regression, support vector machines, and random forests were employed, the scope of hyperparameter tuning was limited due to time constraints. Hyperparameter tuning is cirtical for optimizing model performance, and exploring a wider range of hyperparameters could potentially improve the accuracy and reliability of the predictions.


3. ***External Validation***: The models were mainly validated using internal cross-validation methods and external validation using independent datasets from different institutions would provide a more rigorous assessment of the model’s generalizability and performance in real-world clinical settings. Without external validation, it's difficult to check the model will perform well on unseen data from different sources, which is essential for clinical applicability.

### Ethics & Privacy

1. ***Data Privacy:*** Since we are using sensitive medical data, it is imperative to ensure the privacy and confidentiality of patient information. We must adhere to all relevant data protection regulations, such as HIPAA (Health Insurance Portability and Accountability Act) in the United States, and similar regulations in other jurisdictions. The dataset sourced from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia, is expected to be anonymized to protect patient identities. Proper consent for data usage and sharing must be obtained, and we must comply with the data protection policies specified by the data repository. According to the dataset's terms of use on GitHub, users can browse and download datasets without providing personal information, ensuring compliance with privacy standards.


2. ***Bias and Fairness:*** We acknowledge the potential for bias in healthcare data, influenced by factors such as demographics, socioeconomic status, and access to healthcare. Our team will rigorously evaluate our machine learning model for any biases present in the dataset or introduced by the algorithms used. We will implement strategies to mitigate bias and ensure fairness in our predictions, striving to avoid favoring or disadvantaging any specific demographic group or individual. Additionally, we will actively monitor model performance across different subgroups to identify and address any disparities in predictive accuracy. By doing so, we aim to develop a model that provides equitable healthcare insights for all patient groups.


3. ***Informed Consent:*** Given that the Breast Cancer dataset sourced from GitHub is publicly available and anonymized, explicit consent from individual contributors may not be required. However, we remain committed to upholding ethical standards by ensuring transparency regarding the dataset's origin and usage within our project. We will provide clear communication about the dataset's purpose and how it will be utilized for research purposes. This involves making the objectives and potential impacts of the research accessible and understandable to the public, thereby maintaining trust and accountability in our research practices.

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name="uci-note"></a>1.[^](#uci): Wolberg, W. (14 Jul 1992). Breast Cancer Wisconsin (Original). UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)<br>

<a name="Kim, Dong Wook"></a>2.[^](#Kim): Kim, Dong Wook, et al. (6 May 2019) “Deep Learning-Based Survival Prediction of Oral Cancer Patients.” *Nature News*, Nature Publishing Group, https://www.nature.com/articles/s41598-019-43372-7<br> 

<a name="Esteva, Andre"></a>3.[^](#Esteva): Esteva, Andre, et al. (25 Jan 2017) "Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks." *Nature News*, Nature Publishing Group, https://www.nature.com/articles/nature21056.<br> 

<a name="Ebtisam, Hamed"></a>4.[^](#Ebtisam): Ebtisam, Hamed, et al. (21 June 2018) An Analysis of Particle Swarm Optimization for Feature Selection on Medical Data | IEEE Conference Publication | *IEEE Xplore*, IEEE, https://ieeexplore.ieee.org/document/8389840. <br> 

<a name="Amazona, Adorada"></a>5.[^](#Amazona): Amazona, Adorada, et al. (24 Jan 2019). Support Vector Machine - Recursive Feature Elimination (SVM - RFE) for Selection of MicroRNA Expression Features of Breast Cancer, https://ieeexplore.ieee.org/abstract/document/8621708 <br> 
