# COGS 188 - Project Proposal for Group 21

# project name

## Group members

- Alex Park
- Sandra Lin
- Derrick Lin

# Abstract 

Breast cancer remains one of the most prevalent and deadly cancers among women worldwide. Early detection and accurate diagnosis are critical for improving patient outcomes. This study aims to leverage gene expression data to develop a robust predictive model for breast cancer. Utilizing the dataset from Kaggle, which contains gene expression levels of various genes from multiple breast cancer tissue samples, we employ machine learning algorithms to classify and predict the presence of breast cancer. Our approach includes data preprocessing, feature selection, and the implementation of classification algorithms such as Support Vector Machines (SVM), Random Forest, and Neural Networks. The relevance of these algorithms lies in their ability to handle high-dimensional data and provide accurate predictions, essential for medical diagnostics. This project contributes to the ongoing efforts to improve early breast cancer detection through advanced data analysis techniques.

# Background

Breast cancer is one of the most prevalent cancers among women worldwide. Early detection and prompt treatment are crucial for successful outcomes. However, traditional diagnostic methods relying on manual interpretation of histopathological images by healthcare providers are prone to subjectivity and variability. These limitations highlight the demand for more objective and efficient diagnostic approaches. 

Several studies have explored the use of machine learning and artificial intelligence techniques to improve the accuracy and efficiency of breast cancer diagnosis. For example, Kim (2019) utilized machine learning algorithms to predict breast cancer malignancy based on histopathological images <a name="Kim, Dong Wook"></a>[<sup>[2]</sup>](#Kim). Similarly, Esteva (2017) developed a deep learning model for diagnosing breast cancer using biopsy images <a name="Esteva, Andre"></a>[<sup>[3]</sup>](#Esteva). These studies demonstrate the potential of machine learning in enhancing diagnostic accuracy and efficiency in breast cancer diagnosis.

Furthermore, recent studies have demonstrated the success of feature selection techniques in optimizing machine learning models for breast cancer diagnosis. Ebtisam (2020) applied feature selection methods to identify the most relevant features for predicting breast cancer malignancy <a name="Ebtisam, Hamed"></a>[<sup>[4]</sup>](#Ebtisam). Additionally, Amazona (2019) used Recursive Feature Elimination (RFE) to select key features for diagnosing breast cancer <a name="Amazona, Adorada"></a>[<sup>[5]</sup>](#Amazona).

Despite these achievements, there is still a need for comprehensive research in optimizing scheduling and resource allocation techniques. Through meticulous data preprocessing, feature engineering, and model evaluation, this research seeks to advance the objectivity, efficiency, and accuracy of breast cancer diagnosis, ultimately improving patient outcomes.

# Problem Statement

The problem addressed in this study is to optimize breast cancer diagnosis by using advanced machine learning techniques to accurately predict the malignancy in breast cancer samples. To address this challenge, we plan to use algorithms, such as Support Vector Machines, Random Forests, and Neural Networks, to analyze cell features extracted from fine needle aspirate images. Our focus will be on using the Breast Cancer Wisconsin dataset, which includes attributes like Clump Thickness, Uniformity of Cell Size, and Marginal Adhesion. Through this analysis, our goal is to develop predictive models that can assist medical professionals in early and accurate diagnosis.  Moreover, the problem is quantifiable, as we can measure model performance using evaluation metrics such as accuracy, precision, recall, F1 score, and ROC-AUC score. Through quantitative assessment and comparison with established benchmarks, our study aims to deliver a reliable and reproducible solution for enhancing breast cancer diagnosis.

# Data

For our project, we will be using the Breast Cancer Wisconsin (Original) dataset, available from the UCI Machine Learning Repository. This dataset is a classic in medical data analysis, particularly for machine learning applications aimed at understanding cancer classification based on various diagnostic measurements.

### Dataset Characteristics
- **Source**: [Breast Cancer Wisconsin (Original) - UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original))
- **Number of Instances**: 699
- **Number of Features**: 9, plus the class attribute
- **Missing Values**: Yes, specifically in the 'Bare Nuclei' feature

### Description of Data
The dataset comprises measurements from digitized images of fine needle aspirate (FNA) of breast mass. Each instance corresponds to an image, with the following attributes (excluding the ID number which we will disregard for analysis):
1. **Clump Thickness**: Integer (1-10)
2. **Uniformity of Cell Size**: Integer (1-10)
3. **Uniformity of Cell Shape**: Integer (1-10)
4. **Marginal Adhesion**: Integer (1-10)
5. **Single Epithelial Cell Size**: Integer (1-10)
6. **Bare Nuclei**: Integer (1-10), contains missing values
7. **Bland Chromatin**: Integer (1-10)
8. **Normal Nucleoli**: Integer (1-10)
9. **Mitoses**: Integer (1-10)

### Class Labels
- **Benign**: Encoded as 2
- **Malignant**: Encoded as 4

### Critical Variables
- **Bare Nuclei**: This feature contains missing values and will require imputation. Given its clinical significance in cancer diagnosis, how we handle its missing data could significantly influence model performance.
- **Clump Thickness** and **Uniformity of Cell Size**: These features are particularly noteworthy as they can indicate the severity of cancerous formations.

### Data Handling
Due to the presence of missing values in the 'Bare Nuclei' feature, we will need to perform data cleaning. Options include imputation using statistical methods or more complex machine learning models. Additionally, given the range and nature of the features, standardizing or normalizing the data might be necessary to improve algorithm performance.

This dataset, with its rich feature set and real-world applicability, offers a robust platform for developing and testing machine learning models aimed at classifying cancerous conditions, thus potentially improving diagnostic processes in medical practice.


# Proposed Solution

Breast cancer is one of the most prevalent cancers among women worldwide. Early detection and accurate diagnosis are crucial for effective treatment and management. The dataset chosen for this project contains features derived from digitized images of fine needle aspirates (FNAs) of breast masses. The task is to classify the masses as benign or malignant based on these features.

**Proposed Algorithm: Support Vector Machine (SVM)**

Support Vector Machines (SVMs) are powerful supervised learning models that are particularly effective for classification tasks. The key idea behind SVMs is to find a hyperplane that best separates the data points of different classes. The relevance of using SVM for this problem can be justified through the following points:
- **High Dimensionality Handling:** SVMs are effective in high-dimensional spaces, which is beneficial given the 30 features present in the breast cancer dataset. These features include mean radius, texture, perimeter, area, smoothness, and other characteristics that are crucial for distinguishing between benign and malignant masses.

- **Margin Maximization:** SVMs aim to find the hyperplane that maximizes the margin between the two classes. This characteristic is particularly useful in medical diagnosis tasks where it’s important to have a clear boundary to minimize false positives and false negatives, thereby reducing the chances of misdiagnosis.

- **Kernel Trick:** SVMs can employ the kernel trick to handle non-linear relationships between features, making it adaptable to the complexity of biological data. For instance, the Radial Basis Function (RBF) kernel can map the data into a higher-dimensional space where a linear separation is possible.

- **Robustness to Overfitting:** By adjusting the regularization parameter, SVMs can balance the trade-off between achieving a low error on training data and avoiding overfitting, ensuring better generalization on unseen data.

**Implementation Details**

1. **Data Preprocessing:**

   - **Normalization:** Since SVMs are sensitive to the scale of data, features will be normalized to ensure each feature contributes equally to the distance calculations.

   - **Train-Test Split:** The dataset will be split into training and test sets to evaluate the model’s performance on unseen data.

2. **Model Training:**

   - **Kernel Selection:** We will experiment with linear, polynomial, and RBF kernels to determine the best fit for the data.

   - **Hyperparameter Tuning:** Using techniques such as grid search with cross-validation, we will tune hyperparameters like the regularization parameter (C) and kernel-specific parameters (gamma for RBF).

3. **Evaluation Metrics:**

   - see section below

4. **Model Interpretation:**

   - **Support Vectors:** Analyzing the support vectors can provide insights into which data points are most critical in defining the decision boundary.

   - **Feature Importance:** By examining the coefficients of the linear SVM model, we can identify which features are most influential in the classification decision.

By using SVM, we aim to develop a robust and interpretable model that can effectively distinguish between benign and malignant breast masses, aiding in early and accurate diagnosis.

# Evaluation Metrics

1. **Accuracy:** The proportion of correctly classified instances.

2. **Precision and Recall:** Precision measures the proportion of true positives among the instances classified as positive, while recall measures the proportion of true positives among all actual positives. These metrics are critical in medical diagnosis to ensure that malignant cases are correctly identified.

3. **F1-Score:** The harmonic mean of precision and recall, providing a single metric that balances the two.

# Results

You may have done tons of work on this. Not all of it belongs here. 

Reports should have a __narrative__. Once you've looked through all your results over the quarter, decide on one main point and 2-4 secondary points you want us to understand. Include the detailed code and analysis results of those points only; you should spend more time/code/plots on your main point than the others.

If you went down any blind alleys that you later decided to not pursue, please don't abuse the TAs time by throwing in 81 lines of code and 4 plots related to something you actually abandoned.  Consider deleting things that are not important to your narrative.  If its slightly relevant to the narrative or you just want us to know you tried something, you could keep it in by summarizing the result in this report in a sentence or two, moving the actual analysis to another file in your repo, and providing us a link to that file.

### Subsection 1

You will likely have different subsections as you go through your report. For instance you might start with an analysis of the dataset/problem and from there you might be able to draw out the kinds of algorithms that are / aren't appropriate to tackle the solution.  Or something else completely if this isn't the way your project works.

### Subsection 2

Another likely section is if you are doing any feature selection through cross-validation or hand-design/validation of features/transformations of the data

### Subsection 3

Probably you need to describe the base model and demonstrate its performance.  Maybe you include a learning curve to show whether you have enough data to do train/validate/test split or have to go to k-folds or LOOCV or ???

### Subsection 4

Perhaps some exploration of the model selection (hyper-parameters) or algorithm selection task. Validation curves, plots showing the variability of perfromance across folds of the cross-validation, etc. If you're doing one, the outcome of the null hypothesis test or parsimony principle check to show how you are selecting the best model.

### Subsection 5 

Maybe you do model selection again, but using a different kind of metric than before?



# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

Are there any problems with the work?  For instance would more data change the nature of the problem? Would it be good to explore more hyperparams than you had time for?   

### Ethics & Privacy

1. ***Data Privacy:*** Since we are using the sensitive medical data, we must ensure the privacy and confidentiality of patient information. We must obey to all relevant data protection regulations, such as HIPAA (Health Insurance Portability and Accountability Act) in the United States, and obtain proper consent for data usage and sharing that UCI Machine Learning Repository has which is that we can browse our site or download datasets without providing any personal information such as your name, mailing address, email address, phone number, or social security number.


2. ***Bias and Fairness:*** We acknowledge the potential for bias in healthcare data, influenced by factors such as demographics, socioeconomic status, and access to healthcare. Our team will evaluate our machine learning model for any biases in the dataset or introduced by the algorithms used. We will implement strategies to mitigate bias and ensure fairness in our predictions, striving to avoid favoring or disadvantaging any specific demographic group or individual. Additionally, we will actively monitor model performance across different subgroups to identify and address any disparities in predictive accuracy.


3. ***Informed Consent:*** Given that the Breast Cancer Wisconsin dataset sourced from the UCI Machine Learning Repository is publicly available and anonymized, individual contributors' explicit consent may not be applicable. However, we remain committed to upholding ethical standards by ensuring transparency regarding the dataset's origin and usage within our project. We will provide clear communication about the dataset's purpose and how it will be utilized for research purposes.

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name="uci-note"></a>1.[^](#uci): Wolberg, W. (14 Jul 1992). Breast Cancer Wisconsin (Original). UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)<br>

<a name="Kim, Dong Wook"></a>2.[^](#Kim): Kim, Dong Wook, et al. (6 May 2019) “Deep Learning-Based Survival Prediction of Oral Cancer Patients.” *Nature News*, Nature Publishing Group, https://www.nature.com/articles/s41598-019-43372-7<br> 

<a name="Esteva, Andre"></a>3.[^](#Esteva): Esteva, Andre, et al. (25 Jan 2017) "Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks." *Nature News*, Nature Publishing Group, https://www.nature.com/articles/nature21056.<br> 

<a name="Ebtisam, Hamed"></a>4.[^](#Ebtisam): Ebtisam, Hamed, et al. (21 June 2018) An Analysis of Particle Swarm Optimization for Feature Selection on Medical Data | IEEE Conference Publication | *IEEE Xplore*, IEEE, https://ieeexplore.ieee.org/document/8389840. <br> 

<a name="Amazona, Adorada"></a>5.[^](#Amazona): Amazona, Adorada, et al. (24 Jan 2019). Support Vector Machine - Recursive Feature Elimination (SVM - RFE) for Selection of MicroRNA Expression Features of Breast Cancer, https://ieeexplore.ieee.org/abstract/document/8621708 <br> 
