Breast Cancer Prediction
- What are the strongest predictors of breast cancer?
- Type of question = predictive
The goal of this project is to discover the strongest predictors of breast cancer in the data source Breast Cancer Coimbra Data Set. The dataset includes 64 records of breast cancer patients and 52 records of healthy controls. There are 9 features in the dataset that contribute in predicting breast cancer. Using these features, the project aims to identify the strongest predictors of breast cancer.
Motivation for analysis
Cancer is an open-ended problem till date. It is one of biggest research areas of medical science. There are many types of cancers which are rapidly getting common. It is estimated that 41,400 deaths (40,920 women and 480 men) from breast cancer will occur this year 2018. We were highly interested to use machine learning models to dive in this dataset and explore about breast cancer predictions.
The data used in the analysis is from the UCI machine learning repository Breast Cancer Coimbra Data Set. The data comprises of nine predictors, and a binary dependent variable indicating the presence or absence of breast cancer. All nine predictors are quantitative variables with positive values.
Quantitative attributes and their description:
Age (years) : Age of the individual.
BMI (kg/m2) : Body mass index of the individual.
Glucose (mg/dL) : Glucose level of the individual.
Insulin (µU/mL) : Insulin level of the individual. Insulin is a hormone made by the pancreas that allows your body to use sugar (glucose) from carbohydrates in the food that you eat for energy or to store glucose for future use.
HOMA : Homeostasis model assessment used to detect insulin resistance and identify patients at high risk of breast cancer development.
Leptin (ng/mL) : Leptin, "the hormone of energy expenditure", is a hormone predominantly made by adipose cells that helps to regulate energy balance by inhibiting hunger. Leptin is opposed by the actions of the hormone ghrelin, the "hunger hormone". Both hormones act on receptors in the arcuate nucleus of the hypothalamus.
Adiponectin (µg/mL) : Adiponectin (also referred to as GBP-28, apM1, AdipoQ and Acrp30) is a protein hormone which is involved in regulating glucose levels as well as fatty acid breakdown. In humans, it is encoded by the ADIPOQ gene and it is produced in adipose tissue.
Resistin (ng/mL) : Resistin also known as adipose tissue-specific secretory factor (ADSF) or C/EBP-epsilon-regulated myeloid-specific secreted cysteine-rich protein (XCP1) is a cysteine-rich adipose-derived peptide hormone that in humans is encoded by the RETN gene.
MCP-1 (pg/dL) : The chemokine (C-C motif) ligand 2 (CCL2) is also referred to as monocyte chemoattractant protein 1 (MCP1) and small inducible cytokine A2. CCL2 is a small cytokine that belongs to the CC chemokine family.
Labels: 1 denotes Healthy controls and 2 denotes Patients.
Plan of action
- Use supervised learning to build a predictive model.
- 80% of the data will be used to train the predictive model, and 20% will be used to test the predictive model. To avoid over-fitting in the model, use cross validation.
- Visualize distributions of training data features using histograms to identify better predictors from the data set.
- Use decision tree classification to build the predictive model.
- Visualise the test data predictions.
Choosing decision tree classification
We choose decision tree classification for our analysis because it is parametric. In our attempt to build a model that ranks features based on their importance, decision tree classification takes all of the features and complete training data to pick the strongest predictors. Other supervised learning approaches that are non-parametric such as K-Nearest Neighbours would not be able to rank the features by importance, and thus, fail to answer our analysis question.
Steps without Docker:
Clone this repo, and using the command line, navigate to the root of this project.
- Run the following commands:
python scripts/read_clean.py data/breast_cancer.csv results/breast_cancer_new.csv
python scripts/eda.py results/breast_cancer_new.csv img/plot
python scripts/analysis.py results/breast_cancer_new.csv results/detailed.csv
python scripts/analysis.py results/breast_cancer_new.csv results/importance.csv
python scripts/plot.py results/importance.csv results/results.png
Rscript -e "rmarkdown::render('doc/report.Rmd')
- Makefile runs all the above commands using the following command:
- To erase all analysis output files created by make all, use the following command:
Follow the steps to run this analysis using Docker:
- Clone/download this repository and run the following command:
docker pull talhaadnan100/breast-cancer-prediction
- Now, use the command line to navigate to the root of this project on your computer, and then run the following command(filling in PATH_ON_YOUR_COMPUTER with the absolute path to the root of this project on your computer):
Mac/Linux users run:
docker run --rm -v <PATH-ON-YOUR-COMPUTER>:/home/breast-cancer-prediction talhaadnan100/breast-cancer-prediction make -C '/home/breast-cancer-prediction' all
Windows users run:
docker run --rm -v <PATH-ON-YOUR-COMPUTER>:/home/breast-cancer-prediction talhaadnan100/breast-cancer-prediction make -C /home/breast-cancer-prediction all
- To clean up the analysis use the following command:
Mac/Linux users run:
docker run --rm -v <PATH-ON-YOUR-COMPUTER>:/home/breast-cancer-prediction talhaadnan100/breast-cancer-prediction make -C '/home/breast-cancer-prediction' clean
Windows users run:
docker run --rm -v <PATH-ON-YOUR-COMPUTER>:/home/breast-cancer-prediction talhaadnan100/breast-cancer-prediction make -C /home/breast-cancer-prediction clean
Python version 3.6.5 and the following python packages:
- numpy (version 1.14.3)
- pandas (version 0.23.0)
- sklearn (version 0.19.1)
- matplotlib (version 3.0.0)
- seaborn (version 0.9.0)
- argparse (version 1.0.10)
R version 3.5.1 and the following R packages:
- tidyverse (version 1.2.1)
- cowsay (version 0.7.0)
- gridExtra (version 2.3)
- png (version 0.1-7)
- here (version 0.1)
Result Summary and Visualization
The Final report records visualization for the importance of all the features and accuracy of the predictive model on the test data set.
|Result files||Link to file|
|File after cleaning the data||breast_cancer_new.csv|
|File that includes all the predictions||detailed.csv|
|File for importance of features||importance.csv|
|Plot for the result||results.png|
Relevant research link
This project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms. Feedback, bug reports (and fixes!), and feature requests are welcomed.
Wikipedia (for basic terms of medical attributes and their importance in the cancer research).
Patrício, M., Pereira, J., Crisóstomo, J., Matafome, P., Gomes, M., Seiça, R., & Caramelo, F. (2018). Using Resistin, glucose, age and BMI to predict the presence of breast cancer.
Breast cancer image by ConsenSys Media