# _Breast Cancer Predictor Using Clinical and Anthropometric Data._

#### Names of contributors: Anene Ifeanyi, Atymtayeva Saule, Bhatraju Aditya, Kuriyedath Rahul.

#### Date: 2020-12-08

# Table of contents

1. [Summary](#Summary)
2. [Methods](#Methods)
    1. [Data](#Data)
    2. [Analysis and Results](#Analysis-and-Results)
        1. [Class distributions of some important features](#class-distribution)
        2. [Cross validation results for different classifiers](#cvresults)
3. [Future Improvements](#FutureImprovements)
3. [References](#References)

## Summary <a name="Summary"></a>

- Given the clinical and anthropometric data of a new patient, we attempt to predict the presence or absence of breast cancer. These predictions assist the medical staff in taking an appropirate course of action with respect to treatment. 
- We chose `Recall` as our metric as minimizing `False Negatives` was our priority (i.e we care more about accurately predicting the presence of breast cancer)
- Our best classifier for chosen metric, Logistic Regression, has given us a recall of `0.7142`
- Meaning only `71.42%` of cases that had breast cancer were predicted by our model and in `28.58%` it incorrectly predicts that a patient does not have breast cancer.
- To achieve better results in future, we recommend using recursive feature elimination and using more data for training. 

## Methods <a name="Methods"></a>

## Data <a name="Data"></a>

The dataset used in this project consists of anthropometric data and parameters gathered in a standard blood analysis. This dataset was created by Miguel Patrício, José Pereira, Joana Crisóstomo, Paulo Matafome, Raquel Seiça, Francisco Caramelo, all from the Faculty of Medicine of the University of Coimbra and also Manuel Gomes from the University Hospital Centre of Coimbra (Patrício et al., 2018). The dataset was sourced from the UCI Machine Learning Repository (Dua and Graff 2017) and it can be found [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra), particularly [this file](https://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv). Each row in this dataset represents a set of observations of individual patients and each column represents a variable. In this dataset, there are 116 observations and 9 features which are all numerical. There are zero observations with missing values for each class in the dataset. The target column is a binary dependent variable, which indicates the presence (Classification = 2) or absence (Classification = 1) of breast cancer. <br/> <br/>
As there are no missing values and data only has numerical features, the only feature transformation used was `Scaling`.

## Analysis and Results <a name="Analysis-and-Results"></a>

#### Class distributions of some important features <a name="class-distribution"></a>

![output_16_2.png](attachment:8f0ff12b-df96-4943-928e-f4a73e8716dd.png)

![output_16_3.png](attachment:b8f4dbd3-a668-4e79-a2ef-78d49b52444b.png)

![output_16_4.png](attachment:f60d5ed8-07c6-4284-8b57-335b624eb32d.png)

Figure 1: Frequency against features diagram, which illustrates the class distribution of some important features (Glucose, Insulin and HOMA) in this data set, whereby class 1 = absence of breast cancer and class 2 = presence of breast cancer.

- In the initial exploratory data analysis undertaken, it seems like Glucose, Insulin, and HOMA are some of the most useful features in differentiating class 2 (which is presence of breast cancer) and class 1 (which is absence of breast cancer). The three features displayed above are assumed to be useful because looking at the diagram, above a certain threshold (e.g 26 µU/mL for insulin, 120 mg/dL for glucose, and 8.5 for HOMA), the model can easily predict class 2. However, all other features are included when developing the model because they may be predictive in conjunction with these three features.

![](../figures/01_PairPlot_HOMA-Insulin-Glucose.png)

Figure 2: Pairplots among Glucose, Insulin and HOMA

### Cross validation results for different classifiers <a name="cvresults"></a>


![different_classifiers.png](attachment:8ff6fed8-d91a-40a7-9334-ce57e9426ee2.png)

Figure 3: Cross-Validation results using different Machine Learning Classifiers. 

- DummyClassifier was used as the baseline model.

- All scaled numeric features were used to fit the Decision tree, k-Nearest Neighbours, Support Vector Machine (SVM) using the Radial Basis Function (RBF) kernel, Naive Bayes, Logistic Regression, and Random Forest classifiers.

- Decision tree classifier and Random Forest classifer have high train scores for accuracy but much lower validation scores, which indicates overfitting on the training data. Hence, we can eliminate these classifiers. 

- When compared to all other classifiers, it turned out that `Logistic Regression` model has achieved the highest validation recall performance with a regularization hyperparamter of `C` = 100, therefore it was chosen for this prediction task.

- **We have achieved 0.625 accuracy and 0.714 recall.**

- The code used to perform the analysis and create this report can be found here: https://github.com/UBC-MDS/dsci522-group16.

![confusion_matrix.png](attachment:94c4a9e3-f449-4abd-9edd-12ae5be4588f.png)

Figure 4: Confusion matrix of the Logistic Regression Model on test data. 

Figure 4 highlights the number of true positives (top left), true negatives (bottom right), false positives (top right) and false negatives (bottom left) predicted by our model. We are particularly interested in reducing the number of false negatives (patients with breast cancer but are predicted to be healthy), thus, improving the recall score is imperative.

## Future Improvements <a name="FutureImprovements"></a>

The scores obtained by the `Logistic regression` model for the vital evaluation metric (`Recall`), is much better than the score of the baseline model (Dummy Classifier) and other models, however, there is room for improvement. Some suggestions include:

- Performing extensive feature engineering by applying concepts such as recursive feature elimination. 

- Obtaining more data to train the model. 

- Apply ensemble classifiers, particularly model stacking and averaging. 

## References <a name="References"></a>

Patrício, M., Pereira, J., Crisóstomo, J., Matafome, P., Gomes, M., Seiça, R. and Caramelo, F., 2018. Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer, 18(1). https://doi.org/10.1186/s12885-017-3877-1

Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.

de Jonge, E., 2020. CRAN - Package Docopt. [online] Cran.r-project.org. Available at: <https://cran.r-project.org/web/packages/docopt/index.html> [Accessed 29 November 2020].

Oliphant, T.E., 2006. A guide to NumPy, Trelgol Publishing USA.

McKinney, W. & others, 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference. pp. 51–56.

Waskom, M. et al., 2017. mwaskom/seaborn: v0.8.1 (September 2017), Zenodo. Available at: https://doi.org/10.5281/zenodo.883859.

Van Rossum, G. & Drake, F.L., 2009. Python 3 Reference Manual, Scotts Valley, CA: CreateSpace.

Hunter, J.D., 2007. Matplotlib: A 2D graphics environment. Computing in science &amp; engineering, 9(3), pp.90–95.

Pedregosa, F. et al., 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), pp.2825–2830.

P&#x27;erez, Fernando & Granger, B.E., 2007. IPython: a system for interactive scientific computing. Computing in Science &amp; Engineering, 9(3).

Kluyver, T. et al., 2016. Jupyter Notebooks – a publishing format for reproducible computational workflows. In F. Loizides & B. Schmidt, eds. Positioning and Power in Academic Publishing: Players, Agents and Agendas. pp. 87–90.

Anon, 2020. Anaconda Software Distribution, Anaconda Inc. Available at: https://docs.anaconda.com/.

Ronacher, A., 2020. Jinja. [online] Pallets. Available at: <https://palletsprojects.com/p/jinja/> [Accessed 29 November 2020].