# Breast Cancer Wisconsin (Diagnostic) Data Set

## Introduction

For our project, we as a group wanted to focus on the health industry. We wanted to find a dataset where we could apply predictions to give a diagnosis to the patient. As we were browsing for datasets, we had to decide which disease we wanted to study. We decided to look at cancer, as cancer is a widely studied disease today. This motivated us to analyze the breast cancer dataset publicly available on Kaggle. We believe that our analysis of this dataset would be helpful to doctors and patients, who want to find out whether the cancer is benign or malignant. 

For our model, we decided to choose to use a classification/prediction model as our goal is to classify whether the cancer is benign or malignant. To do this, we ran different models including logistic regression, decision trees and random forests and then, based on the performance on the vailidation set, we picked the best one. After that, we evaluated our final model on the test set. Furthermore, we looked at the significance of our model parameters using confidence intervals and performede hypothesis testing using the parametric Two-Sample T-Test and the non-parametric Wilcoxian Rank Sumt Test to see whether our results are statistically significant at a significance level of 5%. We would like the results of our hypothesis testing to serve as initial indicators of potentical cancer diagnosis for the doctors. Based on our final model, we hope that our model can serve as a strong basis for predicting whether the cancer is benign or malignant based on its characteristics. 

## Data Description

The raw data is from Kaggle, which can be found [here](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data). The data was given to us in a csv file with 569 observations, so it was hard to interpret. To better interpret the data, we converted the data into a pandas dataframe. Each row of the data represents a specific patient. 357 patients have benign cancer and 212 have malignant cancer. The dataset contains 32 columns with 30 features representing the tumor's characteristics. We treated the diagnosis column as our response variable. All of the features are quantitative (float64) except for the id column (int64) and the diagnosis ('M' or 'B').  To combat this issue, we One Hot Encoded the diagnosis to convert it to 1s and 0s, which would allow us to run our classification model. The diagnosis is read as follows: 0 for benign and 1 for malignant. We also dropped the last column as it was empty. 

For each patient, a digitized image depicting the fine needle aspirate (FNA) of a breast mass is taken. Ten real-valued features are then computed for each cell nuclei present in the image. To summarize the findings for each patient, the table below contains the mean, the standard error, and the "worst" (mean of the three largest values) for each of the ten features, generating a total of thirty features for each patient.

Below is the cleaned dataset that we generated:

In [1]:
import pandas as pd
from PIL import Image

data = pd.read_csv('data/clean.csv')
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,1.0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,1.0,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,1.0,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,1.0,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,1.0,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### Exploratory Data Analysis

### Data Analysis

## Prediction

### Logistic Regression 

In this section, we will perform logistic regression analysis.

Description: Logistic Regression works by calculating posterior probabilies for the data by using the logistic function and based on those probablities it classifies the data point.Given below is the logistic function:

$$ s(x) = \frac{1}{1+ e^{-x}} $$

Limitations and Assumptions: An important thing to note about logistic regression is that it can only be used for binary classifiation and not multiclass classification. This is because it follows the rule that any point whose probability of being in a class is more than 50% is assigned to that class. If not, it is assigned to the other class. 

Preprocessing: For the preprocessing, we dropped the diagnosis column (since we don't want to see the labels) and the id column (since it is not a very predictive feature) to create our feature matrix. Then we evaluated the model using various metrics. The calculated metrics and the preprocessing can be found in the logistic_reg.ipynb. Here are the plots:

In [3]:
conf_image = Image.open('figures/confusion_matrix_logistic.png')
prec_recall_image = Image.open('figures/precision_recall_curve_logistic.png')
roc_curve_image = Image.open('figures/roc_curve_logistic.png')
display(conf_image)
display(prec_recall_image)
display(roc_curve_image)

FileNotFoundError: [Errno 2] No such file or directory: 'figures/confusion_matrix_logistic.png'

Evaluation: The classifier seems to performing well with a very few false positive (falsely classifying cancer as malignant) and false negatives (falsely classfying the cancer as benign). In addition, the AUC of the classifier is 0.92, which means that the classifier is getting the correct answer 92% of the time.

### Decision Tree

### Final Model

## Assess significance of features using LR

## Hypothesis testing

While classification helps to distinguish malignant breast cancer patients from benign cancer patients, we next want to see whether some features differ in the two populations which may not be attributed to just chance due to the sample observed. More specifically,  using parametric and non-parametric hypothesis testing, we will identify features that have a significantly different average value in the two populations, which doctors may then utilize as initial indicators of malignant vs. benign cancer prior to actual diagnosis. Rather than immediately identifying patiients with either malignant or benign cancer, which may require health senstive and cost intensive procedures, doctors may use the results of the hypothesis testing to better guide the patients in the early stages of their treatments. Additionally, if all of the required input features are not present to use the models we developed above, doctors may utilize the results of hypothesis testing.   

We identified that the 'area worst' feature, which is the mean of the three highest values of the area feature for each patient, had the largest average difference between the two populations and a large difference in their sample variances as well. On the otherhand, the 'texture se', which is the standard error of the texture feature, had the smallest average difference between the two populations as well as close sample variances. We specifically utilized the Two Sample T Test and the Wilcoxon Rank Sum Test to conduct the hypothesis testings, as the assumption of normality for the two samples may not be assumed with high confidence.



For each of the two features identified, we performed the following hypothesis test (parametric Two Sample T Test and the non-parametric Wilcoxian Rank Sum Test).
$$H_0: \mu_0 = \mu_1$$

$$H_1: \mu_0 \neq \mu_1$$

In the table below, we see that the 'area worst' feature has a statistically highly significant p value for both the parametric and the non-parametric test, which is less than 1%. Hence, we reject the null hypothesis that the two populations have the same mean value for the 'area worst' feature. On the otherhand, for the 'texture se' feature, we see that the p value is above 5% for both the parametric and the non-parametric test, hence we fail to reject the null hypothesis that the two populations have the same mean value for the 'texture se' feature. The doctors may choose to utilize thee 'area worst' faeture as an early pontential indicator for malignant vs. benign cancer.  

In [4]:
ht_results = pd.read_pickle("tables/ht_results.pkl")
ht_results

Unnamed: 0,T Statistic,P Value
area worst parametric,-20.570814,4.937924e-54
area worst non-parametric,-18.754029,1.794645e-78
texture se parametric,0.197724,0.843332
texture se non-parametric,-0.462805,0.643504


## Discussion & Conclusion

## Author Contributions

Kshitij Chauhan (TJ): Created the logistic_reg.ipynb notebook to conduct logistic regression analysis and tests in test_logistic_reg.py file to test the functions. Also did the introduction, data description and logistic regression sections of this notebook.

Neha Haq: Created the two_populations_analysis.ipynb notebook to conduct the parametric Two Sample T Tests and the non-parametric Wilcoxon Rank Sum Test. Set up the diagnosis python package and wrote the methods in twosample.py and the tests in the test_twosample.py. Created the Jupyter Book and the Github actions. Wrote the hypothesis tests section of the main.ipynb. 