# Breast Cancer Prediction Using Daimensions Attribute Ranking

In this notebook, we'll be working with a dataset from the University of California Irvine's Machine Learning Repository. It has nine attribute columns to describe various aspects of cells and one classification column that classifies each cell as benign or malignant cancer. More information about the data can be found at: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).

We have two goals: one is to build a model predicting whether a cell is benign or malignant on future cell data and the other is to use attribute rank to learn about which attributes of the cell are most important for predicting cancer in cells. Daimensions' attribute rank option is useful for a lot of biomedical data like cancer cells because most of the time we are not only looking to predict which cells are cancerous but also what caused the cancer. Attribute rank helps us learn about this aspect of the data by telling us which attributes most closely correlate with a cell's classification. This greatly contributes to our understanding of the data and helps guide us toward probable cause.

Here is a look at our training data and the attributes we're using. For the target column, 2 is benign and 4 is malignant.

In [1]:
! head data/cancer_train.csv
# For Windows command prompt:
# type cancer_train.csv | more

Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
5,1,1,1,2,1,3,1,1,2
5,4,4,5,7,10,3,2,1,2
3,1,1,1,2,2,3,1,1,2
6,8,8,1,3,4,3,7,1,2
4,1,1,3,2,1,3,1,1,2
8,10,10,8,7,10,9,7,1,4
1,1,1,1,2,10,3,1,1,2
2,1,2,1,2,1,3,1,1,2
2,1,1,1,2,1,1,1,5,2


### Installing Brainome via Pip
Simply run the cell below in order to install Brainome and be able to use it in terminal

In [2]:
# ! pip install brainome

## 1. Get Measurements

We always want to measure our data before building our predictor in order to ensure we are building the right model. For more information about how to use Daimensions and why we want to measure our data beforehand, check out the Titanic notebook.

In [2]:
! brainome -measureonly data/cancer_train.csv


[01;1mBrainome Table Compiler v1.005-7-prod[0m
Copyright (c) 2019-2021 Brainome, Inc. All Rights Reserved.
Licensed to:                 Alexander Makhratchev  (Evaluation)
Expiration Date:             2021-08-31   26 days left
Maximum File Size:           30 GB
Maximum Instances:           unlimited
Maximum Attributes:          unlimited
Maximum Classes:             unlimited
Connected to:                daimensions.brainome.ai  (local execution)

[01;1mCommand:[0m
    btc -measureonly data/cancer_train.csv

Start Time:                 08/05/2021, 16:43 PDT

Cleaning...done. 
Splitting into training and validation...done. 
Pre-training measurements...done. 


[01;1mPre-training Measurements[0m
Data:
    Input:                      data/cancer_train.csv
    Target Column:              Class
    Number of instances:        559
    Number of attributes:         9 out of 9
    Number of classes:            2

Class Balance:                
                               2: 63.15%
  

## 2. Build the Predictor

Based on our measurements, Daimensions recommends we use a decision tree, which has lower risk of overfit and higher generalization for this dataset. We are also using -rank to prioritize certain attributes from our data, and we'll look at which attributes Daimensions decides are important later.

In [3]:
! brainome -v -v -f DT data/cancer_train.csv -o cancer_predict.py -e 10 -rank --yes


[01;1mBrainome Table Compiler v1.005-7-prod[0m
Copyright (c) 2019-2021 Brainome, Inc. All Rights Reserved.
Licensed to:                 Alexander Makhratchev  (Evaluation)
Expiration Date:             2021-08-31   26 days left
Maximum File Size:           30 GB
Maximum Instances:           unlimited
Maximum Attributes:          unlimited
Maximum Classes:             unlimited
Connected to:                daimensions.brainome.ai  (local execution)

[01;1mCommand:[0m
    btc -v -v -f DT data/cancer_train.csv -o cancer_predict.py -e 10 -rank --yes

Start Time:                 08/05/2021, 16:43 PDT

Cleaning...done. < 1s
Ranking attributes...done. < 1s

[01;1mAttribute Ranking:[0m
    Columns selected:           Uniformity_of_Cell_Shape, Bare_Nuclei, Clump_Thickness, Normal_Nucleoli, Uniformity_of_Cell_Size
    Risk of coincidental column correlation:    0.0%
    Ignoring columns:           Marginal_Adhesion, Single_Epithelial_Cell_Size, Bland_Chromatin, Mitoses
    Test Accuracy Pr

## 3. Validate the Model

Now we can validate our model on a separate set of data that wasn't used for training.

In [4]:
! python3 cancer_predict.py -validate data/cancer_valid.csv

Classifier Type:                    Decision Tree
System Type:                        2-way classifier

Accuracy:
    Best-guess accuracy:            75.00%
    Model accuracy:                 99.28% (139/140 correct)
    Improvement over best guess:    24.28% (of possible 25.0%)

Model capacity (MEC):               7 bits
Generalization ratio:               16.10 bits/bit

Confusion Matrix:

      Actual |Predicted
    ------------------
           2 |104   1
           4 |  0  35

Accuracy by Class:

      target |  TP FP  TN FN     TPR     TNR     PPV     NPV      F1      TS
    -------- | --- - --- - ------- ------- ------- ------- ------- -------
           2 | 104 0  35 1  99.05% 100.00% 100.00%  97.22%  99.52%  99.05%
           4 |  35 1 104 0 100.00%  99.05%  97.22% 100.00%  98.59%  97.22%


## 4. Learn From Attribute Rank

From validating the data, we can see that the predictor has 99.28% accuracy. This is great for making predictions on future data. However, what might be of greater interest is looking at the output from building our predictor, specifically the attributes that Daimensions decided to use. Under the section of output called "Attribute Rank," Daimensions has listed the attributes used: Uniformity_of_Cell_Size, Bare_Nuclei, Clump_Thickness, Marginal_Adhesion, Mitoses, and Uniformity_of_Cell_Shape. This information about what attributes were the best predictors of malignant cancer cells is valuable to scientists looking for the causes of this cancer.

## Citation
This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.

Sources:
- Dr. WIlliam H. Wolberg (physician), University of Wisconsin Hospitals, Madison, Wisconsin, USA
- Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu), received by David W. Aha (aha@cs.jhu.edu)
- Date: 15 July 1992