<a href="https://colab.research.google.com/github/andrybrew/sma-health/blob/master/01_structured_data_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Predicting Breast Cancer Diagnosis**

Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. In this section, we will use credit risk as our classification case study.

Breast cancer is cancer that develops from breast tissue. After skin cancer, breast cancer is the most common cancer diagnosed in women in the United States. Breast cancer can occur in both men and women, but it's far more common in women.

We will form a classification model through historical data for benign and malignant cancers to predict whether a new patient's cancer is potentially benign or malignant.

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Attribute Information:

1.   ID number
2.   Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

1.   radius (mean of distances from center to points on the perimeter)
2.   texture (standard deviation of gray-scale values)
3.   perimeter
4.   area
5.   smoothness (local variation in radius lengths)
6.   compactness (perimeter^2 / area - 1.0)
7.   concavity (severity of concave portions of the contour)
8.   concave points (number of concave portions of the contour)
9.   symmetry
10.   fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

#### **Install and Import Libraries**

Before we begin to implement our classifier, we need to import some libraries to use them later. Here are the libraries we need to import.

***Import Libraries***

In [None]:
# Import Library for Data Manipulation
import pandas as pd

# Import Library for Machine Learning
import sklearn.metrics as metrics

# Import Library for Visualization
import matplotlib. pyplot as plt
import seaborn as sns

#### **Import Dataset**

Then, import our breast cancer dataset into this notebook using Pandas library. Then, we discover the dataset information and statistics.

***Credit Risk Data***

In [None]:
# Import Dataset
df_cancer = pd.read_csv('https://raw.githubusercontent.com/andrybrew/sma-health/master/data/breast_cancer.csv', sep =',')
df_cancer

In [None]:
# Prints the Dataset Information
df_cancer.info()

In [None]:
# Prints Descriptive Statistics
df_cancer.describe().transpose()

#### **Explore the Dataset**

We need to visualize the data before implement our classifier. Data visualization is the act of taking information (data) and placing it into a visual context, such as a map or graph. 

***Visualize Correlation between Features***

In [None]:
# Draw Correlation Map
sns.clustermap(df_cancer.corr(), center=0, cmap='vlag', linewidths=.75)

#### **Preprocess the Data**

We should transforms raw data into an understandable format. Raw data cannot be sent through a model because would cause certain errors. That is why we need to preprocess data before sending through a model.

***Handling Missing Values***

In [None]:
# Check for Missing Values
df_cancer.isnull().sum()

***Replace Values***

In [None]:
# Replace Class Values
df_cancer['diagnosis'].replace(['B','M'],[0,1],inplace=True)

# Show Data
df_cancer

***Select Feature and Target***

Features are individual independent variables that act as the input in your system while target is whatever the output of the input variables. 

In [None]:
# Select Features
feature = df_cancer.drop(['id', 'diagnosis'], axis=1)
feature

In [None]:
# Select Target
target = df_cancer['diagnosis']
target

***Set Training and Testing Data***

The next step is to split our data into tran and test sets. For this purpose, we use the scikit-learn's train_test_split function.

In [None]:
# Import Module
from sklearn.model_selection import train_test_split, cross_val_score

# Set Training and Testing Data (70:30)
feature_train, feature_test, target_train, target_test = train_test_split(feature , target, shuffle = True, test_size=0.3, random_state=1)

# Show the Training and Testing Data
print(feature_train.shape)
print(feature_test.shape)
print(target_train.shape)
print(target_test.shape)

#### **Modeling**

##### **Decision Tree Classifier**

Decision tree learning is one of the predictive modelling approaches used in statistics, data mining and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves).

***Build Model***

In [None]:
# Import library
from sklearn import tree

# Modeling Decision Tree
dtree = tree.DecisionTreeClassifier(min_impurity_decrease=0.01)
dtree.fit(feature_train, target_train)

# Predict Test Data 
target_predicted_dtree = dtree.predict(feature_test)
target_predicted_dtree

In [None]:
# Visualize Tree

from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

feature_names = df_cancer.columns[2:]

dot_data = StringIO()
export_graphviz(dtree, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,
                class_names = ['Benign', 'Malignant'],
                feature_names = feature_names)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

***Model Evaluation***

In [None]:
# Confsion Matrix
cm_dtree = metrics.confusion_matrix(target_test, target_predicted_dtree)
cm_dtree

In [None]:
# Accuracy, Precision, Recall
acc_dtree = metrics.accuracy_score(target_test, target_predicted_dtree)
prec_dtree = metrics.precision_score(target_test, target_predicted_dtree)
rec_dtree = metrics.recall_score(target_test, target_predicted_dtree)
f1_dtree = metrics.f1_score(target_test, target_predicted_dtree)
kappa_dtree = metrics.cohen_kappa_score(target_test, target_predicted_dtree)

# Show Accuracy, Precision, Recall
print('Accuracy:', acc_dtree )
print('Precision:', prec_dtree)
print('Recall:', rec_dtree)
print('F1 Score:', f1_dtree)
print('Cohens Kappa Score:', kappa_dtree)

In [None]:
# Import Visualization Package
plt.rcParams['figure.figsize'] = (10, 10)
plt.style.use('ggplot')

# Visualize ROC Curve
target_predicted_dtree_prob = dtree.predict_proba(feature_test)[::,1]
fp_rate_dtree, tp_rate_dtree, _ = metrics.roc_curve(target_test,  target_predicted_dtree_prob)
auc_dtree = metrics.roc_auc_score(target_test, target_predicted_dtree_prob)
plt.plot(fp_rate_dtree, tp_rate_dtree, label='Decision Tree, auc='+str(auc_dtree))
plt.xlabel('false positive rate') 
plt.ylabel('true positive rate')
plt.legend(loc=4)
plt.show()

##### **K-Nearest Neighbor Classifier**

K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970's as a non-parametric technique.

***Build Model***

In [None]:
# Import Module
from sklearn.neighbors import KNeighborsClassifier

# Modeling Naive Bayes
knn = KNeighborsClassifier(n_neighbors= 71)
knn.fit(feature_train, target_train)

# Predict Test Data 
target_predicted_knn = knn.predict(feature_test)
target_predicted_knn

***Model Evaluation***

In [None]:
# Confsion Matrix
cm_knn = metrics.confusion_matrix(target_test, target_predicted_knn)
cm_knn

In [None]:
# Accuracy, Precision, Recall
acc_knn = metrics.accuracy_score(target_test, target_predicted_knn)
prec_knn = metrics.precision_score(target_test, target_predicted_knn)
rec_knn = metrics.recall_score(target_test, target_predicted_knn)
f1_knn = metrics.f1_score(target_test, target_predicted_knn)
kappa_knn = metrics.cohen_kappa_score(target_test, target_predicted_knn)

# Show Accuracy, Precision, Recall
print('Accuracy:', acc_knn)
print('Precision:', prec_knn)
print('Recall:', rec_knn)
print('F1 Score:', f1_knn)
print('Cohens Kappa Score:', kappa_knn)

In [None]:
# Import Visualization Package
plt.rcParams['figure.figsize'] = (10, 10)
plt.style.use('ggplot')

# Visualize ROC Curve
target_predicted_knn_prob = knn.predict_proba(feature_test)[::,1]
fp_rate_knn, tp_rate_knn, _ = metrics.roc_curve(target_test,  target_predicted_knn_prob)
auc_knn = metrics.roc_auc_score(target_test, target_predicted_knn_prob)
plt.plot(fp_rate_knn, tp_rate_knn, label='KNN, auc='+str(auc_knn))
plt.xlabel('false positive rate') 
plt.ylabel('true positive rate')
plt.legend(loc=4)
plt.show()

##### **Naive Bayes Classifier**

Naïve Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set.

***Build Model***

In [None]:
# Import Module
from sklearn.naive_bayes import GaussianNB 

# Modeling Naive Bayes
nb = GaussianNB()
nb.fit(feature_train, target_train)

# Predict Test Data 
target_predicted_nb = nb.predict(feature_test)
target_predicted_nb

***Model Evaluation***

In [None]:
# Confsion Matrix
cm_nb = metrics.confusion_matrix(target_test, target_predicted_nb)
cm_nb

In [None]:
# Accuracy, Precision, Recall
acc_nb = metrics.accuracy_score(target_test, target_predicted_nb)
prec_nb = metrics.precision_score(target_test, target_predicted_nb)
rec_nb = metrics.recall_score(target_test, target_predicted_nb)
f1_nb = metrics.f1_score(target_test, target_predicted_nb)
kappa_nb = metrics.cohen_kappa_score(target_test, target_predicted_nb)

# Show Accuracy, Precision, Recall
print('Accuracy:', acc_nb)
print('Precision:', prec_nb)
print('Recall:', rec_nb)
print('F1 Score:', f1_nb)
print('Cohens Kappa Score:', kappa_nb)

In [None]:
# Import Visualization Package
plt.rcParams['figure.figsize'] = (10, 10)
plt.style.use('ggplot')

# Visualize ROC Curve
target_predicted_nb_prob = nb.predict_proba(feature_test)[::,1]
fp_rate_nb, tp_rate_nb, _ = metrics.roc_curve(target_test,  target_predicted_nb_prob)
auc_nb = metrics.roc_auc_score(target_test, target_predicted_nb_prob)
plt.plot(fp_rate_nb, tp_rate_nb, label='Naive Bayes, auc='+str(auc_nb))
plt.xlabel('false positive rate') 
plt.ylabel('true positive rate')
plt.legend(loc=4)
plt.show()

#### **Evaluating Models**

***Compare Model Performance***

In [None]:
# Comparing Model Performance
print('Decision Tree Accuracy =',acc_dtree)
print('Decision Tree Precision =',prec_dtree)
print('Decision Tree Recall =',rec_dtree)
print('Decision Tree F1-Score =', f1_dtree)
print('_______________________')
print('k-NN Accuracy =', acc_knn)
print('k-NN Precision =', prec_knn)
print('k-NN Recall =', rec_knn)
print('k-NN F1-Score =', f1_knn)
print('_______________________')
print('Naive Bayes Accuracy =', acc_nb)
print('Naive Bayes Precision =', prec_nb)
print('Naive Bayes Recall =', rec_nb)
print('Naive Bayes F1-Score =', f1_nb)

***Compare ROC Curve***

In [None]:
# Comparing ROC Curve
plt.plot(fp_rate_dtree,tp_rate_dtree,label='Decision Tree, auc='+str(auc_dtree))
plt.plot(fp_rate_knn,tp_rate_knn,label='K-NN, auc='+str(auc_knn))
plt.plot(fp_rate_nb,tp_rate_nb,label='Naive Bayes, auc='+str(auc_nb))
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc=4)
plt.show()

#### **Predict New Data**

***Import New Credit Data***

In [None]:
# Import New Dataset
df_cancer_new = pd.read_csv('https://raw.githubusercontent.com/andrybrew/sma-health/master/data/breast_cancer_new.csv', sep =',')
df_cancer_new

***Check for Missing Values***

In [None]:
# Check for Missing Values
df_cancer_new.isnull().sum()

In [None]:
# Select Features
new_feature = df_cancer_new.drop(['id'], axis=1)
new_feature

***Predict New Customer Data***

In [None]:
# Predict using Decision Tree Classifier
new_predicted_dtree = pd.DataFrame(dtree.predict(new_feature), columns = ['diagnosis_dtree'])
new_predicted_dtree.reset_index()
new_predicted_dtree

In [None]:
# Predict using K-Nearest Neighbor Classifier
new_predicted_knn = pd.DataFrame(knn.predict(new_feature), columns = ['diagnosis_knn'])
new_predicted_knn.reset_index()
new_predicted_knn

In [None]:
# Predict using Naive Bayes Classifier
new_predicted_nb = pd.DataFrame(nb.predict(new_feature), columns = ['diagnosis_nb'])
new_predicted_nb.reset_index()
new_predicted_nb

***Show Prediction Comparation***

In [None]:
# Show Prediction Result
pred_cancer = pd.concat([df_cancer_new, new_predicted_dtree, new_predicted_knn, new_predicted_nb], axis=1)
pred_cancer

***Save Prediction Result***

In [None]:
# Save Prediction Result
pred_cancer.to_csv('cancer_prediction.csv', index=False)