# Project 2, Part 2

Ashton Cole

AVC687

COE 379L: Software Design for Responsive Intelligent Systems

## Description

We are interested in using a breast cancer patient dataset to build a model which predicts relapsing based on personal and tumor characteristics.

In [1]:
import numpy as np
import pandas as pd
import sklearn as sk
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier

In [2]:
bc = pd.read_csv('bc_clean.csv')
bc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262 entries, 0 to 261
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               262 non-null    int64  
 1   age                      262 non-null    int64  
 2   tumor-size               262 non-null    float64
 3   inv-nodes                262 non-null    int64  
 4   class_recurrence-events  262 non-null    bool   
 5   menopause_premeno        262 non-null    bool   
 6   node-caps_yes            262 non-null    bool   
 7   deg-malig_2              262 non-null    bool   
 8   deg-malig_3              262 non-null    bool   
 9   breast_right             262 non-null    bool   
 10  breast-quad_left_low     262 non-null    bool   
 11  breast-quad_left_up      262 non-null    bool   
 12  breast-quad_right_low    262 non-null    bool   
 13  breast-quad_right_up     262 non-null    bool   
 14  irradiat_yes             2

In [3]:
bc.drop('Unnamed: 0', axis=1, inplace=True)
bc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262 entries, 0 to 261
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   age                      262 non-null    int64  
 1   tumor-size               262 non-null    float64
 2   inv-nodes                262 non-null    int64  
 3   class_recurrence-events  262 non-null    bool   
 4   menopause_premeno        262 non-null    bool   
 5   node-caps_yes            262 non-null    bool   
 6   deg-malig_2              262 non-null    bool   
 7   deg-malig_3              262 non-null    bool   
 8   breast_right             262 non-null    bool   
 9   breast-quad_left_low     262 non-null    bool   
 10  breast-quad_left_up      262 non-null    bool   
 11  breast-quad_right_low    262 non-null    bool   
 12  breast-quad_right_up     262 non-null    bool   
 13  irradiat_yes             262 non-null    bool   
dtypes: bool(11), float64(1), i

## Train-Test Split

In [4]:
X = bc.drop('class_recurrence-events', axis=1, inplace=False)
y = bc['class_recurrence-events']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

## K-Nearest Neighbor

This model more or less classifies a data point by polling th $k$ closest training data points. We will try to find the best hyperparameter $k$ using k-fold cross-validation.

In [5]:
model_knn = KNeighborsClassifier()
param_grid_knn = {'n_neighbors': np.arange(1, 50)}
gscv_knn = GridSearchCV(model_knn, param_grid_knn, cv=5)
gscv_knn.fit(X_train, y_train)

In [6]:
model_knn = gscv_knn.best_estimator_

In [7]:
gscv_knn.best_params_

{'n_neighbors': 7}

## Naive Bayes

In this case, specifically the Multinomial Naive Bayes method will be used. This is the model that works best for classifying discrete features, appropriate for a data set of booleans and binned scales. The Gaussian method is not appropriate, because it rests on the asssumption that all input variables are normally distributed. The Bernoulli method, however, would be appropriate if the binned scales were represented with one-hot encoding, since it requires all inputs to be binary.

In [8]:
model_nb = MultinomialNB()
model_nb.fit(X_train, y_train)

## Random Forest

In [None]:
model_rf = RandomForestClassifier()
param_grid_rf = {
    "n_estimators": np.arange(start=10, stop=100, step=2),
    "max_depth": np.arange(start=2, stop=20),
    "min_samples_leaf": np.arange(start=1, stop=5),
    "class_weight": [{0: 0.3, 1: 0.7}, {0: 0.5, 1: 0.5}, {0: 0.7, 1: 0.3}],
}
gscv_rf = GridSearchCV(model_rf, param_grid_rf, cv=5, n_jobs=12, scoring="recall")
gscv_rf.fit(X_train, y_train)

In [None]:
model_rf = gscv_rf.best_estimator_

In [None]:
gscv_rf.best_params_

## Performance Metrics

In this data set, a false positive would be incorrectly predicting a recurrence event. A false negative would be predicting full remission when a relapse does happen. The latter, false negatives, would be far worse. False positives can be easily eliminated by further examination, while false negatives might grant a false sense of security.

### Definitions

- accuracy: the proportion of points correctly predicted
- recall: the ratio of true positives against true positives and false negatives; as discussed above, this would be a more important metric
- precision: the ratio of true positives against true positives and false positives
- f1: the harmonic mean of recall and precision

### K-Nearest Neighbor

In [None]:
y_train_predict_knn = model_knn.predict(X_train)
y_test_predict_knn = model_knn.predict(X_test)

accuracy_knn_train = accuracy_score(y_train, y_train_predict_knn)
accuracy_knn_test = accuracy_score(y_test, y_test_predict_knn)
recall_knn_train = recall_score(y_train, y_train_predict_knn)
recall_knn_test = recall_score(y_test, y_test_predict_knn)
precision_knn_train = precision_score(y_train, y_train_predict_knn)
precision_knn_test = precision_score(y_test, y_test_predict_knn)
f1_knn_train = f1_score(y_train, y_train_predict_knn)
f1_knn_test = f1_score(y_test, y_test_predict_knn)

In [None]:
[[accuracy_knn_train, accuracy_knn_test],
 [recall_knn_train, recall_knn_test],
 [precision_knn_train, precision_knn_test],
 [f1_knn_train, f1_knn_test]]

### Multinomial Naive Bayes

In [None]:
y_train_predict_nb = model_nb.predict(X_train)
y_test_predict_nb = model_nb.predict(X_test)

accuracy_nb_train = accuracy_score(y_train, y_train_predict_nb)
accuracy_nb_test = accuracy_score(y_test, y_test_predict_nb)
recall_nb_train = recall_score(y_train, y_train_predict_nb)
recall_nb_test = recall_score(y_test, y_test_predict_nb)
precision_nb_train = precision_score(y_train, y_train_predict_nb)
precision_nb_test = precision_score(y_test, y_test_predict_nb)
f1_nb_train = f1_score(y_train, y_train_predict_nb)
f1_nb_test = f1_score(y_test, y_test_predict_nb)

In [None]:
[[accuracy_nb_train, accuracy_nb_test],
 [recall_nb_train, recall_nb_test],
 [precision_nb_train, precision_nb_test],
 [f1_nb_train, f1_nb_test]]

### Random Forest

In [None]:
y_train_predict_rf = model_rf.predict(X_train)
y_test_predict_rf = model_rf.predict(X_test)

accuracy_rf_train = accuracy_score(y_train, y_train_predict_rf)
accuracy_rf_test = accuracy_score(y_test, y_test_predict_rf)
recall_rf_train = recall_score(y_train, y_train_predict_rf)
recall_rf_test = recall_score(y_test, y_test_predict_rf)
precision_rf_train = precision_score(y_train, y_train_predict_rf)
precision_rf_test = precision_score(y_test, y_test_predict_rf)
f1_rf_train = f1_score(y_train, y_train_predict_rf)
f1_rf_test = f1_score(y_test, y_test_predict_rf)

In [None]:
[[accuracy_rf_train, accuracy_rf_test],
 [recall_rf_train, recall_rf_test],
 [precision_rf_train, precision_rf_test],
 [f1_rf_train, f1_rf_test]]