# Mini-Kaggle Project 1: Breast Cancer Classification
Don Krapohl

## Summary

I used looping to try out many hyperparameters to choose the best model over the preprocessed data. Ultimately I submitted a Random Forest model that did not do as well as expected so I did predictions using the Logistic Regression algorithm to complete my final submission.


## Requirements

1. Joining the Kaggle Competition: Use this linkLinks to an external site. to join. Ensure your Kaggle team name matches your Canvas name. This is crucial as it's the basis for your grade. Once you've made a submission, verify your name appears correctly on the Leaderboard.

2. Data Preprocessing: Split the given training dataset. Remember to exclude non-informative columns like ID.

3. Model Development: Train the following classifiers: Perceptron, Logistic Regression, SVM, Decision Trees, KNN, and Random Forest. Document and evaluate each model's performance.

4. Kaggle Submission: You can develop either locally or within Kaggle. Once you're satisfied with your model, create the submission.csv and submit it on Kaggle to receive a score based on the hidden test set.

5. Performance: Aim to surpass a benchmark performance of 98%. If your performance is lower than 94%, you will lose the entire 70% of the grade allocated for the coding portion of this assignment.

6. Notebook Submission: Whether developed locally or on Kaggle, download your notebook (or respective files) and upload it to Canvas for peer-reviews. Clearly mention your chosen classifier for the final Kaggle submission. 

7. Due: Submit your submission file on Kaggle and notebook on Canvas by Nov 10, 2024, 11:59 PM.

### We will be training, testing, and using the following models:
* Perceptron
* Logistic Regression
* SVM
* Decision Trees
* Random Forest
* KNN

Ref: https://scikit-learn.org/1.5/supervised_learning.html

## Approach

I will load from train.csv, explore the data for quality and distribution, remove and encode some columns, and split the data into train and test sets. I'll train multiple models for each of the classifier algorithms and capture the one of each type that has the highest accuracy. After all are trained and tested I'll select the one with the highest accuracy and highest AUC as my submission.  I'll then predict over the test.csv file and submit the results.

## Environment Setup

### Establish environment
1. Download the python venv for the project from https://github.com/dkrapohl/uwf-venv-breast-cancer/tree/main
2. Activate the environment using the README from that repo
3. Set the Jupyter environment to use this kernel (top right of this window)

If we need to reproduce the environment the private repo for this notebook has a pip requirements.txt

## Data Exploration

### Import required libraries.


In [1]:
# basic dataframe and operations
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt
import seaborn as sns

# manipulation and preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score

# models
from sklearn.linear_model import Perceptron, LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# measuring results
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

# warning suppression
import warnings
from sklearn.exceptions import ConvergenceWarning

In [2]:
# Suppress ConvergenceWarnings and UserWarning.  They're noise here.
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=UserWarning)

### Create a few collections to capture the info on our different models

To explore the accuracies of multiple model hyperparameters I'll be training multiple models and keeping the model from each type of classifier that has the highest accuracy.  I suspect there's an easier way to accomplish this but this is what I can do at this point.


In [3]:
# collections we'll use for our best of each type of model
best_models = []                    # List of model instances that are our best for final evaluation
model_accuracies = []               # The accuracies of our best models in a key-value dictionary

Here I'll load the data into an initial dataframe to be used for exploration and the start of preprocessing.


In [4]:
# Import the csv training dataset to a pandas dataframe
# The dataset is expected in the same directory as this notebook
#   under a subfolder path datasets/breast-cancer-wisconsin-data/
data_train = pd.read_csv('datasets/breast-cancer-wisconsin-data/train.csv')
# Show the shape of the dataset
data_train.shape


(455, 32)

### Exploratory Data Analysis

Output the first 5 rows of the data to see the general character and nature of the data like missing values, obvious dirty data, features with very large ranges, etc.

In [5]:
# Display a few rows from the training data
data_train.head()

Unnamed: 0,id,label,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,90524101,M,17.99,20.66,117.8,991.7,0.1036,0.1304,0.1201,0.08824,...,21.08,25.41,138.1,1349.0,0.1482,0.3735,0.3301,0.1974,0.306,0.08503
1,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
2,89346,B,9.0,14.4,56.36,246.3,0.07005,0.03116,0.003681,0.003472,...,9.699,20.07,60.9,285.5,0.09861,0.05232,0.01472,0.01389,0.2991,0.07804
3,902975,B,12.21,14.09,78.78,462.0,0.08108,0.07823,0.06839,0.02534,...,13.13,19.29,87.65,529.9,0.1026,0.2431,0.3076,0.0914,0.2677,0.08824
4,904969,B,12.34,14.95,78.29,469.1,0.08682,0.04571,0.02109,0.02054,...,13.18,16.85,84.11,533.1,0.1048,0.06744,0.04921,0.04793,0.2298,0.05974


### Look at the data to make sure we don't have null or missing data

This will give the count of null values for each column to see if we need to handle missing data.

In [6]:
# Get the count of nulls per column
data_train.isnull().sum()

id                         0
label                      0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

In [7]:
# Get the count of nulls per column
data_train.isna().sum()

id                         0
label                      0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

### I prefer to verify object data types. Not important here but at scale it definitely is.

Sometimes numeric data come in as object, which can make lookups and indexing inefficient.

In [8]:
# Verify data types to see if there's a better explicit cast for any feature
# We're looking specifically for anything marked "object" as potential for casting
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 455 entries, 0 to 454
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       455 non-null    int64  
 1   label                    455 non-null    object 
 2   radius_mean              455 non-null    float64
 3   texture_mean             455 non-null    float64
 4   perimeter_mean           455 non-null    float64
 5   area_mean                455 non-null    float64
 6   smoothness_mean          455 non-null    float64
 7   compactness_mean         455 non-null    float64
 8   concavity_mean           455 non-null    float64
 9   concave points_mean      455 non-null    float64
 10  symmetry_mean            455 non-null    float64
 11  fractal_dimension_mean   455 non-null    float64
 12  radius_se                455 non-null    float64
 13  texture_se               455 non-null    float64
 14  perimeter_se             4

### Get the statistics about the data and their distribution.

I'm looking here for any columns with differing counts and any outrageous outliers.

In [9]:
# Display basic metrics about each feature, like count, mean, std, min/max, and IQR
data_train.describe()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,455.0,455.0,455.0,455.0,455.0,455.0,455.0,455.0,455.0,455.0,...,455.0,455.0,455.0,455.0,455.0,455.0,455.0,455.0,455.0,455.0
mean,34944290.0,14.213492,19.354374,92.572791,664.583077,0.096372,0.105059,0.089651,0.04959,0.181131,...,16.411787,25.705165,108.253319,900.190549,0.132138,0.256131,0.272104,0.11582,0.288476,0.083636
std,138782600.0,3.617912,4.399626,24.993837,362.603052,0.013746,0.051977,0.080264,0.039412,0.027257,...,5.01379,6.289274,34.849813,595.178062,0.02219,0.154821,0.204274,0.06703,0.058845,0.016646
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1566,0.05521
25%,869583.5,11.705,16.17,75.085,421.95,0.08673,0.06588,0.02886,0.020335,0.162,...,12.98,20.97,83.68,511.05,0.11785,0.14965,0.1109,0.064985,0.2508,0.07209
50%,905978.0,13.4,18.87,86.87,551.1,0.09639,0.09661,0.06387,0.03483,0.1799,...,14.92,25.27,97.66,684.6,0.1316,0.2186,0.2322,0.101,0.2815,0.08009
75%,8910375.0,16.09,21.83,105.4,801.55,0.1049,0.13055,0.13235,0.074975,0.1949,...,19.185,29.915,126.9,1122.5,0.1448,0.3418,0.3857,0.1661,0.3152,0.09195
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,36.04,49.54,251.2,4254.0,0.2226,0.9379,1.17,0.291,0.5774,0.1486


In [10]:
# For the label column in the training set
# Show the unique values of the training labels
print(data_train['label'].unique())

['M' 'B']


## Preprocessing

### Adjust columns as needed

We need to remove label and ID from the training data and encode the class labels from string to integers.

In [11]:
# remove the result column from the input parameters
# also remove the ID column. It carries no signal.
X_train_without_label = data_train.drop('label', axis=1).drop('id', axis=1)

# Assign class labels for the input data
y_labels = data_train['label']      # assign the labels we'll encode in the next block


In [12]:
# Encode the labels in y_train. Note we have not done train/test split yet.
encoder = LabelEncoder()
y_train = encoder.fit_transform(y_labels)

### Split and scale the data

Here I'll be splitting the data to be 70% training set, 30% test set. I then scale the data using the Standard Scaler to make all the data within the same range having 0 as the mean and standard deviation of 1.

In [13]:
# Do train test split
X_train, X_test, y_train, y_test =    train_test_split(X_train_without_label, y_train,
    test_size=0.3, 
    random_state=17, stratify=y_train)

In [14]:
# Scale the features
# We split before we scale so the scaler has no knowledge of the test set
#   This helps to verify that the scaling is likely to be appropriate for the range of real-world values.

# Scaling will be important especially for Perceptron, Logistic Regression, and KNN
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)



## Perceptron Train and Test

Train the Perceptron with different hyperparameters. I'm going from smallest eta to largest because if this was used at scale I'd prefer the largest eta I can get for the best accuracy. This is based on my understanding that lower eta values take longer to converge on a solution.

This algorithm will be one of the simplest to train but the data have to be linearly separable or we'll never converge.

In [15]:
# In a loop over 15 values, train a Perceptron with different etas and capture the accuracies
etas = [0.005, 0.01, 0.03, 0.05, 0.07, 0.1, 0.14, 0.18, 0.25, 0.4, 0.6, 0.8, 1.0, 2.0, 5.0]  # eta values to try

# I'll also keep track of the highest accuracy for the highest eta as our "best" model
preceptron_highest_accuracy=0
perceptron_highest_test=0
perceptron_highest_eta=0

for eta in etas:
    perc_model = Perceptron(eta0=eta, max_iter = 1000 )   # create a perceptron
    perc_model.fit(X_train_scaled, y_train)                                    # train it
    
    # Collect info on training results if desired
    #model_train_score = perc_model.score(X_train_scaled, y_train)                    # get the model accuracy
    #print("{}\t{}".format("Perceptron train (eta={0})".format(eta), "{:.4f}".format(model_train_score)))

    # Collect info on test results
    y_pred = perc_model.predict(X_test_scaled)  
    model_test_score = accuracy_score(y_test, y_pred)                    # get the model accuracy
  
    # Print the accuracies for each of the model params so far.   
    print("{}\t{}".format("Perceptron (eta={0})".format(eta), "{:.4f}".format(model_test_score)))
        
    # Also change the "best" model info if appropriate
    if model_test_score >= perceptron_highest_test:  # only update if the score is better or equal to
        preceptron_highest_accuracy = model_test_score      # store the test score for this model
        perceptron_highest_eta = eta                        # store the best eta
        test_predictions = y_pred                           # store test predictions
        perceptron_best_model = perc_model                  # store the best model
        perceptron_highest_test = model_test_score          # update our highest score
        
best_models.append(perceptron_best_model)                   # add to our "best model" collection
model_accuracies.append(preceptron_highest_accuracy)        # add the accuracy

Perceptron (eta=0.005)	0.9635
Perceptron (eta=0.01)	0.9635
Perceptron (eta=0.03)	0.9635
Perceptron (eta=0.05)	0.9635
Perceptron (eta=0.07)	0.9635
Perceptron (eta=0.1)	0.9635
Perceptron (eta=0.14)	0.9635
Perceptron (eta=0.18)	0.9635
Perceptron (eta=0.25)	0.9635
Perceptron (eta=0.4)	0.9635
Perceptron (eta=0.6)	0.9635
Perceptron (eta=0.8)	0.9635
Perceptron (eta=1.0)	0.9635
Perceptron (eta=2.0)	0.9635
Perceptron (eta=5.0)	0.9635


After each algorithm section I will output the accuracy, confusion matrix, and F1 score for both classes to verify that the results are reasonable and further inform a final decision.

This being a health study I would also want to determine how much we can accept false positives and false negatives but that's going to be out of scope for this study.

In [16]:
# Print the best model info
print("Best model: accuracy {}, eta {}".format("{:.4f}".format(preceptron_highest_accuracy), perceptron_highest_eta))

# Print the confusion matrix for the best model
print("Model Confusion Matrix:\n", confusion_matrix(y_test, test_predictions))
print("Train data F1-Score for class '1':", f1_score(y_test, test_predictions, pos_label=1))
print("Train data F1-Score for class '0':", f1_score(y_test, test_predictions, pos_label=0))


Best model: accuracy 0.9635, eta 5.0
Model Confusion Matrix:
 [[84  2]
 [ 3 48]]
Train data F1-Score for class '1': 0.9504950495049505
Train data F1-Score for class '0': 0.9710982658959537


## Logistic Regression Train and Test

As with Perceptron I'm looping through hyperparameters and training multiple models, keeping the "best" based on accuracy. The secondary consideration is that I will prefer the lowest C value as lower C values make simpler models.

This model assumes linearity in independent variables and performs well if the data are linearly separable. It'll also give use proababilities on the class predictions.

In [17]:
# In a loop over several values, train an SVM with different C values and capture the accuracies
# We prefer lower C for better power so we'll only update if accuracy is higher and C lower

solvers = ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga']
cs = [100.0, 50.0, 20.0, 10.0, 5.0, 3.0, 2.0, 1.0, 0.75, 0.5, 0.1, 0.01, 0.001, 0.0001]  # C values to try

# I'll also keep track of the highest accuracy for the lowest C as our "best" model
lr_highest_accuracy=0
lr_lowest_c=1000.0

# Try all SVM kernels
for solver in solvers:
    # Try the range of C values
    for c in cs:
        # Initialize and train the SVM model
        logreg_model = LogisticRegression(solver=solver, C=c, random_state = 17)
        logreg_model.fit(X_train_scaled, y_train)
        
        # Collect info on training results if desired
        # model_train_score = logreg_model.score(X_train_scaled, y_train)                    # get the model accuracy
        # print("{}\t{}".format("Logistic Regression (solver={})".format(solver), "{:.4f}".format(model_test_score)))
            
        # Make predictions using training data
        y_pred = logreg_model.predict(X_test_scaled)
        model_test_score = accuracy_score(y_test, y_pred)

        # Print the accuracies for each of the model params so far.   
        print("{}\t{}".format("Logistic Regression (solver={})".format(solver), "{:.4f}".format(model_test_score)))
            
        # we want the lowest C for better generalization so only keep
        #   accuracy if it's better but C is lower
        if model_test_score >= lr_highest_accuracy: # we're in a list with decreasing values so don't need to check C
            test_predictions = y_pred                   # store test predictions
            lr_lowest_c = c                             # store the lowest C
            lr_best_model = logreg_model                # store the best model
            lr_best_solver = solver                     # store the best solver
            lr_highest_accuracy = model_test_score       # update our highest score

best_models.append(lr_best_model)                       # add to our "best model" collection      
model_accuracies.append(lr_highest_accuracy)                # add the accuracy      

Logistic Regression (solver=lbfgs)	0.9781
Logistic Regression (solver=lbfgs)	0.9781
Logistic Regression (solver=lbfgs)	0.9708
Logistic Regression (solver=lbfgs)	0.9708
Logistic Regression (solver=lbfgs)	0.9708
Logistic Regression (solver=lbfgs)	0.9781
Logistic Regression (solver=lbfgs)	0.9781
Logistic Regression (solver=lbfgs)	0.9781
Logistic Regression (solver=lbfgs)	0.9781
Logistic Regression (solver=lbfgs)	0.9781
Logistic Regression (solver=lbfgs)	0.9781
Logistic Regression (solver=lbfgs)	0.9562
Logistic Regression (solver=lbfgs)	0.8759
Logistic Regression (solver=lbfgs)	0.6350
Logistic Regression (solver=liblinear)	0.9781
Logistic Regression (solver=liblinear)	0.9708
Logistic Regression (solver=liblinear)	0.9708
Logistic Regression (solver=liblinear)	0.9708
Logistic Regression (solver=liblinear)	0.9708
Logistic Regression (solver=liblinear)	0.9781
Logistic Regression (solver=liblinear)	0.9781
Logistic Regression (solver=liblinear)	0.9781
Logistic Regression (solver=liblinear)	0.978

In [18]:
# Print the best model info
print("Best model: {} solver, c {}, accuracy {}".format(lr_best_solver, c, "{:.4f}".format(lr_highest_accuracy)))

# Print the confusion matrix for the best model
print("Model Confusion Matrix:\n", confusion_matrix(y_test, test_predictions))
print("Train data F1-Score for class '1':", f1_score(y_test, test_predictions, pos_label=1))
print("Train data F1-Score for class '0':", f1_score(y_test, test_predictions, pos_label=0))

Best model: saga solver, c 0.0001, accuracy 0.9927
Model Confusion Matrix:
 [[86  0]
 [ 1 50]]
Train data F1-Score for class '1': 0.9900990099009901
Train data F1-Score for class '0': 0.9942196531791907


## SVM Train and Test

As with Logistic Regression, looping to train multiple models, keeping the one with the highest accuracy for the lowest C value.  C value again we prefer lower C as it makes for simpler models.

This model will be computationally more expensive than the others, especially since I'm testing multiple C values on all kernels. The algorithm is more flexible in its ability to model both linear and non-linear data.

In [19]:
# In a loop over several values, train an SVM with different C values and capture the accuracies
# We prefer lower C for better power so we'll only update if accuracy is higher and C lower
cs = [100.0, 50.0, 20.0, 10.0, 5.0, 3.0, 2.0, 1.0, 0.75, 0.5, 0.1, 0.01, 0.001, 0.0001]  # C values to try
kernels = ['linear', 'rbf', 'poly', 'sigmoid']

# I'll also keep track of the highest accuracy for the highest eta as our "best" model
svm_highest_accuracy=0
svm_lowest_c=1000

# Try all SVM kernels
for kernel in kernels:
    # Loop through C values largest to smallest, train and test each
    for c in cs:
        # Initialize and train the SVM model
        linear_svm = SVC(kernel=kernel, C=c, random_state=17)
        linear_svm.fit(X_train_scaled, y_train)
        
        # Collect info on training results if desired
        # model_train_score = linear_svm.score(X_train_scaled, y_train)                    # get the model accuracy
        # print("{}\t{}".format("SVM Train (kernel={}, C={})".format(kernel, c), "{:.4f}".format(model_test_score)))
            
        # Make predictions using training data
        y_pred = linear_svm.predict(X_test_scaled)
        model_test_score = accuracy_score(y_test, y_pred)

        # Print the accuracies for each of the model params so far.   
        print("{}\t{}".format("SVM (kernel={}, C={})".format(kernel, c), "{:.4f}".format(model_test_score)))
            
        # we want the lowest C for better generalization so only keep
        #   accuracy if it's better but C is lower
        if model_test_score >= svm_highest_accuracy: # we're in a list with decreasing values so don't need to check C
            svm_lowest_c = c                            # store lowest C
            test_predictions = y_pred                   # store test predictions
            svm_best_model = linear_svm                 # store the best model
            svm_best_kernel = kernel                    # store the best kernel
            svm_highest_accuracy = model_test_score     # update our highest score

best_models.append(svm_best_model)                      # add to our "best model" collection   
model_accuracies.append(svm_highest_accuracy)           # add the accuracy

SVM (kernel=linear, C=100.0)	0.9708
SVM (kernel=linear, C=50.0)	0.9708
SVM (kernel=linear, C=20.0)	0.9781
SVM (kernel=linear, C=10.0)	0.9635
SVM (kernel=linear, C=5.0)	0.9708
SVM (kernel=linear, C=3.0)	0.9708
SVM (kernel=linear, C=2.0)	0.9708
SVM (kernel=linear, C=1.0)	0.9781
SVM (kernel=linear, C=0.75)	0.9781
SVM (kernel=linear, C=0.5)	0.9781
SVM (kernel=linear, C=0.1)	0.9854
SVM (kernel=linear, C=0.01)	0.9635
SVM (kernel=linear, C=0.001)	0.9562
SVM (kernel=linear, C=0.0001)	0.6350
SVM (kernel=rbf, C=100.0)	0.9781
SVM (kernel=rbf, C=50.0)	0.9781
SVM (kernel=rbf, C=20.0)	0.9781
SVM (kernel=rbf, C=10.0)	0.9854
SVM (kernel=rbf, C=5.0)	0.9854
SVM (kernel=rbf, C=3.0)	0.9854
SVM (kernel=rbf, C=2.0)	0.9854
SVM (kernel=rbf, C=1.0)	0.9708
SVM (kernel=rbf, C=0.75)	0.9708
SVM (kernel=rbf, C=0.5)	0.9708
SVM (kernel=rbf, C=0.1)	0.9489
SVM (kernel=rbf, C=0.01)	0.6277
SVM (kernel=rbf, C=0.001)	0.6277
SVM (kernel=rbf, C=0.0001)	0.6277
SVM (kernel=poly, C=100.0)	0.9635
SVM (kernel=poly, C=50.0)	0.9708

In [20]:
# Print the best model info
print("Best model: {} kernel, accuracy {}, C {}".format(svm_best_kernel, "{:.4f}".format(svm_highest_accuracy), 
                                                        svm_lowest_c))

# Print the confusion matrix for the best model
print("Model Confusion Matrix:\n", confusion_matrix(y_test, test_predictions))
print("Train data F1-Score for class '1':", f1_score(y_test, test_predictions, pos_label=1))
print("Train data F1-Score for class '0':", f1_score(y_test, test_predictions, pos_label=0))

Best model: rbf kernel, accuracy 0.9854, C 2.0
Model Confusion Matrix:
 [[85  1]
 [ 1 50]]
Train data F1-Score for class '1': 0.9803921568627451
Train data F1-Score for class '0': 0.9883720930232558


## Decision Trees Train and Test

With Decision Trees we're testing different splitting algorithms as well as model depth keeping the one with the highest accuracy and the lowest depth.  Lowest depth is selected as it makes for the simpler model.

This algorithm should be the easiest to interpret and captures non-linear relationships but it's prone to overfitting.  It's also at risk of creating biased trees if the classes are imbalanced but the classes here are relatively equally represented.

In [21]:
# In a loop over several values, train a decision tree with different split criteria and depths and capture the accuracies
# We prefer lower C for better power so we'll only update if accuracy is higher and depth lower
criteria = ['gini', 'entropy', 'log_loss']

# I'll also keep track of the highest accuracy for the lowest depth as our "best" model
dtree_highest_accuracy=0
dtree_lowest_depth=1000
dtree_best_criterion=''

# Try all SVM kernels
for criterion in criteria:
    # Loop through depth values smallest to largest, train and test each
    for depth in range(50, 1, -1):   # Allow up to depth 5 starting at 5 and down to 1
        # Initialize and train the decision tree model
        decision_tree = DecisionTreeClassifier(criterion=criterion, max_depth=depth, random_state=17)
        decision_tree.fit(X_train_scaled, y_train)
        
        # Collect info on training results if desired
        # model_train_score = decision_tree.score(X_train_scaled, y_train)                    # get the model accuracy
        # print("{}\t{}".format("Decision Tree Train (criterion={}, depth={})".format(criterion, depth), "{:.4f}".format(model_test_score)))
            
        # Make predictions using training data
        y_pred = decision_tree.predict(X_test_scaled)
        model_test_score = accuracy_score(y_test, y_pred)

        # Print the accuracies for each of the model params so far.   
        print("{}\t{}".format("Decision Tree (criterion={}, depth={})".format(criterion, depth), "{:.4f}".format(model_test_score)))
            
        # we want the lowest C for better generalization so only keep
        #   accuracy if it's better but C is lower
        if model_test_score >= dtree_highest_accuracy:
            dtree_lowest_depth = depth              # store lowest depth
            test_predictions = y_pred               # store test predictions
            dtree_best_model = decision_tree        # store the best model
            dtree_best_criterion = criterion        # store the best kernel
            dtree_highest_accuracy = model_test_score   # update our highest score

best_models.append(dtree_best_model)                # add to our "best model" collection
model_accuracies.append(dtree_highest_accuracy)     # add the accuracy

Decision Tree (criterion=gini, depth=50)	0.9270
Decision Tree (criterion=gini, depth=49)	0.9270
Decision Tree (criterion=gini, depth=48)	0.9270
Decision Tree (criterion=gini, depth=47)	0.9270
Decision Tree (criterion=gini, depth=46)	0.9270
Decision Tree (criterion=gini, depth=45)	0.9270
Decision Tree (criterion=gini, depth=44)	0.9270
Decision Tree (criterion=gini, depth=43)	0.9270
Decision Tree (criterion=gini, depth=42)	0.9270
Decision Tree (criterion=gini, depth=41)	0.9270
Decision Tree (criterion=gini, depth=40)	0.9270
Decision Tree (criterion=gini, depth=39)	0.9270
Decision Tree (criterion=gini, depth=38)	0.9270
Decision Tree (criterion=gini, depth=37)	0.9270
Decision Tree (criterion=gini, depth=36)	0.9270
Decision Tree (criterion=gini, depth=35)	0.9270
Decision Tree (criterion=gini, depth=34)	0.9270
Decision Tree (criterion=gini, depth=33)	0.9270
Decision Tree (criterion=gini, depth=32)	0.9270
Decision Tree (criterion=gini, depth=31)	0.9270
Decision Tree (criterion=gini, depth=30)

In [22]:
# Print the best model info
print("Best model: {} criterion, depth {}, accuracy {}".format(dtree_best_criterion, dtree_lowest_depth, 
                                                               "{:.4f}".format(dtree_highest_accuracy)))

# Print the confusion matrix for the best model
print("Model Confusion Matrix:\n", confusion_matrix(y_test, test_predictions))
print("Train data F1-Score for class '1':", f1_score(y_test, test_predictions, pos_label=1))
print("Train data F1-Score for class '0':", f1_score(y_test, test_predictions, pos_label=0))

Best model: gini criterion, depth 6, accuracy 0.9489
Model Confusion Matrix:
 [[80  6]
 [ 1 50]]
Train data F1-Score for class '1': 0.9345794392523364
Train data F1-Score for class '0': 0.9580838323353293


## Random Forest Train and Test

This is very much like Decision Trees except we're tuning splitting criteria as well as the number of estimators used.  We'll prefer fewer estimators for simpler models.

The Random Forest algorithm is harder to interpret because it's an ensemble of decision trees but it's less likely to overfit. It can be harder to compute than a single or a few decision trees.

In [23]:
# In a loop over several values, train a random forest with different split criteria and estimator count and capture the accuracies
criteria = ['gini', 'entropy', 'log_loss']

# I'll also keep track of the highest accuracy for the lowest number of estimators as our "best" model
rforest_highest_accuracy=0
rforest_lowest_estimators=1000
rforest_best_criterion=''

# Try all random forest split criteria
for criterion in criteria:
    # Loop through depth values smallest to largest, train and test each
    for estimators in range(500, 50, -50):   # Allow up to 500 estimators decreasing by 50 each loop
        # Initialize and train the decision tree model
        random_forest = RandomForestClassifier(criterion=criterion, n_estimators=estimators, random_state=17)
        random_forest.fit(X_train_scaled, y_train)
        
        # Collect info on training results if desired
        # model_train_score = random_forest.score(X_train_scaled, y_train)                    # get the model accuracy
        # print("{}\t{}".format("Random Forest Train (criterion={}, estimators={})".format(criterion, estimators), "{:.4f}".format(model_test_score)))
            
        # Make predictions using training data
        y_pred = random_forest.predict(X_test_scaled)
        model_test_score = accuracy_score(y_test, y_pred)

        # Print the accuracies for each of the model params so far.   
        print("{}\t{}".format("Random Forest Test (criterion={}, estimators={})".format(criterion, estimators), "{:.4f}".format(model_test_score)))
            
        # we want the lowest number of estimators for better generalization 
        if model_test_score >= rforest_highest_accuracy:
            rforest_lowest_estimators = estimators  # store lowest number of estimators
            test_predictions = y_pred               # store test predictions
            rforest_best_model = random_forest      # store the best model
            rforest_best_criterion = criterion      # store the best criterion
            rforest_highest_accuracy = model_test_score # update our highest score
            
best_models.append(rforest_best_model)              # add to our "best model" collection        
model_accuracies.append(rforest_highest_accuracy)   # add the accuracy    


Random Forest Test (criterion=gini, estimators=500)	0.9854
Random Forest Test (criterion=gini, estimators=450)	0.9854
Random Forest Test (criterion=gini, estimators=400)	0.9854
Random Forest Test (criterion=gini, estimators=350)	0.9854
Random Forest Test (criterion=gini, estimators=300)	0.9854
Random Forest Test (criterion=gini, estimators=250)	0.9854
Random Forest Test (criterion=gini, estimators=200)	0.9854
Random Forest Test (criterion=gini, estimators=150)	0.9781
Random Forest Test (criterion=gini, estimators=100)	0.9708
Random Forest Test (criterion=entropy, estimators=500)	0.9854
Random Forest Test (criterion=entropy, estimators=450)	0.9854
Random Forest Test (criterion=entropy, estimators=400)	0.9854
Random Forest Test (criterion=entropy, estimators=350)	0.9854
Random Forest Test (criterion=entropy, estimators=300)	0.9854
Random Forest Test (criterion=entropy, estimators=250)	0.9781
Random Forest Test (criterion=entropy, estimators=200)	0.9781
Random Forest Test (criterion=entro

In [24]:
# Print the best model info
print("Best model: {} criterion, estimators {}, accuracy {}".format(rforest_best_criterion, rforest_lowest_estimators, 
                                                                    "{:.4f}".format(rforest_highest_accuracy)))

# Print the confusion matrix for the best model
print("Model Confusion Matrix:\n", confusion_matrix(y_test, test_predictions))
print("Train data F1-Score for class '1':", f1_score(y_test, test_predictions, pos_label=1))
print("Train data F1-Score for class '0':", f1_score(y_test, test_predictions, pos_label=0))

Best model: log_loss criterion, estimators 300, accuracy 0.9854
Model Confusion Matrix:
 [[84  2]
 [ 0 51]]
Train data F1-Score for class '1': 0.9807692307692307
Train data F1-Score for class '0': 0.9882352941176471


## KNN Train and Test

This is simpler as I'm only tuning the number of neighbors and keeping the one with the highest accuracy. We'll prefer the model with the fewest neighbors.

KNN, while computational a bit expensive is simple for this problem and doesn't appear to require much tuning.

In [25]:
# In a loop over several values, train a knn movel with different number of neighbors

# I'll also keep track of the highest accuracy for the lowest # nieghbors as our "best" model
knn_highest_accuracy=0
knn_lowest_neighbors=1

# Loop through neighbor values, train and test each
for n_neighbors in range(2, 50):   # Allow up to 50 neighbors increasing by 1 each loop
    # Initialize and train the decision tree model
    knn_model = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn_model.fit(X_train_scaled, y_train)
    
    # Collect info on training results if desired
    # model_train_score = knn_model.score(X_train_scaled, y_train)                    # get the model accuracy
    # print("{}\t{}".format("KNN Train (neighbors={})".format(n_neighbors), "{:.4f}".format(model_test_score)))
        
    # Make predictions using training data
    y_pred = knn_model.predict(X_test_scaled)
    model_test_score = accuracy_score(y_test, y_pred)

    # Print the accuracies for each of the model params so far.   
    print("{}\t{}".format("KNN Test (neighbors={})".format(n_neighbors), "{:.4f}".format(model_test_score)))
        
    # we want the lowest number of estimators for better generalization 
    if model_test_score >= knn_highest_accuracy:
        knn_lowest_neighbors = n_neighbors      # store lowest number of estimators
        test_predictions = y_pred               # store test predictions
        knn_best_model = knn_model              # store the best model
        knn_highest_accuracy = model_test_score # update our highest score

best_models.append(knn_best_model)              # add to our "best model" collection  
model_accuracies.append(knn_highest_accuracy)   # add the accuracy    

KNN Test (neighbors=2)	0.9489
KNN Test (neighbors=3)	0.9635
KNN Test (neighbors=4)	0.9635
KNN Test (neighbors=5)	0.9708
KNN Test (neighbors=6)	0.9416
KNN Test (neighbors=7)	0.9635
KNN Test (neighbors=8)	0.9562
KNN Test (neighbors=9)	0.9708
KNN Test (neighbors=10)	0.9635
KNN Test (neighbors=11)	0.9635
KNN Test (neighbors=12)	0.9635
KNN Test (neighbors=13)	0.9708
KNN Test (neighbors=14)	0.9635
KNN Test (neighbors=15)	0.9635
KNN Test (neighbors=16)	0.9489
KNN Test (neighbors=17)	0.9489
KNN Test (neighbors=18)	0.9489
KNN Test (neighbors=19)	0.9489
KNN Test (neighbors=20)	0.9489
KNN Test (neighbors=21)	0.9635
KNN Test (neighbors=22)	0.9635
KNN Test (neighbors=23)	0.9562
KNN Test (neighbors=24)	0.9489
KNN Test (neighbors=25)	0.9489
KNN Test (neighbors=26)	0.9489
KNN Test (neighbors=27)	0.9562
KNN Test (neighbors=28)	0.9562
KNN Test (neighbors=29)	0.9562
KNN Test (neighbors=30)	0.9562
KNN Test (neighbors=31)	0.9562
KNN Test (neighbors=32)	0.9489
KNN Test (neighbors=33)	0.9489
KNN Test (neighb

In [26]:
# Print the best model info
print("Best model: {} neighbors, accuracy {}".format(knn_lowest_neighbors, "{:.4f}".format(knn_highest_accuracy)))

# Print the confusion matrix for the best model
print("Model Confusion Matrix:\n", confusion_matrix(y_test, test_predictions))
print("Train data F1-Score for class '1':", f1_score(y_test, test_predictions, pos_label=1))
print("Train data F1-Score for class '0':", f1_score(y_test, test_predictions, pos_label=0))

Best model: 13 neighbors, accuracy 0.9708
Model Confusion Matrix:
 [[86  0]
 [ 4 47]]
Train data F1-Score for class '1': 0.9591836734693877
Train data F1-Score for class '0': 0.9772727272727273


## Final Model Selection

Here I enumerate through the best model for each type of classifier; do cross validation; and output AUC, accuracy, and info on the hyperparameters of our best of each type of model.  I'll use this to select the best model.

The best model I will choose is the one with the highest AUC that has the lowest standard deviation and the one with the highest accuracy.

After this I will load the test.csv we need to predict over, do predictions, encode the class label outputs of the predictions, and save the final csv.

In [27]:
# Do some cross validation on all models using AUC
for model, accuracy in zip(best_models, model_accuracies):
        score = cross_val_score(estimator=model,
                                X=X_train,
                                y=y_train,
                                cv=10,
                                scoring='roc_auc')
        print(f'ROC AUC: {score.mean():.2f} '
                f'(+/- {score.std():.2f}) [Accuracy {accuracy:.4f}]', model)

ROC AUC: 0.94 (+/- 0.05) [Accuracy 0.9635] Perceptron(eta0=5.0)
ROC AUC: 0.91 (+/- 0.07) [Accuracy 0.9927] LogisticRegression(C=0.75, random_state=17, solver='saga')
ROC AUC: 0.97 (+/- 0.03) [Accuracy 0.9854] SVC(C=2.0, random_state=17)
ROC AUC: 0.91 (+/- 0.07) [Accuracy 0.9489] DecisionTreeClassifier(max_depth=6, random_state=17)
ROC AUC: 0.99 (+/- 0.01) [Accuracy 0.9854] RandomForestClassifier(criterion='log_loss', n_estimators=300, random_state=17)
ROC AUC: 0.97 (+/- 0.05) [Accuracy 0.9708] KNeighborsClassifier(n_neighbors=13)


## Predict over our model and save the result

My first submission to Kaggle was based on the best AUC of 0.99 +/- 0.01 with an expected accuracy of 0.9854 and used the Random Forest Classifier using the log_loss criterion with 300 estimators.  Below is from my second submission, which used the Logistic Regression model to get my final submission.

I then did predictions using the choice model, which was saved as the lr_best_model. Then we do the inverse transform of the class labels on the output of the prediction, cobble together a resulting dataset for submission, and save it.

In [29]:
# Do predictions on the submission test set and save the output as csv
data_test_input = pd.read_csv('datasets/breast-cancer-wisconsin-data/test.csv') # get the test inputs

output_ids = data_test_input['id']                  # set the IDs we'll output but don't predict on them
data_test=data_test_input.drop(labels='id', axis=1) # remove the zero-information column "id"

X_final_test = scaler.transform(data_test)          # scale the data with the pre-existing scaler values

y_pred = lr_best_model.predict(X_final_test)   # use the model to predict outcomes
decoded_labels = encoder.inverse_transform(y_pred)  # reverse the label encoding to get B/M on the result

output_df = pd.DataFrame(output_ids)                # prep a dataframe for our output
output_df = output_df.assign(label=decoded_labels)  # append the predictions to the IDs

output_df.to_csv("DonKrapohl_project1_submission.csv", index=False) # write the csv

print(output_df)

print('csv written.')                               # complete

           id label
0      906564     B
1       85715     M
2      891670     B
3      874217     M
4      905680     M
..        ...   ...
109     87164     M
110  84348301     M
111    859471     B
112    911150     B
113  90944601     B

[114 rows x 2 columns]
csv written.


## Conclusion

My choice of models used Logistic Regression with C=0.75 employing the SAGA solver. Submitted to Kaggle it achieved 0.98245 accuracy.  My first submission used Random Forest and, while it scored well locally, it only got to 0.95614 over the final test set.