# Mini-Kaggle Project 2: Adult Income Classification
Don Krapohl

## Summary

I used looping to try out many hyperparameters to choose the best model over the preprocessed data. 


## Requirements

Predict the income level of individuals based on various demographic and personal information.

- Split the provided dataset into a suitable training set and testing set.
- Train various classifiers on your training set, including Logistic Regression, SVM, Decision Trees, KNN, and Random Forest.
- Compare the performance of the classifiers and submit your best score.
- Aim to beat the benchmark performance of 78% on the hidden test dataset. If your performance is lower than 78%, you will lose the entire 70% of the grade allocated for the coding portion of this assignment.
- Note that the provided test dataset (`test.csv`) does not have target variables and is solely for testing your submission within our Kaggle system. Your training dataset (`train.csv`) is all you have, so split it appropriately for training and evaluation purposes.
- Submit your final notebook to Canvas for peer-review.
- In your notebook, use markdown to explain your steps, rationale, and exploration of the model's performance on various classifiers. Clearly mention which classifier you have decided on and your rationale to submit it as your final submission on Kaggle.
- Your Kaggle team name should be exactly identical to your name in Canvas.
- Your notebook submitted on Canvas will be peer-reviewed for further evaluation.
- Submit your submission file on Kaggle and notebook on Canvas by Nov 17, 2024, 11:59 PM.

### We will be training, testing, and using the following models:

* Logistic Regression
* SVM
* Decision Trees
* Random Forest
* KNN

Ref: https://scikit-learn.org/1.5/supervised_learning.html

## Approach

I will:
* load from train.csv
* explore the data for quality and distribution
* discretize the continous variable "income"
* scale the features
* do feature selection to reduce dimensionality
* remove and encode string columns
* split the data into train and test sets
* train all the models and select the one with the best accuracy
    
I'll train multiple models for each of the classifier algorithms and capture the one of each type that has the highest accuracy. After all are trained and tested I'll select the one with the highest accuracy and highest AUC as my submission.  I'll then predict over the test.csv file and submit the results.

## Environment Setup

### Establish environment
1. Download the python venv for the project from https://github.com/dkrapohl/uwf-venv-breast-cancer/tree/main (I'm using the same one from project 1)
2. Activate the environment using the README from that repo
3. Set the Jupyter environment to use this kernel (top right of this window)

If we need to reproduce the environment the private repo for this notebook has a pip requirements.txt

## Data Exploration

### Import required libraries.


In [15]:
# basic dataframe and operations
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt
import seaborn as sns

# manipulation and preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, cross_val_score

# models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# measuring results
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

# warning suppression
import warnings
from sklearn.exceptions import ConvergenceWarning

In [2]:
# Suppress ConvergenceWarnings and UserWarning.  They're noise here.
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=UserWarning)

### Create a few collections to capture the info on our different models

To explore the accuracies of multiple model hyperparameters I'll be training multiple models and keeping the model from each type of classifier that has the highest accuracy.  I suspect there's an easier way to accomplish this but this is what I can do at this point.


In [3]:
# collections we'll use for our best of each type of model
best_models = []                    # List of model instances that are our best for final evaluation
model_accuracies = []               # The accuracies of our best models in a key-value dictionary

Here I'll load the data into an initial dataframe to be used for exploration and the start of preprocessing.


In [6]:
# Import the csv training dataset to a pandas dataframe
# The dataset is expected in the same directory as this notebook
#   under a subfolder path datasets/breast-cancer-wisconsin-data/
data_train = pd.read_csv('datasets/train.csv')
# Show the shape of the dataset
data_train.shape


(39073, 16)

### Exploratory Data Analysis

Output the first 5 rows of the data to see the general character and nature of the data like missing values, obvious dirty data, features with very large ranges, etc.

In [9]:
# Display a few rows from the training data
data_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income,id
0,78,Private,111189,7th-8th,4,Never-married,Machine-op-inspct,Not-in-family,White,Female,0,0,35,Dominican-Republic,0,26052
1,49,Self-emp-inc,122066,Some-college,10,Divorced,Sales,Not-in-family,White,Male,0,0,25,United-States,0,47049
2,62,Self-emp-not-inc,168682,7th-8th,4,Married-civ-spouse,Sales,Husband,White,Male,0,0,5,United-States,0,33915
3,18,Private,110230,10th,6,Never-married,Other-service,Own-child,White,Male,0,0,11,United-States,0,22132
4,40,Private,373050,12th,8,Married-civ-spouse,Other-service,Husband,White,Male,0,0,40,?,0,46452


### Look at the data to make sure we don't have null or missing data

This will give the count of null values for each column to see if we need to handle missing data.

In [7]:
# Get the count of nulls per column
data_train.isnull().sum()

age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
id                 0
dtype: int64

In [8]:
# Get the count of nulls per column
data_train.isna().sum()

age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
id                 0
dtype: int64

### I prefer to verify object data types. Not important here but at scale it definitely is.

Sometimes numeric data come in as object, which can make lookups and indexing inefficient.

In [None]:
# Verify data types to see if there's a better explicit cast for any feature or if we need to encode anything
data_train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39073 entries, 0 to 39072
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              39073 non-null  int64 
 1   workclass        39073 non-null  object
 2   fnlwgt           39073 non-null  int64 
 3   education        39073 non-null  object
 4   educational-num  39073 non-null  int64 
 5   marital-status   39073 non-null  object
 6   occupation       39073 non-null  object
 7   relationship     39073 non-null  object
 8   race             39073 non-null  object
 9   gender           39073 non-null  object
 10  capital-gain     39073 non-null  int64 
 11  capital-loss     39073 non-null  int64 
 12  hours-per-week   39073 non-null  int64 
 13  native-country   39073 non-null  object
 14  income           39073 non-null  int64 
 15  id               39073 non-null  int64 
dtypes: int64(8), object(8)
memory usage: 4.8+ MB



Need to encode (index):
> workclass - 1  
> education - 3  
> marital-status - 4  
> occupation - 6  
> relationship - 7  
> race - 8  
> gender - 9  
> native-country - 13  
    
The other features need to be scaled (index 0, 2, 4, 10, 11, 12, 14, 15)  

Income probably needs to be discretized (14)  

ID needs to be removed from training (15)  



### Get the statistics about the data and their distribution.

I'm looking here for any columns with differing counts and any outrageous outliers.

In [50]:
# Display basic metrics about each feature, like count, mean, std, min/max, and IQR
data_train.describe()

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week,income,id
count,39073.0,39073.0,39073.0,39073.0,39073.0,39073.0,39073.0,39073.0
mean,38.588207,190071.4,10.072556,1067.195327,86.108796,40.390269,0.238144,24465.705372
std,13.695509,105983.9,2.570352,7426.475044,399.34239,12.335446,0.425953,14072.213508
min,17.0,13492.0,1.0,0.0,0.0,1.0,0.0,1.0
25%,28.0,117556.0,9.0,0.0,0.0,40.0,0.0,12319.0
50%,37.0,178478.0,10.0,0.0,0.0,40.0,0.0,24492.0
75%,48.0,238367.0,12.0,0.0,0.0,45.0,0.0,36610.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0,1.0,48842.0


I see that some of the features, such as capital gain/loss have huge variability with the IQR being 0 but the max is 4356.

## Preprocessing

### Adjust columns as needed

1. Remove ID and make a dataframe of our target predictions. We need to remove income and ID from the training data. The former is the has no information and the latter is the value we're predicting.

In [None]:
# remove the result column from the input parameters
# also remove the ID column. It carries no signal.
X_train_without_label = data_train.drop('income', axis=1).drop('id', axis=1)

# Assign the variable we're targeting for the input data
y_values = data_train['income']      # assign the income continuous variable we'll discretize

0    0
1    0
2    0
3    0
4    0
Name: income, dtype: int64

2. Encode categorical values - I'll use the ColumnTransformer to encode multiple columns here

### Split and scale the data

Here I'll be splitting the data to be 70% training set, 30% test set. I then scale the data using the Standard Scaler to make all the data within the same range having 0 as the mean and standard deviation of 1.

In [30]:
# Do train test split
X_train, X_test, y_train, y_test =    train_test_split(X_train_without_label, y_values,
    test_size=0.3, 
    random_state=17)

In [61]:
# Make a list of string categorical cols we need to encode
categorical_features = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'native-country']
# Make a list of features we need to scale
numeric_features = ['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss', 'hours-per-week']

ohc = OneHotEncoder()           # creating a one hot encoder to do our column transforms
ss = StandardScaler()           # creating a scaler for transforming

feature_encoder = ColumnTransformer(
    transformers = [('onehot', ohc, categorical_features), 
                    ('scaler', ss, numeric_features)],
    remainder = 'passthrough'   # we have the numeric columns so we want to preserve those for scaling
)

X_transformed = feature_encoder.fit_transform(X_train)

df_transformed = pd.DataFrame(
    X_transformed,
    columns=feature_encoder.get_feature_names_out()
)
X_test_clean = feature_encoder.fit(X_test)

df_transformed.head()

ValueError: Shape of passed values is (27351, 1), indices imply (27351, 108)

## Logistic Regression Train and Test

I'm looping through hyperparameters and training multiple models, keeping the "best" based on accuracy. The secondary consideration is that I will prefer the lowest C value as lower C values make simpler models.

This model assumes linearity in independent variables and performs well if the data are linearly separable. It'll also give use proababilities on the class predictions.

In [58]:
# In a loop over several values, train an SVM with different C values and capture the accuracies
# We prefer lower C for better power so we'll only update if accuracy is higher and C lower

solvers = ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga']
cs = [100.0, 50.0, 20.0, 10.0, 5.0, 3.0, 2.0, 1.0, 0.75, 0.5, 0.1, 0.01, 0.001, 0.0001]  # C values to try

# I'll also keep track of the highest accuracy for the lowest C as our "best" model
lr_highest_accuracy=0
lr_lowest_c=1000.0

# Try all SVM kernels
for solver in solvers:
    # Try the range of C values
    for c in cs:
        # Initialize and train the SVM model
        logreg_model = LogisticRegression(solver=solver, C=c, random_state = 17)
        logreg_model.fit(X_train_clean, y_train)
        
        # Collect info on training results if desired
        # model_train_score = logreg_model.score(X_train_scaled, y_train)                    # get the model accuracy
        # print("{}\t{}".format("Logistic Regression (solver={})".format(solver), "{:.4f}".format(model_test_score)))
            
        # Make predictions using training data
        y_pred = logreg_model.predict(X_test_clean)
        model_test_score = accuracy_score(y_test, y_pred)

        # Print the accuracies for each of the model params so far.   
        print("{}\t{}".format("Logistic Regression (solver={})".format(solver), "{:.4f}".format(model_test_score)))
            
        # we want the lowest C for better generalization so only keep
        #   accuracy if it's better but C is lower
        if model_test_score >= lr_highest_accuracy: # we're in a list with decreasing values so don't need to check C
            test_predictions = y_pred                   # store test predictions
            lr_lowest_c = c                             # store the lowest C
            lr_best_model = logreg_model                # store the best model
            lr_best_solver = solver                     # store the best solver
            lr_highest_accuracy = model_test_score       # update our highest score

best_models.append(lr_best_model)                       # add to our "best model" collection      
model_accuracies.append(lr_highest_accuracy)                # add the accuracy      

ValueError: Expected 2D array, got scalar array instead:
array=ColumnTransformer(remainder='passthrough',
                  transformers=[('onehot', OneHotEncoder(),
                                 ['workclass', 'education', 'marital-status',
                                  'occupation', 'relationship', 'race',
                                  'gender', 'native-country']),
                                ('num', StandardScaler(),
                                 ['age', 'fnlwgt', 'educational-num',
                                  'capital-gain', 'capital-loss',
                                  'hours-per-week'])]).
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [None]:
# Print the best model info
print("Best model: {} solver, c {}, accuracy {}".format(lr_best_solver, c, "{:.4f}".format(lr_highest_accuracy)))

# Print the confusion matrix for the best model
print("Model Confusion Matrix:\n", confusion_matrix(y_test, test_predictions))
print("Train data F1-Score for class '1':", f1_score(y_test, test_predictions, pos_label=1))
print("Train data F1-Score for class '0':", f1_score(y_test, test_predictions, pos_label=0))

## SVM Train and Test

As with Logistic Regression, looping to train multiple models, keeping the one with the highest accuracy for the lowest C value.  C value again we prefer lower C as it makes for simpler models.

This model will be computationally more expensive than the others, especially since I'm testing multiple C values on all kernels. The algorithm is more flexible in its ability to model both linear and non-linear data.

In [None]:
# In a loop over several values, train an SVM with different C values and capture the accuracies
# We prefer lower C for better power so we'll only update if accuracy is higher and C lower
cs = [100.0, 50.0, 20.0, 10.0, 5.0, 3.0, 2.0, 1.0, 0.75, 0.5, 0.1, 0.01, 0.001, 0.0001]  # C values to try
kernels = ['linear', 'rbf', 'poly', 'sigmoid']

# I'll also keep track of the highest accuracy for the highest eta as our "best" model
svm_highest_accuracy=0
svm_lowest_c=1000

# Try all SVM kernels
for kernel in kernels:
    # Loop through C values largest to smallest, train and test each
    for c in cs:
        # Initialize and train the SVM model
        linear_svm = SVC(kernel=kernel, C=c, random_state=17)
        linear_svm.fit(X_train_scaled, y_train)
        
        # Collect info on training results if desired
        # model_train_score = linear_svm.score(X_train_scaled, y_train)                    # get the model accuracy
        # print("{}\t{}".format("SVM Train (kernel={}, C={})".format(kernel, c), "{:.4f}".format(model_test_score)))
            
        # Make predictions using training data
        y_pred = linear_svm.predict(X_test_scaled)
        model_test_score = accuracy_score(y_test, y_pred)

        # Print the accuracies for each of the model params so far.   
        print("{}\t{}".format("SVM (kernel={}, C={})".format(kernel, c), "{:.4f}".format(model_test_score)))
            
        # we want the lowest C for better generalization so only keep
        #   accuracy if it's better but C is lower
        if model_test_score >= svm_highest_accuracy: # we're in a list with decreasing values so don't need to check C
            svm_lowest_c = c                            # store lowest C
            test_predictions = y_pred                   # store test predictions
            svm_best_model = linear_svm                 # store the best model
            svm_best_kernel = kernel                    # store the best kernel
            svm_highest_accuracy = model_test_score     # update our highest score

best_models.append(svm_best_model)                      # add to our "best model" collection   
model_accuracies.append(svm_highest_accuracy)           # add the accuracy

In [None]:
# Print the best model info
print("Best model: {} kernel, accuracy {}, C {}".format(svm_best_kernel, "{:.4f}".format(svm_highest_accuracy), 
                                                        svm_lowest_c))

# Print the confusion matrix for the best model
print("Model Confusion Matrix:\n", confusion_matrix(y_test, test_predictions))
print("Train data F1-Score for class '1':", f1_score(y_test, test_predictions, pos_label=1))
print("Train data F1-Score for class '0':", f1_score(y_test, test_predictions, pos_label=0))

## Decision Trees Train and Test

With Decision Trees we're testing different splitting algorithms as well as model depth keeping the one with the highest accuracy and the lowest depth.  Lowest depth is selected as it makes for the simpler model.

This algorithm should be the easiest to interpret and captures non-linear relationships but it's prone to overfitting.  It's also at risk of creating biased trees if the classes are imbalanced but the classes here are relatively equally represented.

In [None]:
# In a loop over several values, train a decision tree with different split criteria and depths and capture the accuracies
# We prefer lower C for better power so we'll only update if accuracy is higher and depth lower
criteria = ['gini', 'entropy', 'log_loss']

# I'll also keep track of the highest accuracy for the lowest depth as our "best" model
dtree_highest_accuracy=0
dtree_lowest_depth=1000
dtree_best_criterion=''

# Try all SVM kernels
for criterion in criteria:
    # Loop through depth values smallest to largest, train and test each
    for depth in range(50, 1, -1):   # Allow up to depth 5 starting at 5 and down to 1
        # Initialize and train the decision tree model
        decision_tree = DecisionTreeClassifier(criterion=criterion, max_depth=depth, random_state=17)
        decision_tree.fit(X_train_scaled, y_train)
        
        # Collect info on training results if desired
        # model_train_score = decision_tree.score(X_train_scaled, y_train)                    # get the model accuracy
        # print("{}\t{}".format("Decision Tree Train (criterion={}, depth={})".format(criterion, depth), "{:.4f}".format(model_test_score)))
            
        # Make predictions using training data
        y_pred = decision_tree.predict(X_test_scaled)
        model_test_score = accuracy_score(y_test, y_pred)

        # Print the accuracies for each of the model params so far.   
        print("{}\t{}".format("Decision Tree (criterion={}, depth={})".format(criterion, depth), "{:.4f}".format(model_test_score)))
            
        # we want the lowest C for better generalization so only keep
        #   accuracy if it's better but C is lower
        if model_test_score >= dtree_highest_accuracy:
            dtree_lowest_depth = depth              # store lowest depth
            test_predictions = y_pred               # store test predictions
            dtree_best_model = decision_tree        # store the best model
            dtree_best_criterion = criterion        # store the best kernel
            dtree_highest_accuracy = model_test_score   # update our highest score

best_models.append(dtree_best_model)                # add to our "best model" collection
model_accuracies.append(dtree_highest_accuracy)     # add the accuracy

In [None]:
# Print the best model info
print("Best model: {} criterion, depth {}, accuracy {}".format(dtree_best_criterion, dtree_lowest_depth, 
                                                               "{:.4f}".format(dtree_highest_accuracy)))

# Print the confusion matrix for the best model
print("Model Confusion Matrix:\n", confusion_matrix(y_test, test_predictions))
print("Train data F1-Score for class '1':", f1_score(y_test, test_predictions, pos_label=1))
print("Train data F1-Score for class '0':", f1_score(y_test, test_predictions, pos_label=0))

## Random Forest Train and Test

This is very much like Decision Trees except we're tuning splitting criteria as well as the number of estimators used.  We'll prefer fewer estimators for simpler models.

The Random Forest algorithm is harder to interpret because it's an ensemble of decision trees but it's less likely to overfit. It can be harder to compute than a single or a few decision trees.

In [None]:
# In a loop over several values, train a random forest with different split criteria and estimator count and capture the accuracies
criteria = ['gini', 'entropy', 'log_loss']

# I'll also keep track of the highest accuracy for the lowest number of estimators as our "best" model
rforest_highest_accuracy=0
rforest_lowest_estimators=1000
rforest_best_criterion=''

# Try all random forest split criteria
for criterion in criteria:
    # Loop through depth values smallest to largest, train and test each
    for estimators in range(500, 50, -50):   # Allow up to 500 estimators decreasing by 50 each loop
        # Initialize and train the decision tree model
        random_forest = RandomForestClassifier(criterion=criterion, n_estimators=estimators, random_state=17)
        random_forest.fit(X_train_scaled, y_train)
        
        # Collect info on training results if desired
        # model_train_score = random_forest.score(X_train_scaled, y_train)                    # get the model accuracy
        # print("{}\t{}".format("Random Forest Train (criterion={}, estimators={})".format(criterion, estimators), "{:.4f}".format(model_test_score)))
            
        # Make predictions using training data
        y_pred = random_forest.predict(X_test_scaled)
        model_test_score = accuracy_score(y_test, y_pred)

        # Print the accuracies for each of the model params so far.   
        print("{}\t{}".format("Random Forest Test (criterion={}, estimators={})".format(criterion, estimators), "{:.4f}".format(model_test_score)))
            
        # we want the lowest number of estimators for better generalization 
        if model_test_score >= rforest_highest_accuracy:
            rforest_lowest_estimators = estimators  # store lowest number of estimators
            test_predictions = y_pred               # store test predictions
            rforest_best_model = random_forest      # store the best model
            rforest_best_criterion = criterion      # store the best criterion
            rforest_highest_accuracy = model_test_score # update our highest score
            
best_models.append(rforest_best_model)              # add to our "best model" collection        
model_accuracies.append(rforest_highest_accuracy)   # add the accuracy    


In [None]:
# Print the best model info
print("Best model: {} criterion, estimators {}, accuracy {}".format(rforest_best_criterion, rforest_lowest_estimators, 
                                                                    "{:.4f}".format(rforest_highest_accuracy)))

# Print the confusion matrix for the best model
print("Model Confusion Matrix:\n", confusion_matrix(y_test, test_predictions))
print("Train data F1-Score for class '1':", f1_score(y_test, test_predictions, pos_label=1))
print("Train data F1-Score for class '0':", f1_score(y_test, test_predictions, pos_label=0))

## KNN Train and Test

This is simpler as I'm only tuning the number of neighbors and keeping the one with the highest accuracy. We'll prefer the model with the fewest neighbors.

KNN, while computational a bit expensive is simple for this problem and doesn't appear to require much tuning.

In [None]:
# In a loop over several values, train a knn movel with different number of neighbors

# I'll also keep track of the highest accuracy for the lowest # nieghbors as our "best" model
knn_highest_accuracy=0
knn_lowest_neighbors=1

# Loop through neighbor values, train and test each
for n_neighbors in range(2, 50):   # Allow up to 50 neighbors increasing by 1 each loop
    # Initialize and train the decision tree model
    knn_model = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn_model.fit(X_train_scaled, y_train)
    
    # Collect info on training results if desired
    # model_train_score = knn_model.score(X_train_scaled, y_train)                    # get the model accuracy
    # print("{}\t{}".format("KNN Train (neighbors={})".format(n_neighbors), "{:.4f}".format(model_test_score)))
        
    # Make predictions using training data
    y_pred = knn_model.predict(X_test_scaled)
    model_test_score = accuracy_score(y_test, y_pred)

    # Print the accuracies for each of the model params so far.   
    print("{}\t{}".format("KNN Test (neighbors={})".format(n_neighbors), "{:.4f}".format(model_test_score)))
        
    # we want the lowest number of estimators for better generalization 
    if model_test_score >= knn_highest_accuracy:
        knn_lowest_neighbors = n_neighbors      # store lowest number of estimators
        test_predictions = y_pred               # store test predictions
        knn_best_model = knn_model              # store the best model
        knn_highest_accuracy = model_test_score # update our highest score

best_models.append(knn_best_model)              # add to our "best model" collection  
model_accuracies.append(knn_highest_accuracy)   # add the accuracy    

In [None]:
# Print the best model info
print("Best model: {} neighbors, accuracy {}".format(knn_lowest_neighbors, "{:.4f}".format(knn_highest_accuracy)))

# Print the confusion matrix for the best model
print("Model Confusion Matrix:\n", confusion_matrix(y_test, test_predictions))
print("Train data F1-Score for class '1':", f1_score(y_test, test_predictions, pos_label=1))
print("Train data F1-Score for class '0':", f1_score(y_test, test_predictions, pos_label=0))

## Final Model Selection

Here I enumerate through the best model for each type of classifier; do cross validation; and output AUC, accuracy, and info on the hyperparameters of our best of each type of model.  I'll use this to select the best model.

The best model I will choose is the one with the highest AUC that has the lowest standard deviation and the one with the highest accuracy.

After this I will load the test.csv we need to predict over, do predictions, encode the class label outputs of the predictions, and save the final csv.

In [None]:
# Do some cross validation on all models using AUC
for model, accuracy in zip(best_models, model_accuracies):
        score = cross_val_score(estimator=model,
                                X=X_train,
                                y=y_train,
                                cv=10,
                                scoring='roc_auc')
        print(f'ROC AUC: {score.mean():.2f} '
                f'(+/- {score.std():.2f}) [Accuracy {accuracy:.4f}]', model)

## Predict over our model and save the result

My first submission to Kaggle was based on the best AUC of 0.99 +/- 0.01 with an expected accuracy of 0.9854 and used the Random Forest Classifier using the log_loss criterion with 300 estimators.  Below is from my second submission, which used the Logistic Regression model to get my final submission.

I then did predictions using the choice model, which was saved as the lr_best_model. Then we do the inverse transform of the class labels on the output of the prediction, cobble together a resulting dataset for submission, and save it.

In [None]:
# Do predictions on the submission test set and save the output as csv
data_test_input = pd.read_csv('datasets/breast-cancer-wisconsin-data/test.csv') # get the test inputs

output_ids = data_test_input['id']                  # set the IDs we'll output but don't predict on them
data_test=data_test_input.drop(labels='id', axis=1) # remove the zero-information column "id"

X_final_test = scaler.transform(data_test)          # scale the data with the pre-existing scaler values

y_pred = lr_best_model.predict(X_final_test)   # use the model to predict outcomes
decoded_labels = encoder.inverse_transform(y_pred)  # reverse the label encoding to get B/M on the result

output_df = pd.DataFrame(output_ids)                # prep a dataframe for our output
output_df = output_df.assign(label=decoded_labels)  # append the predictions to the IDs

output_df.to_csv("DonKrapohl_project1_submission.csv", index=False) # write the csv

print(output_df)

print('csv written.')                               # complete

## Conclusion

My choice of models used Logistic Regression with C=0.75 employing the SAGA solver. Submitted to Kaggle it achieved 0.98245 accuracy.  My first submission used Random Forest and, while it scored well locally, it only got to 0.95614 over the final test set.