Classifier for breast cancer data from Kaggle. 

Attribute Information:

1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

In [43]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score

Read in the data

In [44]:
data = pd.read_csv("data.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
id                         569 non-null int64
diagnosis                  569 non-null object
radius_mean                569 non-null float64
texture_mean               569 non-null float64
perimeter_mean             569 non-null float64
area_mean                  569 non-null float64
smoothness_mean            569 non-null float64
compactness_mean           569 non-null float64
concavity_mean             569 non-null float64
concave points_mean        569 non-null float64
symmetry_mean              569 non-null float64
fractal_dimension_mean     569 non-null float64
radius_se                  569 non-null float64
texture_se                 569 non-null float64
perimeter_se               569 non-null float64
area_se                    569 non-null float64
smoothness_se              569 non-null float64
compactness_se             569 non-null float64
concavity_se               569 non

Split the data into training and testing sets

In [45]:
train_set, test_set = train_test_split(data, test_size=0.2, random_state=21)

Create numpy arrays of the diagnoses as well as the target vectors for classification

In [46]:
diag_train = train_set["diagnosis"].to_numpy()
diag_test = test_set["diagnosis"].to_numpy()

train_cf = (diag_train == 'M')
test_cf = (diag_test == 'M')

Build arrays for the attributes to feed to the classifier

In [47]:
# Drop unwanted columns
train_data_pure = train_set.drop(["id","diagnosis","Unnamed: 32"], axis=1)
test_data_pure = test_set.drop(["id","diagnosis","Unnamed: 32"], axis=1)

# Cast to a numpy array for the classifier
train_arr = train_data_pure.to_numpy()
test_arr = test_data_pure.to_numpy()

Build and test the classifier

In [48]:
sgd_clf = SGDClassifier(random_state=21)
sgd_clf.fit(train_arr, train_cf)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=21, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [49]:
# How do we do on 3 cross-validation folds?
cross_val_score(sgd_clf, train_arr, train_cf, cv=3, scoring="accuracy")

array([0.88815789, 0.94078947, 0.9205298 ])

In [50]:
# Check the confusion matrix
# TN FN
# FP TP
train_pred = cross_val_predict(sgd_clf, train_arr, train_cf, cv=3)
confusion_matrix(train_cf, train_pred)

array([[273,   9],
       [ 29, 144]])

In [51]:
# Precision - Fraction of positives which are true
precision_score(train_cf, train_pred)

0.9411764705882353

In [52]:
# Recall - Fracion we correctly identified as positive
recall_score(train_cf, train_pred)

0.8323699421965318

In [53]:
# F1 - harmonic mean of precision and recall
f1_score(train_cf, train_pred)

0.8834355828220859

Build a function to compute these three parameters as we adjust our classifier

In [54]:
# Input - two numpy arrays of the same dimension containing the target vector and the results after classification
# Output - (precision, recall, f1)
def test_classifier(target, results):
    p = precision_score(target, results)
    r = recall_score(target, results)
    f1 = f1_score(target, results)
    return (p, r, f1)

Try it out on the test set

In [55]:
test_pred = sgd_clf.predict(test_arr)
n_correct = sum(test_pred == test_cf)
print(n_correct / len(test_pred))

0.9122807017543859


In [56]:
confusion_matrix(test_cf, test_pred)

array([[74,  1],
       [ 9, 30]])

In [57]:
test_classifier(test_cf, test_pred)

(0.967741935483871, 0.7692307692307693, 0.8571428571428572)