# Predicting if a Breast Lump is Malignant or Benign with K Nearest Neighbors

In this notebook, we will be using a K Nearest Neighbors model to attempt to predict if a breast lump is malignant or benign based on certain characteristics of the lump, the cells that make up the lump, and the nuclei of those cells. The lumps have been assigned a class, either 2 if the lump is benign or 4 if the lump is malignant. This will be our label for the model, that is, we will be using K Nearest Neighbors to assign this 'class' value to the test data.

Something to note: This data uses ? in place of null values. This will have to be cleaned before we start to train our model.

Let's start by importing some libraries we will need and taking a look at the data.

In [1]:
import numpy as np
from sklearn import preprocessing, model_selection, neighbors
import pandas as pd
import pickle

df = pd.read_csv('breast-cancer-wisconsin.csv')
df.head()

Unnamed: 0,id,clump_thickness,unif_cell_size,unif_cell_shape,marg_adhesion,single_epith_cell_size,bare_nuclei_,bland_chrom,norm_nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
id                         699 non-null int64
clump_thickness            699 non-null int64
 unif_cell_size            699 non-null int64
 unif_cell_shape           699 non-null int64
 marg_adhesion             699 non-null int64
 single_epith_cell_size    699 non-null int64
 bare_nuclei_              699 non-null object
 bland_chrom               699 non-null int64
 norm_nucleoli             699 non-null int64
 mitoses                   699 non-null int64
 class                     699 non-null int64
dtypes: int64(10), object(1)
memory usage: 60.1+ KB


It looks like the data has no null values, but it's actually the case that ? are in place of null values. We will have to clean this before we can get started.

In [3]:
df.describe()

Unnamed: 0,id,clump_thickness,unif_cell_size,unif_cell_shape,marg_adhesion,single_epith_cell_size,bland_chrom,norm_nucleoli,mitoses,class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,1071704.0,4.41774,3.134478,3.207439,2.806867,3.216023,3.437768,2.866953,1.589413,2.689557
std,617095.7,2.815741,3.051459,2.971913,2.855379,2.2143,2.438364,3.053634,1.715078,0.951273
min,61634.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,1238298.0,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


Let's do some data prep. First, we will deal with the nulls (the data here has ?s in place of nulls).

-99999 is read as an outlier by most algorithms, so using this prevents us from dropping the other values in the row. Let's convert the ? values to this.

This datset also only has 16 ? values, so we could also use df.drop since we won't lose too much data.

In [4]:
df.replace('?', -99999, inplace=True)

id is an irrelevent field, so we can drop it.

In [5]:
df.drop(['id'], 1, inplace=True)

I noticed class column has a leading space, so let's rename it.

In [6]:
df=df.rename(columns = {' class':'class'})

Now let's set our X and y values. Remember, 'class' is our label, so this will be our y value.

In [7]:
X = np.array(df.drop(['class'],1))
y = np.array(df['class'])

Let's split the data and train a model.

In [8]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size =0.2)

clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

And we'll take a look to see how the model did.

In [9]:
accuracy = clf.score(X_test, y_test)
print(accuracy)

0.9642857142857143


That's not bad, though for something as serious as a test for breast cancer, we would like to see something like 99.9% accuracy. Regardless, this model appears to be a good predictor.

Let's save it for later so we don't have to retrain it every time.

In [10]:
with open('KNearestNeighbors.pickle', 'wb') as f:
    pickle.dump(clf, f)

Now we can use the model to predict new values. We have to make sure the input data matches the original style the data was input, since that is how the model is trained to work. We can easily reshape the data in an array.

In [11]:
example_measures = np.array([4,2,1,1,1,2,3,2,1])
example_measures = example_measures.reshape(1,-1)

prediction = clf.predict(example_measures)

print(prediction)

[2]


What if we had 2 samples we would like to predict for? Easy - we reshape the arrays and feed it to the model.

In [12]:
#What if we had 2 samples?
#we reshape for 2
example_measures = np.array([[4,2,1,1,1,2,3,2,1], [4,2,1,2,2,2,3,2,1]])
example_measures = example_measures.reshape(2,-1)

prediction = clf.predict(example_measures)

print(prediction)

[2 2]


Actually, there's an even easier way that will work for any amount of values we want predicted.

In [13]:
#What if we have a bunch? We don't want to hard code them all for reshaping
example_measures = example_measures.reshape(len(example_measures),-1)

prediction = clf.predict(example_measures)

print(prediction)

[2 2]


Looks like these two examples are predicted to be benign.