# Breast Cancer Predictions with SVM

In this workbook, we will build an SVM model in order to attempt to predict instances of breast cancer and whether a lump is benign or malignant based on certain characteristics of the bump.

We will start by importing some libraries we will be using, and declaring the dataframe.

In [1]:
import numpy as np
from sklearn import preprocessing, model_selection, neighbors, svm
import pandas as pd
import pickle

df = pd.read_csv('breast-cancer-wisconsin.csv')

Let's take a look at the data and explore a bit.

In [2]:
df.head()

Unnamed: 0,id,clump_thickness,unif_cell_size,unif_cell_shape,marg_adhesion,single_epith_cell_size,bare_nuclei_,bland_chrom,norm_nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


'class' represents if the tumor is malignant (value of 4) or benign (value of 2), so this will be our y value, or label.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
id                         699 non-null int64
clump_thickness            699 non-null int64
 unif_cell_size            699 non-null int64
 unif_cell_shape           699 non-null int64
 marg_adhesion             699 non-null int64
 single_epith_cell_size    699 non-null int64
 bare_nuclei_              699 non-null object
 bland_chrom               699 non-null int64
 norm_nucleoli             699 non-null int64
 mitoses                   699 non-null int64
 class                     699 non-null int64
dtypes: int64(10), object(1)
memory usage: 60.1+ KB


It looks like the data has no null values, but it's actually the case that ? are in place of null values. We will have to clean this before we can get started.

In [4]:
df.describe()

Unnamed: 0,id,clump_thickness,unif_cell_size,unif_cell_shape,marg_adhesion,single_epith_cell_size,bland_chrom,norm_nucleoli,mitoses,class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,1071704.0,4.41774,3.134478,3.207439,2.806867,3.216023,3.437768,2.866953,1.589413,2.689557
std,617095.7,2.815741,3.051459,2.971913,2.855379,2.2143,2.438364,3.053634,1.715078,0.951273
min,61634.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,1238298.0,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


Let's do some data prep. First, we will deal with the nulls (the data here has ?s in place of nulls).

-99999 is read as an outlier by most algorithms, so using this prevents us from dropping the other values in the row. Let's convert the ? values to this.

This datset also only has 16 ? values, so we could also use df.drop since we won't lose too much data.

In [5]:
df.replace('?', -99999, inplace=True)

id is irrelevent so we drop it.

In [6]:
df.drop(['id'], 1, inplace=True)

I noticed class column has a leading space, so let's rename it.

In [7]:
df=df.rename(columns = {' class':'class'})

Now let's set our X and y values.

In [8]:
X = np.array(df.drop(['class'],1))
y = np.array(df['class'])

Let's split the data and train a model.

In [9]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size =0.2)

In [10]:
clf = svm.SVC() #There are a lot of parameters we could use
clf.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

We can now test the accuracy of the model.

In [11]:
accuracy = clf.score(X_test, y_test)
print(accuracy)

0.95


It looks good. We can save this model as a pickle so we don't have to train it every time we want to run it in the future.

In [12]:
with open('KNearestNeighbors.pickle', 'wb') as f:
    pickle.dump(clf, f)

Now we can use the model to predict new values. We have to make sure the input data matches the original style the data was input, since that is how the model is trained to work. We can easily reshape the data in an array.

In [13]:
example_measures = np.array([4,2,1,1,1,2,3,2,1])
example_measures = example_measures.reshape(1,-1)

prediction = clf.predict(example_measures)

print(prediction)

[2]


What if we had 2 samples we would like to predict for? Easy - we reshape the arrays and feed it to the model.

In [14]:
example_measures = np.array([[4,2,1,1,1,2,3,2,1], [4,2,1,2,2,2,3,2,1]])
example_measures = example_measures.reshape(2,-1)

prediction = clf.predict(example_measures)

print(prediction)

[2 2]


Actually, there's an even easier way that will work for any amount of values we want predicted.

In [15]:
example_measures = example_measures.reshape(len(example_measures),-1)

prediction = clf.predict(example_measures)

print(prediction)

[2 2]


Looks like these two examples are predicted to be benign.