# Classification of Wisconsin Breast Cancer Database


In this project, I worked with the Wisconsin Breast Cancer dataset.  A logistic regression model was trained to predict the diagnosis.

The following additional experiments were conducted:
1. First, the model was built with all features
2. Experiment (1) was repeated, including an intercept
3. Experiment (1) was repeated, using randomSplit([0.7, 0.3])
4. Experiment (2) was repeated, using randomSplit([0.7, 0.3])

The results of (1) vs (2) and the results of (3) vs (4) are discussed at the end of the notebook.

In [1]:
# load modules
from pyspark.sql import SparkSession
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.feature import VectorAssembler 
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.evaluation import MulticlassMetrics

import os

In [2]:
# param init
infile = 'wisc_breast_cancer_w_fields.csv'

spark = SparkSession \
    .builder \
    .appName("Wisc BRCA") \
    .getOrCreate()

In [3]:
# read data into dataframe
df = spark.read.csv(infile, inferSchema=True, header = True)

In [4]:
df.count()

569

### Fields *f1*, *f2*, *f3* combined into a single *features* column using `VectorAssembler`  

In [5]:
assembler = VectorAssembler(inputCols=["f1", "f2", "f3"], outputCol="features") 
transformed = assembler.transform(df)

### Select the *diagnosis* and *features* fields for modeling. Convert to RDD.

In [6]:
dataRdd = transformed.select("diagnosis", "features").rdd.map(tuple)

In [7]:
# look at some data
dataRdd.take(2)

[('M', DenseVector([17.99, 10.38, 122.8])),
 ('M', DenseVector([20.57, 17.77, 132.9]))]

In [8]:
# map label to binary values, then convert to LabeledPoint
lp = dataRdd.map(lambda row:(1 if row[0]=='M' else 0, Vectors.dense(row[1])))    \
                    .map(lambda row: LabeledPoint(row[0], row[1]))

In [9]:
# look at some data
lp.take(2)

[LabeledPoint(1.0, [17.99,10.38,122.8]),
 LabeledPoint(1.0, [20.57,17.77,132.9])]

### Split data approximately into training (60%) and test (40%) using `seed=314` 

In [10]:
# Split data approximately into training (60%) and test (40%)
training, test = lp.randomSplit([0.6, 0.4], seed=314)

In [11]:
# count records in datasets
(training.count(), test.count(), lp.count())

(356, 213, 569)

In [12]:
(training.count()/lp.count(), test.count()/lp.count(), lp.count()/lp.count())

(0.6256590509666081, 0.37434094903339193, 1.0)

### Train the model

In [13]:
# Build the model
model = LogisticRegressionWithLBFGS.train(training)

### Evaluate the model by computing the accuracy on the test data.

In [14]:
# Evaluating the model on test data
labelsAndPreds_te = test.map(lambda p: (p.label, model.predict(p.features)))
accuracy_te = 1.0 * labelsAndPreds_te.filter(lambda pl: pl[0] == pl[1]).count() / test.count()
print('model accuracy (test): {}'.format(accuracy_te))

model accuracy (test): 0.8732394366197183


### For each of the next experiments, accuracy and confusion matrix are computed for the test set.  

#### Experiment 1: Build the model using all features.

In [15]:
# use all of the fields as features
assembler = VectorAssembler(inputCols=[i for i in df.columns if i[0]=='f'], outputCol="features") 
transformed = assembler.transform(df)

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(transformed)
scaledData = scalerModel.transform(transformed)

# convert to RDD
dataRdd = scaledData.select("diagnosis","scaledFeatures").rdd.map(tuple)

# map label to binary values, then create LabeledPoints
lp = dataRdd.map(lambda row: LabeledPoint(1 if row[0]=='M' else 0, Vectors.dense(row[1])))

# Split data approximately into training (60%) and test (40%)
training, test = lp.randomSplit([0.6, 0.4], seed=314)

# Build the model
model = LogisticRegressionWithLBFGS.train(training)

# Evaluating the model on test data

# make sure predictions are floats
labelsAndPreds_te = test.map(lambda p: (p.label, float(model.predict(p.features))))
accuracy_te = 1.0 * labelsAndPreds_te.filter(lambda pl: pl[0] == pl[1]).count() / test.count()
print('model accuracy (test): {}'.format(accuracy_te))

metrics = MulticlassMetrics(labelsAndPreds_te)
labelsAndPreds_te.take(3)
print("Confusion Matrix:\n{}".format(metrics.confusionMatrix().toArray()))

model accuracy (test): 0.9624413145539906
Confusion Matrix:
[[135.   4.]
 [  4.  70.]]


#### Experiment 2: Repeat experiment 1, with an intercept.

In [16]:
assembler = VectorAssembler(inputCols=[i for i in df.columns if i[0]=='f'], outputCol="features") 
transformed = assembler.transform(df)

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(transformed)
scaledData = scalerModel.transform(transformed)

dataRdd = scaledData.select("diagnosis","scaledFeatures").rdd.map(tuple)

lp = dataRdd.map(lambda row: LabeledPoint(1 if row[0]=='M' else 0, Vectors.dense(row[1])))

training, test = lp.randomSplit([0.6, 0.4], seed=314)

model = LogisticRegressionWithLBFGS.train(training, intercept=True)

labelsAndPreds_te = test.map(lambda p: (p.label, float(model.predict(p.features))))
accuracy_te = 1.0 * labelsAndPreds_te.filter(lambda pl: pl[0] == pl[1]).count() / test.count()
print('model accuracy (test): {}'.format(accuracy_te))

metrics = MulticlassMetrics(labelsAndPreds_te)
labelsAndPreds_te.take(3)
print("Confusion Matrix:\n{}".format(metrics.confusionMatrix().toArray()))

model accuracy (test): 0.9671361502347418
Confusion Matrix:
[[135.   3.]
 [  4.  71.]]


#### Experiment 3: Repeat experiment 1, with randomSplit([0.7, 0.3]).

In [17]:
# use all of the fields as features
assembler = VectorAssembler(inputCols=[i for i in df.columns if i[0]=='f'], outputCol="features") 
transformed = assembler.transform(df)

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(transformed)
scaledData = scalerModel.transform(transformed)

# convert to RDD
dataRdd = scaledData.select("diagnosis","scaledFeatures").rdd.map(tuple)

# map label to binary values, then create LabeledPoints
lp = dataRdd.map(lambda row: LabeledPoint(1 if row[0]=='M' else 0, Vectors.dense(row[1])))

# Split data approximately into training (60%) and test (40%)
training, test = lp.randomSplit([0.7, 0.3], seed=314)

# Build the model
model = LogisticRegressionWithLBFGS.train(training)

# Evaluating the model on test data

# make sure predictions are floats
labelsAndPreds_te = test.map(lambda p: (p.label, float(model.predict(p.features))))
accuracy_te = 1.0 * labelsAndPreds_te.filter(lambda pl: pl[0] == pl[1]).count() / test.count()
print('model accuracy (test): {}'.format(accuracy_te))

metrics = MulticlassMetrics(labelsAndPreds_te)
labelsAndPreds_te.take(3)
print("Confusion Matrix:\n{}".format(metrics.confusionMatrix().toArray()))

model accuracy (test): 0.9433962264150944
Confusion Matrix:
[[98.  6.]
 [ 3. 52.]]


#### Experiment 3: Repeat experiment 2, with randomSplit([0.7, 0.3]).

In [18]:
assembler = VectorAssembler(inputCols=[i for i in df.columns if i[0]=='f'], outputCol="features") 
transformed = assembler.transform(df)

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(transformed)
scaledData = scalerModel.transform(transformed)

dataRdd = scaledData.select("diagnosis","scaledFeatures").rdd.map(tuple)

lp = dataRdd.map(lambda row: LabeledPoint(1 if row[0]=='M' else 0, Vectors.dense(row[1])))

training, test = lp.randomSplit([0.7, 0.3], seed=314)

model = LogisticRegressionWithLBFGS.train(training, intercept=True)

labelsAndPreds_te = test.map(lambda p: (p.label, float(model.predict(p.features))))
accuracy_te = 1.0 * labelsAndPreds_te.filter(lambda pl: pl[0] == pl[1]).count() / test.count()
print('model accuracy (test): {}'.format(accuracy_te))

metrics = MulticlassMetrics(labelsAndPreds_te)
labelsAndPreds_te.take(3)
print("Confusion Matrix:\n{}".format(metrics.confusionMatrix().toArray()))

model accuracy (test): 0.9371069182389937
Confusion Matrix:
[[99.  8.]
 [ 2. 50.]]


#### Discussion of results:

All four models are extremely accurate. Whats interesting to see however is that the intercept model increased the accuracy of the model only when the train/test partition was 60/40, but when it was 70/30, it did not. 