### University of Virginia
### DS 5110: Big Data Systems

### Classification of Wisconsin Breast Cancer Database
### Last updated: June 17, 2021

**Instructions** 

In this project, you will work with the Wisconsin Breast Cancer dataset.  You will train a logistic regression model to predict the diagnosis.  First, you will work through this example, **filling in the missing cells.**  Then you will make modifications and run the code, collecting results at the bottom of the notebook.

The following experiments should be conducted:
1.  Three features were used in the original model.  **Build the model using all features.**
Before training the model, apply scaling to the features using the StandardScaler
transformer.  Then train the model and compute and show the accuracy and confusion matrix, **measured on the test set.**

**Hint**: While the data is in a dataframe, this might be helpful:
```
from pyspark.ml.feature import StandardScaler
```

2. Repeat step (1), including an intercept
3. Repeat step (1), using randomSplit([0.7, 0.3]) but NO intercept

**Total Possible Points: 10**

In [1]:
# load modules
from pyspark.sql import SparkSession
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.feature import VectorAssembler 
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.evaluation import MulticlassMetrics

import os

In [2]:
# param init
infile = 'wisc_breast_cancer_w_fields.csv'

spark = SparkSession \
    .builder \
    .appName("Wisc BRCA") \
    .getOrCreate()

In [3]:
# read data into dataframe
df = spark.read.csv(infile, inferSchema=True, header = True)

In [4]:
df.count()

569

In [6]:
df.show(3)

+--------+---------+-----+-----+-----+------+-------+-------+------+-------+------+-------+------+------+-----+-----+--------+-------+-------+-------+-------+--------+-----+-----+-----+------+------+------+------+------+------+-------+
|      id|diagnosis|   f1|   f2|   f3|    f4|     f5|     f6|    f7|     f8|    f9|    f10|   f11|   f12|  f13|  f14|     f15|    f16|    f17|    f18|    f19|     f20|  f21|  f22|  f23|   f24|   f25|   f26|   f27|   f28|   f29|    f30|
+--------+---------+-----+-----+-----+------+-------+-------+------+-------+------+-------+------+------+-----+-----+--------+-------+-------+-------+-------+--------+-----+-----+-----+------+------+------+------+------+------+-------+
|  842302|        M|17.99|10.38|122.8|1001.0| 0.1184| 0.2776|0.3001| 0.1471|0.2419|0.07871| 1.095|0.9053|8.589|153.4|0.006399|0.04904|0.05373|0.01587|0.03003|0.006193|25.38|17.33|184.6|2019.0|0.1622|0.6656|0.7119|0.2654|0.4601| 0.1189|
|  842517|        M|20.57|17.77|132.9|1326.0|0.08474|0.0

**(1 PT)** Combine fields *f1*, *f2*, *f3* into a single *features* column using `VectorAssembler`  
Name the resulting dataframe *transformed*

Select the *diagnosis* and *features* fields for modeling.  
We will do the remaining steps with RDDs, so we convert to RDD

In [None]:
dataRdd = transformed.select("diagnosis", "features").rdd.map(tuple)

In [None]:
# look at some data
dataRdd.take(2)

In [None]:
# map label to binary values, then convert to LabeledPoint
lp = dataRdd.map(lambda row:(1 if row[0]=='M' else 0, Vectors.dense(row[1])))    \
                    .map(lambda row: LabeledPoint(row[0], row[1]))

In [None]:
# look at some data
lp.take(2)

**(1 PT)** Split data approximately into training (60%) and test (40%) using `seed=314`  
The RDDs that are output from the splitting should be named *training*, *test*, respectively

In [None]:
# count records in datasets
(training.count(), test.count(), lp.count())

In [None]:
# percentage of records in datasets
(training.count()/lp.count(), test.count()/lp.count(), lp.count()/lp.count())

**(1 PT)** Train model `LogisticRegressionWithLBFGS`, naming it *model*

**(1 PT)**  Evaluate the model by computing the accuracy on the **test data**. Print the accuracy.

**SOLUTIONS**  
 For parts 1-3, compute and show for the test set: (1) accuracy (2) confusion matrix.  

Enter solution for Part 1 (2 POINTS)

Enter solution for Part 2  (2 POINTS)

Enter solution for Part 3  (2 POINTS)