# Cancer Diagnosis

 __<h3>DATASET DESCRIPTION__
 
 
 Here is a description of the dataset we will be using:

 Breast Cancer Wisconsin (Diagnostic) Database

 Notes
 -----
 Data Set Characteristics:
     :Number of Instances: 569

     :Number of Attributes: 30 numeric, predictive attributes and the class

     :Attribute Information:
         - radius (mean of distances from center to points on the perimeter)
         - texture (standard deviation of gray-scale values)
         - perimeter
         - area
         - smoothness (local variation in radius lengths)
         - compactness (perimeter^2 / area - 1.0)
         - concavity (severity of concave portions of the contour)
         - concave points (number of concave portions of the contour)
         - symmetry
         - fractal dimension ("coastline approximation" - 1)

         The mean, standard error, and "worst" or largest (mean of the three
         largest values) of these features were computed for each image,
         resulting in 30 features.  For instance, field 3 is Mean Radius, field
         13 is Radius SE, field 23 is Worst Radius.

         - class:
                 - WDBC-Malignant
                 - WDBC-Benign

     :Summary Statistics:

     ===================================== ======= ========
                                            Min     Max
     ===================================== ======= ========
     radius (mean):                         6.981   28.11
     texture (mean):                        9.71    39.28
     perimeter (mean):                      43.79   188.5
     area (mean):                           143.5   2501.0
     smoothness (mean):                     0.053   0.163
     compactness (mean):                    0.019   0.345
     concavity (mean):                      0.0     0.427
     concave points (mean):                 0.0     0.201
     symmetry (mean):                       0.106   0.304
     fractal dimension (mean):              0.05    0.097
     radius (standard error):               0.112   2.873
     texture (standard error):              0.36    4.885
     perimeter (standard error):            0.757   21.98
     area (standard error):                 6.802   542.2
     smoothness (standard error):           0.002   0.031
     compactness (standard error):          0.002   0.135
     concavity (standard error):            0.0     0.396
     concave points (standard error):       0.0     0.053
     symmetry (standard error):             0.008   0.079
     fractal dimension (standard error):    0.001   0.03
     radius (worst):                        7.93    36.04
     texture (worst):                       12.02   49.54
     perimeter (worst):                     50.41   251.2
     area (worst):                          185.2   4254.0
     smoothness (worst):                    0.071   0.223
     compactness (worst):                   0.027   1.058
     concavity (worst):                     0.0     1.252
     concave points (worst):                0.0     0.291
     symmetry (worst):                      0.156   0.664
     fractal dimension (worst):             0.055   0.208
     ===================================== ======= ========

     :Missing Attribute Values: None

     :Class Distribution: 212 - Malignant, 357 - Benign

     :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

     :Donor: Nick Street

     :Date: November, 1995


### Creating a Spark Session

In [1]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("cancer").getOrCreate()

Intitializing Scala interpreter ...

Spark Web UI available at http://Varun-CK:4040
SparkContext available as 'sc' (version = 2.3.0, master = local[*], app id = local-1577783867767)
SparkSession available as 'spark'


2019-12-31 14:47:57 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.


import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@ab81fb


### Initializing Logger

In [2]:
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)

import org.apache.log4j._


### Using Spark to read the cancer data set

In [3]:
val data = spark.read.options(Map(("header","true"),("inferSchema","true"))).csv("Cancer_Data")

data: org.apache.spark.sql.DataFrame = [mean radius: int, mean texture: double ... 28 more fields]


### Count

In [4]:
data.count()

res1: Long = 569


### Schema

In [5]:
data.printSchema()

root
 |-- mean radius: integer (nullable = true)
 |-- mean texture: double (nullable = true)
 |-- mean perimeter: double (nullable = true)
 |-- mean area: double (nullable = true)
 |-- mean smoothness: double (nullable = true)
 |-- mean compactness: double (nullable = true)
 |-- mean concavity: double (nullable = true)
 |-- mean concave points: double (nullable = true)
 |-- mean symmetry: double (nullable = true)
 |-- mean fractal dimension: double (nullable = true)
 |-- radius error: double (nullable = true)
 |-- texture error: double (nullable = true)
 |-- perimeter error: double (nullable = true)
 |-- area error: double (nullable = true)
 |-- smoothness error: double (nullable = true)
 |-- compactness error: double (nullable = true)
 |-- concavity error: double (nullable = true)
 |-- concave points error: double (nullable = true)
 |-- symmetry error: double (nullable = true)
 |-- fractal dimension error: double (nullable = true)
 |-- worst radius: double (nullable = true)
 |-- worst

### Setting up PCA

In [6]:
import org.apache.spark.ml.feature.{PCA,StandardScaler,VectorAssembler}
import org.apache.spark.ml.linalg.Vectors

import org.apache.spark.ml.feature.{PCA, StandardScaler, VectorAssembler}
import org.apache.spark.ml.linalg.Vectors


### Using VectorAssembler to convert the input columns of the cancer data to a single output column of an array called "features"

In [7]:
val assembler = new VectorAssembler().setInputCols(data.columns).setOutputCol("features")

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_4df0293e6b5d


In [8]:
val output = assembler.transform(data)

output: org.apache.spark.sql.DataFrame = [mean radius: int, mean texture: double ... 29 more fields]


In [9]:
output.printSchema

root
 |-- mean radius: integer (nullable = true)
 |-- mean texture: double (nullable = true)
 |-- mean perimeter: double (nullable = true)
 |-- mean area: double (nullable = true)
 |-- mean smoothness: double (nullable = true)
 |-- mean compactness: double (nullable = true)
 |-- mean concavity: double (nullable = true)
 |-- mean concave points: double (nullable = true)
 |-- mean symmetry: double (nullable = true)
 |-- mean fractal dimension: double (nullable = true)
 |-- radius error: double (nullable = true)
 |-- texture error: double (nullable = true)
 |-- perimeter error: double (nullable = true)
 |-- area error: double (nullable = true)
 |-- smoothness error: double (nullable = true)
 |-- compactness error: double (nullable = true)
 |-- concavity error: double (nullable = true)
 |-- concave points error: double (nullable = true)
 |-- symmetry error: double (nullable = true)
 |-- fractal dimension error: double (nullable = true)
 |-- worst radius: double (nullable = true)
 |-- worst

Often its a good idea to normalize each feature to have unit standard deviation and/or zero mean, when using PCA. This is essentially a pre-step to PCA, but its not always necessary.

### Using standard scaler to normalize the data

In [15]:
val scaler = new StandardScaler().setInputCol("features").setOutputCol("scaledFeatures").setWithStd(true).setWithMean(false)

scaler: org.apache.spark.ml.feature.StandardScaler = stdScal_d70bad30cee5


### Compute summary statistics by fitting the StandardScaler. Basically create a new object called scalerModel by using scaler.fit() on the output of the VectorAssembler

In [16]:
val scaler_model = scaler.fit(output)

scaler_model: org.apache.spark.ml.feature.StandardScalerModel = stdScal_d70bad30cee5


### Normalize each feature to have unit standard deviation. Use transform() off of this scalerModel object to create your scaledData

In [17]:
val scaledData = scaler_model.transform(output).select("features","scaledFeatures")

scaledData: org.apache.spark.sql.DataFrame = [features: vector, scaledFeatures: vector]


In [18]:
scaledData.show(5)

+--------------------+--------------------+
|            features|      scaledFeatures|
+--------------------+--------------------+
|[0.0,17.99,10.38,...|[0.0,5.1049235941...|
|[1.0,20.57,17.77,...|[0.00608270930682...|
|[2.0,19.69,21.25,...|[0.01216541861364...|
|[3.0,11.42,20.38,...|[0.01824812792047...|
|[4.0,20.29,14.34,...|[0.02433083722729...|
+--------------------+--------------------+
only showing top 5 rows



### Create a new PCA() object that will take in the scaledFeatures and output the pcs features, use 4 principal components, Then fit this to the scaledData

In [19]:
val pca = new PCA().setInputCol("scaledFeatures").setOutputCol("pcaFeatures").setK(4).fit(scaledData)

2019-12-31 15:03:10 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
2019-12-31 15:03:10 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
2019-12-31 15:03:11 WARN  LAPACK:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
2019-12-31 15:03:11 WARN  LAPACK:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK


pca: org.apache.spark.ml.feature.PCAModel = pca_8f535bf16297


### Fitting and transforming the scaledData 

In [20]:
val pcaDF = pca.transform(scaledData)

pcaDF: org.apache.spark.sql.DataFrame = [features: vector, scaledFeatures: vector ... 1 more field]


In [21]:
pcaDF.show(5)

+--------------------+--------------------+--------------------+
|            features|      scaledFeatures|         pcaFeatures|
+--------------------+--------------------+--------------------+
|[0.0,17.99,10.38,...|[0.0,5.1049235941...|[21.6219973823647...|
|[1.0,20.57,17.77,...|[0.00608270930682...|[15.1217370347581...|
|[2.0,19.69,21.25,...|[0.01216541861364...|[18.4325856097776...|
|[3.0,11.42,20.38,...|[0.01824812792047...|[18.9549565028936...|
|[4.0,20.29,14.34,...|[0.02433083722729...|[16.7333072691961...|
+--------------------+--------------------+--------------------+
only showing top 5 rows



### Show the new pcaFeatures

In [23]:
val result = pcaDF.select("pcaFeatures")

result: org.apache.spark.sql.DataFrame = [pcaFeatures: vector]


In [25]:
result.show(5,false)

+------------------------------------------------------------------------------+
|pcaFeatures                                                                   |
+------------------------------------------------------------------------------+
|[21.62199738236476,8.516595739466684,-3.7318474175794782,-0.4181244970133412] |
|[15.121737034758134,2.697138979042207,-2.3546461829874357,-2.59498897333438]  |
|[18.432585609777654,5.697069543518227,-2.9058070696230303,-3.0552108608152326]|
|[18.95495650289368,16.025442209800573,-5.934803967957989,-4.158068180951641]  |
|[16.73330726919616,4.995746645493643,-0.998499731787013,-0.8269447324688084]  |
+------------------------------------------------------------------------------+
only showing top 5 rows



### Use .head() to confirm that our output column Array of pcaFeatures only has 4 principal components

In [26]:
result.head(1)

res9: Array[org.apache.spark.sql.Row] = Array([[21.62199738236476,8.516595739466684,-3.7318474175794782,-0.4181244970133412]])


### Closing the spark session

In [27]:
spark.stop()

## Thank You!