# Building A classification with PySpark and MLlib
# Breast Cancer Wisconsin case
##### @ author Frederic TWAHIRWA

### in this notebook , Decision Tree, Gradient-Boosted Tree and Random Forest Classier are going to be used.

## Decision Tree
Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions

## Ensemble Models : Gradient-Boosted Tree and Random Forest
An ensemble method is a learning algorithm which creates a model composed of a set of other base models. spark.mllib supports two major ensemble algorithms: GradientBoostedTrees and RandomForest. Both use decision trees as their base models

## Gradient-Boosted Tree and Random Forest
Both Gradient-Boosted Trees (GBTs) and Random Forests are algorithms for learning ensembles of trees, but the training processes are different. There are several practical trade-offs:

- GBTs train one tree at a time, so they can take longer to train than random forests. Random Forests can train multiple trees in parallel.
- On the other hand, it is often reasonable to use smaller (shallower) trees with GBTs than with Random Forests, and training smaller trees takes less time.
- Random Forests can be less prone to overfitting. Training more trees in a Random Forest reduces the likelihood of overfitting, but training more trees with GBTs increases the likelihood of overfitting. (In statistical language, Random Forests reduce variance by using more trees, whereas GBTs reduce bias by using more trees.)
- Random Forests can be easier to tune since performance improves monotonically with the number of trees (whereas performance can start to decrease for GBTs if the number of trees grows too large).

- In short, both algorithms can be effective, and the choice should be based on the particular dataset.

reference : Apache Spark (https://spark.apache.org/docs/2.2.0/mllib-ensembles.html)

## Breast Cancer Wisconsin (Diagnostic) Data Set

For this example we Data set "Breast Cancer Wisconsin (Diagnostic)" from "UCI Machine learning repository  :(https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) is going to be used. 
1 column : ID number; 
2 column : Diagnosis (M: Malignant, B=benign); 
3-30 columns : features (results of some analysis and measurements); 
more information : https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)



## Initialize a spark session

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession\
        .Builder()\
        .appName("ClassificationExample")\
        .getOrCreate() # required for dataframes

## Load data

In [3]:
## !wget https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data

In [4]:
# show head
!head -n 1 ./Documents/MachineLearning/Breast_cancer_wisconsin/wdbc.data

842302,M,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189


In [5]:
#count number of line
!wc -l ./Documents/MachineLearning/Breast_cancer_wisconsin/wdbc.data

569 ./Documents/MachineLearning/Breast_cancer_wisconsin/wdbc.data


In [6]:
# import Vectors to create local vector (there are 2 types of vectors : dense, sparse)
# import StringIndexer to trasform a string in a number
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import StringIndexer

#### Load a text file and convert each line to a Row.

In [7]:
# Load a text file and convert each line to a Row.

data_list = []

with open("./Documents/MachineLearning/Breast_cancer_wisconsin/wdbc.data") as infile:
    for line in infile:
        tokens = line.rstrip("\n").split(",")        
        y = tokens[1]
        features = Vectors.dense([float(x) for x in tokens[2:]])        
        
        data_list.append((y, features))

### Create a DataFrame

In [8]:
#create a DataFrame
inputDF = spark.createDataFrame(data_list, ["label", "features"])

In [9]:
#show the DataFrame
#inputDF.show()
inputDF.limit(10).toPandas()

Unnamed: 0,label,features
0,M,"[17.99, 10.38, 122.8, 1001.0, 0.1184, 0.2776, ..."
1,M,"[20.57, 17.77, 132.9, 1326.0, 0.08474, 0.07864..."
2,M,"[19.69, 21.25, 130.0, 1203.0, 0.1096, 0.1599, ..."
3,M,"[11.42, 20.38, 77.58, 386.1, 0.1425, 0.2839, 0..."
4,M,"[20.29, 14.34, 135.1, 1297.0, 0.1003, 0.1328, ..."
5,M,"[12.45, 15.7, 82.57, 477.1, 0.1278, 0.17, 0.15..."
6,M,"[18.25, 19.98, 119.6, 1040.0, 0.09463, 0.109, ..."
7,M,"[13.71, 20.83, 90.2, 577.9, 0.1189, 0.1645, 0...."
8,M,"[13.0, 21.82, 87.5, 519.8, 0.1273, 0.1932, 0.1..."
9,M,"[12.46, 24.04, 83.97, 475.9, 0.1186, 0.2396, 0..."


### Indexing the label

In [10]:
# indexing the label (M=>1, B=>0)
stringIndexer = StringIndexer(inputCol = "label", outputCol = "labelIndexed")
si_model = stringIndexer.fit(inputDF)
inputDF2 = si_model.transform(inputDF)

In [11]:
inputDF2.show()
#inputDF2.limit(20).toPandas()

+-----+--------------------+------------+
|label|            features|labelIndexed|
+-----+--------------------+------------+
|    M|[17.99,10.38,122....|         1.0|
|    M|[20.57,17.77,132....|         1.0|
|    M|[19.69,21.25,130....|         1.0|
|    M|[11.42,20.38,77.5...|         1.0|
|    M|[20.29,14.34,135....|         1.0|
|    M|[12.45,15.7,82.57...|         1.0|
|    M|[18.25,19.98,119....|         1.0|
|    M|[13.71,20.83,90.2...|         1.0|
|    M|[13.0,21.82,87.5,...|         1.0|
|    M|[12.46,24.04,83.9...|         1.0|
|    M|[16.02,23.24,102....|         1.0|
|    M|[15.78,17.89,103....|         1.0|
|    M|[19.17,24.8,132.4...|         1.0|
|    M|[15.85,23.95,103....|         1.0|
|    M|[13.73,22.61,93.6...|         1.0|
|    M|[14.54,27.54,96.7...|         1.0|
|    M|[14.68,20.13,94.7...|         1.0|
|    M|[16.13,20.68,108....|         1.0|
|    M|[19.81,22.15,130....|         1.0|
|    B|[13.54,14.36,87.4...|         0.0|
+-----+--------------------+------

### train/test split

In [12]:
(trainingData, testData) = inputDF2.randomSplit([0.7, 0.3], seed = 57)

## Training Decision Tree

In [13]:
# import Decision Tree classifier and classification Evaluator
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [14]:
#create a class which is responsible for training and test
decisionTree = DecisionTreeClassifier(labelCol = "labelIndexed")

In [15]:
# call the method fit to the training data, and obtain a decision tree model
dtModel = decisionTree.fit(trainingData)

### some results (number of nodes, depth, features importances)

In [16]:
#number of node
print ("Numbers of node are : {}".format(dtModel.numNodes))
print ("the depth of the model is  : {}".format(dtModel.depth))
print ("features importances  : {}".format(dtModel.featureImportances))
print ("Number of features  : {}".format(dtModel.numFeatures))

Numbers of node are : 27
the depth of the model is  : 5
features importances  : (30,[1,4,7,10,16,21,23,27,29],[0.0374235426794,0.038778742902,0.00541570897648,0.0140606515278,0.0046629181201,0.065269843249,0.690521755556,0.125268152215,0.0185986847745])
Number of features  : 30


##### The feature "23" seems to be the most important

In [17]:
# Visualize the Decision Tree
print (dtModel.toDebugString)

DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4018ad760d13159556da) of depth 5 with 27 nodes
  If (feature 23 <= 877.0)
   If (feature 27 <= 0.13495000000000001)
    If (feature 10 <= 0.7658)
     If (feature 27 <= 0.11085)
      If (feature 10 <= 0.6265499999999999)
       Predict: 0.0
      Else (feature 10 > 0.6265499999999999)
       Predict: 0.0
     Else (feature 27 > 0.11085)
      If (feature 29 <= 0.07909)
       Predict: 1.0
      Else (feature 29 > 0.07909)
       Predict: 0.0
    Else (feature 10 > 0.7658)
     Predict: 1.0
   Else (feature 27 > 0.13495000000000001)
    If (feature 21 <= 25.674999999999997)
     If (feature 4 <= 0.12455)
      If (feature 16 <= 0.032535)
       Predict: 0.0
      Else (feature 16 > 0.032535)
       Predict: 0.0
     Else (feature 4 > 0.12455)
      Predict: 1.0
    Else (feature 21 > 25.674999999999997)
     Predict: 1.0
  Else (feature 23 > 877.0)
   If (feature 1 <= 15.705)
    If (feature 4 <= 0.096245)
     Predict: 0.0
  

### Evaluation of the Decision tree model

In [18]:
predictions_dt = dtModel.transform(testData)

In [19]:
#predictions_dt.select('label', 'labelIndexed', 'probability', 'prediction').show()
predictions_dt.select('label', 'labelIndexed', 'probability', 'prediction').limit(20).toPandas()

Unnamed: 0,label,labelIndexed,probability,prediction
0,B,0.0,"[0.995555555556, 0.00444444444444]",0.0
1,B,0.0,"[0.995555555556, 0.00444444444444]",0.0
2,B,0.0,"[0.995555555556, 0.00444444444444]",0.0
3,B,0.0,"[0.995555555556, 0.00444444444444]",0.0
4,B,0.0,"[0.995555555556, 0.00444444444444]",0.0
5,B,0.0,"[0.995555555556, 0.00444444444444]",0.0
6,B,0.0,"[0.995555555556, 0.00444444444444]",0.0
7,B,0.0,"[0.995555555556, 0.00444444444444]",0.0
8,B,0.0,"[0.995555555556, 0.00444444444444]",0.0
9,B,0.0,"[0.995555555556, 0.00444444444444]",0.0


In [20]:
evaluator_dt = MulticlassClassificationEvaluator(labelCol = "labelIndexed",\
                                                 predictionCol = "prediction",\
                                                 metricName = "accuracy")
accuracy_dt = evaluator_dt.evaluate(predictions_dt)

print("Test Error = %g" % (1.0 - accuracy_dt))

Test Error = 0.054878


In [21]:
# import MulticlassMetrics to compute matrix Confusion
from pyspark.mllib.evaluation import MulticlassMetrics

In [22]:
predictionsAndLabels_dt= predictions_dt.select("prediction", "labelIndexed")

In [23]:
metrics_dt = MulticlassMetrics(predictionsAndLabels_dt.rdd)
metrics_dt.accuracy

0.9451219512195121

In [24]:
metrics_dt.confusionMatrix().toArray()

array([[ 92.,   7.],
       [  2.,  63.]])

In [25]:
# import BinaryClassificationsMetrics to compute area Under ROC
from pyspark.mllib.evaluation import BinaryClassificationMetrics

In [26]:
metrics_dt_bc = BinaryClassificationMetrics(predictionsAndLabels_dt.rdd)
metrics_dt_bc.areaUnderROC

0.9492618492618493

### decision tree (with default parameters) gives a model with 94.5% af accuracy

## Gradient Boosted Decison Tree (GBDT)

In [27]:
#   import GBT classiefier
from pyspark.ml.classification import GBTClassifier

In [28]:
# create the object 
gbdt = GBTClassifier(labelCol = "labelIndexed", featuresCol = "features", maxIter = 50, stepSize = 0.1)
gbdtModel = gbdt.fit(trainingData)
prediction_gbdt = gbdtModel.transform(testData)
prediction_gbdt.select('label', 'labelIndexed', 'probability', 'prediction').show(10)

+-----+------------+--------------------+----------+
|label|labelIndexed|         probability|prediction|
+-----+------------+--------------------+----------+
|    B|         0.0|[0.97851682222474...|       0.0|
|    B|         0.0|[0.97871014383998...|       0.0|
|    B|         0.0|[0.97866117079892...|       0.0|
|    B|         0.0|[0.97871014383998...|       0.0|
|    B|         0.0|[0.97871014383998...|       0.0|
|    B|         0.0|[0.97835010698528...|       0.0|
|    B|         0.0|[0.97871014383998...|       0.0|
|    B|         0.0|[0.97868210148285...|       0.0|
|    B|         0.0|[0.97866117079892...|       0.0|
|    B|         0.0|[0.97856611929318...|       0.0|
+-----+------------+--------------------+----------+
only showing top 10 rows



### Evaluate the GBDT model

In [29]:
predictionsAndLabels_gbdt= prediction_gbdt.select("prediction", "labelIndexed")
metrics_gbdt = MulticlassMetrics(predictionsAndLabels_gbdt.rdd)
metrics_gbdt_bc = BinaryClassificationMetrics(predictionsAndLabels_gbdt.rdd)
print ("the accuracy with the GBDT model is : {}".format(metrics_gbdt.accuracy))
print ("the Area under ROC  with the GBDT model is : {}".format(metrics_gbdt_bc.areaUnderROC))
print ( " The matrix confustion with the GBDT model :")
print (metrics_gbdt.confusionMatrix().toArray())

the accuracy with the GBDT model is : 0.9512195121951219
the Area under ROC  with the GBDT model is : 0.9516705516705518
 The matrix confustion with the GBDT model :
[[ 94.   5.]
 [  3.  62.]]


In [30]:
# Features Importances
gbdtModel.featureImportances

SparseVector(30, {0: 0.0122, 1: 0.0218, 2: 0.0, 3: 0.0, 4: 0.0205, 5: 0.0198, 6: 0.0046, 7: 0.0039, 8: 0.0, 9: 0.0, 10: 0.0054, 11: 0.012, 12: 0.0115, 13: 0.0126, 14: 0.01, 15: 0.0206, 16: 0.0031, 17: 0.002, 18: 0.0, 19: 0.0014, 20: 0.002, 21: 0.0743, 22: 0.0054, 23: 0.603, 24: 0.0212, 25: 0.0033, 26: 0.0086, 27: 0.1173, 28: 0.0023, 29: 0.0011})

#### with the GBDT model, the feature "23" is also the most important

## RANDOM FOREST model

In [31]:
## import RAndom forest packages
from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel

In [32]:
#### Train the RF model

In [33]:
rfClassifer = RandomForestClassifier(labelCol = "labelIndexed", numTrees = 40)

In [34]:
rfModel = rfClassifer.fit(trainingData)
prediction_rf = rfModel.transform(testData)
prediction_rf.select('label', 'labelIndexed', 'probability', 'prediction').show(10)

+-----+------------+--------------------+----------+
|label|labelIndexed|         probability|prediction|
+-----+------------+--------------------+----------+
|    B|         0.0|[0.99813186528588...|       0.0|
|    B|         0.0|[0.99863027038954...|       0.0|
|    B|         0.0|[0.99775307740709...|       0.0|
|    B|         0.0|[0.99779010009827...|       0.0|
|    B|         0.0|[0.99855451281379...|       0.0|
|    B|         0.0|[0.99668347687298...|       0.0|
|    B|         0.0|[0.99855451281379...|       0.0|
|    B|         0.0|[0.96894575917046...|       0.0|
|    B|         0.0|[0.99775307740709...|       0.0|
|    B|         0.0|[0.99900905826833...|       0.0|
+-----+------------+--------------------+----------+
only showing top 10 rows



### Evaluate the RF model

In [35]:
predictionsAndLabels_rf= prediction_rf.select("prediction", "labelIndexed")
metrics_rf = MulticlassMetrics(predictionsAndLabels_rf.rdd)
metrics_rf_bc = BinaryClassificationMetrics(predictionsAndLabels_rf.rdd)
print ("The accuracy with the Random Forest  model is : {}".format(metrics_rf.accuracy))
print ("The Area under ROC  with the Random Forest model is : {}".format(metrics_rf_bc.areaUnderROC))
print ( "The matrix confustion with the Random Forest model is :")
print (metrics_rf.confusionMatrix().toArray())

The accuracy with the Random Forest  model is : 0.9695121951219512
The Area under ROC  with the Random Forest model is : 0.9694638694638695
The matrix confustion with the Random Forest model is :
[[ 96.   3.]
 [  2.  63.]]


### setting some parameters to optimize performancies

In [36]:
rfClassifer2 = RandomForestClassifier(labelCol = "labelIndexed",\
                                      #numTrees = 40,\
                                     impurity ="gini", \
                                     maxDepth = 30) # DecisionTree currently only supports maxDepth <= 30

In [37]:
rfModel2 = rfClassifer2.fit(trainingData)
prediction_rf2 = rfModel2.transform(testData)
prediction_rf2.select('label', 'labelIndexed', 'probability', 'prediction').show(10)

+-----+------------+-----------+----------+
|label|labelIndexed|probability|prediction|
+-----+------------+-----------+----------+
|    B|         0.0|  [1.0,0.0]|       0.0|
|    B|         0.0|  [1.0,0.0]|       0.0|
|    B|         0.0|  [1.0,0.0]|       0.0|
|    B|         0.0|  [1.0,0.0]|       0.0|
|    B|         0.0|  [1.0,0.0]|       0.0|
|    B|         0.0|  [1.0,0.0]|       0.0|
|    B|         0.0|  [1.0,0.0]|       0.0|
|    B|         0.0|[0.95,0.05]|       0.0|
|    B|         0.0|  [1.0,0.0]|       0.0|
|    B|         0.0|  [1.0,0.0]|       0.0|
+-----+------------+-----------+----------+
only showing top 10 rows



In [38]:
### Evaluate the 2nd RF model

In [39]:
predictionsAndLabels_rf2= prediction_rf2.select("prediction", "labelIndexed")
metrics_rf2 = MulticlassMetrics(predictionsAndLabels_rf2.rdd)
metrics_rf2_bc = BinaryClassificationMetrics(predictionsAndLabels_rf2.rdd)
print ("The accuracy with the Random Forest  model (the 2nd model)  is : {}".format(metrics_rf2.accuracy))
print ("The Area under ROC  with the Random Forest model (the 2nd model)is : {}" \
       .format(metrics_rf2_bc.areaUnderROC))
print ( "The matrix confustion with the Random Forest model (the 2nd model) is :")
print (metrics_rf2.confusionMatrix().toArray())

The accuracy with the Random Forest  model (the 2nd model)  is : 0.975609756097561
The Area under ROC  with the Random Forest model (the 2nd model)is : 0.9771561771561772
The matrix confustion with the Random Forest model (the 2nd model) is :
[[ 96.   3.]
 [  1.  64.]]


### Given models above, random forest performed a little bit better on our data.

references : 
    
- https://spark.apache.org/docs/2.2.0/mllib-ensembles.html
- https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#decision-tree-classifier
- https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#random-forest-classifier
- https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#gradient-boosted-tree-classifier