### Spark MLLib - Random Forest

**Description**

- It is one of the most popular algorithms.
- It is an Ensemble Method algorithm.
- The Random Forest model builds several models and each one is used to predict results individually. In the end, Random Forest takes a vote to choose the best model.

**Pros:** It usually offers good accuracy, is efficient with many predictor variables, works very well in parallel and is excellent with missing values.

**Cons:** It is slower and Bias can occur frequently.

**Application:** Scientific research, medical diagnosis.

### Classifying a Bank's Credit Risk

In [1]:
import math

from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import PCA
from pyspark.ml.feature import StringIndexer
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row

In [2]:
spSession = SparkSession.builder.master('local').appName('BankCreditRisk').getOrCreate()

In [3]:
rddBank01 = sc.textFile('aux/datasets/bank.csv')

**We can cache the RDD to optimize performance.**

In [4]:
rddBank01.cache()

aux/datasets/bank.csv MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [5]:
rddBank01.count()

542

In [6]:
rddBank01.take(5)

['"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"',
 '30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"',
 '33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"yes"',
 '35;"management";"single";"tertiary";"no";1350;"yes";"no";"cellular";16;"apr";185;1;330;1;"failure";"yes"',
 '30;"management";"married";"tertiary";"no";1476;"yes";"yes";"unknown";3;"jun";199;4;-1;0;"unknown";"yes"']

In [7]:
header = rddBank01.first()

In [8]:
rddBank02 = rddBank01.filter(lambda row: row != header)

In [9]:
rddBank02.count()

541

### Data Cleaning

In [10]:
def dataCleaning(strRow):
    listAttr = strRow.replace('\"', '').split(';')
    
    age       = float(listAttr[0])
    balance   = float(listAttr[5])
    single    = 1.0 if listAttr[2]  == "single"    else 0.0
    married   = 1.0 if listAttr[2]  == "married"   else 0.0
    divorced  = 1.0 if listAttr[2]  == "divorced"  else 0.0
    primary   = 1.0 if listAttr[3]  == "primary"   else 0.0
    secondary = 1.0 if listAttr[3]  == "secondary" else 0.0
    tertiary  = 1.0 if listAttr[3]  == "tertiary"  else 0.0
    default   = 0.0 if listAttr[4]  == "no"        else 1.0
    loan      = 0.0 if listAttr[7]  == "no"        else 1.0
    outcome   = 0.0 if listAttr[16] == "no"        else 1.0
    
    row = Row(
        AGE = age,
        BALANCE  = balance,
        SINGLE = single,
        MARRIED  = married,
        DIVORCED = divorced,
        PRIMARY = primary,
        SECONDARY = secondary,
        TERTIARY = tertiary,
        DEFAULT = default,
        LOAN = loan,
        OUTCOME = outcome
    )
    
    return row

In [11]:
rddBank03 = rddBank02.map(dataCleaning)

In [12]:
rddBank03.collect()[:5]

[Row(AGE=30.0, BALANCE=1787.0, SINGLE=0.0, MARRIED=1.0, DIVORCED=0.0, PRIMARY=1.0, SECONDARY=0.0, TERTIARY=0.0, DEFAULT=0.0, LOAN=0.0, OUTCOME=0.0),
 Row(AGE=33.0, BALANCE=4789.0, SINGLE=0.0, MARRIED=1.0, DIVORCED=0.0, PRIMARY=0.0, SECONDARY=1.0, TERTIARY=0.0, DEFAULT=0.0, LOAN=1.0, OUTCOME=1.0),
 Row(AGE=35.0, BALANCE=1350.0, SINGLE=1.0, MARRIED=0.0, DIVORCED=0.0, PRIMARY=0.0, SECONDARY=0.0, TERTIARY=1.0, DEFAULT=0.0, LOAN=0.0, OUTCOME=1.0),
 Row(AGE=30.0, BALANCE=1476.0, SINGLE=0.0, MARRIED=1.0, DIVORCED=0.0, PRIMARY=0.0, SECONDARY=0.0, TERTIARY=1.0, DEFAULT=0.0, LOAN=1.0, OUTCOME=1.0),
 Row(AGE=59.0, BALANCE=0.0, SINGLE=0.0, MARRIED=1.0, DIVORCED=0.0, PRIMARY=0.0, SECONDARY=1.0, TERTIARY=0.0, DEFAULT=0.0, LOAN=0.0, OUTCOME=0.0)]

### Exploratory Data Analysis

**Converting the RDD to a DataFrame**

In [13]:
dfBank = spSession.createDataFrame(rddBank03)

In [14]:
dfBank.describe(['AGE', 'BALANCE', 'SINGLE', 'MARRIED', 'DIVORCED']).show()

+-------+------------------+------------------+------------------+------------------+-------------------+
|summary|               AGE|           BALANCE|            SINGLE|           MARRIED|           DIVORCED|
+-------+------------------+------------------+------------------+------------------+-------------------+
|  count|               541|               541|               541|               541|                541|
|   mean| 41.26987060998152|1444.7818853974122|0.2754158964879852|0.6155268022181146|0.10905730129390019|
| stddev|10.555374170161665|2423.2722735171924|0.4471370479760759|0.4869207382098541| 0.3119995822161848|
|    min|              19.0|           -1206.0|               0.0|               0.0|                0.0|
|    max|              78.0|           16873.0|               1.0|               1.0|                1.0|
+-------+------------------+------------------+------------------+------------------+-------------------+



In [15]:
dfBank.describe(['DEFAULT', 'LOAN', 'OUTCOME']).show()

+-------+--------------------+-------------------+-------------------+
|summary|             DEFAULT|               LOAN|            OUTCOME|
+-------+--------------------+-------------------+-------------------+
|  count|                 541|                541|                541|
|   mean|0.022181146025878003|0.16266173752310537| 0.3974121996303142|
| stddev|  0.1474086424402979|0.36939832735881994|0.48981549262335145|
|    min|                 0.0|                0.0|                0.0|
|    max|                 1.0|                1.0|                1.0|
+-------+--------------------+-------------------+-------------------+



In [16]:
dfBank.describe(['DEFAULT', 'LOAN', 'OUTCOME']).show()

+-------+--------------------+-------------------+-------------------+
|summary|             DEFAULT|               LOAN|            OUTCOME|
+-------+--------------------+-------------------+-------------------+
|  count|                 541|                541|                541|
|   mean|0.022181146025878003|0.16266173752310537| 0.3974121996303142|
| stddev|  0.1474086424402979|0.36939832735881994|0.48981549262335145|
|    min|                 0.0|                0.0|                0.0|
|    max|                 1.0|                1.0|                1.0|
+-------+--------------------+-------------------+-------------------+



In [17]:
for column in dfBank.columns:
    if not(isinstance(dfBank.select(column).take(1)[0][0], str)):
        print(f"OUTCOME correlation with {column}: {dfBank.stat.corr('OUTCOME', column)}")

OUTCOME correlation with AGE: -0.18232104327365253
OUTCOME correlation with BALANCE: 0.03657486611997681
OUTCOME correlation with SINGLE: 0.46323284934360515
OUTCOME correlation with MARRIED: -0.37532412991335623
OUTCOME correlation with DIVORCED: -0.07812659940926987
OUTCOME correlation with PRIMARY: -0.12561548832677982
OUTCOME correlation with SECONDARY: 0.026392774894072973
OUTCOME correlation with TERTIARY: 0.08494840766635618
OUTCOME correlation with DEFAULT: -0.04536965206737378
OUTCOME correlation with LOAN: -0.030420586112717318
OUTCOME correlation with OUTCOME: 1.0


### Data Pre-Processing

**Creating a LabeledPoint (target, Vector[features])**<br />
It removes not relevant columns to the model (or with low correlation)

In [18]:
def setLabeledPoint(row):
    labeledPoint = (
        row['OUTCOME'], 
        Vectors.dense([
            row['AGE'],
            row['BALANCE'],
            row['SINGLE'],
            row['MARRIED'],
            row['DIVORCED'],
            row['PRIMARY'],
            row['SECONDARY'],
            row['TERTIARY'],
            row['DEFAULT'],
            row['LOAN']
        ])
    )
    
    return labeledPoint

In [19]:
rddBank04 = dfBank.rdd.map(setLabeledPoint)

In [20]:
rddBank04.take(5)

[(0.0, DenseVector([30.0, 1787.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0])),
 (1.0, DenseVector([33.0, 4789.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0])),
 (1.0, DenseVector([35.0, 1350.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0])),
 (1.0, DenseVector([30.0, 1476.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0])),
 (0.0, DenseVector([59.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]))]

In [21]:
dfBank = spSession.createDataFrame(rddBank04, ['label', 'features'])

In [22]:
dfBank.select('label', 'features').show(5)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|[30.0,1787.0,0.0,...|
|  1.0|[33.0,4789.0,0.0,...|
|  1.0|[35.0,1350.0,1.0,...|
|  1.0|[30.0,1476.0,0.0,...|
|  0.0|[59.0,0.0,0.0,1.0...|
+-----+--------------------+
only showing top 5 rows



**Applying PCA**

In [23]:
pcaBank = PCA(k = 3, inputCol = 'features', outputCol = 'pca_features')

In [24]:
pcaModel = pcaBank.fit(dfBank)

In [25]:
pcaResult = pcaModel.transform(dfBank).select('label', 'pca_features')

In [26]:
pcaResult.show(5, truncate = False)

+-----+-------------------------------------------------------------+
|label|pca_features                                                 |
+-----+-------------------------------------------------------------+
|0.0  |[-1787.018897197381,28.86209683775529,0.06459982604876331]   |
|1.0  |[-4789.020177138492,29.922562636341947,0.9830243513096395]   |
|1.0  |[-1350.022213163262,34.10110809796688,-0.8951427168301695]   |
|1.0  |[-1476.0189517184556,29.051333993596703,-0.39527238680219323]|
|0.0  |[-0.037889185366442445,58.9897182000177,0.7290792383661886]  |
+-----+-------------------------------------------------------------+
only showing top 5 rows



**Creating a numeric index for the label target column**

In [27]:
stringIndexer = StringIndexer(inputCol = 'label', outputCol = 'indexed')

In [28]:
stringIndexerModel = stringIndexer.fit(pcaResult)

In [29]:
dfBank = stringIndexerModel.transform(pcaResult)

In [30]:
dfBank.take(5)

[Row(label=0.0, pca_features=DenseVector([-1787.0189, 28.8621, 0.0646]), indexed=0.0),
 Row(label=1.0, pca_features=DenseVector([-4789.0202, 29.9226, 0.983]), indexed=1.0),
 Row(label=1.0, pca_features=DenseVector([-1350.0222, 34.1011, -0.8951]), indexed=1.0),
 Row(label=1.0, pca_features=DenseVector([-1476.019, 29.0513, -0.3953]), indexed=1.0),
 Row(label=0.0, pca_features=DenseVector([-0.0379, 58.9897, 0.7291]), indexed=0.0)]

### Machine Learning

In [31]:
(dataTraining, dataTest) = dfBank.randomSplit([.7, .3])

In [32]:
dataTraining.count()

376

In [33]:
dataTest.count()

165

In [34]:
dataTraining.count() + dataTest.count() == dfBank.count()

True

In [35]:
randomForestClassifier = RandomForestClassifier(labelCol = 'indexed', featuresCol = 'pca_features')

In [36]:
model = randomForestClassifier.fit(dataTraining)

In [37]:
model

RandomForestClassificationModel: uid=RandomForestClassifier_318ab898e4ec, numTrees=20, numClasses=2, numFeatures=3

In [38]:
print(f'Features number: {str(model.numFeatures)}')
print(f'Classses number: {str(model.numClasses)}')

Features number: 3
Classses number: 2


In [39]:
predictions = model.transform(dataTest)

In [40]:
predictions.select('label', 'indexed', 'pca_features', 'prediction').show(5)

+-----+-------+--------------------+----------+
|label|indexed|        pca_features|prediction|
+-----+-------+--------------------+----------+
|  0.0|    0.0|[-14093.033692429...|       0.0|
|  0.0|    0.0|[-11494.034229470...|       0.0|
|  0.0|    0.0|[-9374.0231055509...|       1.0|
|  0.0|    0.0|[-7190.0255034198...|       0.0|
|  0.0|    0.0|[-5996.0302268844...|       0.0|
+-----+-------+--------------------+----------+
only showing top 5 rows



In [41]:
evaluator = MulticlassClassificationEvaluator(
    predictionCol = 'prediction', 
    labelCol = 'indexed', 
    metricName = 'accuracy')

In [42]:
evaluator.evaluate(predictions)

0.6909090909090909

**Confusion Matrix - Summing Up Predictions**

In [43]:
predictions.groupBy('indexed', 'prediction').count().show()

+-------+----------+-----+
|indexed|prediction|count|
+-------+----------+-----+
|    1.0|       1.0|   18|
|    0.0|       1.0|   12|
|    1.0|       0.0|   39|
|    0.0|       0.0|   96|
+-------+----------+-----+

