# Random Forest

## Step 1: Create the SparkSession Object


We start the Jupyter Notebook and import `SparkSession` and create a new 
`SparkSession` object to use Spark

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('random_forest').getOrCreate()

## Step 2: Read the Dataset

We then load and read the dataset within Spark using Dataframe

In [3]:
df = spark.read.csv('affairs.csv',inferSchema=True,header=True)

## Step 3: Exploratory Data Analysis

In this section, we drill deeper into the dataset by viewing the dataset and 
validating the shape of the dataset and various statistical measures of the 
variables. 

In [4]:
print((df.count(), len(df.columns)))

(6366, 6)


So, the above output confirms the size of our dataset and we can then 
validate the data types of the input values to check if we need to change/
cast any columns data types

In [5]:
df.printSchema()

root
 |-- rate_marriage: integer (nullable = true)
 |-- age: double (nullable = true)
 |-- yrs_married: double (nullable = true)
 |-- children: double (nullable = true)
 |-- religious: integer (nullable = true)
 |-- affairs: integer (nullable = true)



Let’s have a look at the dataset using show 
function in Spark

In [6]:
df.show(5)

+-------------+----+-----------+--------+---------+-------+
|rate_marriage| age|yrs_married|children|religious|affairs|
+-------------+----+-----------+--------+---------+-------+
|            5|32.0|        6.0|     1.0|        3|      0|
|            4|22.0|        2.5|     0.0|        2|      0|
|            3|32.0|        9.0|     3.0|        3|      1|
|            3|27.0|       13.0|     3.0|        1|      1|
|            4|22.0|        2.5|     0.0|        1|      1|
+-------------+----+-----------+--------+---------+-------+
only showing top 5 rows



In [7]:
df.describe().select('summary','rate_marriage','age', 'yrs_married','children','religious').show()

+-------+------------------+------------------+-----------------+------------------+------------------+
|summary|     rate_marriage|               age|      yrs_married|          children|         religious|
+-------+------------------+------------------+-----------------+------------------+------------------+
|  count|              6366|              6366|             6366|              6366|              6366|
|   mean| 4.109644989004084|29.082862079798932| 9.00942507068803|1.3968740182218033|2.4261702796104303|
| stddev|0.9614295945655025| 6.847881883668817|7.280119972766412| 1.433470828560344|0.8783688402641785|
|    min|                 1|              17.5|              0.5|               0.0|                 1|
|    max|                 5|              42.0|             23.0|               5.5|                 4|
+-------+------------------+------------------+-----------------+------------------+------------------+



We can observe that the average age of people is close to 29 years, and 
they have been married for 9 year

Let us explore individual columns to understand the data in deeper 
detail. The `groupBy` function used along with counts returns us the 
frequency of each of the categories in the data.

In [8]:
df.groupBy('affairs').count().show()

+-------+-----+
|affairs|count|
+-------+-----+
|      1| 2053|
|      0| 4313|
+-------+-----+



In [9]:
df.groupBy('rate_marriage').count().show()

+-------------+-----+
|rate_marriage|count|
+-------------+-----+
|            1|   99|
|            3|  993|
|            5| 2684|
|            4| 2242|
|            2|  348|
+-------------+-----+



In [10]:
df.groupBy('rate_marriage','affairs').count().orderBy('rate_marriage','affairs','count',ascending=True).show()

+-------------+-------+-----+
|rate_marriage|affairs|count|
+-------------+-------+-----+
|            1|      0|   25|
|            1|      1|   74|
|            2|      0|  127|
|            2|      1|  221|
|            3|      0|  446|
|            3|      1|  547|
|            4|      0| 1518|
|            4|      1|  724|
|            5|      0| 2197|
|            5|      1|  487|
+-------------+-------+-----+



In [11]:
df.groupBy('religious','affairs').count().orderBy('religious','affairs','count',ascending=True).show()

+---------+-------+-----+
|religious|affairs|count|
+---------+-------+-----+
|        1|      0|  613|
|        1|      1|  408|
|        2|      0| 1448|
|        2|      1|  819|
|        3|      0| 1715|
|        3|      1|  707|
|        4|      0|  537|
|        4|      1|  119|
+---------+-------+-----+



In [12]:
df.groupBy('children','affairs').count().orderBy('children','affairs','count',ascending=True).show()

+--------+-------+-----+
|children|affairs|count|
+--------+-------+-----+
|     0.0|      0| 1912|
|     0.0|      1|  502|
|     1.0|      0|  747|
|     1.0|      1|  412|
|     2.0|      0|  873|
|     2.0|      1|  608|
|     3.0|      0|  460|
|     3.0|      1|  321|
|     4.0|      0|  197|
|     4.0|      1|  131|
|     5.5|      0|  124|
|     5.5|      1|   79|
+--------+-------+-----+



In [13]:
df.groupBy('affairs').mean().show()

+-------+------------------+------------------+------------------+------------------+------------------+------------+
|affairs|avg(rate_marriage)|          avg(age)|  avg(yrs_married)|     avg(children)|    avg(religious)|avg(affairs)|
+-------+------------------+------------------+------------------+------------------+------------------+------------+
|      1|3.6473453482708234|30.537018996590355|11.152459814905017|1.7289332683877252| 2.261568436434486|         1.0|
|      0| 4.329700904242986| 28.39067934152562| 7.989334569904939|1.2388128912589844|2.5045212149316023|         0.0|
+-------+------------------+------------------+------------------+------------------+------------------+------------+



## Step 4: Feature Engineering

This is the part where we create a single vector combining all input 
features by using Spark’s `VectorAssembler`.

In [14]:
from pyspark.ml.feature import VectorAssembler

df_assembler = VectorAssembler(inputCols=['rate_marriage', 'age', 'yrs_married', 'children', 'religious'], outputCol="features")
df = df_assembler.transform(df)

In [16]:
df.printSchema()

root
 |-- rate_marriage: integer (nullable = true)
 |-- age: double (nullable = true)
 |-- yrs_married: double (nullable = true)
 |-- children: double (nullable = true)
 |-- religious: integer (nullable = true)
 |-- affairs: integer (nullable = true)
 |-- features: vector (nullable = true)



As we can see, now we have one extra column named features, which 
is nothing but a combination of all the input features represented as a 
single dense vector

In [17]:
df.select(['features','affairs']).show(10,False)

+-----------------------+-------+
|features               |affairs|
+-----------------------+-------+
|[5.0,32.0,6.0,1.0,3.0] |0      |
|[4.0,22.0,2.5,0.0,2.0] |0      |
|[3.0,32.0,9.0,3.0,3.0] |1      |
|[3.0,27.0,13.0,3.0,1.0]|1      |
|[4.0,22.0,2.5,0.0,1.0] |1      |
|[4.0,37.0,16.5,4.0,3.0]|1      |
|[5.0,27.0,9.0,1.0,1.0] |1      |
|[4.0,27.0,9.0,0.0,2.0] |1      |
|[5.0,37.0,23.0,5.5,2.0]|1      |
|[5.0,37.0,23.0,5.5,2.0]|1      |
+-----------------------+-------+
only showing top 10 rows



Let us select only the features column as input and the affairs column 
as output for training the random forest model

In [18]:
model_df = df.select(['features','affairs'])

## Step 5: Splitting the Dataset

We have to split the dataset into training and test datasets in order to train 
and evaluate the performance of the random forest model. We split it into 
a 75/25 ratio and train our model on 75% of the dataset

In [20]:
train_df, test_df = model_df.randomSplit([0.75,0.25])
print(train_df.count())

4743


In [21]:
train_df.groupBy('affairs').count().show()

+-------+-----+
|affairs|count|
+-------+-----+
|      1| 1551|
|      0| 3192|
+-------+-----+



This ensures we have balanced set values for the target class (‘affairs’) 
into the training and test sets

In [22]:
test_df.groupBy('affairs').count().show()

+-------+-----+
|affairs|count|
+-------+-----+
|      1|  502|
|      0| 1121|
+-------+-----+



## Step 6: Build and Train Random Forest Model

In this part, we build and train the random forest model using features 
such as input and Status as the output colum.

In [23]:
from pyspark.ml.classification import RandomForestClassifier

rf_classifier = RandomForestClassifier(labelCol='affairs', numTrees=50).fit(train_df)

## Step 7: Evaluation on Test Data

Once we have trained our model on the training dataset, we can evaluate 
its performance on the test set

In [24]:
rf_predictions = rf_classifier.transform(test_df)
rf_predictions.show()

+--------------------+-------+--------------------+--------------------+----------+
|            features|affairs|       rawPrediction|         probability|prediction|
+--------------------+-------+--------------------+--------------------+----------+
|[1.0,22.0,2.5,1.0...|      1|[25.4750535109005...|[0.50950107021801...|       0.0|
|[1.0,27.0,2.5,1.0...|      1|[25.6910529703782...|[0.51382105940756...|       0.0|
|[1.0,27.0,6.0,1.0...|      0|[17.2420790681183...|[0.34484158136236...|       1.0|
|[1.0,27.0,6.0,1.0...|      0|[17.2420790681183...|[0.34484158136236...|       1.0|
|[1.0,32.0,2.5,1.0...|      0|[25.3484798408766...|[0.50696959681753...|       0.0|
|[1.0,32.0,13.0,0....|      1|[21.3084220071231...|[0.42616844014246...|       1.0|
|[1.0,32.0,13.0,2....|      1|[13.9218400777464...|[0.27843680155492...|       1.0|
|[1.0,32.0,16.5,2....|      1|[13.7206914983931...|[0.27441382996786...|       1.0|
|[1.0,32.0,16.5,3....|      1|[15.6431715009444...|[0.31286343001888...|    

In [25]:
rf_predictions.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|       0.0| 1340|
|       1.0|  283|
+----------+-----+



To evaluate these preditions, we will import the 
classificationEvaluators.

In [27]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator

**Accuracy**

In [30]:
rf_accuracy = MulticlassClassificationEvaluator(labelCol='affairs',metricName='accuracy').evaluate(rf_predictions)
print('The accuracy of RF on test data is {0:.0%}'.format(rf_accuracy))

The accuracy of RF on test data is 73%


**Precision**

In [31]:
rf_precision = MulticlassClassificationEvaluator(labelCol='affairs',metricName='weightedPrecision').evaluate(rf_predictions)
print('The precision rate on test data is {0:.0%}'.format(rf_precision))

The precision rate on test data is 72%


**AUC**

In [32]:
rf_auc = BinaryClassificationEvaluator(labelCol='affairs').evaluate(rf_predictions)
print(rf_auc)

0.7572226704244571


As mentioned in the earlier part, RF gives the importance of each 
feature in terms of predictive power, and it is very useful to figure out the 
critical variables that contribute the most to predictions

In [33]:
rf_classifier.featureImportances

SparseVector(5, {0: 0.612, 1: 0.0217, 2: 0.2234, 3: 0.0636, 4: 0.0793})

We used five features and the importance can be found out using the 
feature importance function. To know which input feature is mapped to 
which index values, we can use metadata information.

In [34]:
df.schema["features"].metadata["ml_attr"]["attrs"]

{'numeric': [{'idx': 0, 'name': 'rate_marriage'},
  {'idx': 1, 'name': 'age'},
  {'idx': 2, 'name': 'yrs_married'},
  {'idx': 3, 'name': 'children'},
  {'idx': 4, 'name': 'religious'}]}

So, rate_marriage is the most important feature from a prediction 
standpoint followed by yrs_married. The least significant variable seems to 
be Age

## Step 8: Saving the Model

Save the ML model

In [35]:
from pyspark.ml.classification import RandomForestClassificationModel

rf_classifier.save("./RF_model")

Load the ML model

In [36]:
rf = RandomForestClassificationModel.load("./RF_model")