# Random Forests


# Random Forests


In [None]:
!apt-get update

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz
!pip install -q findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

In [None]:
import findspark
findspark.init()

# **Code**
Build a random forest model using Spark’s MLlib library and
predict the target variable using the input features

Step **1**: Create the SparkSession Object

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('random_forest').getOrCreate()


Step **2**: Read the Dataset

In [None]:
df= spark.read.csv('affairs.csv',inferSchema=True,header=True)

Step **3**: Exploratory Data Analysis

In [None]:
print((df.count(),len(df.columns)))

(6366, 6)


In [None]:
df.printSchema()

root
 |-- rate_marriage: integer (nullable = true)
 |-- age: double (nullable = true)
 |-- yrs_married: double (nullable = true)
 |-- children: double (nullable = true)
 |-- religious: integer (nullable = true)
 |-- affairs: integer (nullable = true)



In [None]:
df.show()

+-------------+----+-----------+--------+---------+-------+
|rate_marriage| age|yrs_married|children|religious|affairs|
+-------------+----+-----------+--------+---------+-------+
|            5|32.0|        6.0|     1.0|        3|      0|
|            4|22.0|        2.5|     0.0|        2|      0|
|            3|32.0|        9.0|     3.0|        3|      1|
|            3|27.0|       13.0|     3.0|        1|      1|
|            4|22.0|        2.5|     0.0|        1|      1|
|            4|37.0|       16.5|     4.0|        3|      1|
|            5|27.0|        9.0|     1.0|        1|      1|
|            4|27.0|        9.0|     0.0|        2|      1|
|            5|37.0|       23.0|     5.5|        2|      1|
|            5|37.0|       23.0|     5.5|        2|      1|
|            3|22.0|        2.5|     0.0|        2|      1|
|            3|27.0|        6.0|     0.0|        1|      1|
|            2|27.0|        6.0|     2.0|        1|      1|
|            5|27.0|        6.0|     2.0

We can now use the describe function to go over statistical measures of
the dataset

In [None]:
df.describe().select('summary','rate_marriage','age','yrs_married','children','religious').show()

+-------+------------------+------------------+-----------------+------------------+------------------+
|summary|     rate_marriage|               age|      yrs_married|          children|         religious|
+-------+------------------+------------------+-----------------+------------------+------------------+
|  count|              6366|              6366|             6366|              6366|              6366|
|   mean| 4.109644989004084|29.082862079798932| 9.00942507068803|1.3968740182218033|2.4261702796104303|
| stddev|0.9614295945655025| 6.847881883668817|7.280119972766412| 1.433470828560344|0.8783688402641785|
|    min|                 1|              17.5|              0.5|               0.0|                 1|
|    max|                 5|              42.0|             23.0|               5.5|                 4|
+-------+------------------+------------------+-----------------+------------------+------------------+



Let us explore individual columns to understand the data in deeper
detail. The groupBy function used along with counts returns us the
frequency of each of the categories in the data

In [None]:
df.groupBy('affairs').count().show()

+-------+-----+
|affairs|count|
+-------+-----+
|      1| 2053|
|      0| 4313|
+-------+-----+



More than 33% of the people who are involved in some
sort of extramarital affair out of a total number of people

In [None]:
df.groupBy('rate_marriage').count().show()

+-------------+-----+
|rate_marriage|count|
+-------------+-----+
|            1|   99|
|            3|  993|
|            5| 2684|
|            4| 2242|
|            2|  348|
+-------------+-----+



The majority of the people rate their marriage very high (4 or 5), and
the rest rate it on the lower side. Let’s drill down a little bit further to
understand if the marriage rating is related to the affair variable or not.

In [None]:
df.groupBy('rate_marriage','affairs').count().orderBy('rate_marriage','affairs','count',ascending=True).show()

+-------------+-------+-----+
|rate_marriage|affairs|count|
+-------------+-------+-----+
|            1|      0|   25|
|            1|      1|   74|
|            2|      0|  127|
|            2|      1|  221|
|            3|      0|  446|
|            3|      1|  547|
|            4|      0| 1518|
|            4|      1|  724|
|            5|      0| 2197|
|            5|      1|  487|
+-------------+-------+-----+



In [None]:
df.groupBy('religious','affairs').count().orderBy('religious','affairs','count',ascending=True).show()

+---------+-------+-----+
|religious|affairs|count|
+---------+-------+-----+
|        1|      0|  613|
|        1|      1|  408|
|        2|      0| 1448|
|        2|      1|  819|
|        3|      0| 1715|
|        3|      1|  707|
|        4|      0|  537|
|        4|      1|  119|
+---------+-------+-----+



people who have rated lower on religious features have
higher percentage of affair involvement

In [None]:
df.groupBy('affairs','children').count().orderBy('affairs','children','count').show()

+-------+--------+-----+
|affairs|children|count|
+-------+--------+-----+
|      0|     0.0| 1912|
|      0|     1.0|  747|
|      0|     2.0|  873|
|      0|     3.0|  460|
|      0|     4.0|  197|
|      0|     5.5|  124|
|      1|     0.0|  502|
|      1|     1.0|  412|
|      1|     2.0|  608|
|      1|     3.0|  321|
|      1|     4.0|  131|
|      1|     5.5|   79|
+-------+--------+-----+



The above table does not clearly indicate any of the trends regarding
the relation between the number of children and chances of being
involved in an affair

In [None]:
#Use the groupBy function along with the mean to know more about the dataset
df.groupBy('affairs').mean().show()

+-------+------------------+------------------+------------------+------------------+------------------+------------+
|affairs|avg(rate_marriage)|          avg(age)|  avg(yrs_married)|     avg(children)|    avg(religious)|avg(affairs)|
+-------+------------------+------------------+------------------+------------------+------------------+------------+
|      1|3.6473453482708234|30.537018996590355|11.152459814905017|1.7289332683877252| 2.261568436434486|         1.0|
|      0| 4.329700904242986| 28.39067934152562| 7.989334569904939|1.2388128912589844|2.5045212149316023|         0.0|
+-------+------------------+------------------+------------------+------------------+------------------+------------+



So, the people who have affairs rate their marriages low and a little on
the higher side from an age standpoint. They have also been married for a
higher number of years and are less religious.

Step **4**: Feature Engineering

In [None]:
from pyspark.ml.feature import VectorAssembler

In [None]:
df_assembler = VectorAssembler(inputCols= ['rate_marriage','age','yrs_married','children','religious'],outputCol='features')
df =df_assembler.transform(df)
df.printSchema()

root
 |-- rate_marriage: integer (nullable = true)
 |-- age: double (nullable = true)
 |-- yrs_married: double (nullable = true)
 |-- children: double (nullable = true)
 |-- religious: integer (nullable = true)
 |-- affairs: integer (nullable = true)
 |-- features: vector (nullable = true)



As we can see, now we have one extra column named features, which
is nothing but a combination of all the input features represented as a
single dense vector

In [None]:
model_df = df.select('features','affairs')

Step **5**: Splitting the Dataset

In [None]:
train_df,test_df = model_df.randomSplit([0.75,0.25])
train_df.groupBy('affairs').count().show()

+-------+-----+
|affairs|count|
+-------+-----+
|      1| 1545|
|      0| 3241|
+-------+-----+



This ensures we have balanced set values for the target class (‘affairs’)
into the training and test sets.

In [None]:
test_df.groupBy('affairs').count().show()

+-------+-----+
|affairs|count|
+-------+-----+
|      1|  508|
|      0| 1072|
+-------+-----+



Step **6**: Build and Train Random Forest Model

In [None]:
from pyspark.ml.classification import RandomForestClassifier

rf_classifier = RandomForestClassifier(labelCol='affairs',numTrees=50).fit(train_df)

here are many hyperparameters that can be set to tweak the
performance of the model, but we are chosing the deafault ones here
except for one that is the number of decision trees that we want to build

Step **7**: Evaluation on Test Data


In [None]:
rf_predictions=rf_classifier.transform(test_df)

In [None]:
rf_predictions.show()

+--------------------+-------+--------------------+--------------------+----------+
|            features|affairs|       rawPrediction|         probability|prediction|
+--------------------+-------+--------------------+--------------------+----------+
|[1.0,27.0,2.5,1.0...|      1|[21.7055600424835...|[0.43411120084967...|       1.0|
|[1.0,27.0,6.0,1.0...|      1|[18.9464042910167...|[0.37892808582033...|       1.0|
|[1.0,27.0,6.0,1.0...|      0|[17.7000559162815...|[0.35400111832563...|       1.0|
|[1.0,27.0,6.0,1.0...|      0|[17.6632372843970...|[0.35326474568794...|       1.0|
|[1.0,27.0,9.0,2.0...|      1|[15.7081951289924...|[0.31416390257984...|       1.0|
|[1.0,27.0,13.0,2....|      1|[15.6270411156630...|[0.31254082231326...|       1.0|
|[1.0,32.0,9.0,3.0...|      1|[15.9350126791239...|[0.31870025358247...|       1.0|
|[1.0,32.0,13.0,2....|      0|[17.2515120817907...|[0.34503024163581...|       1.0|
|[1.0,32.0,13.0,3....|      1|[19.2210783384767...|[0.38442156676953...|    

The first column in the predictions table is that of input features of the
test data. The second column is the actual label or output of the test data.
The third column (rawPrediction) represents the measure of confidence
for both possible outputs. The fourth column is that of conditional
probability of each class label, and the final column is the prediction by the
random forest classifier.

Apply a groupBy function on the prediction
column to find out the number of predictions made for the positive and
negative classes.

In [None]:
rf_predictions.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|       0.0| 1252|
|       1.0|  328|
+----------+-----+



To evaluate these preditions, we will import the
classificationEvaluators.

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Accuracy

In [None]:
rf_accuracy = MulticlassClassificationEvaluator(labelCol='affairs',metricName='accuracy').evaluate(rf_predictions)

In [None]:
print('The accuracy of RF on test data is {0:.0%}'.format(rf_accuracy))

The accuracy of RF on test data is 72%


# Precision

In [None]:
rf_precision=MulticlassClassificationEvaluator(labelCol='affairs',metricName='weightedPrecision').evaluate(rf_predictions)

In [None]:
print('The precision rate on test data is {0:.0%}'.
format(rf_precision))

The precision rate on test data is 70%


# AUC

In [None]:
rf_auc=BinaryClassificationEvaluator(labelCol='affairs').evaluate(rf_predictions)

In [None]:
print( rf_auc)

0.7417183276530734


RF gives the importance of each
feature in terms of predictive power, and it is very useful to figure out the
critical variables that contribute the most to predictions.

In [None]:
rf_classifier.featureImportances

SparseVector(5, {0: 0.6156, 1: 0.0275, 2: 0.2394, 3: 0.0591, 4: 0.0583})

We used five features and the importance can be found out using the
feature importance function. To know which input feature is mapped to
which index values, we can use metadata information.

In [None]:
df.schema["features"].metadata["ml_attr"]["attrs"]

{'numeric': [{'idx': 0, 'name': 'rate_marriage'},
  {'idx': 1, 'name': 'age'},
  {'idx': 2, 'name': 'yrs_married'},
  {'idx': 3, 'name': 'children'},
  {'idx': 4, 'name': 'religious'}]}

So, rate_marriage is the most important feature from a prediction
standpoint followed by yrs_married. The least significant variable seems to
be Age.

Step **8**: Saving the Model

1-Save the ML model

In [None]:
pwd

'/content'

In [None]:
from pyspark.ml.classification import RandomForestClassificationModel
rf_classifier.save("/content/RF_model")

2-Load the ML model

In [None]:
rf=RandomForestClassificationModel.load("/content/RF_model")
#new_preditions=rf.transform(new_df)
#A new predictions table would contain the column with the model predictions