# Tree Methods 

Following are the 3 different types of tree methods:

* A single decision tree
* A random forest
* A gradient boosted tree classifier

**Prediction to try to classify colleges as Private or Public based off these features:**

    Private A factor with levels No and Yes indicating private or public university
    Apps Number of applications received
    Accept Number of applications accepted
    Enroll Number of new students enrolled
    Top10perc Pct. new students from top 10% of H.S. class
    Top25perc Pct. new students from top 25% of H.S. class
    F.Undergrad Number of fulltime undergraduates
    P.Undergrad Number of parttime undergraduates
    Outstate Out-of-state tuition
    Room.Board Room and board costs
    Books Estimated book costs
    Personal Estimated personal spending
    PhD Pct. of faculty with Ph.D.’s
    Terminal Pct. of faculty with terminal degree
    S.F.Ratio Student/faculty ratio
    perc.alumni Pct. alumni who donate
    Expend Instructional expenditure per student
    Grad.Rate Graduation rate

In [59]:
# Initialize pyspark
import findspark
findspark.init()
import pyspark

In [60]:
# Initialize and create a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('college').getOrCreate()

In [61]:
# Using Spark to read the college data set
data = spark.read.csv('College.csv', header=True, inferSchema=True)

In [62]:
# Printing the first row of the dataframe
data.head()

Row(School='Abilene Christian University', Private='Yes', Apps=1660, Accept=1232, Enroll=721, Top10perc=23, Top25perc=52, F_Undergrad=2885, P_Undergrad=537, Outstate=7440, Room_Board=3300, Books=450, Personal=2200, PhD=70, Terminal=78, S_F_Ratio=18.1, perc_alumni=12, Expend=7041, Grad_Rate=60)

In [63]:
# Printing the schema of the dataframe
data.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



### Spark Formatting of Data

In [64]:
# A few things we need to do before Spark can accept the data!
# It needs to be in the form of two columns
# ("label","features")

# Import VectorAssembler and Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [65]:
data.columns

['School',
 'Private',
 'Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate']

In [66]:
#Ignoring school column since it is a string and not much of use

In [67]:
#Assembling all the dependant features to a single vector column "features"

assembler = VectorAssembler(inputCols=['Apps','Accept','Enroll','Top10perc','Top25perc','F_Undergrad',
                             'P_Undergrad','Outstate','Room_Board','Books','Personal','PhD','Terminal',
                             'S_F_Ratio','perc_alumni','Expend','Grad_Rate'], outputCol='features')

In [68]:
output = assembler.transform(data)

*Since the input label "Private" is a string categorical column, it needs to be converted to Numerical type*

In [69]:
from pyspark.ml.feature import StringIndexer

In [70]:
indexer = StringIndexer(inputCol='Private', outputCol='PrivateInd')

In [71]:
indexer_model = indexer.fit(output)

In [72]:
fixed_output = indexer_model.transform(output)

In [73]:
fixed_output.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- PrivateInd: double (nullable = false)



In [74]:
fixed_output.select('Private','PrivateInd').show(3)

+-------+----------+
|Private|PrivateInd|
+-------+----------+
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
+-------+----------+
only showing top 3 rows



In [75]:
final_data = fixed_output.select('PrivateInd','features')

In [76]:
final_data.show(3)

+----------+--------------------+
|PrivateInd|            features|
+----------+--------------------+
|       0.0|[1660.0,1232.0,72...|
|       0.0|[2186.0,1924.0,51...|
|       0.0|[1428.0,1097.0,33...|
+----------+--------------------+
only showing top 3 rows



__Splitting the resultant data into training data and testing data, Training data is to train the model, Testing data is to test the builted model__

In [77]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

### Setting up the Tree Classifier Models

In [78]:
from pyspark.ml.classification import DecisionTreeClassifier,RandomForestClassifier,GBTClassifier

Creating all three models:

In [79]:
dtc = DecisionTreeClassifier(labelCol='PrivateInd', featuresCol='features')
rfc = RandomForestClassifier(labelCol='PrivateInd', featuresCol='features')
gbt = GBTClassifier(labelCol='PrivateInd', featuresCol='features')

Training all three models:

In [80]:
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)

Getting the results of all 3 models on a test dataset

In [81]:
dtc_predictions = dtc_model.transform(test_data)
rfc_predictions = rfc_model.transform(test_data)
gbt_predictions = gbt_model.transform(test_data)

In [82]:
dtc_predictions.show(3)

+----------+--------------------+-------------+-----------+----------+
|PrivateInd|            features|rawPrediction|probability|prediction|
+----------+--------------------+-------------+-----------+----------+
|       0.0|[81.0,72.0,51.0,3...|   [16.0,0.0]|  [1.0,0.0]|       0.0|
|       0.0|[150.0,130.0,88.0...|  [299.0,0.0]|  [1.0,0.0]|       0.0|
|       0.0|[167.0,130.0,46.0...|  [299.0,0.0]|  [1.0,0.0]|       0.0|
+----------+--------------------+-------------+-----------+----------+
only showing top 3 rows



In [83]:
rfc_predictions.show(3)

+----------+--------------------+--------------------+--------------------+----------+
|PrivateInd|            features|       rawPrediction|         probability|prediction|
+----------+--------------------+--------------------+--------------------+----------+
|       0.0|[81.0,72.0,51.0,3...|[18.8130082383486...|[0.94065041191743...|       0.0|
|       0.0|[150.0,130.0,88.0...|[19.8229420131831...|[0.99114710065915...|       0.0|
|       0.0|[167.0,130.0,46.0...|[19.8130082383486...|[0.99065041191743...|       0.0|
+----------+--------------------+--------------------+--------------------+----------+
only showing top 3 rows



In [84]:
gbt_predictions.show(3)

+----------+--------------------+--------------------+--------------------+----------+
|PrivateInd|            features|       rawPrediction|         probability|prediction|
+----------+--------------------+--------------------+--------------------+----------+
|       0.0|[81.0,72.0,51.0,3...|[1.37160769408461...|[0.93952903486028...|       0.0|
|       0.0|[150.0,130.0,88.0...|[1.54645239565077...|[0.95659912404290...|       0.0|
|       0.0|[167.0,130.0,46.0...|[1.54645239565077...|[0.95659912404290...|       0.0|
+----------+--------------------+--------------------+--------------------+----------+
only showing top 3 rows



#### Evaluating the models using MulticlassClassificationEvaluator

In [85]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [86]:
# Select (prediction, true label) and compute test error
acc_evaluator = MulticlassClassificationEvaluator(labelCol="PrivateInd", predictionCol="prediction", metricName="accuracy")

In [87]:
dtc_acc = acc_evaluator.evaluate(dtc_predictions)
rfc_acc = acc_evaluator.evaluate(rfc_predictions)
gbt_acc = acc_evaluator.evaluate(gbt_predictions)

In [88]:
print("Here are the results!")
print('-'*80)
print('A single decision tree had an accuracy of: {0:2.2f}%'.format(dtc_acc*100))
print('-'*80)
print('A random forest ensemble had an accuracy of: {0:2.2f}%'.format(rfc_acc*100))
print('-'*80)
print('A ensemble using GBT had an accuracy of: {0:2.2f}%'.format(gbt_acc*100))

Here are the results!
--------------------------------------------------------------------------------
A single decision tree had an accuracy of: 90.18%
--------------------------------------------------------------------------------
A random forest ensemble had an accuracy of: 91.96%
--------------------------------------------------------------------------------
A ensemble using GBT had an accuracy of: 91.07%


**Converting the data to rdd and evaluating using MulticlassMetrics to print the confusion matrix**

In [89]:
from pyspark.mllib.evaluation import MulticlassMetrics

In [90]:
dtc_predictionAndLabel = dtc_predictions.select('prediction','PrivateInd').rdd
rfc_predictionAndLabel = rfc_predictions.select('prediction','PrivateInd').rdd
gbt_predictionAndLabel = gbt_predictions.select('prediction','PrivateInd').rdd

In [91]:
dtc_metrics = MulticlassMetrics(dtc_predictionAndLabel)
rfc_metrics = MulticlassMetrics(rfc_predictionAndLabel)
gbt_metrics = MulticlassMetrics(gbt_predictionAndLabel)

Printing the confusion matrix

In [92]:
print(dtc_metrics.confusionMatrix())

DenseMatrix([[143.,  15.],
             [  7.,  59.]])


In [93]:
print(rfc_metrics.confusionMatrix())

DenseMatrix([[150.,   8.],
             [ 10.,  56.]])


In [94]:
print(gbt_metrics.confusionMatrix())

DenseMatrix([[146.,  12.],
             [  8.,  58.]])


In [95]:
print("Accuracy of DecisionTreeClassifier Model:",dtc_metrics.accuracy)

Accuracy of DecisionTreeClassifier Model: 0.9017857142857143


In [96]:
print("Accuracy of RandomForestClassifier Model:",rfc_metrics.accuracy)

Accuracy of RandomForestClassifier Model: 0.9196428571428571


In [97]:
print("Accuracy of GBTClassifier Model:",gbt_metrics.accuracy)

Accuracy of GBTClassifier Model: 0.9107142857142857


In [None]:
#Closing spark session
spark.stop()