# Tree Methods

Scenario:

We have a dataset describing a group of US Universities. The dataset has the features of universities and labeled either Private or Public. The aim of the project is to create a machine learning model in order to classify from the features if a university is Private or Public. In order to do that I will use the Tree Methods Classifiers: DecisionTreeClassifier, GBTClassifier, RandomForestClassifier

The dataset:

Let's start!

In [7]:
# Basic imports
from pyspark.sql import SparkSession

In [8]:
# Creation of the spark session
spark = SparkSession.builder.appName('treeMethods').getOrCreate()

In [9]:
# Reading the data
data = spark.read.csv('College.csv', inferSchema= True, header=True)

In [12]:
# Basic exploration
data.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



In [6]:
data.head(1)

[Row(School='Abilene Christian University', Private='Yes', Apps=1660, Accept=1232, Enroll=721, Top10perc=23, Top25perc=52, F_Undergrad=2885, P_Undergrad=537, Outstate=7440, Room_Board=3300, Books=450, Personal=2200, PhD=70, Terminal=78, S_F_Ratio=18.1, perc_alumni=12, Expend=7041, Grad_Rate=60)]

# Data preparation

After a look to the data let's assenbler the features into a dense vector

In [13]:
# Import of the vector assembler method
from pyspark.ml.feature import VectorAssembler

In [14]:
# Instancing the object
# All the input column will be the features i will use to build the model
assembler = VectorAssembler(inputCols=['Apps','Accept','Enroll','Top10perc','Top25perc','F_Undergrad','P_Undergrad','Outstate',
                                       'Room_Board','Books','Personal','PhD','Terminal','S_F_Ratio','perc_alumni','Expend','Grad_Rate'], 
                            outputCol='features') 


In [15]:
# Here I transform the dataframe using the structure of the assembler
output = assembler.transform(data)

In [18]:
# Checking if the "features" column has been created
output.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)
 |-- features: vector (nullable = true)



As you can see the targhet variable "Private" column is in a string type, our classifiers can't works on strings. In order to face this problem we have to convert the string in integers. 

In [19]:
# Import the StringIndexer mathod in order to convert the strings
from pyspark.ml.feature import StringIndexer

In [20]:
# Instancing the object
# Here I pass in input the column I want to convert
# The output column with carry the conversion
indexer = StringIndexer(inputCol='Private', outputCol='PrivateIndex')

# Converting the dataframe
output_fixed = indexer.fit(output).transform(output)

In [21]:
# Checking the conversion
output_fixed.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- PrivateIndex: double (nullable = true)



Nice, we can see that the "PrivateIndex" has been created and the values inside of it are double.

# Creation of the models

Now we have our dataframe ready to use. The next steps will be to reduce our dataframe and create a training set and test set.

In [22]:
# Here I take only the columns i need for the models
final_data = output_fixed.select('features', 'PrivateIndex')

In [23]:
# Creation of training set and test set
training_data, test_data = final_data.randomSplit([0.7, 0.3])

In [24]:
# Import of the models and Pipeline method
from pyspark.ml.classification import (DecisionTreeClassifier, GBTClassifier, RandomForestClassifier)
#from pyspark.ml import Pipeline

In [26]:
# Instancing the models
# 1) Decision Tree Classifier
dtc = DecisionTreeClassifier(labelCol='PrivateIndex', featuresCol='features')

# 2) RandomForestClassifier
# In this case I decide the number of trees I want to build
rfc = RandomForestClassifier(labelCol='PrivateIndex', featuresCol='features', numTrees=150)

# 3) GBTClassifier
gbc = GBTClassifier(labelCol='PrivateIndex', featuresCol='features')

In [27]:
# Creation of the models on the training data
dtc_model = dtc.fit(training_data)
rfc_model = rfc.fit(training_data)
gbc_model = gbc.fit(training_data)

In [28]:
# Testing the models on the test set
dtc_predict = dtc_model.transform(test_data)
rfc_predict = rfc_model.transform(test_data)
gbc_predict = gbc_model.transform(test_data)

In [29]:
# Let's have a look on the decision tree model prediction before
gbc_predict.show()

+--------------------+------------+--------------------+--------------------+----------+
|            features|PrivateIndex|       rawPrediction|         probability|prediction|
+--------------------+------------+--------------------+--------------------+----------+
|[81.0,72.0,51.0,3...|         0.0|[1.54404859124739...|[0.95639908666787...|       0.0|
|[100.0,90.0,35.0,...|         0.0|[1.52190005518274...|[0.95451410360351...|       0.0|
|[141.0,118.0,55.0...|         0.0|[1.30168608465055...|[0.93107829166289...|       0.0|
|[150.0,130.0,88.0...|         0.0|[1.54404859124739...|[0.95639908666787...|       0.0|
|[191.0,165.0,63.0...|         0.0|[1.52972462825513...|[0.95518872926387...|       0.0|
|[202.0,184.0,122....|         0.0|[1.54404859124739...|[0.95639908666787...|       0.0|
|[212.0,197.0,91.0...|         0.0|[1.54404859124739...|[0.95639908666787...|       0.0|
|[213.0,166.0,85.0...|         0.0|[1.54404859124739...|[0.95639908666787...|       0.0|
|[235.0,217.0,121....

# Evaluations

In the following section I will use two evaluatin methods.
1) The BinaryClassificationEvaluator
2) The MulticlassClassificationEvaluator

In [31]:
# Import
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [32]:
# Instancing the object for the classificatio
# Here I pay attention to give in input the right colum where the evaluator will work
# The metric we use will be the area under the ROC curve
myBinaryEval = BinaryClassificationEvaluator(labelCol='PrivateIndex', metricName='areaUnderROC')

The value of the Area Under the Curve goes from 0 to 1. In any case an acceptable model must get a value upper than 0.5

In [34]:
print ('DTC_accuracy = ', myBinaryEval.evaluate(dtc_predict))

DTC_accuracy =  0.9406991260923845


In [35]:
#Dal valore sembra che il RandomForestClassifier funziona meglio
print ('RFC_accuracy = ', myBinaryEval.evaluate(rfc_predict))

RFC_accuracy =  0.970869746150646


In [36]:
print ('GBC_accuracy = ', myBinaryEval.evaluate(gbc_predict))

GBC_accuracy =  0.9508426966292134


Woow! All the models have high score close to 1

In [50]:
dtc_predict.printSchema()

root
 |-- features: vector (nullable = true)
 |-- PrivateIndex: double (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)



In [51]:
rfc_predict.printSchema()

root
 |-- features: vector (nullable = true)
 |-- PrivateIndex: double (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)



In [52]:
gbc_predict.printSchema()

root
 |-- features: vector (nullable = true)
 |-- PrivateIndex: double (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)



In [43]:
#In questo caso utilizzo il modulo che mi consente di lavorare su variabili multiclasse
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [44]:
acc_eval = MulticlassClassificationEvaluator(labelCol='PrivateIndex', metricName='accuracy')

In [45]:
#Valuto l'accuratezza del RandomForestClassifier
rfc_accuracy = acc_eval.evaluate(rfc_predict)
rfc_accuracy

0.9353448275862069

In [46]:
#Valuto l'accuratezza del DecisionTreeClassifier
dtc_accuracy = acc_eval.evaluate(dtc_predict)
dtc_accuracy

0.9094827586206896

In [47]:
#Valuto l'accuratezza del GBTClassifier
gbc_accuracy = acc_eval.evaluate(gbc_predict)
gbc_accuracy

0.9137931034482759

Ok, even with the MulticlassClassificationEvaluator we get nice results! Now our model is ready to be used on new unlabeled data.

# Thanks for your attention!