# Decision Tree Classification Models  
## Introduction  
The following Python code demonstrates some basic __Decision Tree__ classification using Spark. We will create basic Weather data for predicting if someone will 'Play' tennis. The data is hard coded as a list of lists and put into a dataframe. The dataframe is then mapped into an RDD of labeled point vectors. __Notice__ that since `numpy` cannot handle categorical variables, these are recoded as binary indicator variables. For Example, __outlook__ = `sunny`, `overcast` or `rainy`, is replaced by three variables; `sunny` (1 or 0), `overcast` (1 or 0), `rainy` (1 or 0). 

In [1]:
# Initialize the environment
import numpy as np
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint

# Create the `rawdata`, loosely based on the UHCI weather data set
rawdata = [
['sunny',85,85,'FALSE',0],
['sunny',80,90,'TRUE',0],
['overcast',83,86,'FALSE',1],
['rainy',70,96,'FALSE',1],
['rainy',68,80,'FALSE',1],
['rainy',65,70,'TRUE',0],
['overcast',64,65,'TRUE',1],
['sunny',72,95,'FALSE',0],
['sunny',69,70,'FALSE',1],
['rainy',75,80,'FALSE',1],
['sunny',75,70,'TRUE',1],
['overcast',72,90,'TRUE',1],
['overcast',81,75,'FALSE',1],
['rainy',71,91,'TRUE',0]
]

# Create a Data Frame from the `rawdata`
from pyspark.sql import SQLContext,Row
sqlContext = SQLContext(sc)

data_df = sqlContext.createDataFrame(rawdata,
   ['outlook','temp','humid','windy','play'])

# Transform categorical variables into indicator variables
out2index = {'sunny':[1,0,0],'overcast':[0,1,0],'rainy':[0,0,1]}

# Make an RDD of labeled vectors
def newrow(dfrow):
    outrow = list(out2index.get((dfrow[0])))  #get dictionary entry for outlook
    outrow.append(dfrow[1])   #temp
    outrow.append(dfrow[2])   #humidity
    if dfrow[3]=='TRUE':      #windy
        outrow.append(1)
    else:
        outrow.append(0)
    return (LabeledPoint(dfrow[4],outrow))

datax_rdd=data_df.map(newrow)

- Verify the __RDD__ data and some basic summary statistics.

In [2]:
datax_rdd.take(5)

[LabeledPoint(0.0, [1.0,0.0,0.0,85.0,85.0,0.0]),
 LabeledPoint(0.0, [1.0,0.0,0.0,80.0,90.0,1.0]),
 LabeledPoint(1.0, [0.0,1.0,0.0,83.0,86.0,0.0]),
 LabeledPoint(1.0, [0.0,0.0,1.0,70.0,96.0,0.0]),
 LabeledPoint(1.0, [0.0,0.0,1.0,68.0,80.0,0.0])]

In [3]:
datax_rdd.count()

14

## Building the Decision Tree Model

In [4]:
from pyspark.mllib.tree import DecisionTree
dt_model = DecisionTree.trainClassifier(datax_rdd,2,{},impurity='entropy',
          maxDepth=3,maxBins=32, minInstancesPerNode=2)  

- The decision tree function returns a decision tree object. We can review out the object by entering the object name.

In [5]:
#maxDepth and maxBins
#{} could be categorical feature list,
# To do regression, have no numclasses,and use trainRegression function
dt_model
dir(dt_model)

['__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__doc__',
 '__format__',
 '__getattribute__',
 '__hash__',
 '__init__',
 '__module__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_java_loader_class',
 '_java_model',
 '_load_java',
 '_sc',
 'call',
 'depth',
 'load',
 'numNodes',
 'predict',
 'save',
 'toDebugString']

- View the Decision Tree with `toDebugString()

In [6]:
print dt_model.toDebugString()

DecisionTreeModel classifier of depth 3 with 9 nodes
  If (feature 1 <= 0.0)
   If (feature 4 <= 80.0)
    If (feature 3 <= 68.0)
     Predict: 0.0
    Else (feature 3 > 68.0)
     Predict: 1.0
   Else (feature 4 > 80.0)
    If (feature 0 <= 0.0)
     Predict: 0.0
    Else (feature 0 > 0.0)
     Predict: 0.0
  Else (feature 1 > 0.0)
   Predict: 1.0



- __Notice__ the number of nodes are the predict (leaf nodes) and the ifs. Now we get some of training data and test the accuracy of the model.

In [7]:
datax_col = datax_rdd.collect()   #if datax_rdd was big, use sample or take

#redo the confidence  matrix code (it would be more efficient to pass a model)
dt_cf_mat=np.zeros([2,2])  #num of classes
for pnt in datax_col:
    predctn = dt_model.predict(np.array(pnt.features))
    dt_cf_mat[pnt.label][predctn]+=1
corrcnt=0
for i in range(2): 
    corrcnt+=dt_cf_mat[i][i]
dt_per_corr=corrcnt/dt_cf_mat.sum()
print 'Decision Tree: Confidence Matrix and Percent Correct'
print dt_cf_mat
print dt_per_corr

Decision Tree: Confidence Matrix and Percent Correct
[[ 5.  0.]
 [ 2.  7.]]
0.857142857143




- The first line returned should tell us the number of nodes and the depth of the tree. Observe how the number of nodes is related to the decision tree. 

## Test Model Performance by introducing "Dummy" Variables  
In this scenario, we re-run the model against the same data set, but introduce "useless" variables that is constant, after the 'play' column.

In [8]:
import numpy as np
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint

#outlook,temperature,humidity,windy,play, copied from Weka's data example
rawdata=[
['sunny',85,85,'FALSE',0,1],
['sunny',80,90,'TRUE',0,1],
['overcast',83,86,'FALSE',1,1],
['rainy',70,96,'FALSE',1,1],
['rainy',68,80,'FALSE',1,1],
['rainy',65,70,'TRUE',0,1],
['overcast',64,65,'TRUE',1,1],
['sunny',72,95,'FALSE',0,1],
['sunny',69,70,'FALSE',1,1],
['rainy',75,80,'FALSE',1,1],
['sunny',75,70,'TRUE',1,1],
['overcast',72,90,'TRUE',1,1],
['overcast',81,75,'FALSE',1,1],
['rainy',71,91,'TRUE',0,1]
]

from pyspark.sql import SQLContext,Row
sqlContext = SQLContext(sc)

data_df=sqlContext.createDataFrame(rawdata,
                                   ['outlook','temp','humid','windy','play','mydummy']) #<--add field

#transform categoricals into indicator variables
out2index={'sunny':[1,0,0],'overcast':[0,1,0],'rainy':[0,0,1]}

#make RDD of labeled vectors
def newrow(dfrow):    
    outrow = list(out2index.get((dfrow[0])))  #get dictionary entry
    outrow.append(dfrow[1])   #temp    
    outrow.append(dfrow[2])   #humidity    
    if dfrow[3]=='TRUE':      #windy        
        outrow.append(1)    
    else:        
        outrow.append(0)    
    outrow.append(dfrow[5])  # <---- add this     
    return (LabeledPoint(dfrow[4],outrow))

datax_rdd=data_df.map(newrow)

In [9]:
datax_rdd.take(5)

[LabeledPoint(0.0, [1.0,0.0,0.0,85.0,85.0,0.0,1.0]),
 LabeledPoint(0.0, [1.0,0.0,0.0,80.0,90.0,1.0,1.0]),
 LabeledPoint(1.0, [0.0,1.0,0.0,83.0,86.0,0.0,1.0]),
 LabeledPoint(1.0, [0.0,0.0,1.0,70.0,96.0,0.0,1.0]),
 LabeledPoint(1.0, [0.0,0.0,1.0,68.0,80.0,0.0,1.0])]

In [10]:
from pyspark.mllib.tree import DecisionTree
dt_model = DecisionTree.trainClassifier(datax_rdd,2,{},impurity='entropy',
          maxDepth=3,maxBins=32, minInstancesPerNode=2)  

print dt_model.toDebugString()

datax_col=datax_rdd.collect()   #if datax_rdd was big, use sample or take

#redo the conf. matrix code (it would be more efficient to pass a model)
dt_cf_mat=np.zeros([2,2])  #num of classes
for pnt in datax_col:
    predctn = dt_model.predict(np.array(pnt.features))
    dt_cf_mat[pnt.label][predctn]+=1
corrcnt=0
for i in range(2): 
    corrcnt+=dt_cf_mat[i][i]
dt_per_corr=corrcnt/dt_cf_mat.sum()
print 'Decision Tree: Confidence Matrix and Percent Correct'
print dt_cf_mat
print dt_per_corr

DecisionTreeModel classifier of depth 3 with 9 nodes
  If (feature 1 <= 0.0)
   If (feature 4 <= 80.0)
    If (feature 3 <= 68.0)
     Predict: 0.0
    Else (feature 3 > 68.0)
     Predict: 1.0
   Else (feature 4 > 80.0)
    If (feature 0 <= 0.0)
     Predict: 0.0
    Else (feature 0 > 0.0)
     Predict: 0.0
  Else (feature 1 > 0.0)
   Predict: 1.0

Decision Tree: Confidence Matrix and Percent Correct
[[ 5.  0.]
 [ 2.  7.]]
0.857142857143




## Test Model Performance on New Unseen Data    
In this scenario, we re-run the model against the new data to measure it's performance.
The `newpoint` corresponds to the following:

- Outlook binary indicator variables = sunny,overcast,rainy
- Temperature = 68
- Humidy = 79
- Windy = 0
- Useless-constant (dummy variable) = 1

In [11]:
# Add a new data point
newpoint = np.array([1,0,0,68,79,0,1])

# Execute the new test point through the Naive Bayes Model
test = dt_model.predict(newpoint)
print test

0.0


The Decision Tree result is completely different from the same data executed using the __Naiive Bayes__ model. To better analyze why this is the case, we need to further analyse the model.

In [12]:
print dt_model.toDebugString()

DecisionTreeModel classifier of depth 3 with 9 nodes
  If (feature 1 <= 0.0)
   If (feature 4 <= 80.0)
    If (feature 3 <= 68.0)
     Predict: 0.0
    Else (feature 3 > 68.0)
     Predict: 1.0
   Else (feature 4 > 80.0)
    If (feature 0 <= 0.0)
     Predict: 0.0
    Else (feature 0 > 0.0)
     Predict: 0.0
  Else (feature 1 > 0.0)
   Predict: 1.0

