# Simple Decision Tree Machine Learning Example

This is a simple example of creating and training a decision tree model using the available Spark machine learning libraries. Model training and evaluation is performed, along with saving a trained model to use for future predictions.

Create a Spark Context to work with:

In [1]:
import pyspark

# Create a configuration object.
conf = (
    pyspark
      .SparkConf()
      .setMaster('local[*]')
      .setAppName('Simple Decision Tree Notebook')
)

# Create a Spark context for local work
try:
    sc
except:
    sc = pyspark.SparkContext(conf = conf)
    
print('Running with Spark version: ',sc.version)

Running with Spark version:  1.6.1


### Use the Forest CoverType dataset

The Covtype data set is available online from: [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/). The file _covtype.data.gz_ includes the type data and _covtype.info_ includes the metadata. This data was originally provided by Colorado State University.

This dataset has also been used in a [*Kaggle*](https://www.kaggle.com/c/forest-cover-type-prediction) competition.

First lets make sure that the file is available locally. The _urlretrieve()_ method called here is only supposed to download the file if it is not present locally. My version appears to download the file each time, so a more explicit check is performed.

In [2]:
import os
import urllib.request

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz'
localfile = 'data/covtype.data.gz'

# Ensure we fetch the data if we need to.
if(not os.path.isfile(localfile)):
    print("Downloading data into: ",localfile)
    localfile, headers = urllib.request.urlretrieve(url, localfile)
else:
    print("Data file already present at: ",localfile)

Data file already present at:  data/covtype.data.gz


Now that we have the data locally, we can grab it through an RDD.  Use a simple textfile RDD since this is an ASCII CSV file:

In [3]:
rawData = sc.textFile(localfile)

Now we can transform the raw data into a sequence of LabeledPoints (from MLLib):

In [4]:
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint

# Extract the dataset features and target.
def ingest(line):
    # Simple numeric features (some are one-hot encoded).
    # Last field is the label (training target).
    fields = [float(f) for f in line.split(',')]
    features = Vectors.dense(fields[0:len(fields)-1])
    
    # Subtract 1 from the label to satisfy the '0' based
    # DecisionTree model.
    label    = fields[-1] - 1
    return LabeledPoint(label,features)

pointdata = rawData.map(ingest)

Now that we have our data in a format we can train a model with, lets look at a few entries to see if it is as expected:

In [5]:
pointdata.take(5)

[LabeledPoint(4.0, [2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]),
 LabeledPoint(4.0, [2590.0,56.0,2.0,212.0,-6.0,390.0,220.0,235.0,151.0,6225.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]),
 LabeledPoint(1.0, [2804.0,139.0,9.0,268.0,65.0,3180.0,234.0,238.0,135.0,6121.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]),
 LabeledPoint(1.0, [2785.0,155.0,18.0,242.0,118.0,3090.0,238.0,238.0,122.0,6211.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.

and we see that we have a sequence of LabeledPoint objects, each with a label (numeric in this case, since we ensured that the values would be converted to float values).  There is a DenseMatrix element included for each point, with 55 (float) values corresponding to the features.

The first 10 features are numeric.  The next two are categorical and have been encoded as one-hot codes.  The _Wilderness Areas_ feature takes 4 columns, and the _Soil Type_ takes up 40 columns.

The _Cover Type_ factor is used as the label is a numerically encoded categorical factor.  It is important to ensure that this encoded value is not treated as a number - no ordering in the feature is implied.  Each value stands alone and simply indicates a category and not a relationship with any other feature.

### Partition the Data

At this point, we can partition the data into a training set, a validation set, and a testing set.  We will use the training set to train the model, and the validation set to measure how our training is performing.  We can adjust model hyperparameters to adjust training results as measured by the validation dataset without issue.

We will only run predictions against the testing dataset and no information from testing will be used during training or validation.  This keeps our confidence in the test performance measurements high.

For this example we can simply split the data randomly into the 3 sets of data, with 80% used for training, and 10% used for each of validation and testing:

In [6]:
trainData, cvData, testData = pointdata.randomSplit([0.8,0.1,0.1])
trainData.cache()
cvData.cache()
testData.cache()

print(trainData.count(),cvData.count(),testData.count())

465022 57874 58116


we can see that the data partitioning has separated the original dataset into suitable size segments that we can use for our different purposes.

### Train a Decision Tree Model

At this point we can go ahead and train the model.  Here we are using the Spark MLLib [DecisionTree](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTree) and [DecisionTreeModel](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTreeModel) for this purpose.  Read more about these and the hyperparameters available from the documentation.

In [7]:
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel

model = DecisionTree.trainClassifier( trainData, 7, {}, "gini", 4, 100)

### Score the Validation Data

At this point we have a trained model and we would like to evaluate its performance.  We do this with the validation data.  If we find that we want to increase the performance, we can adjust hyperparameters or even select a different model type depending on the outcome from this evaluation.

In [8]:
# Score the validation data using the model.
cvPredictions = model.predict( cvData.map(lambda x: x.features))

# Label the results.
cvLabelsAndPredictions = cvData.map(lambda x: x.label).zip(cvPredictions)

### Evaluate the Validation Results

Using the MLLib [MulticlassMetrics](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.evaluation.MulticlassMetrics) object allows us to very simply evaluate the prediction performance of the model.  Here we perform the predictions, then form a labelled set of results that can be evaluated.

In [9]:
# Create a metrics object to evaluate the results.
from pyspark.mllib.evaluation import MulticlassMetrics
cvMetrics = MulticlassMetrics(cvLabelsAndPredictions)

And now we can extract some performance results from the metrics object:

In [10]:
print('                 precision: ',cvMetrics.precision(),'\n',
      '                   recall: ',cvMetrics.recall(),'\n',
      '                 fMeasure: ',cvMetrics.fMeasure(),'\n',
      '        weightedPrecision: ',cvMetrics.weightedPrecision,'\n',
      '           weightedRecall: ',cvMetrics.weightedRecall,'\n',
      'weightedFalsePositiveRate: ',cvMetrics.weightedFalsePositiveRate,'\n',
      ' weightedTruePositiveRate: ',cvMetrics.weightedTruePositiveRate,'\n',
      '\nConfusion Matrix:'
     )

import numpy as np
print(cvMetrics.confusionMatrix().toArray().astype(int))

                 precision:  0.6938003248436259 
                    recall:  0.6938003248436259 
                  fMeasure:  0.6938003248436259 
         weightedPrecision:  0.7403884816611251 
            weightedRecall:  0.6938003248436259 
 weightedFalsePositiveRate:  0.18594948465009822 
  weightedTruePositiveRate:  0.6938003248436259 
 
Confusion Matrix:
[[14174  5403     0     0  1132]
 [ 6606 22244   728   904    40]
 [    1   461  2898    28     0]
 [    0     3     0     8     0]
 [  343    35     0     0   829]]


### Score and Evaluate the Test Data Results

Once we are satisfied with the models predictive performance (hint - the performance above is not very satisfying!), we can then evaluate the model on the test data, which we expect will give slightly lower performance than the validation data since some work flows include iteration over the validation data more than once.

In [11]:
# Score and label the test data using the model.
testPredictions = model.predict( testData.map(lambda x: x.features))
testLabelsAndPredictions = testData.map(lambda x: x.label).zip(testPredictions)

# Create a metrics object for the test results.
testMetrics = MulticlassMetrics(testLabelsAndPredictions)


In [12]:
print('                 precision: ',testMetrics.precision(),'\n',
      '                   recall: ',testMetrics.recall(),'\n',
      '                 fMeasure: ',testMetrics.fMeasure(),'\n',
      '        weightedPrecision: ',testMetrics.weightedPrecision,'\n',
      '           weightedRecall: ',testMetrics.weightedRecall,'\n',
      'weightedFalsePositiveRate: ',testMetrics.weightedFalsePositiveRate,'\n',
      ' weightedTruePositiveRate: ',testMetrics.weightedTruePositiveRate,'\n',
      '\nConfusion Matrix:'
     )

print(testMetrics.confusionMatrix().toArray().astype(int))

                 precision:  0.6945935714777342 
                    recall:  0.6945935714777342 
                  fMeasure:  0.6945935714777342 
         weightedPrecision:  0.7411035884452303 
            weightedRecall:  0.694593571477734 
 weightedFalsePositiveRate:  0.18594438180801656 
  weightedTruePositiveRate:  0.694593571477734 
 
Confusion Matrix:
[[14095  5353     0     0  1128]
 [ 6697 22519   733   949    35]
 [    1   465  2856    20     0]
 [    0     4     0    12     0]
 [  334    30     0     0   885]]


### Save the Model

Since we might want to use the model to perform scoring (prediction) on data at a later time, we can keep the model around.  This is especially important if we have invested much effort and time to develop and train the model to optimize its performance.

This is easily done by first storing the model.  Models are stored within a specified directory which contain the data and metadata for the model.  The model data is further stored in a series of [Parquet](https://parquet.apache.org/documentation/latest/) files for efficiency.

In [13]:
# Location where the model will be stored.
modelLocation = 'tree-model'

# Ensure that there is no model currently in the location.
# Choose to store multiple models by using multiple locations.
import shutil
shutil.rmtree(modelLocation,ignore_errors=True)

# Actually save the model.
model.save(sc,modelLocation)

### Load the Stored Model

Now we can read in the stored model and use it to perform additional predictions.

In [14]:
sameModel = DecisionTreeModel.load(sc,modelLocation)

otherData = pointdata.sample(False,0.2,2112)
otherData.cache()
print('Samples to predict with previously stored model: ',otherData.count())

Samples to predict with previously stored model:  116409


In [15]:
otherPredictions = model.predict( otherData.map(lambda x: x.features))
otherLabelsAndPredictions = otherData.map(lambda x: x.label).zip(otherPredictions)
otherMetrics = MulticlassMetrics(otherLabelsAndPredictions)

In [16]:
print('                 precision: ',otherMetrics.precision(),'\n',
      '                   recall: ',otherMetrics.recall(),'\n',
      '                 fMeasure: ',otherMetrics.fMeasure(),'\n',
      '        weightedPrecision: ',otherMetrics.weightedPrecision,'\n',
      '           weightedRecall: ',otherMetrics.weightedRecall,'\n',
      'weightedFalsePositiveRate: ',otherMetrics.weightedFalsePositiveRate,'\n',
      ' weightedTruePositiveRate: ',otherMetrics.weightedTruePositiveRate,'\n',
      '\nConfusion Matrix:'
     )

print(otherMetrics.confusionMatrix().toArray().astype(int))

                 precision:  0.6932625484283861 
                    recall:  0.6932625484283861 
                  fMeasure:  0.6932625484283861 
         weightedPrecision:  0.7399464558364611 
            weightedRecall:  0.6932625484283862 
 weightedFalsePositiveRate:  0.1860846008923653 
  weightedTruePositiveRate:  0.6932625484283862 
 
Confusion Matrix:
[[28393 10769     0     0  2270]
 [13434 44836  1474  1848    69]
 [    2   930  5683    57     0]
 [    0    12     0    25     0]
 [  669    75     0     0  1765]]
