##Random Forest for Star-Galaxy Classification


The following code allows us to use PySpark in the iPython Notebook 

In [1]:
import os
import sys

# Set the path for spark installation
# this is the path where you have built spark using sbt/sbt assembly
os.environ['SPARK_HOME']="/Users/blorangest/Desktop/spark-1.3.1-bin-hadoop2.6"

# Append to PYTHONPATH so that pyspark could be found
sys.path.append("/Users/blorangest/Desktop/spark-1.3.1-bin-hadoop2.6/python")
sys.path.append(os.path.join(os.environ['SPARK_HOME'], 'python/lib/py4j-0.8.2.1-src.zip'))

# Now we are ready to import Spark Modules
try:
    from pyspark.mllib.tree import RandomForest
    from pyspark.mllib.util import MLUtils
    from pyspark.mllib.regression import LabeledPoint
    from pyspark import SparkContext

except ImportError as e:
    print ("Error importing Spark Modules", e)
    sys.exit(1)
import numpy as np

Now we set some variables that will determine the properties of the random forest. 
test_size is the percentage of the data that will be used to test the model.
num_trees is the number of trees in the forest.
max_depth is the maximum depth of each tree. It must be no more than 30.

In [2]:
dataFile = "cfhtlens_matched.csv"
test_size = 0.2
num_trees = 6
max_depth = 10

This function will be used to add colors to the feature data by taking the differences of adjacent magnitudes.

In [3]:
def addColors (features):
    for i in range (len(features)-1):
        features.append(features[i+1]-features[i])
    return features

This function does data preprocessing. It removes unwanted columns and puts the relevant data in LabeledPoint objects. Note that, in this particular dataset, one of the columns has comma seperated values enclosed by quotes that all belong under a single heading. This column will be quotes[1]. 

In [4]:
def parse (line): 
    quotes = np.array([x for x in line.split('"')])
    row = quotes[0].split(',')[:-1] + [quotes[1]] + quotes[2].split(',')[1:]
    label = float(row[heads['true_class']])
    want = ['MAG_u', 'MAG_g', 'MAG_r', 'MAG_i', 'MAG_z']
    want_index = []
    for w in want:
        want_index.append(heads[w])
    features = []
    for i in range (len(row)):
        for w in want_index:
            if i == w:
                features.append(float(row[i]))
    features = addColors(features)
    return LabeledPoint(label, features)

Now we initialize the SparkContext and load the raw data into an RDD.

In [5]:
sc = SparkContext(appName="stargalaxy")
rawData = sc.textFile(dataFile) # is an RDD with 66389 things

Now we remove the header row and use the parse function above to map the data into a new RDD that will be usable by the random forest.

In [6]:
header = rawData.first()
lines = rawData.filter(lambda x: x != header) #now the header is gone
header_split = str(header).split(',')
heads = {}
for i in range( len(header_split)):
    heads[header_split[i]] = i
data = lines.map(parse).cache() # RDD of LabeledPoints

Split the data into training and testing sets.

In [8]:
(trainingData, testData) = data.randomSplit([1-test_size, test_size])

Train the random forest.

In [9]:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=num_trees, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth = max_depth, maxBins=32)

Test the random forest model and print some metrics to evaluate its performance.

In [10]:
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
Mg = float(labelsAndPredictions.filter(lambda (v, p): v == 0 and p == 1).count())
Ng = float(labelsAndPredictions.filter(lambda (v, p): v == 0 and p == 0).count())
Ms = float(labelsAndPredictions.filter(lambda (v, p): v == 1 and p == 0).count())
print ('Purity = ' + str(Ng / (Ng+Ms)))
print ('Completeness = ' + str(Ng / (Ng+Mg)))
print ('1 - Accuracy = ' + str(testErr))

Purity = 0.971548188653
Completeness = 0.987837720441
1 - Accuracy = 0.0356927256263


Then we can print a representation of each tree in the forest.

In [11]:
print(model.toDebugString())

TreeEnsembleModel classifier with 6 trees

  Tree 0:
    If (feature 5 <= -0.5125999999999991)
     If (feature 4 <= 20.6658)
      If (feature 8 <= -5.171399999999998)
       If (feature 6 <= 1.3135000000000012)
        If (feature 5 <= -2.5139999999999993)
         If (feature 3 <= 23.6463)
          If (feature 4 <= 17.4562)
           Predict: 1.0
          Else (feature 4 > 17.4562)
           If (feature 3 <= 23.5447)
            If (feature 7 <= 3.9162999999999997)
             Predict: 1.0
            Else (feature 7 > 3.9162999999999997)
             Predict: 0.0
           Else (feature 3 > 23.5447)
            If (feature 1 <= 18.2017)
             Predict: 0.0
            Else (feature 1 > 18.2017)
             If (feature 2 <= 16.992)
              Predict: 1.0
             Else (feature 2 > 16.992)
              Predict: 1.0
         Else (feature 3 > 23.6463)
          If (feature 4 <= 19.7566)
           If (feature 0 <= 23.2874)
            If (feature 2 <= 21.4621)
  