### Step 1: Load Predictions
#### Execute the first cell to load the classes used in this activity

In [1]:
from pyspark.sql import SQLContext
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics

#### Run the second cell to load the predictions CSV file created at the end of Week 3 Hands-On Classification in Spark into a DataFrame:

In [2]:
sqlContext = SQLContext(sc)
predictions = sqlContext.read.load('file:///home/cloudera/Downloads/big-data-4/predictions.csv', 
                          format='com.databricks.spark.csv', 
                          header='true',inferSchema='true')

### Step 2: Compute accuracy
#### Let's create an instance of MultiClassificationEvaluator to determine the accuracy of the predictions:

In [3]:
evaluator = MulticlassClassificationEvaluator(
   labelCol='label', predictionCol='prediction', metricName='precision')

#### The first two arguements specify the names of the label and prediction columns, and the third arguement specifies that we wan the overall precision

#### We can compute the accuracy by calling evaluate():

In [4]:
accuracy = evaluator.evaluate(predictions)
print('Accuracy = %g ' % (accuracy))

Accuracy = 0.809524 


### Step 3: Display confusion matrix
#### The MulticlassMetrics class can be used to generate a confusion matrix of our classifier model. 
#### HOWEVER, unlike MulticlassClassificationEvaluator, MulticlassMetrics works with RDDs of numbers and not DataFrame, so we need to convert our predictions DataFrame into an RDD.

#### If we use the rdd attribute of predictions, we see this is an RDD of Rows:

In [5]:
predictions.rdd.take(2)

[Row(prediction=1.0, label=1.0), Row(prediction=1.0, label=1.0)]

#### Instead, we can map the RDD to tuple to get an RDD of numbers:

In [8]:
predictions.rdd.map(tuple).take(2)

[(1.0, 1.0), (1.0, 1.0)]

#### Let's create an instance of MulticlassMetrics with this RDD:

In [10]:
metrics = MulticlassMetrics(predictions.rdd.map(tuple))

### NOTE: the above command can take longer to execute than most Spark commands when first run in the notebook.

#### The confusionMatrix() function returns a Spark Matrix, which we can convert to a Python Numpy array, and transpose to view:

In [11]:
metrics.confusionMatrix().toArray().transpose()

array([[ 87.,  26.],
       [ 14.,  83.]])