## CIS5560: PySpark Decision Tree Classifier in Databricks

### by Team 4 (Uche, Raymond, Tofunmi and Sweta) edited on 05/15/2020
Tested in Runtime 6.5 (Spark 2.4.5/2.4.0 Scala 2.11) of Databricks CE

##Prepare the Data
First, import the libraries you will need and prepare the training and test data:

In [3]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, RegressionEvaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit, CrossValidator
from pyspark.ml.classification import GBTClassifier

from pyspark.sql import functions as F
import pyspark.sql.functions as func

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.ml.feature import VectorAssembler, StringIndexer, VectorIndexer, MinMaxScaler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors, SparseVector
from pyspark.sql.types import DoubleType
import re

In [4]:
%fs ls /FileStore/tables/df_ml.csv

path,name,size
dbfs:/FileStore/tables/df_ml.csv,df_ml.csv,77730282


## Create a DataFrame Schema, 
that should be a Table schema

In [6]:
# DataFrame Schema, that should be a Table schema by Team 4 
df_mlSchema = StructType([
  StructField("user_id", IntegerType(), False),
  StructField("text", StringType(), False),
  StructField("date", TimestampType(), False),
  StructField("review_id", IntegerType(), False),
  StructField("business_id", IntegerType(), False),
  StructField("funny", IntegerType(), False),
  StructField("cool", IntegerType(), False),
  StructField("useful", IntegerType(), False),
  StructField("stars", IntegerType(), False),
])

In [7]:
IS_SPARK_SUBMIT_CLI = False
if IS_SPARK_SUBMIT_CLI:
    sc = SparkContext.getOrCreate()
    spark = SparkSession(sc)

##Load Dataset 

ensure command line above: IS_SPARK_SUBMIT_CLI = False. Also remember to set it to 'True' before exporting

Read csv file from DBFS (Databricks File Systems)

## follow the direction to read your table after upload it to Data at the left frame
NOTE: See above for the data type - 

After df_ml_csv file is added to the data of the left frame, create a table using the UI, especially, "Upload File"
tick header and infer schema before creating table

In [10]:
if IS_SPARK_SUBMIT_CLI:
   df_ml = spark.read.csv('df_ml.csv', inferSchema=True, header=True)
else:
    df_ml = spark.sql("SELECT * FROM scaled_subset_csv")

In [11]:
df_ml.show(5)

##Create a New Dataframe with columns "user_id", "review_id", "business_id" and "stars"(label)
The label is the stars (stars > 2 = 1 (positive review) else: 0 (negative review)

These are the columns we used in building of Decision Tree Classifier Model

In [13]:
data = df_ml.select( "user_id", "review_id", "business_id", ((col("stars") > 2).cast("Double").alias("label")))

data.show(5)

In [14]:
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1].withColumnRenamed("label", "trueLabel")
train_rows = train.count()
test_rows = test.count()
print ("Training Rows:", train_rows, " Testing Rows:", test_rows)

### Build the Recommender
 user_id, review_id and business_id are columns we used to build the Decision Tree Classifier Model.

#### Latent Features
We can use the features to produce some sort of algorithm (**DecisionTreeClassifier**) to intelligently calculate stars(ratings) 

The dt class is an estimator, so you can use its **fit** method to traing a model, or you can include it in a pipeline. Rather than specifying a feature vector and as label, the dt algorithm requries user_id, review_id and business_id columns are Normalized  
NOTE: all columns are normalized in python jupyter notebook before dataframe was imported

In [16]:
dtassembler = VectorAssembler(inputCols=["user_id", "review_id", "business_id"], outputCol="features")
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=3) 
dtp = Pipeline(stages=[dtassembler, dt])

#### Add paramGrid and Validation

In [18]:
paramGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [1, 2, 6])
             .addGrid(dt.maxBins, [20, 40])
             .build())

### To build a general model, _TrainValidationSplit_ is used by us

In [20]:
dt_tvs = TrainValidationSplit(estimator=dtp, evaluator=MulticlassClassificationEvaluator(), estimatorParamMaps=paramGrid, trainRatio=0.8)

dtModel = dt_tvs.fit(train)

### Test the Recommender
Now that we've trained the recommender, lets see how accurately it predicts known stars in the test set.

In [22]:
prediction = dtModel.transform(test)
predicted = prediction.select("features", "prediction", "trueLabel")
predicted.show(10)

##TP, FP, TN, and FN all calculated
Precision and recall also calculated

In [24]:
tp = float(predicted.filter("prediction == 1.0 AND truelabel == 1").count())
fp = float(predicted.filter("prediction == 1.0 AND truelabel == 0").count())
tn = float(predicted.filter("prediction == 0.0 AND truelabel == 0").count())
fn = float(predicted.filter("prediction == 0.0 AND truelabel == 1").count())
metrics = spark.createDataFrame([
      ("TP", tp),
      ("FP", fp),
      ("TN", tn),
      ("FN", fn),
      ("Precision", tp / (tp + fp)),
      ("Recall", tp / (tp + fn))],["metric", "value"])
metrics.show()

In [25]:
dt_evaluator =  MulticlassClassificationEvaluator(labelCol="trueLabel", predictionCol="prediction")
#dt_evaluator =  BinaryClassificationEvaluator(labelCol="trueLabel", rawPredictionCol="prediction", metricName="areaUnderROC")
dt_auc = dt_evaluator.evaluate(prediction)

print("AUC for Decision Tree Classifier = ", dt_auc)

## TrainValidationSplit AUC for Decision Tree Classifier =  0.6732488403529867

## Building same model using a CrossValidator

In [28]:
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit, CrossValidator

## number of folds = 5

In [30]:
# TODO: K = 2 you may test it with 5, 10
# K=2, 3, 5, 
# K= 10 takes too long
cv = CrossValidator(estimator=dtp, evaluator=BinaryClassificationEvaluator(), \
                    estimatorParamMaps=paramGrid, numFolds=5)

# the third best model
model = cv.fit(train)

In [31]:
prediction = model.transform(test)
predicted = prediction.select("features", "prediction", "trueLabel")
predicted.show(10)

##TP, FP, TN, and FN all calculated
Precision and recall also calculated

In [33]:
tp = float(predicted.filter("prediction == 1.0 AND truelabel == 1").count())
fp = float(predicted.filter("prediction == 1.0 AND truelabel == 0").count())
tn = float(predicted.filter("prediction == 0.0 AND truelabel == 0").count())
fn = float(predicted.filter("prediction == 0.0 AND truelabel == 1").count())
metrics = spark.createDataFrame([
      ("TP", tp),
      ("FP", fp),
      ("TN", tn),
      ("FN", fn),
      ("Precision", tp / (tp + fp)),
      ("Recall", tp / (tp + fn))],["metric", "value"])
metrics.show()

In [34]:
#dt_evaluator =  MulticlassClassificationEvaluator(labelCol="trueLabel", predictionCol="prediction")
dt_evaluator =  BinaryClassificationEvaluator(labelCol="trueLabel", rawPredictionCol="prediction", metricName="areaUnderROC")
dt_auc = dt_evaluator.evaluate(prediction)

print("AUC for Decision Tree Classifier = ", dt_auc)

## CrossValidator AUC for Decision Tree Classifier =  0.5