# Decision Tree Classification

---

# Create entry points to spark

In [1]:
from pyspark import SparkContext
sc = SparkContext(master = 'local')

from pyspark.sql import SparkSession
spark = SparkSession.builder \
          .appName("Python Spark SQL basic example") \
          .config("spark.some.config.option", "some-value") \
          .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/04/28 09:48:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/04/28 09:48:37 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


# Decision tree classification with pyspark

## Import data

In [2]:
cuse = spark.read.csv('data/cuse_binary.csv', header=True, inferSchema=True)
cuse.show(5)

+---+---------+---------+---+
|age|education|wantsMore|  y|
+---+---------+---------+---+
|<25|      low|      yes|  0|
|<25|      low|      yes|  0|
|<25|      low|      yes|  0|
|<25|      low|      yes|  0|
|<25|      low|      yes|  0|
+---+---------+---------+---+
only showing top 5 rows



## Process categorical columns
The following code does three things with pipeline:

* **`StringIndexer`** all categorical columns
* **`OneHotEncoder`** all categorical index columns
* **`VectorAssembler`** all feature columns into one vector column

### Categorical columns

In [3]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline

# categorical columns
categorical_columns = cuse.columns[0:3]

In [4]:
categorical_columns

['age', 'education', 'wantsMore']

#### Build StringIndexer stages

In [5]:
stringindexer_stages = [StringIndexer(inputCol=c, outputCol='strindexed_' + c) for c in categorical_columns]
# encode label column and add it to stringindexer_stages
stringindexer_stages += [StringIndexer(inputCol='y', outputCol='label')]

In [6]:
all_stages = stringindexer_stages
pipeline = Pipeline(stages=all_stages)

In [7]:
pipeline_model = pipeline.fit(cuse)

In [8]:
indexed=pipeline_model.transform(cuse)

In [18]:
indexed.show(5)

+---+---------+---------+---+--------------+--------------------+--------------------+-----+
|age|education|wantsMore|  y|strindexed_age|strindexed_education|strindexed_wantsMore|label|
+---+---------+---------+---+--------------+--------------------+--------------------+-----+
|<25|      low|      yes|  0|           2.0|                 1.0|                 0.0|  0.0|
|<25|      low|      yes|  0|           2.0|                 1.0|                 0.0|  0.0|
|<25|      low|      yes|  0|           2.0|                 1.0|                 0.0|  0.0|
|<25|      low|      yes|  0|           2.0|                 1.0|                 0.0|  0.0|
|<25|      low|      yes|  0|           2.0|                 1.0|                 0.0|  0.0|
+---+---------+---------+---+--------------+--------------------+--------------------+-----+
only showing top 5 rows



In [10]:
indexed.select("strindexed_age").distinct().show()

+--------------+
|strindexed_age|
+--------------+
|           0.0|
|           1.0|
|           3.0|
|           2.0|
+--------------+



In [12]:
indexed.createOrReplaceTempView("indexed")
spark.sql("select distinct age, strindexed_age from indexed order by strindexed_age").show()




+-----+--------------+
|  age|strindexed_age|
+-----+--------------+
|30-39|           0.0|
|25-29|           1.0|
|  <25|           2.0|
|40-49|           3.0|
+-----+--------------+



In order to 

#### Build OneHotEncoder stages

OneHotEncoder:
https://stackoverflow.com/questions/42295001/how-to-interpret-results-of-spark-onehotencoder


In [13]:
onehotencoder_stages = [OneHotEncoder(inputCol='strindexed_' + c, outputCol='onehot_' + c) for c in categorical_columns]

#### Build VectorAssembler stage

Spark Machine Learning API requires packing all individual feature variables (columns) into a single column which is a vector that contains all individual feature variables (columns). VectorAssembler is doing exactly that.


In [14]:
feature_columns = ['onehot_' + c for c in categorical_columns]
vectorassembler_stage = VectorAssembler(inputCols=feature_columns, outputCol='features') 

#### Build pipeline model

In [15]:
# all stages
all_stages = stringindexer_stages + onehotencoder_stages + [vectorassembler_stage]
pipeline = Pipeline(stages=all_stages)

#### Fit pipeline model

In [16]:
pipeline_model = pipeline.fit(cuse)

#### Transform data

In [20]:
final_columns = feature_columns + ['features', 'label']
#cuse_df = pipeline_model.transform(cuse).\
#            select(final_columns)


cuse_df = pipeline_model.transform(cuse)            
cuse_df.show(5)

+---+---------+---------+---+--------------+--------------------+--------------------+-----+-------------+----------------+----------------+-------------------+
|age|education|wantsMore|  y|strindexed_age|strindexed_education|strindexed_wantsMore|label|   onehot_age|onehot_education|onehot_wantsMore|           features|
+---+---------+---------+---+--------------+--------------------+--------------------+-----+-------------+----------------+----------------+-------------------+
|<25|      low|      yes|  0|           2.0|                 1.0|                 0.0|  0.0|(3,[2],[1.0])|       (1,[],[])|   (1,[0],[1.0])|(5,[2,4],[1.0,1.0])|
|<25|      low|      yes|  0|           2.0|                 1.0|                 0.0|  0.0|(3,[2],[1.0])|       (1,[],[])|   (1,[0],[1.0])|(5,[2,4],[1.0,1.0])|
|<25|      low|      yes|  0|           2.0|                 1.0|                 0.0|  0.0|(3,[2],[1.0])|       (1,[],[])|   (1,[0],[1.0])|(5,[2,4],[1.0,1.0])|
|<25|      low|      yes|  0|     

In [21]:
cuse_df.createOrReplaceTempView("cuse_df")
spark.sql("select distinct age, strindexed_age, onehot_age from cuse_df order by strindexed_age").show()



+-----+--------------+-------------+
|  age|strindexed_age|   onehot_age|
+-----+--------------+-------------+
|30-39|           0.0|(3,[0],[1.0])|
|25-29|           1.0|(3,[1],[1.0])|
|  <25|           2.0|(3,[2],[1.0])|
|40-49|           3.0|    (3,[],[])|
+-----+--------------+-------------+

23/04/30 01:01:38 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 1796153 ms exceeds timeout 120000 ms
23/04/30 01:01:39 WARN SparkContext: Killing executors is not supported by current scheduler.


### Explain onehotencoder here

For age, there are 4 categories as shown (30-39, 25-29, <25, 40-49). One encoder needs a vector of 3 elements, each element is 1 bit

In Spark one hot encoder:

0.0   =>     (1,0,0)  => Sparse (3,[0],[1.0])
1.0   =>     (0,1,0)  => Sparse (3,[1],[1.0])
2.0   =>     (0,0,1)  => Sparse (3,[2],[1.0])
3.0   =>     (0,0,0)  => Sparse (3,[],[])


### Split data into training and test datasets

In [14]:
training, test = cuse_df.randomSplit([0.8, 0.2], seed=1234)

### Build cross-validation model

#### Estimator

Estimator has a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer, that can transform feature vector to produce predicted label.


In [15]:
from pyspark.ml.regression import GeneralizedLinearRegression
from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier

dt = DecisionTreeClassifier(featuresCol='features', labelCol='label', )

#### Parameter grid

https://spark.apache.org/docs/latest/ml-tuning.html


In [16]:
from pyspark.ml.tuning import ParamGridBuilder
param_grid = ParamGridBuilder().\
    addGrid(dt.maxDepth, [2,3,4,5]).\
    build()

#### Evaluator

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.Evaluator.html



In [17]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", metricName="areaUnderROC")

#### Build cross-validation model

In [18]:
from pyspark.ml.tuning import CrossValidator
cv = CrossValidator(estimator=dt, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=4)

#### Fit cross-validation mode

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.CrossValidator.html?highlight=crossvalidator


In [19]:
cv_model = cv.fit(cuse_df)

#### Prediction

In [20]:
show_columns = ['features', 'label', 'prediction', 'rawPrediction', 'probability']

##### Prediction on training data

In [21]:
pred_training_cv = cv_model.transform(training)
pred_training_cv.select(show_columns).show(5, truncate=False)

+---------+-----+----------+-------------+----------------------------------------+
|features |label|prediction|rawPrediction|probability                             |
+---------+-----+----------+-------------+----------------------------------------+
|(5,[],[])|0.0  |1.0       |[203.0,237.0]|[0.46136363636363636,0.5386363636363637]|
|(5,[],[])|0.0  |1.0       |[203.0,237.0]|[0.46136363636363636,0.5386363636363637]|
|(5,[],[])|0.0  |1.0       |[203.0,237.0]|[0.46136363636363636,0.5386363636363637]|
|(5,[],[])|0.0  |1.0       |[203.0,237.0]|[0.46136363636363636,0.5386363636363637]|
|(5,[],[])|0.0  |1.0       |[203.0,237.0]|[0.46136363636363636,0.5386363636363637]|
+---------+-----+----------+-------------+----------------------------------------+
only showing top 5 rows



##### Prediction on test data

In [22]:
pred_test_cv = cv_model.transform(test)
pred_test_cv.select(show_columns).show(5, truncate=False)

+---------+-----+----------+-------------+----------------------------------------+
|features |label|prediction|rawPrediction|probability                             |
+---------+-----+----------+-------------+----------------------------------------+
|(5,[],[])|0.0  |1.0       |[203.0,237.0]|[0.46136363636363636,0.5386363636363637]|
|(5,[],[])|0.0  |1.0       |[203.0,237.0]|[0.46136363636363636,0.5386363636363637]|
|(5,[],[])|0.0  |1.0       |[203.0,237.0]|[0.46136363636363636,0.5386363636363637]|
|(5,[],[])|0.0  |1.0       |[203.0,237.0]|[0.46136363636363636,0.5386363636363637]|
|(5,[],[])|0.0  |1.0       |[203.0,237.0]|[0.46136363636363636,0.5386363636363637]|
+---------+-----+----------+-------------+----------------------------------------+
only showing top 5 rows



### Confusion matrix

Pyspark doesn’t have a function to calculate the confusion matrix automatically, but we can still easily get a confusion matrix with a combination use of several methods from the RDD class.

In [23]:
label_and_pred = cv_model.transform(cuse_df).select('label', 'prediction')
label_and_pred.rdd.zipWithIndex().countByKey()

[Stage 327:>                                                        (0 + 1) / 1]                                                                                

defaultdict(int,
            {Row(label=0.0, prediction=0.0): 897,
             Row(label=0.0, prediction=1.0): 203,
             Row(label=1.0, prediction=0.0): 270,
             Row(label=1.0, prediction=1.0): 237})

### Parameters from the best model

In [24]:
print('The best MaxDepth is:', cv_model.bestModel._java_obj.getMaxDepth())

The best MaxDepth is: 3
