# [CPSC 310](https://github.com/GonzagaCPSC310) Data Mining
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Decision Trees
What are our learning objectives for this lesson?
* Learn decision tree terminology
* Introduce the TDIDT algorithm and clashes

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes
* [Data Science from Scratch](https://www.amazon.com/Data-Science-Scratch-Principles-Python/dp/149190142X/ref=sr_1_1?ie=UTF8&qid=1491521130&sr=8-1&keywords=joel+grus) by Joel Grus

## Intro to Decision Tree Classifiers
$k$-NN and Naive Bayes are "instance-at-a-time" classifiers
* Given a new instance, use training set to predict class label
* Hard to know "why" or what overall pattern led to prediction
* Highly dependent on particular instance given (its attribute values)

Decision trees are "rule"-based classifiers
* Build a set of general rules from training set
* Like a "compiled" version of the training set
* Use rules (not training set) to classify new instances

Rules are basic if-then statements:

>IF $att_1 = val_1^1 \wedge att_2 = val_1^2 \wedge ...$ THEN $class = label_1$

>IF $att_1 = val_2^1 \wedge att_2 = val_2^2 \wedge ...$ THEN $class = label_2$

>IF $att_1 = val_3^1 \wedge att_2 = val_3^2 \wedge ...$ THEN $class = label_3$

The rules are captured in a "decision tree"
* Internal nodes denote attributes (e.g., job status, standing, etc.)
* Edges denote values of the attribute
* Leaves denote class labels (e.g., buys iphone = yes)
    * Either stating a prediction
    * Or giving the distribution...


### Lab Task 1
An example for the iphone prediction example. iPhone Purchases (Fake) dataset:

|standing |job_status |credit_rating |buys_iphone|
|-|-|-|-|
|1 |3 |fair |no|
|1 |3 |excellent |no|
|2 |3 |fair |yes|
|2 |2 |fair |yes|
|2 |1 |fair |yes|
|2 |1 |excellent |no|
|2 |1 |excellent |yes|
|1 |2 |fair |no|
|1 |1 |fair |yes|
|2 |2 |fair |yes|
|1 |2 |excellent |yes|
|2 |2 |excellent |yes|
|2 |3 |fair |yes|
|2 |2 |excellent |no|
|2 |3 |fair |yes|

A *clash* is when two or more instances in a partition have the same combination of attribute values but different classifications. 

Bramer's definition of the Top-Down Induction of Decision Trees (TDIDT) assumes the *adequacy condition*, which ensures that no two instances with identical attribute values have different class labels (e.g. no clashes).

Does the iPhone dataset have any clashes?

### Lab Task 2
Here is an example decision tree for the iPhone dataset:
<img src="https://raw.githubusercontent.com/GonzagaCPSC310/U5-Decision-Trees/master/figures/iphone_decision_tree_example.png" width="850"/>

* Note that attribute values as inner "oval" nodes (not edge labels)
* These represent "partitions" of the training set
* Leaf nodes give distribution of class labels

Extract the rules from this decision tree.

## The TDIDT (Top-Down Induction of Decision Trees) Algorithm
Basic Approach (uses recursion!):
* At each step, pick an attribute ("attribute selection")
* Partition data by attribute values ... this creates pairwise disjoint partitions
* Repeat until one of the following occurs (base cases):
    1. Partition has only class labels that are the same ... no clashes, make a leaf node
    2. No more attributes to partition ... reached the end of a branch and there may be clashes, see options below
    3. No more instances to partition ... see options below

    
### More on Case 3
Assume we have the following:
<img src="https://raw.githubusercontent.com/GonzagaCPSC310/U5-Decision-Trees/master/figures/decision_tree_one_attr.png" width="300"/>

* Where the partition for att1=v1 has many instances
* But the partition for att1=v2 has no instances
* What are our options?
    1. Do Nothing: Leave value out of tree (creates incomplete decision tree)
    2. Backtrack: replace Attribute 1 node with leaf node (possibly w/clashes, see options below)
* For the first choice, we won't be able to classify all instances
* We also need to know the possible attribute values ahead of time

### Handling Clashes for Prediction
1. "Majority Voting"... select the class with highest number of instances
2. "Intuition"... that is, use common sense and pick one (hand modify tree)
3. "Discard"... remove the branch from the node above
    * Similar to case 3 above
    * Results in "missing" attribute combos (some instances can't be classified)
    * e.g., just remove two 50/50 branches from iPhone example tree

### Lab Task 3
Use TDIDT to create a decision tree for iPhone example
* Randomly select attributes as your "attribute selection" approach
* Extract the rules for your decision tree

### Lab Task 4
Consider the following data set. In this dataset, each instance example is an attribute list describing a job candidate:
* Level of expertise (string): Junior, Mid, Senior
* Preferred language (string): Java, Python, R
* Whether she is active on twitter (boolean): yes, no
* Whether she has a PhD (boolean): yes, no
* CLASS: Interviewed well? (boolean): True, False
    
|level|lang|tweets|phd|interviewed_well|
|-|-|-|-|-|
|Senior|Java|no|no|False|
|Senior|Java|no|yes|False|
|Mid|Python|no|no|True|
|Junior|Python|no|no|True|
|Junior|R|yes|no|True|
|Junior|R|yes|yes|False|
|Mid|R|yes|yes|True|
|Senior|Python|no|no|False|
|Senior|R|yes|no|True|
|Junior|Python|yes|no|True|
|Senior|Python|yes|yes|True|
|Mid|Python|no|yes|True|
|Mid|Java|yes|no|True|
|Junior|Python|no|yes|False|

Construct a decision tree that first splits on `level`. For the `Senior` partition, split on `tweets`. For the `Junior` partition, split on `phd`. Label each of your leaf nodes with the class proportion of the partition of that subtree.

### Lab Task 5
Use your tree from the previous task to classify the following test instances:
1. $X_{1}$ = `["Junior", "Java", "yes", "no"]`
1. $X_{2}$ = `["Junior", "Java", "yes", "yes"]`