<div style="text-align: right">INFO 6105 Data Science Eng Methods and Tools, Lecture 9</div>
<div style="text-align: right">Dino Konstantopoulos, ! April 2019, with material from Joe McCarthy and Chris Roach</div>

Last semsester a student emailed me this video and told me that's what my lectures felt like:

<br />
<center>
<img src = ipynb.images/6105stat.gif width = 250 />
</center>

So I tried to slow down my lectures. But we have a lot of material to cover today because we're actually going to encode a decision tree algorithm. On Wednesday, we'll have a lab with `scikit-learn` implementations.

We spent the first part of the semester learning about how to think of data and fit the data to a model. Now let's make it more complicated and think about how to design machines to do the same thing :-) 

<br />
<center>
<img src = ipynb.images/irobot.jpg width = 300 />
</center>



# 1. Decision Trees and regression trees

The idea is to split a dataset based on **homogeneity of data**. A **decision tree** is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar **values** (homogenous). 

On the other hand, in a **regression tree**, since the target (dependent) variable is a real valued number, we fit a regression model to the target variable using each of the independent variables. 

Then for each independent variable, the data is split at several split points. We calculate Sum of Squared Error(SSE) at each split point between the predicted value and the actual values. The variable resulting in minimum SSE is selected for the node. Then this process is recursively continued till the entire data is covered. Each split point may belong to a different independent variable.

All of us actually use decision trees in our daily life! To illustrate the concept, let's use an everyday example: predicting tomorrow’s maximum temperature for Boston. Wasn't today so much colder than yesterday?!

In order to answer the single max temperature question, we need to work through an entire series of queries. We start by forming an initial reasonable range given our domain knowledge, which for Boston is very little.. Let's say that it's 30–60 degrees Fahrenheit. Gradually, through a set of questions and answers we will reduce this range until we are confident enough to make a single prediction.

What makes a good question to start with? What kind of idnependent variable should we split the data by? Well, if we want to limit the range **as much as possible** initially, let's think of the most relevant question to ask. Since temperature is highly dependent on time of year, a decent place to start would be: what season are we in? Winter *close to spring*, right? So we can limit the prediction range to 30–50 degrees because we have an idea of what the general max temperatures are in Boston winter-close-to-spring-time (yes, yesterday was 60 but that was an outlier). This first question already cuts our range by a lot. We use that independent variable as our first node variable. But, this question isn’t quite enough to narrow down our estimate so we need to find out more information for our second node.

A good follow-up question is: what is the historical average max temperature on this day? For Boston, the answer is 36 degrees. This allows us to further restrict our range of consideration to, let's say, 30–40 degrees. 

Two questions (two nodes)  are still not quite enough to make a prediction because this year might be warmer or colder than average. Therefore, we also would want to look at the max temperature today to get an idea if the year has been unusually warm or cold. Our question is simple: what is the maximum temperature today? If the max temperature today was 40 degrees, it might be colder this year and our estimate for tomorrow should be a little lower than the historical average. At this point, we can feel pretty confident in making a prediction of 35 degrees for the max temperature tomorrow. 

So, to arrive at an estimate, we used a series of questions, with each question narrowing our possible values until we were confident enough to make a single prediction. So, following one path (the most probable one) down the tree, we used 3 nodes to make a decision. 

We also need to complete all paths and add nodes to all split points so we have a decision for each leaf of the tree (we did not do this in our questioning above).

**Regression Forests** are different than a single tree: They are an **ensemble** of different regression trees. These models work on the principle of the **wisdom of the crowd** . In short, it is better to consider the opinions of 1000 different people with not much knowledge than to consider the opinion of only one expert (provided the 1000 people have accuracy better than random guessing, i.e more than 50%). There is actually a mathematical proff about this.

Ok, is our intuition about the algorithm in place?

# 2. The Computer Science *Tree* 

In computer science, a **tree** is a widely used abstract data type (ADT) that simulates a hierarchical tree structure, with a root value and subtrees of children with a parent node, represented as a set of linked nodes. A tree ends in leaf nodes and is represented upside node with the root at the top.

A **decision tree** is a decision support tool that uses a tree-like model of decisions and their possible outcomes. Much like a [graph](https://en.wikipedia.org/wiki/Graph_theory) is a way to display transitions of state machines, a [tree](https://en.wikipedia.org/wiki/Tree_(graph_theory) is a specialization of a graph that displays conditional control statements.

<br />
<center>
<img src = ipynb.images/election.png width = 500 />
</center>

Decision trees are commonly used in operations research and [Machine Learning](https://en.wikipedia.org/wiki/Decision_tree_learning).

Some techniques, often called **ensemble methods**, construct *more than one* decision tree, and thus talk about **decision forests** rather than trees. For example:

- **Boosted trees** incrementally build an ensemble by training each new instance to emphasize the training instances previously mis-modeled. A typical example is [AdaBoost](https://en.wikipedia.org/wiki/AdaBoost#LogitBoost)
- **Bootstrap aggregated** (or **bagged**) decision trees, an early ensemble method, build multiple decision trees by repeatedly resampling training data *with replacement*, and voting the trees for a consensus prediction
- A **random forest classifier** is a specific type of bootstrap aggregating rotation forest in which every decision tree is trained by first applying principal component analysis (PCA) on a random subset of the input features

Decision Trees are simple to understand, maybe the simplest, albeit powerful, ML method there is, since trees can also be displayed graphically in a way that is easy for non-experts to understand. Trees are able to handle both numerical and categorical data. For example, relation rules can be used with nominal variables while neural networks can be used with numerical variables or categoricals converted to 0-1 values.

Decision trees use a **white box model**: If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a **black box model**, the explanation for the results is typically difficult to understand, as for example with artificial neural networks (unless they're Bayesian).

Decision Trees perform well with large datasets. Large amounts of data can be analysed using standard computing resources in reasonable time and mirror human decision making pretty closely.

Finally, Decision trees can be sampled using **MCMC**: By constructing a Markov chain that has the desired distribution as its equilibrium distribution, we can obtain a sample of the desired distribution by observing the chain after a number of steps. For example, [here](https://www2.stat.duke.edu/courses/Fall05/sta395/casper1.pdf).

In [1]:
from IPython.display import display, Image, HTML

# 3. The Algorithm type: Supervised Classification

[Supervised learning](https://en.wikipedia.org/wiki/Supervised_learning), or **Classification**, is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. In unsupervised learning, you are essentially missing the output.

<img src="http://www.nltk.org/images/supervised-classification.png" title="Supervised Classification, from NLTK book, Chapter 6" alt="nltk_ch06_supervised-classification.png" style="width: 500px" /></a>

> (a) During *training*, a **feature extractor** is used to convert each **input value** to a **feature set**. These feature sets, which capture the basic information about each input that should be used to classify it, are discussed in the next section. Pairs of feature sets and **labels** are fed into the **machine learning algorithm** to generate a **model**. (b) During *prediction*, the same feature extractor is used to convert **unseen inputs** to feature sets. These feature sets are then fed into the model, which generates **predicted labels**.

# 4. The data: UCI Mushrooms

The [Center for Machine Learning and Intelligent Systems](http://cml.ics.uci.edu/) at the University of California, Irvine (UCI), hosts a  [Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html) containing over 200 publicly available data sets. It is yours truly most loved (and used) dataset archive.

<img src="ipynb.images/mushrooms.jpg" alt="mushroom" width = 400/>

<a href="https://archive.ics.uci.edu/ml/datasets/Mushroom"><img src="https://archive.ics.uci.edu/ml/assets/MLimages/Large73.jpg"  style="margin: 0px 0px 5px 20px; width: 125px; float: right;" title="Mushrooms from Agaricus and Lepiota Family" alt="mushroom"/></a>
For our decision tree data, will use the [mushroom](https://archive.ics.uci.edu/ml/datasets/Mushroom) data set, used in Chapter 3 of Provost & Fawcett [data science book](https://www.amazon.com/Data-Science-Business-Data-Analytic-Thinking/dp/1449361323).

The following description of the dataset is provided at the UCI repository:

>This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525 [The Audubon Society Field Guide to North American Mushrooms, 1981]). Each species is identified as:
- definitely edible
- definitely poisonous
- of unknown edibility and not recommended. 

>This latter class is usually combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom..
> 
> **Number of Instances**: 8124
> 
> **Number of Attributes**: 22 (all nominally valued)
> 
> **Attribute Information**: (*classes*: edible=e, poisonous=p)
> 
> 1. *cap-shape*: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
> 2. *cap-surface*: fibrous=f, grooves=g, scaly=y, smooth=s
> 3. *cap-color*: brown=n ,buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
> 4. *bruises?*: bruises=t, no=f
> 5. *odor*: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
> 6. *gill-attachment*: attached=a, descending=d, free=f, notched=n
> 7. *gill-spacing*: close=c, crowded=w, distant=d
> 8. *gill-size*: broad=b, narrow=n
> 9. *gill-color*: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
> 10. *stalk-shape*: enlarging=e, tapering=t
> 11. *stalk-root*: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
> 12. *stalk-surface-above-ring*: fibrous=f, scaly=y, silky=k, smooth=s
> 13. *stalk-surface-below-ring*: fibrous=f, scaly=y, silky=k, smooth=s
> 14. *stalk-color-above-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
> 15. *stalk-color-below-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
> 16. *veil-type*: partial=p, universal=u
> 17. *veil-color*: brown=n, orange=o, white=w, yellow=y
> 18. *ring-number*: none=n, one=o, two=t
> 19. *ring-type*: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
> 20. *spore-print-color*: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y
> 21. *population*: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
> 22. *habitat*: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d
> 
> **Missing Attribute Values**: 2480 of them (denoted by "?"), all for attribute #11 (starting from 0 indeex).
> 
> **Class Distribution**: -- edible: 4208 (51.8%) -- poisonous: 3916 (48.2%) -- total: 8124 instances

The [data file](https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data) associated with this dataset, up on blackboard, has one instance of a hypothetical mushroom per line, with abbreviations for the values of the class and each of the other 22 attributes separated by commas.

Here is a sample line from the data file:

p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d

This instance represents a mushroom with the following attribute values (highlighted in **bold**):

*class*: edible=e, **poisonous=p**

1. *cap-shape*: bell=b, conical=c, convex=x, flat=f, **knobbed=k**, sunken=s
2. *cap-surface*: **fibrous=f**, grooves=g, scaly=y, smooth=s
3. *cap-color*: **brown=n** ,buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
4. *bruises?*: bruises=t, **no=f**
5. *odor*: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, **none=n**, pungent=p, spicy=s
6. *gill-attachment*: attached=a, descending=d, **free=f**, notched=n
7. *gill-spacing*: **close=c**, crowded=w, distant=d
8. *gill-size*: broad=b, **narrow=n**
9. *gill-color*: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, **white=w**, yellow=y
10. *stalk-shape*: **enlarging=e**, tapering=t
11. *stalk-root*: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, **missing=?**
12. *stalk-surface-above-ring*: fibrous=f, scaly=y, **silky=k**, smooth=s
13. *stalk-surface-below-ring*: fibrous=f, **scaly=y**, silky=k, smooth=s
14. *stalk-color-above-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, **white=w**, yellow=y
15. *stalk-color-below-ring*: **brown=n**, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
16. *veil-type*: **partial=p**, universal=u
17. *veil-color*: brown=n, orange=o, **white=w**, yellow=y
18. *ring-number*: none=n, **one=o**, two=t
19. *ring-type*: cobwebby=c, **evanescent=e**, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
20. *spore-print-color*: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, **white=w**, yellow=y
21. *population*: abundant=a, clustered=c, numerous=n, scattered=s, **several=v**, solitary=y
22. *habitat*: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, **woods=d**



In [2]:
attribute_names = ['class', 
                   'cap-shape', 'cap-surface', 'cap-color', 
                   'bruises?', 
                   'odor', 
                   'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 
                   'stalk-shape', 'stalk-root', 
                   'stalk-surface-above-ring', 'stalk-surface-below-ring', 
                   'stalk-color-above-ring', 'stalk-color-below-ring',
                   'veil-type', 'veil-color', 
                   'ring-number', 'ring-type', 
                   'spore-print-color', 
                   'population', 
                   'habitat']
print(attribute_names)

['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']


We read in our data file (up on blackboard): agaricus-lepiota.data. [Lepiota](https://en.wikipedia.org/wiki/Lepiota) is a genus of gilled mushrooms in the family [Agaricaceae](https://en.wikipedia.org/wiki/Agaricaceae). All Lepiota species are ground-dwelling [saprotrophs](https://en.wikipedia.org/wiki/Saprotrophic_nutrition) with a preference for rich, calcareous soils. We build a list of instances, where each instance is a list of attribute values.

The following code creates a list of instances, where each instance is a list of attribute values (like `instance_1_str` above). 

In [4]:
all_instances = []  # initialize instances to an empty list
data_filename = 'data/agaricus-lepiota.data'

with open(data_filename, 'r') as f:
    for line in f:  # 'line' will be bound to the next line in f in each for loop iteration
        all_instances.append(line.strip().split(','))
        
print('Read', len(all_instances), 'instances from', data_filename)
# we don't want to print all the instances, so we'll just print the first one to verify
print('First instance:', all_instances[0]) 

Read 8124 instances from data/agaricus-lepiota.data
First instance: ['p', 'x', 's', 'n', 't', 'p', 'f', 'c', 'n', 'k', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'u']


#### *Exercise*: Using a Python list comprehension, convert comma-separated strings to '|' separated strings.
```python
DELIMITER = '|'
delimited_string = ''
token_list = ['a', 'b', 'c']

delimited_string = DELIMITER.join([...])
delimited_string
```

In [6]:
DELIMITER = '|'
delimited_string = ''
token_list = ['a', 'b', 'c']

delimited_string = DELIMITER.join([token for token in token_list])
delimited_string

'a|b|c'

#### Missing values & "clean" instances

As noted in the initial description of the UCI mushroom set above, 2480 of the 8124 instances have missing attribute values (denoted by `'?'`). We will simply ignore any such instances and restrict our focus to only the *clean* instances (with no missing values).


In [7]:
UNKNOWN_VALUE = '?'
clean_instances = [instance
                   for instance in all_instances
                   if UNKNOWN_VALUE not in instance]

print(len(clean_instances), 'clean instances')

5644 clean instances


#### *Exercise*: Rewrite the code above using list comprehensions
```python
[instance for ... if ...]
```

#### Using Dictionaries

In [8]:
attribute_values_cap_type = {'b': 'bell', 
                             'c': 'conical', 
                             'x': 'convex', 
                             'f': 'flat', 
                             'k': 'knobbed', 
                             's': 'sunken'}

for attribute_value_abbrev in attribute_values_cap_type:
    print(attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev])

b = bell
c = conical
x = convex
f = flat
k = knobbed
s = sunken


If we want to count the numbers of **edible** and **poisonous** mushrooms in the *clean_instances* list we created earlier:

In [9]:
edible_count = 0
for instance in clean_instances:
    if instance[0] == 'e':
        edible_count += 1  # this is shorthand for edible_count = edible_count + 1

print('There are', edible_count, 'edible mushrooms among the', 
      len(clean_instances), 'clean instances')

There are 3488 edible mushrooms among the 5644 clean instances


If we want to count the number of occurrences (**frequencies**) of each possible value for an attribute, we can create a dictionary where each dictionary key is an attribute value and each dictionary value is the count of instances with that attribute value.

Using an ordinary dictionary, we must be careful to create a new dictionary entry the first time we see a new attribute value (that is not already contained in the dictionary).

In [10]:
cap_state_value_counts = {}
for instance in clean_instances:
    cap_state_value = instance[1]  # cap-state is the 2nd attribute
    if cap_state_value not in cap_state_value_counts:
        # first occurrence, must explicitly initialize counter for this cap_state_value
        cap_state_value_counts[cap_state_value] = 0
    cap_state_value_counts[cap_state_value] += 1

print('Counts for each value of cap-state:')
for value in cap_state_value_counts:
    print(value, ':', cap_state_value_counts[value])

Counts for each value of cap-state:
x : 2840
b : 300
s : 32
f : 2432
k : 36
c : 4


#### *Exercise*: Create the same dictionary using list comprehensions
```python
attribute_values_cap_type_2 = [[x[0], x ]
                               for x in ['bell', 'conical', 'convex', 'flat', 'knobbed', 'sunken']]
print(attribute_values_cap_type_2)
```

In [11]:
attribute_values_cap_type_2 = [[x[0], x ]
                               for x in ['bell', 'conical', 'convex', 'flat', 'knobbed', 'sunken']]
print(attribute_values_cap_type_2)

[['b', 'bell'], ['c', 'conical'], ['c', 'convex'], ['f', 'flat'], ['k', 'knobbed'], ['s', 'sunken']]


You could also had used a `Counter` object, which when instantiated with a list of items, returns a dictionary-like container in which the *keys* are the unique items in the list, and the *values* are the counts of each unique item in that list. 

In [12]:
from collections import Counter
cap_state_value_counts = Counter()

counts = Counter(['a', 'b', 'c', 'a', 'b', 'a'])
print(counts)
print(counts.most_common())

Counter({'a': 3, 'b': 2, 'c': 1})
[('a', 3), ('b', 2), ('c', 1)]


In [14]:
# Rebuild `cap_state_value_counts` using a Counter object.
# ```python
from collections import Counter

cap_state_value_counts = Counter()
for instance in clean_instances:
    cap_state_value = instance[1]
    # no need to explicitly initialize counters for cap_state_value; all start at zero
    cap_state_value_counts[cap_state_value] += 1

print('Counts for each value of cap-state:')
for value in cap_state_value_counts:
    print(value, ':', cap_state_value_counts[value])
# ```

Counts for each value of cap-state:
x : 2840
b : 300
s : 32
f : 2432
k : 36
c : 4


#### Exercise: Do it better using a list comprehension constructor in Counter(...):

Rebuild `cap_state_value_counts` using a Counter object.
```python
from collections import Counter

cap_state_value_counts = Counter(...)

print('Counts for each value of cap-state:')
for value in cap_state_value_counts:
    print(value, ':', cap_state_value_counts[value])
```

# 5. The Classifier: A Decision Tree

The image below depicts a decision tree created from the UCI mushroom dataset. It is taken from [here](http://gieseanw.wordpress.com/2012/03/03/decision-tree-learning/).

* a white box represents an *internal node* (and the label represents the *attribute* being tested)
* a blue box represents an attribute value (an *outcome* of the *test* of that attribute)
* a green box represents a *leaf node* with a *class label* of *edible*
* a red box represents a *leaf node* with a *class label* of *poisonous*

<img src="http://gieseanw.files.wordpress.com/2012/03/mushroomdecisiontree.png" style="width: 800px;" />

The UCI mushroom dataset consists entirely of [categorical variables](https://en.wikipedia.org/wiki/Categorical_variable), i.e., every variable (or *attribute*) has an **enumerated set of possible values**. 

Our decision tree will only accommodate *categorical variables*. It is based on [this](https://github.com/chrisspen/dtree) codebase, which itself is based on an article (since deleted) by Chris Roach.

This algorithm will be our introduction to the 3 basic steps of Machine Learning:
* ***Create*** a decision tree using a set of *training* instances
* ***Classify*** (predict class labels) for a set of *test* instances (!= *training* instances) using a simple decision tree
* ***Evaluate*** the performance of a simple decision tree on classifying a set of test instances



### Non-parametric supervised learning

Decision Trees (DTs) are a **non-parametric** *supervised* learning method used for classification and regression: There are no hyper-parameters to play with. That is why ML neophytes *love* decision trees.

A decision node has two or more branches. A **leaf node** represents a classification or ***decision***. An **inner** node represents an intermediate (non-terminal) classification, or ***splitting***. The topmost decision node in a tree, which corresponds to the *best* overall predictor, is called the **root node**. 

<br />
<center>
<img src = ipynb.images/dectree.png width = 400 />
</center>

**Splitting** is the process of partitioning the data set into subsets. Splits are formed on a particular variable.

<br />
<center>
<img src = ipynb.images/splitting.png width = 600 />
</center>

A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogeneous). The `ID3` algorithm uses **entropy** to calculate the homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample is an equally divided one, it has entropy of one.

The **information gain** is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding an attribute that returns the highest information gain (i.e., the most homogeneous branches).

### Entropy

When building a supervised classification model, the frequency distribution of attribute values is potentially a factor in determining the relative importance of each attribute in the model building process. That distribution is exactly the dictionary we built up in our Probability counting framework.

This pdf can be used to compute [entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory), a measure of **disorder in a dataset**. We compute entropy by multiplying the proportion of instances of each class label by the $log$ of that proportion, and then take the negative sum of those terms:

For a 2-class (binary) classification task:

$\text{entropy}(S) = - p_1 log_2 (p_1) - p_2 log_2 (p_2)$

where $p_i$ is proportion (relative frequency) of class *i* within the set *S*.

Why entropy is a good metric for splitting involves a bit of math, which we skip. If you want to see the math, look [here](https://github.com/rasbt/python-machine-learning-book/blob/master/faq/decision-tree-binary.md), then [here](https://www.unine.ch/files/live/sites/imi/files/shared/documents/papers/Gini_index_fulltext.pdf).

We know that the proportion of `clean_instances` that are labeled `'e'` (class `edible`) in the UCI dataset is $3488 \div 5644 = 0.618$, and the proportion labeled `'p'` (class `poisonous`) is $2156 \div 5644 = 0.382$. Thus the entropy of our mushroom dataset:


In [15]:
import math

entropy = \
    - (3488 / 5644) * math.log(3488 / 5644, 2) \
    - (2156 / 5644) * math.log(2156 / 5644, 2)
print(entropy)

0.9594413373534086


Now download `simple_ml.py`, a python helper module for decision trees from blackboard, create a folder called `pythoncode` in your `C:/Users/<Username>` folder, and copy the file to that folder. Therein, a function `entropy(instances)` computes the entropy of `instances`. You may assume the class label is in position 0.

*Note: We'll see in subsequent lectures that the class label in most datasets is the **last** rather than the **first** item on each row.*

In [16]:
import sys
sys.path.append('pythoncode/')

import simple_ml

# delete 'simple_ml.' in the function call below to test your function
print(simple_ml.entropy(clean_instances))

0.9594413373534086


### Information Gain

Informally, a decision tree is constructed from a set of instances using a recursive algorithm that: 

* Selects the *best* attribute 
* Splits the set into subsets based on the values of that attribute (each subset is composed of instances from the original set that have the same value for that attribute)
* Repeats the process on each of these subsets until a stopping condition is met (e.g., a subset has no instances or has instances which all have the same class label)

Entropy is a metric that can be used in selecting the best attribute for each split: the best attribute is the one resulting in the *largest decrease in entropy* for a set of instances.

*Information gain* measures the decrease in entropy that results from splitting a set of instances based on an attribute.

$IG(S, a) = \text{entropy}(S) - [p(s_1) * \text{entropy}(s_1) + p(s_2) * \text{entropy}(s_2) ... + p(s_n) * \text{entropy}(s_n)]$

where 
* $n$ is the number of distinct values of attribute $a$
* $s_i$ is the subset of $S$ where all instances have the $i$th value of $a$
* $p(s_i)$ is the proportion of instances in $S$ that have the $i$th value of $a$

We'll use `information_gain()` in `simple_ml` to print the information gain for each of the attributes in the mushroom dataset.

In [17]:
print('Information gain for the different mushroom attributes:', end='\n\n')
for i in range(1, len(attribute_names)):
    print('{:5.3f}  {:2} {}'.format(
        simple_ml.information_gain(clean_instances, i), i, attribute_names[i]))

Information gain for the different mushroom attributes:

0.017   1 cap-shape
0.005   2 cap-surface
0.195   3 cap-color
0.140   4 bruises?
0.860   5 odor
0.004   6 gill-attachment
0.058   7 gill-spacing
0.032   8 gill-size
0.213   9 gill-color
0.275  10 stalk-shape
0.097  11 stalk-root
0.425  12 stalk-surface-above-ring
0.409  13 stalk-surface-below-ring
0.306  14 stalk-color-above-ring
0.279  15 stalk-color-below-ring
0.000  16 veil-type
0.002  17 veil-color
0.012  18 ring-number
0.463  19 ring-type
0.583  20 spore-print-color
0.110  21 population
0.101  22 habitat


We can sort the attributes based in decreasing order of information gain, which indicates that `odor` is the best attribute for the first split in a decision tree that models the instances in this dataset.

#### *Exercise#: Sort the attributes based in decreasing order of information gain

### Decision Tree

We will implement a modified version of the [ID3](https://en.wikipedia.org/wiki/ID3_algorithm) algorithm for building a  decision tree.

    ID3 (Examples, Target_Attribute, Candidate_Attributes)
        Create a Root node for the tree
        If all examples have the same value of the Target_Attribute, 
            Return the single-node tree Root with label = that value 
        If the list of Candidate_Attributes is empty,
            Return the single node tree Root,
                with label = most common value of Target_Attribute in the examples.
        Otherwise Begin
            A ← The Attribute that best classifies examples (most information gain)
            Decision Tree attribute for Root = A.
            For each possible value, v_i, of A,
                Add a new tree branch below Root, corresponding to the test A = v_i.
                Let Examples(v_i) be the subset of examples that have the value v_i for A
                If Examples(v_i) is empty,
                    Below this new branch add a leaf node 
                        with label = most common target value in the examples
                Else 
                    Below this new branch add the subtree 
                        ID3 (Examples(v_i), Target_Attribute, Attributes – {A})
        End
        Return Root

In building a decision tree, we will need to split the instances based on the index of the *best* attribute, i.e., the attribute that offers the *highest information gain*. We will use separate utility functions to handle these subtasks. To simplify, we rely exclusively on attribute *indexes* rather than attribute *names*.

First, define a function, **`split_instances(instances, attribute_index)`**, to split a set of instances based on any attribute. This function will return a dictionary where each *key* is a distinct value of the specified `attribute_index`, and the *value* of each key is a list representing the subset of `instances` that have that `attribute_index` value.

Use a [**`defaultdict`**](http://docs.python.org/2/library/collections.html#defaultdict-objects), a specialized dictionary class in the [**`collections`**](http://docs.python.org/2/library/collections.html) module that automatically creates an appropriate default value for a new key. For example, a `defaultdict(int)` automatically initializes a new dictionary entry to 0 (zero). A `defaultdict(list)` automatically initializes a new dictionary entry to the empty list (`[]`).

In [18]:
from collections import defaultdict

def split_instances(instances, attribute_index):
    '''Returns a list of dictionaries, splitting a list of instances 
        according to their values of a specified attribute index
    
    The key of each dictionary is a distinct value of attribute_index,
    and the value of each dictionary is a list representing 
       the subset of instances that have that value for the attribute
    '''
    partitions = defaultdict(list)
    for instance in instances:
        partitions[instance[attribute_index]].append(instance)
    return partitions

To test the function, we will partition `clean_instances` based on the `odor` attribute (index position 5) and print out the size (number of instances) in each partition rather than the lists of instances in each partition. These are all the possible odor categories:

In [19]:
partitions = split_instances(clean_instances, 5)
print([(partition, len(partitions[partition])) for partition in partitions])

[('p', 256), ('a', 400), ('l', 400), ('n', 2776), ('f', 1584), ('c', 192), ('m', 36)]


Nowe we have our first split and we know we can split instances based on a particular attribute. We want to be able to choose the *best* attribute with which to split the instances at every node, where *best* is defined as the attribute that provides the *greatest information gain* if instances were split based on that attribute. 

We restrict candidate attributes so that we don't try to split on an attribute that was used higher up in the decision tree (or use the target attribute as a candidate).

### Helper Functions

`choose_best_attribute_index(instances, candidate_attribute_indexes)` returns the index in the list of `candidate_attribute_indexes` that provides the highest information gain if `instances` are split based on that attribute index.

In [20]:
print('Best attribute index:', 
      simple_ml.choose_best_attribute_index(clean_instances, range(1, len(attribute_names))))

Best attribute index: 5


A leaf node in a decision tree represents the most frequently occurring - or majority - class value for that path through the tree (it's still a probabilistic algorithm). We will need a function that determines the majority value for the class index among a set of instances. One way to do this is to use the [`Counter`](https://docs.python.org/2/library/collections.html#counter-objects) class.

In [21]:
class_counts = Counter()  # create an empty counter
for instance in clean_instances:
    class_counts[instance[0]] += 1
    
print ('class_counts: {}\n  most_common(1): {}\n  most_common(1)[0][0]: {}'.format(
    class_counts,
    class_counts.most_common(1), # returns a list in which the 1st element is a tuple with the most common value and its count
    class_counts.most_common(1)[0][0]))  # the most common value (1st element in that tuple)

class_counts: Counter({'e': 3488, 'p': 2156})
  most_common(1): [('e', 3488)]
  most_common(1)[0][0]: e


#### *Exercise*: Use a list comprehension to improve on the cell above
```python
class_counts = Counter([instance[0] for ...])
```

It is often useful to compute the number of unique values and/or the total number of values in a `Counter`.

The number of unique values is simply the number of dictionary entries.

The total number of values can be computed by taking the [**`sum()`**](https://docs.python.org/2/library/functions.html#sum) of all the counts (the *value* of each *key: value* pair ... or *key, value* tuple, if we use `Counter().most_common()`).

In [22]:
print('Number of unique values: {}'.format(len(class_counts)))
print('Total number of values:  {}'.format(sum([v 
                                                for k, v in class_counts.most_common()])))

Number of unique values: 2
Total number of values:  5644


#### *Aside*

Note that Python has a flexible mechanism for the testing truth values: In an **if** condition, any null object, zero-valued numerical expression or empty container (string, list, dictionary or tuple) is interpreted as *False* (i.e., *not True*):

In [23]:
for x in [False, None, 0, 0.0, "", [], {}, ()]:
    print('"{}" is'.format(x), end=' ')
    if x:
        print(True)
    else:
        print(False)

"False" is False
"None" is False
"0" is False
"0.0" is False
"" is False
"[]" is False
"{}" is False
"()" is False


Sometimes, particularly with function parameters, it is helpful to differentiate `None` from empty lists and other data structures with a `False` truth value (one common use case is illustrated in `create_decision_tree()` below).

In [24]:
for x in [False, None, 0, 0.0, "", [], {}, ()]:
    print('"{} is None" is'.format(x), end=' ')
    if x is None:
        print(True)
    else:
        print(False)

"False is None" is False
"None is None" is True
"0 is None" is False
"0.0 is None" is False
" is None" is False
"[] is None" is False
"{} is None" is False
"() is None" is False


In [25]:
for x in [False, None, 0, 0.0, "", [], {}, ()]:
    print('"{}" is {}'.format(x, True if x else False)) 

"False" is False
"None" is False
"0" is False
"0.0" is False
"" is False
"[]" is False
"{}" is False
"()" is False


`majority_value(instances, class_index)` returns the most frequently occurring value of `class_index` in `instances`. The `class_index` parameter should be optional, and have a default value of `0` (zero).


In [26]:
print('Majority value of index {}: {}'.format(
    0, simple_ml.majority_value(clean_instances))) 

# although there is only one class_index for the dataset, 
# we'll test the function by specifying other indexes using optional / keyword arguments
print('Majority value of index {}: {}'.format(
    1, simple_ml.majority_value(clean_instances, 1)))  # using argument order
print('Majority value of index {}: {}'.format(
    2, simple_ml.majority_value(clean_instances, class_index=2)))  # using keyword argument

Majority value of index 0: e
Majority value of index 1: x
Majority value of index 2: y


### Building a Simple Decision Tree

The recursive `create_decision_tree()` function below uses an optional parameter, `class_index`, which defaults to `0`. This is to accommodate other datasets in which the class label is the ***last*** element on each line (which would be most easily specified by using a `-1` value). Most data files in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html) have class labels as either the first element or the last element.

To show how the decision tree is being built, an optional `trace` parameter, when non-zero, will generate trace information as the tree is constructed. The indentation level is incremented with each recursive call via the use of the conditional expression (ternary operator), `trace + 1 if trace else 0`.

In [27]:
def create_decision_tree(instances, 
                         candidate_attribute_indexes=None, 
                         class_index=0, 
                         default_class=None, 
                         trace=0):
    '''Returns a new decision tree trained on a list of instances.
    
    The tree is constructed by recursively selecting and splitting instances based on 
    the highest information_gain of the candidate_attribute_indexes.
    
    The class label is found in position class_index.
    
    The default_class is the majority value for the current node's parent in the tree.
    A positive (int) trace value will generate trace information 
        with increasing levels of indentation.
    
    Derived from the simplified ID3 algorithm presented in Building Decision Trees in Python 
        by Christopher Roach,
    http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=3
    '''
    
    # if no candidate_attribute_indexes are provided, 
    # assume that we will use all but the target_attribute_index
    # Note that None != [], 
    # as an empty candidate_attribute_indexes list is a recursion stopping condition
    if candidate_attribute_indexes is None:
        candidate_attribute_indexes = [i 
                                       for i in range(len(instances[0])) 
                                       if i != class_index]
        # Note: do not use candidate_attribute_indexes.remove(class_index)
        # as this would destructively modify the argument,
        # causing problems during recursive calls
        
    class_labels_and_counts = Counter([instance[class_index] for instance in instances])

    # If the dataset is empty or the candidate attributes list is empty, 
    # return the default value
    if not instances or not candidate_attribute_indexes:
        if trace:
            print('{}Using default class {}'.format('< ' * trace, default_class))
        return default_class
    
    # If all the instances have the same class label, return that class label
    elif len(class_labels_and_counts) == 1:
        class_label = class_labels_and_counts.most_common(1)[0][0]
        if trace:
            print('{}All {} instances have label {}'.format(
                '< ' * trace, len(instances), class_label))
        return class_label
    else:
        default_class = simple_ml.majority_value(instances, class_index)

        # Choose the next best attribute index to best classify the instances
        best_index = simple_ml.choose_best_attribute_index(
            instances, candidate_attribute_indexes, class_index)        
        if trace:
            print('{}Creating tree node for attribute index {}'.format(
                    '> ' * trace, best_index))

        # Create a new decision tree node with the best attribute index 
        # and an empty dictionary object (for now)
        tree = {best_index:{}}

        # Create a new decision tree sub-node (branch) for each of the values 
        # in the best attribute field
        partitions = simple_ml.split_instances(instances, best_index)

        # Remove that attribute from the set of candidates for further splits
        remaining_candidate_attribute_indexes = [i 
                                                 for i in candidate_attribute_indexes 
                                                 if i != best_index]
        for attribute_value in partitions:
            if trace:
                print('{}Creating subtree for value {} ({}, {}, {}, {})'.format(
                    '> ' * trace,
                    attribute_value, 
                    len(partitions[attribute_value]), 
                    len(remaining_candidate_attribute_indexes), 
                    class_index, 
                    default_class))
                
            # Create a subtree for each value of the the best attribute
            subtree = create_decision_tree(
                partitions[attribute_value],
                remaining_candidate_attribute_indexes,
                class_index,
                default_class,
                trace + 1 if trace else 0)

            # Add the new subtree to the empty dictionary object 
            # in the new tree/node we just created
            tree[best_index][attribute_value] = subtree

    return tree

# split instances into separate training and testing sets
training_instances = clean_instances[:-20]
test_instances = clean_instances[-20:]
tree = create_decision_tree(training_instances, trace=1)  # remove trace=1 to turn off tracing
print(tree)

> Creating tree node for attribute index 5
> Creating subtree for value p (256, 21, 0, e)
< < All 256 instances have label p
> Creating subtree for value a (400, 21, 0, e)
< < All 400 instances have label e
> Creating subtree for value l (400, 21, 0, e)
< < All 400 instances have label e
> Creating subtree for value n (2764, 21, 0, e)
> > Creating tree node for attribute index 20
> > Creating subtree for value n (1296, 20, 0, e)
< < < All 1296 instances have label e
> > Creating subtree for value k (1296, 20, 0, e)
< < < All 1296 instances have label e
> > Creating subtree for value r (72, 20, 0, e)
< < < All 72 instances have label p
> > Creating subtree for value w (100, 20, 0, e)
> > > Creating tree node for attribute index 21
> > > Creating subtree for value v (60, 19, 0, e)
< < < < All 60 instances have label e
> > > Creating subtree for value c (16, 19, 0, e)
< < < < All 16 instances have label p
> > > Creating subtree for value y (24, 19, 0, e)
< < < < All 24 instances have labe

The structure of the tree shown above is difficult to discern from the normal printed representation of a dictionary.

Python's [**`pprint`**](http://docs.python.org/2/library/pprint.html) module has a number of useful methods for pretty-printing or formatting objects in a more human readable way.

The [**`pprint.pprint(object, stream=None, indent=1, width=80, depth=None)`**](http://docs.python.org/2/library/pprint.html#pprint.pprint) method will print `object` to a `stream` (a default value of `None` will dictate the use of [sys.stdout](http://docs.python.org/2/library/sys.html#sys.stdout), the same destination as `print` function output), using `indent` spaces to differentiate nesting levels, using up to a maximum `width` columns and up to to a maximum nesting level `depth` (`None` indicating no maximum).

In [28]:
from pprint import pprint

pprint(tree)

{5: {'a': 'e',
     'c': 'p',
     'f': 'p',
     'l': 'e',
     'm': 'p',
     'n': {20: {'k': 'e',
                'n': 'e',
                'r': 'p',
                'w': {21: {'c': 'p', 'v': 'e', 'y': 'e'}}}},
     'p': 'p'}}


### Classifying Instances with a Simple Decision Tree

Usually, when we construct a decision tree based on a set of *training* instances, we do so with the intent of using that tree to classify a set of one or more *test* instances.

We define a function, **`classify(tree, instance, default_class=None)`**, to use a decision tree to classify a single `instance`, where an optional `default_class` can be specified as the return value if the instance represents a set of attribute values that don't have a representation in the decision tree.

We will use a design pattern in which we will use a series of `if` statements, each of which returns a value if the condition is true, rather than a nested series of `if`, `elif` and/or `else` clauses, as it helps constrain the levels of indentation in the function.

In [29]:
def classify(tree, instance, default_class=None):
    '''Returns a classification label for instance, given a decision tree'''
    if not tree:  # if the node is empty, return the default class
        return default_class
    if not isinstance(tree, dict):  # if the node is a leaf, return its class label
        return tree
    attribute_index = list(tree.keys())[0]  # using list(dict.keys()) for Python 3 compatibility
    attribute_values = list(tree.values())[0]
    instance_attribute_value = instance[attribute_index]
    if instance_attribute_value not in attribute_values:  # this value was not in training data
        return default_class
    # recursively traverse the subtree (branch) associated with instance_attribute_value
    return classify(attribute_values[instance_attribute_value], instance, default_class)

for instance in test_instances:
    predicted_label = classify(tree, instance)
    actual_label = instance[0]
    print('predicted: {}; actual: {}'.format(predicted_label, actual_label))

predicted: p; actual: p
predicted: p; actual: p
predicted: p; actual: p
predicted: e; actual: e
predicted: e; actual: e
predicted: p; actual: p
predicted: e; actual: e
predicted: e; actual: e
predicted: e; actual: e
predicted: p; actual: p
predicted: e; actual: e
predicted: e; actual: e
predicted: e; actual: e
predicted: p; actual: p
predicted: e; actual: e
predicted: e; actual: e
predicted: e; actual: e
predicted: e; actual: e
predicted: p; actual: p
predicted: p; actual: p


### Evaluating the Accuracy of a Simple Decision Tree

It is often helpful to evaluate the performance of a model using a dataset not used in the training of that model. In the simple example shown above, we used all but the last 20 instances to train a simple decision tree, then classified those last 20 instances using the tree.

The advantage of this training/test split is that visual inspection of the classifications (sometimes called *predictions*) is relatively straightforward, revealing that all 20 instances were correctly classified.

There are a variety of metrics that can be used to evaluate the performance of a model. Scikit Learn's [Model Evaluation](http://scikit-learn.org/stable/modules/model_evaluation.html) library provides an overview and implementation of several possible metrics. For now, we simply measure the *accuracy* of a model, i.e., the percentage of test instances that are correctly classified (*true positives* and *true negatives*).

The accuracy of the model above, given the set of 20 test instances, is 100% (20/20).

The function below calculates the classification accuracy of a `tree` over a set of `test_instances` (with an optional `class_index` parameter indicating the position of the class label in each instance).

In [30]:
def classification_accuracy(tree, test_instances, class_index=0, default_class=None):
    '''Returns the accuracy of classifying test_instances with tree, 
    where the class label is in position class_index'''
    num_correct = 0
    for i in range(len(test_instances)):
        prediction = classify(tree, test_instances[i], default_class)
        actual_value = test_instances[i][class_index]
        if prediction == actual_value:
            num_correct += 1
    return num_correct / len(test_instances)

print(classification_accuracy(tree, test_instances))

1.0


In addition to showing the percentage of correctly classified instances, it may be helpful to return the actual counts of correctly and incorrectly classified instances, e.g., if we want to compile a total count of correctly and incorrectly classified instances over a collection of test instances.

We use the [**`zip([iterable, ...])`**](http://docs.python.org/2.7/library/functions.html#zip) function, which combines 2 or more sequences or iterables; the function returns a list of tuples, where the *i*th tuple contains the *i*th element from each of the argument sequences or iterables.

In [31]:
list(zip([0, 1, 2], ['a', 'b', 'c']))

[(0, 'a'), (1, 'b'), (2, 'c')]

We use list comprehensions, the `Counter` class and the `zip()` function to modify `classification_accuracy()` so that it returns a packed tuple with: 

* the percentage of instances correctly classified
* the number of correctly classified instances
* the number of incorrectly classified instances

We also modify the function to use `instances` rather than `test_instances`, as we sometimes want to be able to evaluate the accuracy of a model when tested on the training instances used to create it.

In [32]:
def classification_accuracy(tree, instances, class_index=0, default_class=None):
    '''Returns the accuracy of classifying test_instances with tree, 
    where the class label is in position class_index'''
    predicted_labels = [classify(tree, instance, default_class) 
                        for instance in instances]
    actual_labels = [x[class_index] 
                     for x in instances]
    counts = Counter([x == y 
                      for x, y in zip(predicted_labels, actual_labels)])
    return counts[True] / len(instances), counts[True], counts[False]

print(classification_accuracy(tree, test_instances))

(1.0, 20, 0)


We sometimes want to partition instances into subsets of equal size to measure performance. One metric this partitioning allows us to compute is a [learning curve](https://en.wikipedia.org/wiki/Learning_curve), i.e., assess how well the model performs based on the size of its training set. Another use of these partitions (aka *folds*) would be to conduct an [*n-fold cross validation*](https://en.wikipedia.org/wiki/Cross-validation_(statistics) evaluation.

The following function, **`partition_instances(instances, num_partitions)`**, partitions a set of `instances` into `num_partitions` relatively equal-sized subsets.

In [33]:
def partition_instances(instances, num_partitions):
    '''Returns a list of relatively equally sized disjoint sublists (partitions) 
    of the list of instances'''
    return [[instances[j] 
             for j in range(i, len(instances), num_partitions)]
            for i in range(num_partitions)]

Before testing this function on the 5644 `clean_instances` from the UCI mushroom dataset, we create a small number of simplified instances to verify that the function has the desired behavior.

In [34]:
instance_length = 3
num_instances = 5

simplified_instances = [[j 
                         for j in range(i, instance_length + i)] 
                        for i in range(num_instances)]

print('Instances:', simplified_instances)
partitions = partition_instances(simplified_instances, 2)
print('Partitions:', partitions)

Instances: [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]
Partitions: [[[0, 1, 2], [2, 3, 4], [4, 5, 6]], [[1, 2, 3], [3, 4, 5]]]


[**`enumerate(sequence, start=0)`**](http://docs.python.org/2.7/library/functions.html#enumerate) function creates an iterator that successively returns the index and value of each element in a `sequence`, beginning at the `start` index.

In [35]:
for i, x in enumerate(['a', 'b', 'c']):
    print(i, x)

0 a
1 b
2 c


We can use `enumerate()` to facilitate slightly more rigorous testing of our `partition_instances` function on our `simplified_instances`.

Note that since we are printing values rather than accumulating values, we will not use nested list comprehensions for this task.

In [36]:
for i in range(num_instances):
    print('\n# partitions:', i)
    for j, partition in enumerate(partition_instances(simplified_instances, i)):
        print('partition {}: {}'.format(j, partition))


# partitions: 0

# partitions: 1
partition 0: [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]

# partitions: 2
partition 0: [[0, 1, 2], [2, 3, 4], [4, 5, 6]]
partition 1: [[1, 2, 3], [3, 4, 5]]

# partitions: 3
partition 0: [[0, 1, 2], [3, 4, 5]]
partition 1: [[1, 2, 3], [4, 5, 6]]
partition 2: [[2, 3, 4]]

# partitions: 4
partition 0: [[0, 1, 2], [4, 5, 6]]
partition 1: [[1, 2, 3]]
partition 2: [[2, 3, 4]]
partition 3: [[3, 4, 5]]


Returning our attention to the UCI mushroom dataset, the following will partition our `clean_instances` into 10 relatively equally sized disjoint subsets. We will use a list comprehension to print out the length of each partition

In [37]:
partitions = partition_instances(clean_instances, 10)
print([len(partition) for partition in partitions])

[565, 565, 565, 565, 564, 564, 564, 564, 564, 564]


The following shows the different trees that are constructed based on partition 0 (first 10th) of `clean_instances`, partitions 0 and 1 (first 2/10ths) of `clean_instances` and all `clean_instances`.

In [38]:
tree0 = create_decision_tree(partitions[0])
print('Tree trained with {} instances:'.format(len(partitions[0])))
pprint(tree0)
print()

tree1 = create_decision_tree(partitions[0] + partitions[1])
print('Tree trained with {} instances:'.format(len(partitions[0] + partitions[1])))
pprint(tree1)
print()

tree = create_decision_tree(clean_instances)
print('Tree trained with {} instances:'.format(len(clean_instances)))
pprint(tree)

Tree trained with 565 instances:
{5: {'a': 'e',
     'c': 'p',
     'f': 'p',
     'l': 'e',
     'm': 'p',
     'n': {20: {'k': 'e', 'n': 'e', 'r': 'p', 'w': 'e'}},
     'p': 'p'}}

Tree trained with 1130 instances:
{5: {'a': 'e',
     'c': 'p',
     'f': 'p',
     'l': 'e',
     'm': 'p',
     'n': {20: {'k': 'e',
                'n': 'e',
                'r': 'p',
                'w': {21: {'c': 'p', 'v': 'e', 'y': 'e'}}}},
     'p': 'p'}}

Tree trained with 5644 instances:
{5: {'a': 'e',
     'c': 'p',
     'f': 'p',
     'l': 'e',
     'm': 'p',
     'n': {20: {'k': 'e',
                'n': 'e',
                'r': 'p',
                'w': {21: {'c': 'p', 'v': 'e', 'y': 'e'}}}},
     'p': 'p'}}


The only difference between the first two trees - *tree0* and *tree1* - is that in the first tree, instances with no `odor` (attribute index `5` is `'n'`) and a `spore-print-color` of white (attribute `20` = `'w'`) are classified as `edible` (`'e'`). With additional training data in the 2nd partition, an additional distinction is made such that instances with no `odor`, a white `spore-print-color` and a clustered `population` (attribute `21` = `'c'`) are classified as `poisonous` (`'p'`), while all other instances with no `odor` and a white `spore-print-color` (and any other value for the `population` attribute) are classified as `edible` (`'e'`).

Note that there is no difference between `tree1` and `tree` (the tree trained with all instances). This early convergence on an optimal model is uncommon on most datasets (outside the UCI repository).

### Learning curves

Now that we can partition our instances into subsets, we can use these subsets to construct different-sized training sets in the process of computing a learning curve.

We will start off with an initial training set consisting only of the first partition, and then progressively extend that training set by adding a new partition during each iteration of computing the learning curve.

[**`list.extend(L)`**](http://docs.python.org/2/tutorial/datastructures.html#more-on-lists) enables us to extend `list` by appending all the items in another list, `L`, to the end of `list`.

In [39]:
x = [1, 2, 3]
x.extend([4, 5])
print(x)

[1, 2, 3, 4, 5]


We now define the function, **`compute_learning_curve(instances, num_partitions=10)`**, which will take a list of `instances`, partition it into `num_partitions` relatively equally sized disjoint partitions, and then iteratively evaluate the accuracy of models trained with an incrementally increasing combination of instances in the first `num_partitions - 1` partitions then tested with instances in the last partition, a variant of . That is, a model trained with the first partition will be constructed (and tested), then a model trained with the first 2 partitions will be constructed (and tested), and so on. 

The function will return a list of `num_partitions - 1` tuples representing the size of the training set and the accuracy of a tree trained with that set (and tested on the `num_partitions - 1` set). This will provide some indication of the relative impact of the size of the training set on model performance.

In [40]:
def compute_learning_curve(instances, num_partitions=10):
    '''Returns a list of training sizes and scores for incrementally increasing partitions.

    The list contains 2-element tuples, each representing a training size and score.
    The i-th training size is the number of instances in partitions 0 through num_partitions - 2.
    The i-th score is the accuracy of a tree trained with instances 
    from partitions 0 through num_partitions - 2
    and tested on instances from num_partitions - 1 (the last partition).'''
    
    partitions = partition_instances(instances, num_partitions)
    test_instances = partitions[-1][:]
    training_instances = []
    accuracy_list = []
    for i in range(0, num_partitions - 1):
        # for each iteration, the training set is composed of partitions 0 through i - 1
        training_instances.extend(partitions[i][:])
        tree = create_decision_tree(training_instances)
        partition_accuracy = classification_accuracy(tree, test_instances)
        accuracy_list.append((len(training_instances), partition_accuracy))
    return accuracy_list

accuracy_list = compute_learning_curve(clean_instances)
print(accuracy_list)

[(565, (0.9964539007092199, 562, 2)), (1130, (1.0, 564, 0)), (1695, (1.0, 564, 0)), (2260, (1.0, 564, 0)), (2824, (1.0, 564, 0)), (3388, (1.0, 564, 0)), (3952, (1.0, 564, 0)), (4516, (1.0, 564, 0)), (5080, (1.0, 564, 0))]


The UCI mushroom dataset is a particularly clean and simple data set, enabling quick convergence on an optimal decision tree for classifying new instances using relatively few training instances. 

We can use a larger number of smaller partitions to see a little more variation in accuracy performance.

In [41]:
accuracy_list = compute_learning_curve(clean_instances, 100)
print(accuracy_list[:10])

[(57, (0.9821428571428571, 55, 1)), (114, (1.0, 56, 0)), (171, (0.9821428571428571, 55, 1)), (228, (1.0, 56, 0)), (285, (1.0, 56, 0)), (342, (1.0, 56, 0)), (399, (1.0, 56, 0)), (456, (1.0, 56, 0)), (513, (1.0, 56, 0)), (570, (1.0, 56, 0))]


### Python Class to Encapsulate a Simple Decision Tree

The simple decision tree defined above uses a Python dictionary for its representation. One can imagine using other data structures, and/or extending the decision tree to support confidence estimates, numeric features and other capabilities that are often included in more fully functional implementations. To support future extensibility, and hide the details of the representation from the user, it would be helpful to have a user-defined class for simple decision trees.

Note that other machine learning libraries may use different terminology for some of the functions defined above. For example, in the [`sklearn.tree.DecisionTreeClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) class (and in most `sklearn` classifier classes), the method for constructing a classifier is named [`fit()`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit) - since it "fits" the data to a model - and the method for classifying instances is named [`predict()`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict) - since it is predicting the class label for an instance.

In keeping with this common terminology, the code below defines a class, **`SimpleDecisionTree`**, with a single pseudo-protected member variable `_tree`, three public methods - `fit()`, `predict()` and `pprint()` - and two auxilary methods - `_create_tree()` and `_predict()` - to augment the `fit()` and `predict()` methods, respectively. 

The `fit()` method is identical to the `create_decision_tree()` function above, with the inclusion of the `self` parameter (as it is now a class method rather than a function). The `predict()` method is a similarly modified version of the `classify()` function, with the added capability to predict the label of either a single instance or a list of instances. The `classification_accuracy()` method is similar to the function of the same name (with the addition of the `self` parameter). The `pprint()` method prints the tree in a human-readable format.



In [42]:
class SimpleDecisionTree:

    _tree = {}  # this instance variable becomes accessible to class methods via self._tree

    def __init__(self):
        # this is where we would initialize any parameters to the SimpleDecisionTree
        pass
            
    def fit(self, 
            instances, 
            candidate_attribute_indexes=None,
            target_attribute_index=0,
            default_class=None):
        if not candidate_attribute_indexes:
            candidate_attribute_indexes = [i 
                                           for i in range(len(instances[0]))
                                           if i != target_attribute_index]
        self._tree = self._create_tree(instances,
                                       candidate_attribute_indexes,
                                       target_attribute_index,
                                       default_class)
        
    def _create_tree(self,
                     instances,
                     candidate_attribute_indexes,
                     target_attribute_index=0,
                     default_class=None):
        class_labels_and_counts = Counter([instance[target_attribute_index] 
                                           for instance in instances])
        if not instances or not candidate_attribute_indexes:
            return default_class
        elif len(class_labels_and_counts) == 1:
            class_label = class_labels_and_counts.most_common(1)[0][0]
            return class_label
        else:
            default_class = simple_ml.majority_value(instances, target_attribute_index)
            best_index = simple_ml.choose_best_attribute_index(instances, 
                                                               candidate_attribute_indexes, 
                                                               target_attribute_index)
            tree = {best_index:{}}
            partitions = simple_ml.split_instances(instances, best_index)
            remaining_candidate_attribute_indexes = [i 
                                                     for i in candidate_attribute_indexes 
                                                     if i != best_index]
            for attribute_value in partitions:
                subtree = self._create_tree(
                    partitions[attribute_value],
                    remaining_candidate_attribute_indexes,
                    target_attribute_index,
                    default_class)
                tree[best_index][attribute_value] = subtree
            return tree
    
    def predict(self, instances, default_class=None):
        if not isinstance(instances, list):
            return self._predict(self._tree, instance, default_class)
        else:
            return [self._predict(self._tree, instance, default_class) 
                    for instance in instances]
    
    def _predict(self, tree, instance, default_class=None):
        if not tree:
            return default_class
        if not isinstance(tree, dict):
            return tree
        attribute_index = list(tree.keys())[0]  # using list(dict.keys()) for Py3 compatibiity
        attribute_values = list(tree.values())[0]
        instance_attribute_value = instance[attribute_index]
        if instance_attribute_value not in attribute_values:
            return default_class
        return self._predict(attribute_values[instance_attribute_value],
                             instance,
                             default_class)
    
    def classification_accuracy(self, instances, default_class=None):
        predicted_labels = self.predict(instances, default_class)
        actual_labels = [x[0] for x in instances]
        counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)])
        return counts[True] / len(instances), counts[True], counts[False]
    
    def pprint(self):
        pprint(self._tree)

The following statements instantiate a `SimpleDecisionTree`, using all but the last 20 `clean_instances`, prints out the tree using its `pprint()` method, and then uses the `classify()` method to print the classification of the last 20 `clean_instances`.

In [43]:
simple_decision_tree = SimpleDecisionTree()
simple_decision_tree.fit(training_instances)
simple_decision_tree.pprint()
print()

predicted_labels = simple_decision_tree.predict(test_instances)
actual_labels = [instance[0] for instance in test_instances]
for predicted_label, actual_label in zip(predicted_labels, actual_labels):
    print('Model: {}; truth: {}'.format(predicted_label, actual_label))
print()
print('Classification accuracy:', simple_decision_tree.classification_accuracy(test_instances))

{5: {'a': 'e',
     'c': 'p',
     'f': 'p',
     'l': 'e',
     'm': 'p',
     'n': {20: {'k': 'e',
                'n': 'e',
                'r': 'p',
                'w': {21: {'c': 'p', 'v': 'e', 'y': 'e'}}}},
     'p': 'p'}}

Model: p; truth: p
Model: p; truth: p
Model: p; truth: p
Model: e; truth: e
Model: e; truth: e
Model: p; truth: p
Model: e; truth: e
Model: e; truth: e
Model: e; truth: e
Model: p; truth: p
Model: e; truth: e
Model: e; truth: e
Model: e; truth: e
Model: p; truth: p
Model: e; truth: e
Model: e; truth: e
Model: e; truth: e
Model: e; truth: e
Model: p; truth: p
Model: p; truth: p

Classification accuracy: (1.0, 20, 0)


## 4. Conclusion

[Scikit-Learn](http://scikit-learn.org/) has more full-featured decision tree building functions, and other types of machine learning algorithms. We'll use scikit-learn's decision forest building algoirthm in our lab.

#### The ID3 and C4.5 Algorithms
[Ross Quinlan](https://en.wikipedia.org/wiki/Ross_Quinlan) invented the ID3 algorithm that uses entropy and information gain to recursively create decision trees. This algorithm had some weaknesses, such as the inability to handle numerical attributes or attributes with missing values. His extension to ID3, C4.5, addressed those weaknesses.

#### Overfit
The idea of overfitting means that your prediction model is too biased towards your training data. Think about the hypothetical case where a node in our mushroom decision tree has 1,000 examples that are poisonous and only 1 that is edible. Should we really split the tree again here, or would it be a better idea to just assume that all mushrooms at this node are poisonous? Perhaps that 1 edible mushroom was simply mislabeled by our scientists. Or perhaps there is some attribute about the mushrooms that the scientists did not take into account that might better split the examples earlier on? It’s a good idea to assume that there is noise in our examples. If we don’t we may build a tree too affected by this noise, and thus have it overfit to our training examples. New examples, then, could have lower classification accuracy.

To alleviate the problem of overfit, it is custom to [prune](https://en.wikipedia.org/wiki/Pruning_(decision_trees) decision trees, or remove sections of the tree that provide little power to classify instances.

<br />
<center>
<img src = ipynb.images/pruned.jpeg width = 600 />
</center>

#### MLB

How could you apply this ML algorithm to problems we tackled in class? Well, for one, you could just extensively compile *hundreds* of ML statistics, put them all in a spreadsheet, and let *your laptop tell you which statistics are the most important* by having it build a decision tree (most important factors are closer to the root of the tree). Then you can model these parameters, or simply have the decision tree predict winners and losers based on new data.

Don't you hate me, that I tell you just now? :-)
