# Building a Decision Tree

## Introduction to the Data

In the last mission, we used a data set on U.S. income from the 1994 census; we'll continue using it here. It contains information on marital status, age, type of work, and more. The target column, `high_income`, indicates a salary less than or equal to 50k per year (`0`), or more than 50k per year (`1`).<br>

You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).

## Overview of the ID3 Algorithm

In the last mission, we learned about the basics of decision trees, including entropy and information gain. In this mission, we'll build on those concepts to construct a full decision tree in Python and use it make predictions.<br>

We'll use the [ID3 Algorithm](https://en.wikipedia.org/wiki/ID3_algorithm) for constructing decision trees to accomplish this. This algorithm involves [recursion](https://en.wikipedia.org/wiki/Recursion_(computer_science)) and an understanding of time complexity. If you're unfamiliar with these topics, we suggest trying [our Data Structures and Algorithms course](https://www.dataquest.io/course/data-structures-algorithms). We also suggest learning about lambda functions through our [command line course](https://www.dataquest.io/mission/112/lambda-functions/).<br>

In general, recursion is the process of splitting a large problem into smaller chunks. Recursive functions will call themselves, then combine the results into a final output.<br>

Building a tree is a perfect use case for recursive algorithms. At each node, we'll call a recursive function that will split the data into two branches. Each branch will lead to a node, and the function will call itself to build the tree out.<br>

We've created a pseudocode version of the full ID3 Algorithm below. Pseudocode is a plain-text outline of a piece of code that explains how it works. Exploring the pseudocode for an algorithm is a good way to understand it better before trying to code it.

```python
def id3(data, target, columns)
    1 Create a node for the tree
    2 If all values of the target attribute are 1, Return the node, with label = 1
    3 If all values of the target attribute are 0, Return the node, with label = 0
    4 Using information gain, find A, the column that splits the data best
    5 Find the median value in column A
    6 Split column A into values below or equal to the median (0), and values above the median (1)
    7 For each possible value (0 or 1), vi, of A,
    8    Add a new tree branch below Root that corresponds to rows of data where A = vi
    9    Let Examples(vi) be the subset of examples that have the value vi for A
   10    Below this new branch add the subtree id3(data[A==vi], target, columns)
   11 Return Root
```

We've made a minor modification to the algorithm so that it only creates two branches from each node. This will simplify the process of constructing the tree, and make it easier to demonstrate the principles it involves.<br>

The recursive nature of the algorithm comes into play on line `10`. Every node in the tree will call the `id3()` function, and the final tree will be the result of all of these calls.