# Supervised Learning—How to do a Classification Tree in Python

## When to use a classification tree

- When the response variable (the thing you are trying to predict) is a logical (true or false) or categorical variable.
- You need easy interpretability of the results.
- You want a computationally cheap model.
- You want to convert continuous features into categorical features as a feature engineeering step for a deep learning model.

## Do you actually want an ensemble of classification trees?

Classification trees are very simple models which often have weak predictive power. In most cases, you need to increase their power and robustness by using an ensemble of trees. That means:

- a bagging method like random forests, or
- a boosting method like gradient boosting.

## Which Python packages can you use?

- scikit learn (used here)
- PyCaret

## Case study: determining the variety of raisin

Automated food detection is useful in food production, food safety, and dietary monitoring. 

The process traditionally has two steps:

1. Convert images of food into numeric features, like dimensions, shape, color.
2. Run a classification model on those features.

This raisin dataset, sourced from the [UCI Machine Learning Archive](https://archive.ics.uci.edu/ml/datasets/Raisin+Dataset) contains the output of step 1 for varieties of Turkish raisin.

We'll need **pandas** for importing the data and doing some manipulation, then **scikit-learn** for modeling, and **matplotlib** for plotting.

The dataset is in a CSV file named `"raisins.csv"`.

## Data dictionary

Each row represents one raisin. Some columns are easier to interpret if you think of the raisin as being an ellipse.

![](ellipse.png)

- **Area**: Area of the raisin in pixels. Approximately `pi * a * b`.
- **MajorAxisLength**: The length of the longest diameter of the raisin in pixels. Equal to `2 * b`.
- **MinorAxisLength**: The length of the shortest diameter of the raisin in pixels. Equal to `2 * a`.
- **Eccentricity**: How close to circular is the raisin? Equal to `sqrt(1 - (2 * a) ^ 2 / (2 * b) ** 2)`.
- **ConvexArea**: Area of smallest convex shape around the raisin in pixel. Approximately `pi * a * b`, and slightly more than Area.
- **Extent**: Fraction of a rectangle drawn around the raisin that contains the raisin image.
- **Perimeter**: Perimeter of the raisin in pixels. Approximately `pi * (3 * (a + b) - sqrt((3 * a + b) * (a + 3 * b)))`
- **Variety**: The variety of raisin. Either **Kecimen** (sour black grape) or **Besni** (pale grape).

## Splitting into response and explanatory columns

The response column is `"Variety"`. The explanatory (input) columns are all the other columns.

## Splitting into training and testing sets

The explanatory and response datasets need to be split into training and testing sets. 

Here we'll use [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) with the default arguments.

## Fitting the model to the training set

The data is now ready to model. The first modeling step is to create a [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) object.

Use the [`.fit()`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit) method to fit the model to the training set.

## Making predictions on the testing set

You can calculate the predicted response with the [`.predict()`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict) method.

## Assessing model performance

There are four possible outcomes, depending on whether the actual response and the predicted response are true or false. The confusion matrix, created with [`confusion_matrix()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) shows the counts of each case.

|                     |**predicted Besni** |**predicted Kecimen** |
|:--------------------|:-----------------|:----------------|
|**actual Besni** |correct           |false positive   |
|**actual Kecimen**  |false negative    |correct          |

[`accuracy_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) provides a commonly used metric about the performance of the model. 

**Accuracy**: What fraction of the values were correctly predicted? (Sum of diagonal divided by sum of all values.)

Visualizing the tree helps to see how the decisions are made. Using [`plot_tree`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html) is the simplest way to do this.

## Want to learn more?

- The scikit learn tutorial on [Understanding the decision tree structure](https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html) provides more details on how to underastand the resulting tree.
- Several DataCamp courses cover decision trees with scikit-learn. Starts with [Machine Learning with Tree-Based Models in Python](https://app.datacamp.com/learn/courses/machine-learning-with-tree-based-models-in-python), and consider industry-specific variants like [Machine Learning for Finance in Python](https://app.datacamp.com/learn/courses/machine-learning-for-finance-in-python) and [Machine Learning for Marketing in Python](https://app.datacamp.com/learn/courses/machine-learning-for-marketing-in-python).
- Try applying your skills with this [Decision Tree Classification](https://app.datacamp.com/workspace/templates/playbook-python-decision-tree-marketing-data) Workspace template.