---

### 🎓 **Professor**: Apostolos Filippas

### 📘 **Class**: Web Analytics

### 📋 **Topic**: Pandas (self-study)

### 🔗 **Link**: https://bit.ly/WA_LEC7_NONLINEAR

🚫 **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---

# 1. 🌳  Learning A Decision Tree

In this notebook, we'll learn about decision trees and how to use them to perform classification and regression. Before we get started, let's start by importing libraries and loading the dataset.

## 1.1. Getting the data

In [None]:
import os
import numpy as np
import pandas as pd
import math
import matplotlib.pylab as plt
import seaborn as sns

%matplotlib inline
sns.set(style='ticks', palette='Set2')


url = "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original"
column_names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration',
                'model', 'origin', 'car_name']

mpg_df = pd.read_csv(url,
                     delim_whitespace=True,
                     header=None,
                     names=column_names).dropna()

mpg_df.head()

## 1.2 Turning the regression problem into a classification problem
We'll turn our problem into a classification problem.
Instead of predicting the exact milage, we'll predict whether a car gets high or low milage.

In [None]:
mpg_df["mpg"].hist()

Arbitrarily, let's say that those cars with a MPG greater than the median get good miles per gallon. 

In [None]:
# find the median
median_mpg = mpg_df["mpg"].median()

# print the median
print("the median MPG is: %s" % median_mpg)

# To do operations on an entire column in pandas, the following is a best practice:

# 1. we define a function of the operations we want to do
def is_high_mpg(mpg):
    return 1 if mpg > median_mpg else 0

# 2. we then apply this function over the every single row of a column ("mpg")
#    and we save the output as a new column ("is_high_mpg")
mpg_df["is_high_mpg"] = mpg_df["mpg"].apply(is_high_mpg)

# let's see the result
mpg_df.iloc[:20]

Our algorithms task is to now predict whether a car gets high or low MPG, or in other words, which class a car belongs to. 

We have used `1` to designate the  high-MPG class and `0` for the low-MPG class.

## 1.2 What's a decision tree?
The best way to learn is to read Chapter 3 from "Data Science for Business" by Provost and Fawcett. Nevertheless, we'll try to give you a brief overview of decision trees here.

A **decision tree**, as the name suggests, can be visualized as a tree structure. It is one of the most commonly used machine learning algorithms and is useful for both classification (categorizing items) and regression (predicting numerical values).
- For our purposes, we'll only look at the classification case.

### Components of a Decision Tree:

- **Root Node**: This represents the entire dataset. The root node splits into child nodes based on a decision rule.
- **Decision Nodes**: Each node represents a question about one feature. For each possible answer to that question, the tree directs us to a child node.
- **Leaf Nodes**: These nodes have no children, but instead contain the outcome or the final decision (classification or value). They do not split any further.

### How does a Decision Tree work?
When we're fitting our data to produce a decision tree, the algorithm goes through the following steps:
- **Selection of attribute**: The algorithm attempts to find the most informative feature based on a certain criterion (like entropy, gini impurity, etc.) to make a decision. By informative, we mean that asking a question about this feature should be able to split the dataset well into different classes.
- **Decision Making**: Once an attribute is selected, the dataset is split based on the values of this attribute. This process is performed recursively, producing a tree-like model of decisions.

### A simple example
Let's consider a simple decision-making scenario - deciding if we should play tennis based on weather conditions.
- The root node may ask: "Is it raining?"
    - If 'Yes', the tree might conclude, 'Do not play tennis.' (This becomes a leaf node).
    - If 'No', it may lead to a decision node: "Is the humidity high?"
        - If 'Yes', the tree might conclude, 'Do not play tennis.'
        - If 'No', the tree would then conclude, 'Play tennis.'




## 1.3 Building a decision tree

We'll use scikit-learn

In [None]:
from sklearn.tree import DecisionTreeClassifier

predictor_cols = ["weight", "acceleration", "acceleration", "cylinders", "displacement"]

# Let's define the model (tree)
decision_tree = DecisionTreeClassifier(max_depth=3, criterion="entropy")   # Look at those 2 arguments !!! 

# Let's tell the model what is the data
decision_tree.fit(mpg_df[predictor_cols], mpg_df["is_high_mpg"])

We now have a classifier tree, let's visualize the results! (to get this to work, create a directory called "images" in the same directory you are running this script from)

You should also have the "dot" command line tool installed. If you don't, you can install it with 
- "brew install graphviz" on a mac.
- "sudo apt-get install graphviz" on linux.
- "conda install graphviz" or by [downloading it](https://graphviz.gitlab.io/_pages/Download/Download_windows.html) on windows

In [None]:
import os
from IPython.display import Image
from sklearn.tree import export_graphviz
import subprocess

def visualize_tree(decision_tree, feature_names, class_names, directory="./images", name="tree", proportion=True, cleanup=True):
    
    # Check or create the directory
    if not os.path.exists(directory):
        os.makedirs(directory)
    
    # Export our decision tree to graphviz format
    dot_name = os.path.join(directory, f"{name}.dot")
    dot_file = export_graphviz(decision_tree, 
                               out_file=dot_name,
                               feature_names=feature_names, 
                               class_names=class_names,
                               proportion=proportion)
        
    # Call graphviz to make an image file from our decision tree
    image_name = os.path.join(directory, f"{name}.png")
    
    try:
        subprocess.check_call(['dot', '-Tpng', dot_name, '-o', image_name])
    except subprocess.CalledProcessError as e:
        print("Could not run dot, ie graphviz, to produce visualization")
    
    # Cleanup .dot file
    if cleanup:
        os.remove(dot_name)
    
    # Return the .png image so we can see it
    return Image(filename=image_name)

visualize_tree(decision_tree, predictor_cols, ["n", "y"])


Check out the decision tree.
- The first question it "asks" of any data point is if the displacement is <= 190.5.
  - If the answer is Yes, then it goes to the left child, and asks if the weight is <= 2278.5
  - If the answer is No, then it goes to the right child, and asks if the acceleration is <= 15.9

... and so on.

In other words, the tree is a series of nested "if" statements. Much like the "Guess who?" game or [Akinator](https://en.akinator.com/theme-selection)

---
# ❓ 2. Are our predictions good?


How good is our model? Let's compute accuracy, the percent of times where we correctly identified that a car was high MPG.

In [None]:
# get some predictions from the model
preds = decision_tree.predict(mpg_df[predictor_cols])

predictions_mpg_df = mpg_df.assign(predictions=preds)

predictions_mpg_df.head(50)

To get a measure of how good the predictions are, we'll use a measure called "accuracy", which is the percent of times where we correctly identified whether a car was high MPG or not.

It's easy to compute the accuracy yourselves, or you can do so via the sklearn's `accuracy_score` function.

In [None]:
from sklearn import metrics
print ( "Accuracy = %.3f" % (metrics.accuracy_score(predictions_mpg_df["predictions"], predictions_mpg_df["is_high_mpg"])) )

Is this good? Should we be satisified with this level of accuracy?

We'll find out how to answer this question in our next class :)