# Will John play tennis?

In this first example we will use the example from the video and try to predict if John will play tennis on a particular day? We will use scikit-learn to build the decision tree. We will visualize the tree using Pydotplus and Graphviz. Graphviz is a tool for drawing graphics using dot files. Pydotplus is a module to Graphvizâ€™s Dot language.

First install Graphviz and Pydotplus:

1. Download Graphviz and Install https://graphviz.gitlab.io/_pages/Download/Download_windows.html


2. !pip install graphviz<br>
   !pip install pydotplus


3. Add the graphviz installed path (C:...\graphviz\bin) to Control Panel > System and Security > System > Advanced System Settings > Environment Variables > Path > Edit > New


4. Very Important: Restart your Jupyter Notebook/machine.

In [None]:
!pip install graphviz
!pip install pydotplus

## 1. Import packages and classes

The first step is to import the packages:

In [None]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

## 2. Provide the data

The second step is defining the data to work with. Below you will find the data from the example in Excel-format. We will load the data, build the tree and at the end, we will try to predict if John will play tennis on a rainy day with high humidity and weak wind.

<img src="./resources/tennis.png" style="height: 300px"/>

Let's import the data using pandas:

In [None]:
# load dataset
tennis_df = pd.read_csv("resources/tennis1.csv", sep=';')

We can then print the imported values:

In [None]:
print(tennis_df)

## 3. Split the data

Given input features: *outlook, humidity and wind* we will predict if John will *play* or not. Let's split the features and the target variable:

In [None]:
# split dataset in features and target variable

feature_cols = ['outlook', 'humidity', 'wind']

X = tennis_df[feature_cols]
y = tennis_df[['play']] # target variable

In [None]:
print(X)

In [None]:
print(y)

## 4. Train the classifier

The next step is to train the classifier (decision tree) with the data. As you know, training is always necessary for supervised learning algorithms.

In [None]:
clf = DecisionTreeClassifier()
clf = clf.fit(X, y)

When you run the code above, you should get the following error:

```
ValueError: could not convert string to float: 'sunny'
```

The decision trees implemented in scikit-learn uses only numerical features and these features are interpreted always as continuous numeric variables. *sunny, overcast and rain* are categorical features and they are not supported in scikit-learn.

The simpliest way to solve this, is to replace our categorical string values with numerical values. But, simply replacing the strings with numbers normally should be avoided. Because being considered as a continuous numerical feature, any coding you  use, will induce an order which simply does not exist in your data.

For example coding ['sunny', 'overcast', 'rain'] with [1, 2, 3], would produce weird things like 'sunny' is lower than 'overcast', and if you average a 'sunny' and a 'rain' you will get a 'overcast'.

But since it is the simpliest solution, we are going to do it anyway. You can always search the internet for better solutions.

So let's install `category_encoders` to replace the categories with numeric values.

In [None]:
pip install --upgrade category_encoders

In [None]:
print(X)

In [None]:
print(feature_cols)

In [None]:
import category_encoders as ce

ce_ord = ce.OrdinalEncoder(cols = feature_cols)
X_cat = ce_ord.fit_transform(X)

In [None]:
print(X_cat)

Now we can try to train the classifier (decision tree) with the data.

In [None]:
clf = DecisionTreeClassifier(criterion = "entropy")
clf = clf.fit(X_cat, y)

## 5. Visualize the decision tree

Now we can use the packages from above (graphviz and pydotplus) to visualize our decision tree.

In [None]:
from io import StringIO
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file = dot_data, filled = True, rounded = True,
                special_characters = True, feature_names = feature_cols, class_names=['no','yes'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png(), width = 550)

Remarks:
    
- The algorithm keeps generating levels until the datasets are pure (entropy = 0).
- Since scikit-learn works with continuous values, the conditions in the nodes are formulated as `feature <= some_other_value` with only two outcomes (true or false). In our example we had categorical values: *Outlook is sunny, overcast or rain*.

## 6. Make predictions - Exercise

We wanted to predict if John would play tennis on a rainy day with high humidity and weak wind. Can you first convert the features in number values and use the tree above to determine if John will play tennis?

In [None]:
# answer?
# [outlook, humidity, wind] = [ , , ]
# Will John play tennis? YES / NO

You can check yourself if your prediction was right.

In [None]:
prediction = clf.predict([[3, 1, 1]])                                         
print(prediction)  

## 7. An extra feature temperature - Exercise

In the resources you will find a file tennis2.csv with an extra feature. Can you use this file and predict if John will still  play tennis on a HOT rainy day with high humidity and weak wind?

In [None]:
# load dataset


In [None]:
# print dataset


In [None]:
# split dataset in features and target variable


In [None]:
# encode the categories


In [None]:
# print the encoded categories


In [None]:
# fit the classifier


In [None]:
# print the decision tree


In [None]:
# make prediction
 

In [None]:
# Will John still play tennis on a HOT rainy day with high humidity and weak wind?
# answer