<a href="https://colab.research.google.com/github/axel-sirota/decision-trees-and-random-forests/blob/main/1_Decision_Trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Trees

© Data Trainers LLC. GPL v 3.0.

Author: Axel Sirota

In [None]:
# Glass identification data set
import pandas as pd
from sklearn import linear_model, model_selection, metrics
import warnings
warnings.filterwarnings('ignore')

In [None]:

%%writefile get_data.sh
mkdir -p data
if [ ! -f data/admissions.csv ]; then
  wget -O data/admissions.csv https://www.dropbox.com/scl/fi/bjcutl89xibf3r99yc8q0/admissions.csv?rlkey=n36lo0iffob0j73rys1vf3cn5&dl=0
fi
if [ ! -f data/bank.csv ]; then
  wget -O data/bank.csv https://www.dropbox.com/scl/fi/ukxqbfalj3rx5nyzven9j/bank.csv?rlkey=hfrax97bwt45dq9ag0jdpsgsp&dl=0
fi
if [ ! -f data/evergreen_sites.tsv ]; then
  wget -O data/evergreen_sites.tsv https://www.dropbox.com/scl/fi/c310bmln3pv8vdlbweo1k/evergreen_sites.tsv?rlkey=kie6jqkr4klw26b9gnowinyd9&dl=0
fi
if [ ! -f data/glass.csv ]; then
  wget -O data/glass.csv https://www.dropbox.com/scl/fi/dv522a61am4dsc3vkfp4p/glass.csv?rlkey=6l9v685sw98plzj2myvtjpes6&dl=0
fi
if [ ! -f data/titanic.csv ]; then
  wget -O data/titanic.csv https://www.dropbox.com/scl/fi/csnl3vpbq94i4vxpfoe2w/titanic.csv?rlkey=6q576c7lp0e25tb5khvz066l9&dl=0
fi

In [None]:
!bash get_data.sh

Decision Trees are deterministic classification algorithms that are extremely intuitive into how they classify.

To show their power check the following link:

http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

# Lets build a Tree to classify

<img src="https://www.dropbox.com/scl/fi/z6xbyrn4phkxcr4ca1qyd/tree.jpg?rlkey=we02at2ie3dmnssyxtiadw7yv&raw=1"  align="center"/>

In [None]:
glass = pd.read_csv('data/glass.csv')
glass.columns = ['ri','na','mg','al','si','k','ca','ba','fe','glass_type']
glass["household"] = glass.glass_type.apply(lambda x : x//4)
glass.head(2)

In [None]:
X = glass[['na','fe','al','k']]
y = glass.household
X.head()

In [None]:
y.head()

In [None]:
!pip install pydotplus

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# for vizualizing the tree
import pydotplus
from IPython.display import Image
from sklearn.model_selection import train_test_split
# Create decision tree classifer object
tree_model = DecisionTreeClassifier(random_state=0,max_depth=5)

In [None]:
tree_model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=99)

# logistic regression model fit
tree_model.fit(X_train,y_train)

# do prediction on test Data
y_pred=tree_model.predict(X_test)
print(y_pred)

# Calculate score
tree_model.score(X_test,y_test)

In [None]:
X.columns.values

In [None]:
# Create DOT data
dot_data = tree.export_graphviz(tree_model, out_file=None,
                                feature_names=X.columns.values,
                                class_names=['yes','no'])

# Draw graph
graph = pydotplus.graph_from_dot_data(dot_data)

# Show graph
Image(graph.create_png())

In [None]:
## Lets review the Matrix for above example

metrics.confusion_matrix(y_test, y_pred)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

The problem with Trees is that they tend to overfit, so what can we do? Let's check an idea from this link:

http://www.r2d3.us/visual-intro-to-machine-learning-part-2/

Apparently this idea of having a multitude of trees could be good. Let's explore more of that in the next notebook.


# Now you do it

Predicted who gets admitted or not?

>  test_size : 0.25   and random_state = 99

<img src="https://www.dropbox.com/scl/fi/qt7g1wgsnpne43cfwumu0/hands_on.jpg?rlkey=q1zyeuoiuvofnzux4iylfo6ax&raw=1" width="100" height="100" align="right"/>

**Below we will load in some data on titanic.**

In [None]:
import pandas as pd
from sklearn import linear_model, model_selection, metrics

titanic = pd.read_csv('data/titanic.csv')
titanic.head()