<a href="https://colab.research.google.com/github/bill-mca/CYBN8001-Build-Skills/blob/SW5/task-5/Zoo_decision_tree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Training Data

We will use the 'zoo dataset' to create a decision tree model.

[Zoo dataset](http://archive.ics.uci.edu/ml/datasets/Zoo?ref=datanews.io) -- As described in the dataset information sheet:


In [1]:
# First, we'll download the zoo dataset to a local (temporary) folder
!wget http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data

--2024-08-01 08:48:16--  http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘zoo.data’

zoo.data                [ <=>                ]   4.03K  --.-KB/s    in 0s      

2024-08-01 08:48:16 (89.7 MB/s) - ‘zoo.data’ saved [4126]



In [3]:
# We can also download and display the dataset's description:
# This command downloads the relevant file
!wget http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.names
# This command displays the file's contents
!cat zoo.names

--2024-08-01 08:48:31--  http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.names
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘zoo.names.1’

zoo.names.1             [<=>                 ]       0  --.-KB/s               zoo.names.1             [ <=>                ]   2.53K  --.-KB/s    in 0s      

2024-08-01 08:48:31 (185 MB/s) - ‘zoo.names.1’ saved [2587]

1. Title: Zoo database

2. Source Information
   -- Creator: Richard Forsyth
   -- Donor: Richard S. Forsyth 
             8 Grosvenor Avenue
             Mapperley Park
             Nottingham NG3 5DX
             0602-621676
   -- Date: 5/15/1990
 
3. Past Usage:
   -- None known other than what is shown in Forsyth's PC/BEAGLE User's Guide.

4. Relevant Information:
   -- A simple database containing 17 Boolean-valued attri

## Data Ingestion

Here we'll "ingest" the data by importing it into [pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html):

In [4]:
from IPython.display import display
import pandas as pd

# Because the data file doesn't have header names, we'll list them here
# You can find a description of the data file at http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.names
feature_names = ['animal name', 'hair', 'feathers', 'eggs', 'milk', 'airborne', 'aquatic', 'predator', 'toothed', 'backbone', 'breathes', 'venomous', 'fins', 'legs', 'tail', 'domestic', 'catsize', 'type']

# Import the "zoo" dataset
zoo = pd.read_csv('zoo.data', names = feature_names)

# Lets take a peek at the data
zoo.head()

Unnamed: 0,animal name,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,type
0,aardvark,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
1,antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
2,bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
3,bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
4,boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1


In [5]:
# We'll now import a few useful packages

# Numpy is a linear algebra library,
# useful for common math operations
import numpy as np
# Matplotlib is a common plotting library
import matplotlib.pyplot as plt
# Seaborn is handy for creating beautiful plots
import seaborn as sns; sns.set()

# Feature Extraction

We'll now split up the data into features and labels. We will not do any special pre-processing to generate features, but you are of course welcome to experiment with engineering new features if you have an intuition for how it will improve your model performance.  

In [6]:
features = zoo.loc[:, 'hair':'catsize'] # Omit animal name
labels = zoo.loc[:, 'type']
# We then convert the feature and labels dataframes to
# numpy ndarrays, which can interface with the scikit-learn models
X = features.to_numpy()
y = labels.to_numpy()

In [7]:
# To help familiarize yourself with these matrices,
# have a look at their 'shapes' and understand why they are so.
print('X shape', X.shape)
print('y shape', y.shape)

X shape (101, 16)
y shape (101,)


# ML Algorithm and Model

The model you will use to create classifications is called a “Decision Tree”. You have likely seen these charted as thought diagrams, and are a popular tool among biologists for species identification. It’s also a powerful machine learning model that’s relatively intuitive which we can use to practice classification in the ML pipeline.

The decision tree model is described in *A visual introduction to machine learning* (parts [I](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/) and [II](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)) by Stephanie Yee and Tony Chu, 2018.

We also give one example of pseudo-code for training a simple version tree (with binary splits) in:

[A Course in Machine Learning, ch 1., Decision Trees](http://ciml.info/dl/v0_99/ciml-v0_99-ch01.pdf), by Hal Daumé III, 2015

We encourage you to read it carefully and work with peers to understand it’s behavior.

Additional resources: To use the scikit-learn decision tree algorithm, have a look at their [documentation](https://scikit-learn.org/stable/modules/tree.html). A more advanced ensemble of decision trees is called a “random forest”, while we do not cover it in class you are welcome to learn more about this approach. You can find one useful resource visualizing decision trees and generating random forests [here](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html).

In [1]:
# INITIALIZE YOUR MODEL HERE

# I copied this code across from the documentation for
# class sklearn.tree.DecisionTreeClassifier
# My plan is to modify the code to use the zoo dataset
# after I've worked though this easy example with the iris
# data
from sklearn import tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data
y = iris.target


clf = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0)



## Quality Metric and Model Tuning

First, let's begin by splitting our data into training and test sets. You can do so using scikit-learn's [`test_train_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. When setting its parameters, be mindful record your decisions. (You can also create a validation set by using `test_train_split` a second time.)

In [2]:
# SPLIT YOUR DATA INTO TRAIN/TEST SETS HERE
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

### Train your model

Try training your model decision tree model below on the training set.

In [3]:
# TRAIN YOUR MODEL HERE
blag = clf.fit(X_train, y_train)

## Evaluate your model

Start by testing out your model's test set accuracy using the DecisionTree's `score` function.

In [13]:
# EVALUATE YOUR MODEL ON TEST SET HERE
blag.score()

TypeError: ClassifierMixin.score() missing 2 required positional arguments: 'X' and 'y'

## Visualizing your model

A good first way to investigate your model is by visualizing it! We can do so using code similar to that provided by scikit-learn on the documentation page for the DecisionTreeClassifier model.

You may also find their "[Understanding the decision tree structure](https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py)" useful for understanding your model's behavior.

In [None]:
from sklearn import tree
import graphviz

# clf = # YOUR DECISON TREE CLASSIFIER HERE
dot_data = tree.export_graphviz(clf, out_file=None,
                      feature_names=features.columns,
                      class_names=sorted(list(map(str, labels.unique()))),
                      filled=True, rounded=True,
                      special_characters=True)
graph = graphviz.Source(dot_data)
graph

## Tuning your model

Hyperparameters like max depth can drastically affect your model's performance. Use k-fold cross validation to determine a good `max_depth` for your decision tree. Plot the cross validation score for each `max_depth` setting. Be sure to record how many folds you selected.

You can learn how to do k-fold cross validation with scikit-learn from the [documentation](https://scikit-learn.org/stable/modules/cross_validation.html).

In [None]:
# YOUR CROSS VALIDATION CODE HERE

Now that you've completed the ML pipeline for a classifier model Yay!!
