# **CSI 382 - Data Mining and Knowledge Discovery**

# **Lab 6 - Decision Trees**

One attractive classification method involves the construction of a decision tree, a  collection  ofdecision  nodes,  connected  by branches,  extending  downward from the root node until terminating in leaf nodes.  Beginning at the root node,which by convention is placed at the top of the decision tree diagram,attributes are  tested  at  the  decision  nodes,  with  each  possible  outcome  resulting  in  a branch.  Each branch then leads either to another decision node or to a terminating leaf node.

# **Dataset for Lab 6**

Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX, M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure:

* class: car acceptability
* buying: buying price
* maint: price of the maintenance
* doors: number of doors
* persons: capacity in terms of persons to carry
* lug_boot: the size of luggage boot
* safety: estimated safety of the car

The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety.

Because of known underlying concept structure, this database may be particularly useful for testing constructive induction and structure discovery methods.

**Attribute Information:**

Class Values: unacc, acc, good, vgood

Attributes:

* buying: vhigh, high, med, low.
* maint: vhigh, high, med, low.
* doors: 2, 3, 4, 5more.
* persons: 2, 4, more.
* lug_boot: small, med, big.
* safety: low, med, high.

The dataset can be found here in this [URL](https://drive.google.com/file/d/1wzsmycx2KlW637VTBS9vKoSNYWik9Vaf/view?usp=sharing)

**For today we need the upgraded category_encoders package. So we need to run the following code.**

In [None]:
!pip install category_encoders

## **Loading the dataset**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/car_evaluation.csv')

#Check number of rows and columns in the dataset
print("The dataset has %d rows and %d columns." % df.shape)

# **Exploratory Data Analysis**

Let's look into some attributes of the dataset first before preprocessing



In [None]:
# view dimensions of dataset

df.shape

In [None]:
df.head(10)

In [None]:
# Check data types
df.info()

In [None]:
col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']


for col in col_names:
    print(df[col].value_counts())

In [None]:
df['class'].value_counts()

# **Dataset Preprocessing**

We need to transform all categorical data to numerical ones. That's why we are applying some catoegory_encoder in our dataset.

In [None]:
# check missing values in variables

df.isnull().sum()

In [None]:
import category_encoders as ce

In [None]:
encoder = ce.OrdinalEncoder(cols=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])

df = encoder.fit_transform(df)

In [None]:
df.head(10)

# **Preparing dataset to be fed into Model**

The target/response variable in our dataset is **class**. So we are putting the class labels in our target varible $y$.

The other varaibles/predictors are the columns **[buying, maint, doors, persons, lug_boot, safety]** and should be put in our training variable $X$.

In [None]:
X = df.drop(['class'], axis=1)

y = df['class']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
X_train.head(10)

# **Decision Tree - CART**

We will now build our model of Decision Tree Classifier.

Theclassification and regression trees(CART) method was suggested by Breimanet al.  [1] in 1984.  The decision trees produced by CART are strictly binary,containing exactly two branches for each decision node. CART recursively par-titions the records in the training data set into subsets of records with similarvalues for the target attribute.  The CART algorithm grows the tree by conduct-ing for each decision node, an exhaustive search of all available variables and allpossible splitting values, selecting the optimal split according to the followingcriteria (from Kennedy et al. [2]).

CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it supports numerical target variables (regression) and does not compute rule sets. CART constructs binary trees using the feature and threshold that yield the largest information gain at each node.

scikit-learn uses an optimised version of the CART algorithm; however, scikit-learn implementation does not support categorical variables for now.

## **Measures for selecting the Best Split**

The measures developed for selecting the best split are often based on the degree of impurity of the child nodes. The smaller the degree of impurity, the more skewed the class distribution. For example, a node with class distribution (0,1) has zero impurity wheres a node with uniform class distribution has the highest impurity. Examples of impurity measures include:
   
* Entropy($t$) = $-\sum_{i=0}^{c-1}{p(i|t)\log_{2}p(i|t)}$

* Gini($t$) = $1-\sum_{i=0}^{c-1}{[p(i|t)]^2}$

* Classification Error($t$) = $1-\max_{i}[p(i|t)]$

where, $c$ is the number of classes and $0\log_{2}0=0$ in entropy calculations.

In [None]:
# Find more about scikit-learn's implementation of decision trees here - https://scikit-learn.org/stable/modules/tree.html

from sklearn.tree import DecisionTreeClassifier

In [None]:
# setting maximum depth of the decision tree to be level 7 with randomly chosen samples in the training set
clf_gini = DecisionTreeClassifier(max_depth=7, random_state=42)

# fit the model
clf_gini.fit(X_train, y_train)

In [None]:
# Getting some predictions from the testing set
y_pred_gini = clf_gini.predict(X_test)

y_pred_gini

In [None]:
# Finding the testing accuracy of the model
from sklearn.metrics import accuracy_score

print('Test accuracy score with criterion gini index: {0:0.4f}'. format(accuracy_score(y_test, y_pred_gini)))

In [None]:
# Finding the training accuracy of the model
y_pred_train_gini = clf_gini.predict(X_train)

y_pred_train_gini

In [None]:
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train_gini)))

In [None]:
# print the scores on training and test set

print('Training set score: {:.4f}'.format(clf_gini.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(clf_gini.score(X_test, y_test)))

In [None]:
# plotting the splits
import matplotlib.pyplot as plt

plt.figure(figsize=(96,48))

from sklearn import tree

tree.plot_tree(clf_gini.fit(X_train, y_train))

In [None]:
import graphviz
dot_data = tree.export_graphviz(clf_gini, out_file=None,
                              feature_names=X_train.columns,
                              class_names=y_train,
                              filled=True, rounded=True,
                              special_characters=True)

graph = graphviz.Source(dot_data)

graph

In [None]:
# Save the figure for future reference
graph.render(filename='cart',directory='/content/')

# **Evaluating the Model - CART**

We often use a metric called confusion matrix for evaluating the accuracy of a model.

A confusion matrix is a technique for summarizing the performance of a classification algorithm.

Classification accuracy alone can be misleading if you have an unequal number of observations in each class or if you have more than two classes in your dataset.

Calculating a confusion matrix can give you a better idea of what your classification model is getting right and what types of errors it is making.



##**Confusion Matrix**

A confusion matrix is basically a nxn matrix which indicates the number of instances a record has been classified as any of the n labels.

The desired outcome for a confusion is higher values on the left to right diagonal and 0 in the right to left diagonal.

##**Calculating a Confusion Matrix**

Below is the process for calculating a confusion Matrix.

1. You need a test dataset or a validation dataset with expected outcome values.
2. Make a prediction for each row in your test dataset.
3. From the expected outcomes and predictions count:
    * The number of correct predictions for each class.
    * The number of incorrect predictions for each class, organized by the class that was predicted.

These numbers are then organized into a table, or a matrix as follows:

* Expected down the side: Each row of the matrix corresponds to a predicted class.
* Predicted across the top: Each column of the matrix corresponds to an actual class.

The counts of correct and incorrect classification are then filled into the table.

The total number of correct predictions for a class go into the expected row for that class value and the predicted column for that class value.

In the same way, the total number of incorrect predictions for a class go into the expected row for that class value and the predicted column for that class value.

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred_gini)

print('Confusion matrix\n\n', cm)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf_gini.classes_)
disp.plot()

plt.savefig('/content/drive/MyDrive/CSI 382 - Datasets/cart_confusion_matrix.png')

## **Support and Confidence**

The **support** of the decision rule refers to the proportion of records in the dataset that rest in that particular terminal leaf node.The **confidence** of the rule refers to the proportion of records in the leaf nodefor which the decision rule is true.

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_gini))

# **Decision Tree - C4.5**

The C4.5 algorithm is Quinlan’s extension of his own ID3 algorithm for generating decision trees \cite{10.5555/152181}. Just as with CART, the C4.5 algorithm recursively visits each decision node, selecting the optimal split, until no further splits are possible. However, there are interesting differences between CART and C4.5:

* Unlike CART, the C4.5 algorithm is not restricted to binary splits. Whereas CART always produces a binary tree, C4.5 produces a tree of more variable shape.
* For categorical attributes, C4.5 by default produces a separate branch for each value of the categorical attribute. This may result in more “congested” than desired, since some values may have low frequency or may naturally be associated with other values.
* The C4.5 method for measuring node homogeneity is quite different from the CART method and is examined in detail below.


**The C4.5 algorithm uses the concept of information gain or entropy reduction to select the optimal split.**

C4.5 uses this concept of entropy as follows. Suppose that we have a candidate split $S$, which partitions the training data set $T$ into several subsets, $T_1, T_2, \dots , T_k$.
The mean information requirement can then be calculated as the weighted sum of the entropies for the individual subsets, as follows:

$H_s(T) = \sum_{i=1}^{k}{P_{i}H_{s}(T_{i})}$

where $P_i$ represents the proportion of records in subset $i$ . We may then define our
information gain to be $gain(S) = H(T) - H_S(T)$, that is, the increase in information produced by partitioning the training data $T$ according to this candidate split $S$. At
each decision node, C4.5 chooses the optimal split to be the split that has the greatest information gain, gain(S).

In [None]:
# setting maximum depth of the decision tree to be level 3 with randomly chosen samples in the training set
clf_en = DecisionTreeClassifier(criterion='entropy', max_depth=7, random_state=42)

# fit the model
clf_en.fit(X_train, y_train)

In [None]:
# Getting some predictions from the testing set
y_pred_en = clf_en.predict(X_test)

In [None]:
# Getting some predictions from the training set
y_pred_train_en = clf_en.predict(X_train)

y_pred_train_en

In [None]:
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train_en)))

In [None]:
print('Training set score: {:.4f}'.format(clf_en.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(clf_en.score(X_test, y_test)))

In [None]:
plt.figure(figsize=(12,8))

from sklearn import tree

tree.plot_tree(clf_en.fit(X_train, y_train))

In [None]:
import graphviz
dot_data = tree.export_graphviz(clf_en, out_file=None,
                              feature_names=X_train.columns,
                              class_names=y_train,
                              filled=True, rounded=True,
                              special_characters=True)

graph = graphviz.Source(dot_data)

graph

In [None]:
# Save the figure for future reference
graph.render(filename='C4.5.dot',directory='/content/drive/MyDrive/CSI 382 - Datasets/')

# **Evaluating the model - C4.5**

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred_en)

print('Confusion matrix\n\n', cm)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf_en.classes_)
disp.plot()

plt.savefig('/content/drive/MyDrive/CSI 382 - Datasets/cart_confusion_matrix.png')

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_en))

# **That's all for today!**

# **Tasks**

**Dataset**:

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient. For learning more about the dataset, you can follow this [link](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset).

Attribute Information

1. id: unique identifier
2. gender: "Male", "Female" or "Other"
3. age: age of the patient
4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6. ever_married: "No" or "Yes"
7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8. Residence_type: "Rural" or "Urban"
9. avg_glucose_level: average glucose level in blood
10. bmi: body mass index
11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12. stroke: 1 if the patient had a stroke or 0 if not (**target**)

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

Link to the dataset - [URL](https://drive.google.com/file/d/1xtS5tHc-hkJV4SUwYsyVFmVgtGYNhN7G/view?usp=sharing)

## **Do the following tasks**:

1. Preprocess the dataset if required
2. Apply both configurations of Decision Tree algorithm
    * Visualize the tree for both CART and C4.5
3. Maximize your accuracy!!
4. Analyze the confusion matrix in both cases.