# Tutorial- Decision Tree Classifier

A Decision Tree algorithm is one of the most popular machine learning algorithms. It uses a tree like structure and their possible combinations to solve a particular problem. It belongs to the class of supervised learning algorithms where it can be used for both classification and regression purposes.

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node.

We make some assumptions while implementing the Decision-Tree algorithm. These are listed below:-

At the beginning, the whole training set is considered as the root.
Feature values need to be categorical. If the values are continuous then they are discretized prior to building the model.
Records are distributed recursively on the basis of attribute values.
Order to placing attributes as root or internal node of the tree is done by using some statistical approach.


# Import Libraries

In [78]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore','always')

# Import dataset

In [9]:
dataset=pd.read_csv(r'C:\Users\Lenovo\Desktop\Arun DS\Task 18-30\TASK- 20 to 30\car-evaluation-data-set_ TASK 22\car_evaluation.csv',header=None)

# Exploratory Data Analysis

Now I will explore data to gain insights about the data

In [11]:
dataset.shape

(1728, 7)

We can see that there are 1728 instances and 7 variables in the data set.

In [12]:
dataset.head()

Unnamed: 0,0,1,2,3,4,5,6
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


Rename column names :

We can see that the dataset does not have proper column names. The columns are merely labelled as 0,1,2.... and so on. We should give proper names to the columns. I will do it as follows:-

In [13]:
col_names=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
dataset.columns=col_names
dataset.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


We can see that the column names are renamed. Now, the columns have meaningful names

In [15]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null object
maint       1728 non-null object
doors       1728 non-null object
persons     1728 non-null object
lug_boot    1728 non-null object
safety      1728 non-null object
class       1728 non-null object
dtypes: object(7)
memory usage: 94.6+ KB


Frequency distribution of values in variables
Now, I will check the frequency counts of categorical variables.

In [19]:
col_names=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
for col in col_names:
    print(dataset[col].value_counts())

low      432
high     432
vhigh    432
med      432
Name: buying, dtype: int64
low      432
high     432
vhigh    432
med      432
Name: maint, dtype: int64
4        432
5more    432
2        432
3        432
Name: doors, dtype: int64
4       576
2       576
more    576
Name: persons, dtype: int64
small    576
med      576
big      576
Name: lug_boot, dtype: int64
low     576
high    576
med     576
Name: safety, dtype: int64
unacc    1210
acc       384
good       69
vgood      65
Name: class, dtype: int64


We can see that the doors and persons are categorical in nature. So, I will treat them as categorical variables.

Summary of variables

There are 7 variables in the dataset. All the variables are of categorical data type.
These are given by buying, maint, doors, persons, lug_boot, safety and class.

class is the target variable.

In [21]:
dataset['class'].value_counts()

unacc    1210
acc       384
good       69
vgood      65
Name: class, dtype: int64

The class target variable is ordinal in nature

Let's check missing values in variable if any

In [22]:
#checking missing values in variables
dataset.isnull().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

We can see that there are no missing values in the dataset. I have checked the frequency distribution of values previously. It also confirms that there are no missing values in the dataset.

# Declare Feature Vector and Target Variable

In [23]:
dataset.shape

(1728, 7)

In [24]:
dataset.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [50]:
X=dataset.drop(['class'],axis=1)
y=dataset['class']

# Splitting data into training and testing set

In [51]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=42)

In [52]:
X_train.shape,X_test.shape

((1157, 6), (571, 6))

# Feature Engineering

Feature Engineering is the process of transforming raw data into useful features that help us to understand our model better and increase its predictive power. I will carry out feature engineering on different types of variables.

First, I will check the data types of variables again.

In [53]:
#check datatypes in X_train
X_train.dtypes

buying      object
maint       object
doors       object
persons     object
lug_boot    object
safety      object
dtype: object

Encode categorical variables:
    
Now, I will encode the categorical variables.

In [54]:
X_train.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
48,vhigh,vhigh,3,more,med,low
468,high,vhigh,3,4,small,low
155,vhigh,high,3,more,small,high
1721,low,low,5more,more,small,high
1208,med,low,2,more,small,high


We can see that all the variables are ordinal categorical data type.

In [55]:
#Import category encoders
import category_encoders as ce

In [56]:
#encode variables with ordinal encoding
encoder=ce.OrdinalEncoder(cols=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])
X_train=encoder.fit_transform(X_train)
X_test=encoder.fit_transform(X_test)

In [57]:
X_train.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
48,1,1,1,1,1,1
468,2,1,1,2,2,1
155,1,2,1,1,2,2
1721,3,3,2,1,2,2
1208,4,3,3,1,2,2


In [58]:
X_test.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
599,2,2,4,3,1,2
1201,4,3,3,2,1,3
628,2,2,2,3,3,3
1498,3,2,2,2,1,3
1263,4,3,4,1,1,1


We now have training and test set ready for model building.

# Decision Tree Classifier with criterion gini index

In [59]:
#Import Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

In [60]:
# instantiate the DecisionTreeClassifier model with criterion gini index
clf_gini=DecisionTreeClassifier(criterion='gini',max_depth=3,random_state=0)

#fit the model
clf_gini.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

Predict the Test set results with criterion gini index

In [61]:
y_pred_gini=clf_gini.predict(X_test)

In [64]:
from sklearn.metrics import accuracy_score
print('Model Accuracy score with criterion gini index:',accuracy_score(y_test,y_pred_gini))

Model Accuracy score with criterion gini index: 0.8021015761821366


Here, y_test are the true class labels and y_pred_gini are the predicted class labels in the test-set.

In [69]:
y_pred_train_gini=clf_gini.predict(X_train)
y_pred_train_gini

array(['unacc', 'unacc', 'unacc', ..., 'unacc', 'unacc', 'acc'],
      dtype=object)

In [72]:
print('Training set accuracy score: {0:0.4f}'.format(accuracy_score(y_train,y_pred_train_gini)))

Training set accuracy score: 0.7865


Check for overfitting and underfitting

In [73]:
#print the scores of training and testing sets
print('Training set accuracy score: {0:0.4f}'.format(accuracy_score(y_train,y_pred_train_gini)))
print('Testing set accuracy score: {0:0.4f}'.format(accuracy_score(y_test,y_pred_gini)))

Training set accuracy score: 0.7865
Testing set accuracy score: 0.8021


Here, the training-set accuracy score is 0.7865 while the test-set accuracy to be 0.8021. These two values are quite comparable. So, there is no sign of overfitting.

In [82]:
from sklearn.metrics import confusion_matrix,classification_report
print('Confusion Matrix: \n',confusion_matrix(y_test,y_pred_gini))
#print('Classification Report: \n',classification_report(y_test,y_pred_gini))

Confusion Matrix: 
 [[ 73   0  56   0]
 [ 20   0   0   0]
 [ 12   0 385   0]
 [ 25   0   0   0]]


In [83]:
print('Classification Report: \n',classification_report(y_test,y_pred_gini))

Classification Report: 
               precision    recall  f1-score   support

         acc       0.56      0.57      0.56       129
        good       0.00      0.00      0.00        20
       unacc       0.87      0.97      0.92       397
       vgood       0.00      0.00      0.00        25

   micro avg       0.80      0.80      0.80       571
   macro avg       0.36      0.38      0.37       571
weighted avg       0.73      0.80      0.77       571



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
