# Decision Tree

What is decision tree?

1) Supervised Learning method

2) Decision support tool that uses a tree-like graph or model of decisions and their possible consequences.

3) Various variations such as Boosted Decision Tree, Random Forest

4) Can be used for categorical as well as continuous variables.

![](Capture.PNG)

![](Capture2.PNG)

Statement of the Problem- Predict whether income exceeds $50K/yr based on census data. OR To determine whether a person makes over 50K a year or not.

To download the data, Please follow the link(https://archive.ics.uci.edu/ml/datasets/adult)

In [2]:
#Import Libraries
import pandas as pd

In [3]:
#Read dataset
data=pd.read_csv('04+-+decisiontreeAdultIncome.csv')

In [4]:
data

Unnamed: 0,age,wc,education,marital status,race,gender,hours per week,IncomeClass
0,38,Private,HS-grad,Divorced,White,Male,40,<=50K
1,28,Private,Bachelors,Married,Black,Female,40,<=50K
2,37,Private,Masters,Married,White,Female,40,<=50K
3,31,Private,Masters,Never-married,White,Female,50,>50K
4,42,Private,Bachelors,Married,White,Male,40,>50K
...,...,...,...,...,...,...,...,...
19782,53,Private,Masters,Married,White,Male,40,>50K
19783,22,Private,Some-college,Never-married,White,Male,40,<=50K
19784,40,Private,HS-grad,Married,White,Male,40,>50K
19785,58,Private,HS-grad,Widowed,White,Female,40,<=50K


Description about the data set and features

In [5]:
data.age.unique()

array([38, 28, 37, 31, 42, 23, 32, 40, 59, 56, 19, 39, 49, 20, 45, 30, 21,
       24, 25, 57, 53, 44, 18, 47, 50, 43, 41, 48, 29, 36, 79, 27, 52, 46,
       33, 34, 76, 55, 22, 70, 51, 35, 26, 60, 90, 54, 65, 58, 64, 61, 62,
       66, 74, 67, 71, 63, 78, 69, 73, 68, 77, 75, 17, 80, 72, 81, 83, 84,
       85, 82, 88, 86], dtype=int64)

In [6]:
data.wc.unique()

array([' Private', ' Local-gov', ' Federal-gov', ' Never-worked'],
      dtype=object)

In [7]:
data.education.unique()

array([' HS-grad', ' Bachelors', ' Masters', ' Some-college',
       ' Doctorate', ' Prof-school', ' Preschool'], dtype=object)

In [14]:
data.race.unique()

array([' White', ' Black', ' Other', ' Asian-Pac-Islander',
       ' Amer-Indian-Eskimo'], dtype=object)

In [15]:
data.gender.unique()

array([' Male', ' Female'], dtype=object)

In [19]:
data.IncomeClass.unique()

array([' <=50K', ' >50K'], dtype=object)

In [22]:
#Check null values
data.isnull().sum(axis=0)

age               0
wc                0
education         0
marital status    0
race              0
gender            0
hours per week    0
IncomeClass       0
dtype: int64

In [23]:
#Check the data types
data.dtypes

age                int64
wc                object
education         object
marital status    object
race              object
gender            object
hours per week     int64
IncomeClass       object
dtype: object

In [24]:
#Create dummy variables
data_prep=pd.get_dummies(data,drop_first=True)

In [25]:
data_prep

Unnamed: 0,age,hours per week,wc_ Local-gov,wc_ Never-worked,wc_ Private,education_ Doctorate,education_ HS-grad,education_ Masters,education_ Preschool,education_ Prof-school,education_ Some-college,marital status_ Never-married,marital status_ Widowed,marital status_Married,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White,gender_ Male,IncomeClass_ >50K
0,38,40,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0
1,28,40,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0
2,37,40,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0
3,31,50,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,1
4,42,40,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19782,53,40,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1
19783,22,40,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0
19784,40,40,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,1,1,1
19785,58,40,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0


In [27]:
#Create the X and Y variables
X=data_prep.iloc[:,:-1]

In [28]:
Y=data_prep.iloc[:,-1]

In [30]:
#Create Train and Test dataset

In [31]:
#Split the X and Y datset into training and testing set

In [32]:
from sklearn.model_selection import train_test_split

In [33]:
X_train, X_test, Y_train, Y_test= train_test_split(X,Y, test_size=0.3,random_state=1234, stratify=Y)

In [34]:
from sklearn.tree import DecisionTreeClassifier

sklearn.tree.DecisionTreeClassifier-Parameters
1) max_depth- max depth of the tree 

2) min_samples_split- min samples required for the split

3) min_samples_leaf- min samples required at the leaf

4) max_leaf_nodes- max number of leaf nodes

5) splitter- Split strategy for best features or randon features

How to decide which feature has the best split?
Entropy is as one of the criteria and ectropy basically measures the diversity or impurity of the data. If things are more ordered, then we will say it has leass diversity, whereas those without any order and randomly scattered are supposed to be impure or will have high entropy. Our aim here is to get lower entropy values.

![](Capture3.PNG)

![](Capture4.PNG)
The entropy here is approximately 0.88. This is considered a high entropy , a high level of disorder ( meaning low level of purity). Entropy is measured between 0 and 1.(Depending on the number of classes in your dataset, entropy can be greater than 1 but it means the same thing , a very high level of disorder.

![](Capture5.PNG)

The x-axis measures the proportion of data points belonging to the positive class in each bubble and the y-axis axis measures their respective entropies. Right away, you can see the inverted ‘U’ shape of the graph. Entropy is lowest at the extremes, when the bubble either contains no positive instances or only positive instances. That is, when the bubble is pure the disorder is 0. Entropy is highest in the middle when the bubble is evenly split between positive and negative instances. Extreme disorder , because there is no majority.

![](1.PNG)
![](2.PNG)
![](3.PNG)
According to entropy, credit history has low value of entropy so we can take credit hisort to predict the model.

The second criteria is known as Gini. It ia a measure for equality in economics.
![](4.PNG)
The credit history has lower Gini Value.

Information Gain-
![](5.PNG)

6) max_features- 

7) presort

8) criterion

9) min_impurity_decrease


In [36]:
#Import Decision Tree classifier from sklearn
from sklearn.tree import DecisionTreeClassifier

In [39]:
#Train the model
dtc= DecisionTreeClassifier(random_state=1234)
dtc.fit(X_train, Y_train)

DecisionTreeClassifier(random_state=1234)

In [40]:
Y_predict=dtc.predict(X_test)

In [41]:
Y_predict

array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)

In [43]:
#Evaluate the model
from sklearn.metrics import confusion_matrix

In [46]:
cm=confusion_matrix(Y_test, Y_predict)

In [47]:
cm

array([[3814,  559],
       [ 800,  764]], dtype=int64)

In [45]:
score=dtc.score(X_test, Y_test)

In [48]:
score

0.7710965133906014