Supervised learning model used to solve problems related to classification and regression

Types:
1. CART- for solving regression problems.
2. ID3- for solving classification problems

At each level of decision tree construction, feature must be selected as splitiing attribute(split into subsets based on the splitting features)

Simple mechanism:

a. Based on data provided features should produce regression or classification output

b. Select the best feature to be used to split the dataset

c. Further split the data using best feature as basis 

d. Repeat splitting until all data is classified

Entropy for 'two' class dataset: range 0 to 1

Representation: E(D) = -p[yes].log2(p[yes])-p[no].log2(p[no])

Choosing splitting features if 'n' features in the datasets:

Entropy- measure of impurity in the dataset. if all samples are part of the same class it is considered to be 'pure' else 'impure'. entropy is 0 for pure, 1 for impure.

'n' class dateset: range 0 to log2n

Representation: E(S)= [i=1Σc - Pi.log2(Pi)]; c is number of labels, Pi is number of samples belonging to i

1. Information Gain: measure of reduction in entropy; lower entropy means higher information gain, primarily used for selecting best splitting feature

E[root]-weighted average*E[child]

Representation: E - feature∈datasetΣ [abs(child)/abs(root)]*E(child)

2. Gini Index- measure to find the purity of the split range between 0 and 1

Gi = 1-(p{yes)^2+p(no)^2)

In [30]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [89]:
import warnings
warnings.filterwarnings('ignore')

In [6]:
df=pd.read_csv('train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [22]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [23]:
unwanted_cols=['PassengerId','Name','Ticket','Fare','Cabin','Embarked']

In [24]:
df=df.drop(unwanted_cols,axis=1)

In [25]:
df.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch'], dtype='object')

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
dtypes: float64(1), int64(4), object(1)
memory usage: 41.9+ KB


In [28]:
df=df.fillna(df['Age'].mean())

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
dtypes: float64(1), int64(4), object(1)
memory usage: 41.9+ KB


In [33]:
lbl_enc=LabelEncoder()

df['Sex']=lbl_enc.fit_transform(df['Sex'])
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    int64  
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
dtypes: float64(1), int64(5)
memory usage: 41.9 KB


In [34]:
df.columns=['output', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch']
df.head()

Unnamed: 0,output,Pclass,Sex,Age,SibSp,Parch
0,0,3,1,22.0,1,0
1,1,1,0,38.0,1,0
2,1,3,0,26.0,0,0
3,1,1,0,35.0,1,0
4,0,3,1,35.0,0,0


In [119]:
#Entropy calcualtion function
def entropy(cols):
    #Return all unique values
    count=np.unique(cols,return_counts=True)
    ent=0.0
    for i in count[1]:
        #Probability of occurence
        prob=i/cols.shape[0]
        #Representation of entropy formula
        ent+=(-1.0*prob*np.log2(prob))
    return ent

In [120]:
#Separating dataset
def separate_data(data,fkey,fval):
    #Columns to hold separated dataset values
    rightr=pd.DataFrame([],columns=data.columns)
    leftl=pd.DataFrame([],columns=data.columns)
    for i in range(data.shape[0]):
        #Get data for location
        val=data[fkey].loc[i]
        #Condition to add value to right or left columns
        if val>=fval:
            rightr=rightr._append(data.iloc[i])
        else:
            leftl=leftl._append(data.iloc[i])
    return rightr,leftl

In [132]:
#Information gain function
def information_gain(data,fkey,fval):
    right,left=separate_data(data,fkey,fval)
    #Old and new shape of data is calculated
    l=float(left.shape[0])/data.shape[0]
    r=float(right.shape[0])/data.shape[0]
    #All results in the column are the same
    if left.shape[0]==0 or right.shape[0]==0:
        return -99999
    #New entropy value for information gain
    info_gain=entropy(data.output)-(l*entropy(left.output)+r*entropy(right.output))
    return info_gain

In [133]:
class DecisionTree:
    #Initialize required variables for the tree
    def __init__(self,depth=0,max_depth=3):
        self.left=None
        self.right=None
        self.fkey=None
        self.fval=None
        self.depth=depth
        self.max_depth=max_depth
        self.target=None

    def train(self,X_train):
        feature=['Pclass', 'Sex', 'Age', 'SibSp', 'Parch']
        info_gain_list=[]
        for i in feature:
            #Calculate information gain for all features
            info_gain=information_gain(X_train,i,X_train[i].mean())
            info_gain_list.append(info_gain)
        #Get the max information gain value to make the root node
        self.fkey=feature[np.argmax(info_gain_list)]
        #Which side to split based on mean of column
        self.fval=X_train[self.fkey].mean()
        print('Split',self.fkey)
        n_right,n_left=separate_data(X_train,self.fkey,self.fval)
        #Divide data for tree
        n_right=n_right.reset_index(drop=True)
        n_left=n_left.reset_index(drop=True)
        #Threshold value
        if n_left.shape[0]==0 or n_right.shape[0]==0:
            if X_train.output.mean()>=0.5:
                self.target='SURVIVE'
            else:
                self.target='DEAD'
            return 
        if self.depth>self.max_depth:
            if X_train.output.mean()>=0.5:
                self.target='SURVIVE'
            else:
                self.target='DEAD'
            return 
            
        self.left=DecisionTree(self.depth+1,self.max_depth)
        self.left.train(n_left)
        self.right=DecisionTree(self.depth+1,self.max_depth)
        self.right.train(n_right)
        if X_train.output.mean()>=0.5:
            self.target='SURVIVE'
        else:
            self.target='DEAD'
        return 

    def predict(self,test):
        if test[self.fkey]>self.fval:
            #Check for last node
            if self.right is None:
                return self.target
            return self.right.predict(test)

In [142]:
split=int(0.8*df.shape[0])
train_data=df[:split]
test_data=df[split:]
test_data=test_data.reset_index(drop=True)
test_data

Unnamed: 0,output,Pclass,Sex,Age,SibSp,Parch
0,1,1,1,48.000000,1,0
1,0,3,1,29.000000,0,0
2,0,2,1,52.000000,0,0
3,0,3,1,19.000000,0,0
4,1,1,0,38.000000,0,0
...,...,...,...,...,...,...
174,0,2,1,27.000000,0,0
175,1,1,0,19.000000,0,0
176,0,3,0,29.699118,1,2
177,1,1,1,26.000000,0,0


In [143]:
dec_tre=DecisionTree()

In [144]:
dec_tre.train(train_data)

Split Sex
Split Pclass
Split Age
Split SibSp
Split Pclass
Split SibSp
Split SibSp
Split Parch
Split Pclass
Split Parch
Split SibSp
Split Age
Split Age
Split SibSp
Split Parch
Split Parch
Split Pclass
Split Pclass
Split Age
Split SibSp
Split Age
Split Parch
Split SibSp
Split Age
Split Age
Split Age
Split SibSp
Split Parch
Split Age
Split Age
Split Parch


In [151]:
dec_tre.right.fkey

'Pclass'

In [152]:
dec_tre.left.fkey

'Pclass'

In [157]:
dec_tre.left.right.fkey

'Parch'

In [158]:
dec_tre.left.left.fkey

'Age'