<a href="https://colab.research.google.com/github/DevanshParmar/ICG-Summer-Program-2021-DS/blob/main/Decision_Tree_Model_on_Titanic_Survival_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Decision Tree Model on Titanic Survival Dataset**
This is an implementation of the Decision Tree Model, a machine learning model on the Titanic survival dataset. 

#### **Uploads**
Setting up libraries and uploading dataset files.

In [None]:
import numpy as np
import pandas as pd
from google.colab import files

In [None]:
upload_train = files.upload()
upload_test = files.upload()

Saving train.csv to train.csv


Saving test.csv to test.csv


#### **Data Preprocessing**
Making a function to make changes into the dataframe, such as deleting PassId and converting Sex to male=0, female=1 objective case; and filling in median value of Age wherever it is not available.

In [None]:
def uploadto(address):
    data = pd.read_csv(address)
    col_names = ['PassId', 'Survived', 'PClass', 'Sex', 'Age', 'SibSp', 'ParCh', 'Fare']
    data.columns = col_names
    data.Sex.replace(('male', 'female'), (0, 1), inplace=True)
    data.pop('PassId')
    data = data.fillna(data['Age'].median())
    print(data)
    return data

In [None]:
train_df= uploadto('/content/train.csv')
test_df= uploadto('/content/test.csv')

     Survived  PClass  Sex   Age  SibSp  ParCh     Fare
0           0       3    0  22.0      1      0   7.2500
1           1       1    1  38.0      1      0  71.2833
2           1       3    1  26.0      0      0   7.9250
3           1       1    1  35.0      1      0  53.1000
4           0       3    0  35.0      0      0   8.0500
..        ...     ...  ...   ...    ...    ...      ...
615         1       2    1  24.0      1      2  65.0000
616         0       3    0  34.0      1      1  14.4000
617         0       3    1  26.0      1      0  16.1000
618         1       2    1   4.0      2      1  39.0000
619         0       2    0  26.0      0      0  10.5000

[620 rows x 7 columns]
     Survived  PClass  Sex   Age  SibSp  ParCh     Fare
0           0       3    0  27.0      1      0  14.4542
1           1       1    0  42.0      1      0  52.5542
2           1       3    0  20.0      1      1  15.7417
3           0       3    0  21.0      0      0   7.8542
4           0       3   

#### **Decision Tree Functions**
In the next three blocks, we define:
1. Entropy Function
2. Division Algorithm
3. Information Gain Function

All three of them are important in the study of Decision Trees.

In [None]:
def entropy(target_col):
    elements, counts = np.unique(target_col,return_counts = True)
    sum = 0.0
    n = np.sum(counts)
    for i in counts:
        p = i/n
        sum = sum - (p * np.log2(p))
    return sum

In [None]:
def division(input_data, title, mean):
    right = pd.DataFrame([], columns = input_data.columns)
    left = pd.DataFrame([], columns = input_data.columns)
    k = input_data.shape[0]
    for x in range(k):
        value = input_data[title].loc[x]
        if value >= mean:
            right = right.append(input_data.iloc[x])
        else:
            left = left.append(input_data.iloc[x])
    return right, left

In [None]:
def iGain(input_data, title, mean):
    right, left = division(input_data, title, mean)
    k = input_data.shape[0]
    left_ratio = float(left.shape[0])/k
    right_ratio = float(right.shape[0])/k
    if left.shape[0] == 0 or right.shape[0] == 0:
        return -99999
    igain = entropy(input_data.Survived) - ( left_ratio * entropy(left.Survived) + right_ratio * entropy(right.Survived))
    return igain

#### **Modeling**
In the next block we define the decision tree model. 
1. The first function inside the class initialises the model.
2. The second is the main training module.
3. The third function is the prediction module.

In [None]:
class DT:
    def __init__(self, depth=0, max_depth=5):
        self.left = None
        self.right = None
        self.title_name = None
        self.mean_val = None
        self.depth = depth
        self.max_depth = max_depth
        self.target = None
    #                              
    #                              
    def train_model(self, input_train):
        features = ['PClass', 'Sex', 'Age', 'SibSp', 'ParCh', 'Fare']             
        iGains = []
        for col in features: 
            iGains.append(iGain(input_train, col, input_train[col].mean()))
        #                              
        self.title_name = features[np.argmax(iGains)]                     
        self.mean_val = input_train[self.title_name].mean()  
        #                              
        r_data, l_data = division(input_train, self.title_name, self.mean_val)   
        r_data = r_data.reset_index(drop=True)                    
        l_data = l_data.reset_index(drop=True)
        #                              
        if l_data.shape[0] == 0 or r_data.shape[0] == 0:              
            if input_train.Survived.mean() >= 0.5: 
                self.target = 1                                               
            else:                                                                       
                self.target = 0
            return
        #                              
        if self.depth >= self.max_depth:                                     
            if input_train.Survived.mean() >= 0.5:
                self.target = 1
            else:
                self.target = 0
            return
        #                              
        self.left = DT(self.depth+1,self.max_depth)                   
        self.left.train_model(l_data)
        self.right = DT(self.depth+1,self.max_depth)                  
        self.right.train_model(r_data)
        #                              
        if input_train.Survived.mean() >= 0.5:
            self.target = 1
        else:
            self.target = 0
        return
    #                              
    #                              
    def predictions(self,test_df):                                                     
        if test_df[self.title_name] > self.mean_val:
            if self.right is None:
                return self.target
            return self.right.predictions(test_df)
        #                              
        if test_df[self.title_name] < self.mean_val:
            if self.left is None:
                return self.target
            return self.left.predictions(test_df)

In [None]:
model = DT()
model.train_model(train_df)

#### **Predictions and Accuracy**
In the next two blocks, we have measured the various statistical parameters of our model, such as accuracy, loss, F1 score, sensitivity and precision.

In [None]:
def stats(data):
    prediction = []
    for i in range(data.shape[0]):
        prediction.append(model.predictions(data.loc[i]))
    prediction = np.array(prediction)
    survive_data = np.array(data['Survived'])
    #                              
    loss = 0
    f_neg = 0
    f_pos = 0 
    t_neg = 0
    t_pos = 0
    #                              
    for i, j in zip(prediction, survive_data):
        if i == 1 and j == 1:
            t_pos+=1
        elif i == 1 and j == 0:
            f_pos+=1
            loss+=1
        elif i==0 and j == 1:
            f_neg+=1
            loss+=1
        else:
            t_neg+=1
    #                              
    rec = t_pos / (t_pos + f_neg)
    prc = t_pos / (t_pos + f_pos)
    acc = (t_pos + t_neg) / (t_pos + t_neg + f_pos + f_neg)
    f1s = 2 * prc * rec / (prc + rec)
    #                              
    print('   Accuracy is {:.2f}%'.format(100*acc))
    print('       Loss is',loss)
    print('   F1 Score is {:.4f}'.format(f1s))
    print('Sensitivity is {:.4f}'.format(rec))
    print('  Precision is {:.4f}'.format(prc))

In [None]:
print("Statistics for Training dataset are:")
print(" ")
stats(train_df)
print(" ")
print(" ")
print("Statistics for Test dataset are:")
print(" ")
stats(test_df)

Statistics for Training dataset are:
 
   Accuracy is 82.90%
       Loss is 106
   F1 Score is 0.7686
Sensitivity is 0.7273
  Precision is 0.8148
 
 
Statistics for Test dataset are:
 
   Accuracy is 83.39%
       Loss is 45
   F1 Score is 0.7458
Sensitivity is 0.6804
  Precision is 0.8250


#### **References**

1. Gagan Panwar's YouTube playlist over the same topic was a great help: www.youtube.com/playlist?list=PL9mhv0CavXYg3KFKct0JnslSwBCpAd_g0
2. Some Towards Data Science (TDS) articles were helpful, especially: www.towardsdatascience.com/decision-trees-for-classification-id3-algorithm-explained-89df76e72df1
3. This Exsilio blog was greatly helpful in visualing the final statistics of the model: www.blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/