# 01. DT 1 - Introduction to Decision Trees

### Decision Trees
   - Simple Tree like structure, model makes a decision at every node
   - Useful in simple tasks
   - One of the most popular algorithm
   - Easy explainability, easy to show how a decision process works!

### Why decision trees are popular?
   - Easy to interpret and present
   - Well defined Logic, mimic human level thought
   - Random Forests, Ensembles of decision trees are more powerful classifiers
   - Feature values are preferred to be **categorical**. If the values are continuous then they are discretized prior to building the model.

### Build Decision Trees
- Two common algorithms
   - CART (Classification and Regression Trees) → uses Gini Index(Classification) as metric.
   - ID3 (Iterative Dichotomiser 3) → uses Entropy function and Information gain as metrics

# 02. DT 3 - Process Kaggle Titanic Dataset

### Decision Trees
Problem : Titanic Survivor Prediction Kaggle Challenge

### Learning Goals
- How to pre-process data?
    - Dropping not useful features
    - Filling the missing values (Data Imputation)
- Creating Binary Decision Tree from scratch

In [1]:
import numpy as np

In [2]:
import pandas as pd

In [3]:
data=pd.read_csv("../Csv Files/titanic.csv")

In [4]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
data.head(n=10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [6]:
print(data.shape)

(891, 12)


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [8]:
columns_to_drop=["PassengerId","Name","Ticket","Cabin","Embarked"]

In [10]:
data.drop?

In [11]:
data_clean=data.drop(columns=columns_to_drop,axis=1)
""" axis : int or axis name
    Whether to drop labels from the index (0 / 'index') or
    columns (1 / 'columns').  """

" axis : int or axis name\n    Whether to drop labels from the index (0 / 'index') or\n    columns (1 / 'columns').  "

In [12]:
data_clean.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,0,3,male,22.0,1,0,7.25
1,1,1,female,38.0,1,0,71.2833
2,1,3,female,26.0,0,0,7.925
3,1,1,female,35.0,1,0,53.1
4,0,3,male,35.0,0,0,8.05


In [15]:
from sklearn.preprocessing import LabelEncoder

le=LabelEncoder()
data_clean["Sex"]=le.fit_transform(data_clean["Sex"])

In [16]:
data_clean.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,0,3,1,22.0,1,0,7.25
1,1,1,0,38.0,1,0,71.2833
2,1,3,0,26.0,0,0,7.925
3,1,1,0,35.0,1,0,53.1
4,0,3,1,35.0,0,0,8.05


In [17]:
data_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null int64
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
dtypes: float64(2), int64(5)
memory usage: 48.8 KB


In [20]:
data_clean=data_clean.fillna?
# Fill entries which are NA or NaN
# Not Available or Not A Number

In [21]:
data_clean=data_clean.fillna(data_clean["Age"].mean())

In [22]:
data_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null int64
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
dtypes: float64(2), int64(5)
memory usage: 48.8 KB


In [23]:
# We can also use Imputer available in sklearn

In [24]:
data_clean[2]

KeyError: 2

In [26]:
# This will give error in pandas we need to use locate method
data_clean.loc[2]

Survived     1.000
Pclass       3.000
Sex          0.000
Age         26.000
SibSp        0.000
Parch        0.000
Fare         7.925
Name: 2, dtype: float64

In [27]:
data_clean.loc?

In [32]:
input_cols=["Pclass","Sex","Age","SibSp","Parch","Fare"]
ouput_cols=["Survived"]

X=data_clean[input_cols]
Y=data_clean[ouput_cols]

print(X.shape,Y.shape)
print(type(X))

(891, 6) (891, 1)
<class 'pandas.core.frame.DataFrame'>


# 04. DT 4 - Implementing Information Gain

In [33]:
# Define Entropy and Information Gain

In [35]:
def entropy(col):
    counts=np.unique(col,return_counts=True)
    N=float(col.shape[0])
    print(counts)
    entropy=0.0

In [36]:
col=np.array([1,1,1,1,0,1,0])
entropy(col)

(array([0, 1]), array([2, 5]))


In [37]:
def entropy(col):
    counts=np.unique(col,return_counts=True)
    N=float(col.shape[0])
    ent=0.0
    for ix in counts[1]:
        p=ix/N;
        ent+=-1.0*p*np.log2(p)
    return ent

In [38]:
col=np.array([1,1,1,1,0,1,0])
entropy(col)

0.863120568566631

In [41]:
col=np.array([1,1,0,1,0,0])
entropy(col)
# Entropy is maximum when there is equal no of occurances from each classes

1.0

In [43]:
col=np.array([1,1,1,1,1,1])
entropy(col)
# Entropy is zero when there is only one type of occurances

0.0