# Decision Trees and Random Forests

This notebook demonstrates how to build and train decision tree and random forest models using Python and popular machine learning libraries. We will cover the following steps:
1. Importing necessary libraries
2. Loading and preprocessing the data
3. Building the decision tree model
4. Building the random forest model
5. Training the models
6. Evaluating the models' performance

## 1. Importing Necessary Libraries

We start by importing the required libraries for linear algebra, data processing, and model training.

In [1]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


In [2]:
data = pd.read_csv("../input/titanic/titanic-passengers.csv", sep = ';')

In [3]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,343,No,2,"Collander, Mr. Erik Gustaf",male,28.0,0,0,248740,13.0,,S
1,76,No,3,"Moen, Mr. Sigurd Hansen",male,25.0,0,0,348123,7.65,F G73,S
2,641,No,3,"Jensen, Mr. Hans Peder",male,20.0,0,0,350050,7.8542,,S
3,568,No,3,"Palsson, Mrs. Nils (Alma Cornelia Berglund)",female,29.0,0,4,349909,21.075,,S
4,672,No,1,"Davidson, Mr. Thornton",male,31.0,1,0,F.C. 12750,52.0,B71,S


In [4]:
data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

# Preprocessing the data

In [5]:
data.Age.fillna(data.Age.mean(), inplace = True)

In [6]:
data.Cabin.fillna('G6', inplace = True)

In [7]:
data.replace({'Sex':{'male': 1,'female':0}},inplace=True)

In [8]:
data.replace({'Survived':{'Yes': 1,'No':0}},inplace=True)

# # The Models

In [9]:
from sklearn.model_selection import train_test_split
from sklearn import tree   
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [10]:
#features extraction
x=data.drop(["Survived", "Name", "Cabin", "Ticket", "Embarked"], axis=1)
y= data["Survived"]

#splitting data
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.20,random_state=10)

#applying tree algorithm
model = DecisionTreeClassifier()  
model.fit(x_train, y_train)   #fitting our model
y_pred=model.predict(x_test)   # evaluating our model
print("score:{}".format(accuracy_score(y_test, y_pred)))

score:0.770949720670391


In [11]:
import graphviz

In [12]:
def viz_tree(model, name):
    d = tree.export_graphviz(model)
    graph = graphviz.Source(d)
    graph.render(name)

In [13]:
viz_tree(model, 'tree')

# Change some params

In [14]:
#applying tree algorithm
model = DecisionTreeClassifier(max_depth=5,min_samples_leaf = 5)  
model.fit(x_train, y_train)   #fitting our model
y_pred=model.predict(x_test)   # evaluating our model
print("score:{}".format(accuracy_score(y_test, y_pred)))

score:0.8044692737430168


Note : better accuracy

In [15]:
viz_tree(model, 'tree_2')

# Random Forests

In [16]:
from sklearn.ensemble import RandomForestClassifier #Importing Random Forest Classifier
from sklearn import metrics  # Importing metrics to test accuracy

In [17]:
clf=RandomForestClassifier(n_estimators=10)  #Creating a random forest with 10 decision trees
clf.fit(x_train, y_train)  #Training our model
y_pred=clf.predict(x_test)  #testing our model
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))  #Measuring the accuracy of our model

Accuracy: 0.8715083798882681


Note : Better accuracy than the two previous mdels

In [18]:
viz_tree(model, 'random_forests')

In [19]:
clf=RandomForestClassifier(n_estimators=15)  #Creating a random forest with 10 decision trees
clf.fit(x_train, y_train)  #Training our model
y_pred=clf.predict(x_test)  #testing our model
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))  #Measuring the accuracy of our model
viz_tree(model, 'random_forests_2')

Accuracy: 0.8491620111731844


In [20]:
clf=RandomForestClassifier(n_estimators=20)  #Creating a random forest with 10 decision trees
clf.fit(x_train, y_train)  #Training our model
y_pred=clf.predict(x_test)  #testing our model
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))  #Measuring the accuracy of our model

Accuracy: 0.8491620111731844


In [21]:
clf=RandomForestClassifier(n_estimators=25)  #Creating a random forest with 10 decision trees
clf.fit(x_train, y_train)  #Training our model
y_pred=clf.predict(x_test)  #testing our model
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))  #Measuring the accuracy of our model

Accuracy: 0.8379888268156425


In [22]:
clf=RandomForestClassifier(n_estimators=30)  #Creating a random forest with 10 decision trees
clf.fit(x_train, y_train)  #Training our model
y_pred=clf.predict(x_test)  #testing our model
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))  #Measuring the accuracy of our model

Accuracy: 0.8379888268156425


In [23]:
clf=RandomForestClassifier(n_estimators=40)  #Creating a random forest with 10 decision trees
clf.fit(x_train, y_train)  #Training our model
y_pred=clf.predict(x_test)  #testing our model
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))  #Measuring the accuracy of our model

Accuracy: 0.8379888268156425


In [24]:
clf=RandomForestClassifier(n_estimators=35)  #Creating a random forest with 10 decision trees
clf.fit(x_train, y_train)  #Training our model
y_pred=clf.predict(x_test)  #testing our model
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))  #Measuring the accuracy of our model

Accuracy: 0.8659217877094972


NOTES: 
After few testing, the best accuracy was seen with n_estimators = 10, and n_estimators = 35.
Radom forests were overall better than simple DTs