#### Practicing and trying out DecisionTree on the Titanic dataset

In [167]:
import pandas as pd

In [168]:
from sklearn import datasets
data = datasets.fetch_openml(name='titanic', version=1, as_frame=True)

  warn(


In [169]:
print(data.DESCR)

**Author**: Frank E. Harrell Jr., Thomas Cason  
**Source**: [Vanderbilt Biostatistics](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html)  
**Please cite**:   

The original Titanic dataset, describing the survival status of individual passengers on the Titanic. The titanic data does not contain information from the crew, but it does contain actual ages of half of the passengers. The principal source for data about Titanic passengers is the Encyclopedia Titanica. The datasets used here were begun by a variety of researchers. One of the original sources is Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay.

Thomas Cason of UVa has greatly updated and improved the titanic data frame using the Encyclopedia Titanica and created the dataset here. Some duplicate passengers have been dropped, many errors corrected, many missing ages filled in, and new variable

In [170]:
data

{'data':       pclass                                             name     sex  \
 0        1.0                    Allen, Miss. Elisabeth Walton  female   
 1        1.0                   Allison, Master. Hudson Trevor    male   
 2        1.0                     Allison, Miss. Helen Loraine  female   
 3        1.0             Allison, Mr. Hudson Joshua Creighton    male   
 4        1.0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female   
 ...      ...                                              ...     ...   
 1304     3.0                             Zabour, Miss. Hileni  female   
 1305     3.0                            Zabour, Miss. Thamine  female   
 1306     3.0                        Zakarian, Mr. Mapriededer    male   
 1307     3.0                              Zakarian, Mr. Ortin    male   
 1308     3.0                               Zimmerman, Mr. Leo    male   
 
           age  sibsp  parch  ticket      fare    cabin embarked boat   body  \
 0     29.0000    0.0 

In [171]:
X, y = datasets.fetch_openml(name='titanic', version=1, as_frame=True, return_X_y=True)

  warn(


In [172]:
# features
X.head()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [173]:
# label - whether the person survived or not (0 or 1)
y.head()

0    1
1    1
2    0
3    0
4    0
Name: survived, dtype: category
Categories (2, object): ['0', '1']

Drop the not so useful features that are not very relevant for making the predictions: name, sibsp, parch. ticket, cabin, embarked, boat, body, home.dest

In [174]:
X.drop(columns=['name', 'sibsp', 'parch', 'ticket', 'cabin', 'embarked', 'boat', 'body', 'home.dest'], axis=1, inplace=True)

In [175]:
X

Unnamed: 0,pclass,sex,age,fare
0,1.0,female,29.0000,211.3375
1,1.0,male,0.9167,151.5500
2,1.0,female,2.0000,151.5500
3,1.0,male,30.0000,151.5500
4,1.0,female,25.0000,151.5500
...,...,...,...,...
1304,3.0,female,14.5000,14.4542
1305,3.0,female,,14.4542
1306,3.0,male,26.5000,7.2250
1307,3.0,male,27.0000,7.2250


sex is a categorical variable that we need to take care of. we can transform it to 0-male, 1-female using the OrdinalEncoder

In [176]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['male', 'female']]).set_output(transform='pandas')
oe.fit_transform(X[['sex']])

Unnamed: 0,sex
0,1.0
1,0.0
2,1.0
3,0.0
4,1.0
...,...
1304,1.0
1305,1.0
1306,0.0
1307,0.0


In [177]:
X['sex'] = oe.fit_transform(X[['sex']])
X

Unnamed: 0,pclass,sex,age,fare
0,1.0,1.0,29.0000,211.3375
1,1.0,0.0,0.9167,151.5500
2,1.0,1.0,2.0000,151.5500
3,1.0,0.0,30.0000,151.5500
4,1.0,1.0,25.0000,151.5500
...,...,...,...,...
1304,3.0,1.0,14.5000,14.4542
1305,3.0,1.0,,14.4542
1306,3.0,0.0,26.5000,7.2250
1307,3.0,0.0,27.0000,7.2250


In [178]:
X.isna().sum()

pclass      0
sex         0
age       263
fare        1
dtype: int64

Looks like age and fare contains some `NaN` values, so we can try using imputation to say fill it with say the mean value

In [179]:
from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='mean').set_output(transform='pandas')
si.fit_transform(X[['age']])

Unnamed: 0,age
0,29.000000
1,0.916700
2,2.000000
3,30.000000
4,25.000000
...,...
1304,14.500000
1305,29.881135
1306,26.500000
1307,27.000000


In [180]:
X['age'] = si.fit_transform(X[['age']])
X['fare'] = si.fit_transform(X[['fare']])

In [181]:
X

Unnamed: 0,pclass,sex,age,fare
0,1.0,1.0,29.000000,211.3375
1,1.0,0.0,0.916700,151.5500
2,1.0,1.0,2.000000,151.5500
3,1.0,0.0,30.000000,151.5500
4,1.0,1.0,25.000000,151.5500
...,...,...,...,...
1304,3.0,1.0,14.500000,14.4542
1305,3.0,1.0,29.881135,14.4542
1306,3.0,0.0,26.500000,7.2250
1307,3.0,0.0,27.000000,7.2250


In [182]:
X.isna().sum()

pclass    0
sex       0
age       0
fare      0
dtype: int64

#### Performing the train test split

In [183]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [184]:
X_train.shape, y_train.shape

((1047, 4), (1047,))

#### Fitting the DecisionTree model

In [185]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

Model score on the test data

In [186]:
model.score(X_test, y_test)

0.7442748091603053

Model score on the train data

In [187]:
model.score(X_train, y_train)

0.9684813753581661

#### Fitting the SVM Model

In [204]:
from sklearn.svm import SVC
model = SVC()
model.fit(X_train, y_train)

In [205]:
model.score(X_test, y_test)

0.6679389312977099

In [206]:
model.score(X_train, y_train)

0.6723973256924546