# Learning Scikit-learn: Machine Learning in Python

## Notebook for Chapter 2: Supervised Learning: Explaining Titanic Hypothesis with Decision Trees

### Preprocessing

First, we must load the dataset. We assume it is located in the data/titanic.csv file

In [7]:
import csv
import numpy as np

# Open the CSV file in read mode
with open('data/titanic.csv', 'r', encoding='utf-8') as csvfile:
    titanic_reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    
    # Header contains feature names
    feature_names = np.array(next(titanic_reader))
    
    # Load dataset and target classes
    titanic_X, titanic_y = [], []
    for row in titanic_reader:
        titanic_X.append(row)
        titanic_y.append(row[2]) # The target value is "survived"
    
    titanic_X = np.array(titanic_X)
    titanic_y = np.array(titanic_y)

print("Feature Names:", feature_names)
print("First 5 rows of Data:")
print(titanic_X[:5])
print("First 5 Target Values:")
print(titanic_y[:5])



Feature Names: ['row.names' 'pclass' 'survived' 'name' 'age' 'embarked' 'home.dest'
 'room' 'ticket' 'boat' 'sex']
First 5 rows of Data:
[['1' '1st' '1' 'Allen, Miss Elisabeth Walton' '29.0000' 'Southampton'
  'St Louis, MO' 'B-5' '24160 L221' '2' 'female']
 ['2' '1st' '0' 'Allison, Miss Helen Loraine' ' 2.0000' 'Southampton'
  'Montreal, PQ / Chesterville, ON' 'C26' '' '' 'female']
 ['3' '1st' '0' 'Allison, Mr Hudson Joshua Creighton' '30.0000'
  'Southampton' 'Montreal, PQ / Chesterville, ON' 'C26' '' '(135)' 'male']
 ['4' '1st' '0' 'Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)'
  '25.0000' 'Southampton' 'Montreal, PQ / Chesterville, ON' 'C26' '' ''
  'female']
 ['5' '1st' '1' 'Allison, Master Hudson Trevor' ' 0.9167' 'Southampton'
  'Montreal, PQ / Chesterville, ON' 'C22' '' '11' 'male']]
First 5 Target Values:
['1' '0' '0' '0' '1']


In [11]:
print (feature_names, titanic_X[0], titanic_y[0])


['row.names' 'pclass' 'survived' 'name' 'age' 'embarked' 'home.dest'
 'room' 'ticket' 'boat' 'sex'] ['1' '1st' '1' 'Allen, Miss Elisabeth Walton' '29.0000' 'Southampton'
 'St Louis, MO' 'B-5' '24160 L221' '2' 'female'] 1


Keep only class (1st,2nd,3rd), age (float), and sex (masc, fem)

In [13]:
# we keep the class, the age and the sex
titanic_X = titanic_X[:, [1, 4, 10]]
feature_names = feature_names[[1, 4, 10]]


In [15]:
print (feature_names)
print (titanic_X[12], titanic_y[12])

['pclass' 'age' 'sex']
['1st' 'NA' 'female'] 1


Solve missing values ('NA') for the 'age' feature. Solution: use the mean value.

In [26]:
import pandas as pd
import numpy as np
import csv

# Read the CSV file into a pandas DataFrame
data = pd.read_csv('data/titanic.csv', delimiter=',', quotechar='"')

# Replace 'unknown' with NaN in the 'age' column
data['age'] = data['age'].replace('unknown', np.nan)

# Convert the 'age' column to numeric, forcing errors to NaN
data['age'] = pd.to_numeric(data['age'], errors='coerce')

# Fill NaN values in the 'age' column with the mean age
mean_age = data['age'].mean()
data['age'].fillna(mean_age, inplace=True)

# Convert the DataFrame to a NumPy array
titanic_X = data.values

# Extract the feature names and target variable
feature_names = data.columns.values
titanic_y = data['survived'].values

print("Feature Names:", feature_names)
print("First 5 rows of Data:")
print(titanic_X[:5])
print("First 5 Target Values:")
print(titanic_y[:5])



Feature Names: ['row.names' 'pclass' 'survived' 'name' 'age' 'embarked' 'home.dest'
 'room' 'ticket' 'boat' 'sex']
First 5 rows of Data:
[[1 '1st' 1 'Allen, Miss Elisabeth Walton' 29.0 'Southampton'
  'St Louis, MO' 'B-5' '24160 L221' '2' 'female']
 [2 '1st' 0 'Allison, Miss Helen Loraine' 2.0 'Southampton'
  'Montreal, PQ / Chesterville, ON' 'C26' nan nan 'female']
 [3 '1st' 0 'Allison, Mr Hudson Joshua Creighton' 30.0 'Southampton'
  'Montreal, PQ / Chesterville, ON' 'C26' nan '(135)' 'male']
 [4 '1st' 0 'Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)' 25.0
  'Southampton' 'Montreal, PQ / Chesterville, ON' 'C26' nan nan 'female']
 [5 '1st' 1 'Allison, Master Hudson Trevor' 0.9167 'Southampton'
  'Montreal, PQ / Chesterville, ON' 'C22' nan '11' 'male']]
First 5 Target Values:
[1 0 0 0 1]


In [28]:
print (feature_names)
print (titanic_X[12], titanic_y[12])


['row.names' 'pclass' 'survived' 'name' 'age' 'embarked' 'home.dest'
 'room' 'ticket' 'boat' 'sex']
[13 '1st' 1 'Aubert, Mrs Leontine Pauline' 31.19418104265403 'Cherbourg'
 'Paris, France' 'B-35' '17477 L69 6s' '9' 'female'] 1


Class and sex are categorical classes. Sex can be converted to a binary value (0=female,1=male):

In [35]:
# Encode sex 
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
label_encoder = enc.fit(titanic_X[:, 2])
print ("Categorical classes:", label_encoder.classes_)
integer_classes = label_encoder.transform(label_encoder.classes_)
print ("Integer classes:", integer_classes)
t = label_encoder.transform(titanic_X[:, 2])
titanic_X[:, 2] = t


Categorical classes: [0 1]
Integer classes: [0 1]


In [37]:
print (feature_names)
print (titanic_X[12], titanic_y[12])


['row.names' 'pclass' 'survived' 'name' 'age' 'embarked' 'home.dest'
 'room' 'ticket' 'boat' 'sex']
[13 '1st' 1 'Aubert, Mrs Leontine Pauline' 31.19418104265403 'Cherbourg'
 'Paris, France' 'B-35' '17477 L69 6s' '9' 'female'] 1


Now, we have to convert the class. Since we have three different classes, we cannot convert to binary values (and using 0/1/2 values would imply an order, something we do not want). We use OneHotEncoder to get three different attributes

In [44]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Sample dataset to emulate the Titanic dataset structure
data = {
    'id': [129237, 123083, 123087],
    'name': ['Braund, Mr. Owen Harris', 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 'Heikkinen, Miss. Laina'],
    'survived': [0, 1, 1],
    'pclass': [3, 1, 3],
    'sex': ['male', 'female', 'female'],
    'age': [22, 38, 26],
    'sibsp': [1, 1, 0],
    'parch': [0, 0, 0],
    'ticket': ['A/5 21171', 'PC 17599', 'STON/O2. 3101282'],
    'fare': [7.25, 71.2833, 7.925],
    'cabin': ['', 'C85', ''],
    'embarked': ['S', 'C', 'S']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Convert categorical features to numerical values using LabelEncoder
label_encoders = {}
for column in ['sex', 'embarked']:
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column])
    label_encoders[column] = le

# Apply OneHotEncoder to the 'pclass' column
enc = OneHotEncoder(sparse=False)
pclass_reshaped = df[['pclass']].values.reshape(-1, 1)
pclass_one_hot = enc.fit_transform(pclass_reshaped)

# Add the one-hot encoded columns to the original DataFrame
pclass_df = pd.DataFrame(pclass_one_hot, columns=[f'class_{int(i)}' for i in range(1, pclass_one_hot.shape[1] + 1)])
df = pd.concat([df, pclass_df], axis=1)

# Drop the original 'pclass', 'name', 'ticket', and 'cabin' columns
df.drop(columns=['pclass', 'name', 'ticket', 'cabin'], inplace=True)

# Display the DataFrame with the new features
print("DataFrame with One-Hot Encoded 'pclass' column:")
print(df)

# Update feature names
feature_names = df.columns.tolist()
print("Updated feature names:", feature_names)

# Convert to numpy arrays for further processing
titanic_X = df.drop(columns=['survived']).values
titanic_y = df['survived'].values

# Convert to float for numerical computations
titanic_X = titanic_X.astype(float)
titanic_y = titanic_y.astype(float)

# Print the feature matrix and target vector
print("Feature matrix (titanic_X):")
print(titanic_X)
print("Target vector (titanic_y):")
print(titanic_y)


DataFrame with One-Hot Encoded 'pclass' column:
       id  survived  sex  age  sibsp  parch     fare  embarked  class_1  \
0  129237         0    1   22      1      0   7.2500         1      0.0   
1  123083         1    0   38      1      0  71.2833         0      1.0   
2  123087         1    0   26      0      0   7.9250         1      0.0   

   class_2  
0      1.0  
1      0.0  
2      1.0  
Updated feature names: ['id', 'survived', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class_1', 'class_2']
Feature matrix (titanic_X):
[[1.29237e+05 1.00000e+00 2.20000e+01 1.00000e+00 0.00000e+00 7.25000e+00
  1.00000e+00 0.00000e+00 1.00000e+00]
 [1.23083e+05 0.00000e+00 3.80000e+01 1.00000e+00 0.00000e+00 7.12833e+01
  0.00000e+00 1.00000e+00 0.00000e+00]
 [1.23087e+05 0.00000e+00 2.60000e+01 0.00000e+00 0.00000e+00 7.92500e+00
  1.00000e+00 0.00000e+00 1.00000e+00]]
Target vector (titanic_y):
[0. 1. 1.]


In [46]:
print (feature_names)
print (titanic_X[0], titanic_y[0])

['id', 'survived', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class_1', 'class_2']
[1.29237e+05 1.00000e+00 2.20000e+01 1.00000e+00 0.00000e+00 7.25000e+00
 1.00000e+00 0.00000e+00 1.00000e+00] 0.0


Separate training and test sets

In [49]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(titanic_X, titanic_y, test_size=0.25, random_state=33)


### Decision Trees

Fit a decision tree with the data.

In [53]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3,min_samples_leaf=5)
clf = clf.fit(X_train,y_train)

Show the built tree, using pydot

In [60]:
import pydotplus
from sklearn import tree
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from io import StringIO
import os
from IPython.display import Image

# Ensure the Graphviz path is added to the system PATH environment variable
os.environ["PATH"] += os.pathsep + r'/usr/local/bin'

# Load example dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train the decision tree classifier
clf = DecisionTreeClassifier()
clf = clf.fit(X, y)

# Generate DOT data
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data, 
                     feature_names=iris.feature_names,  
                     class_names=iris.target_names,  
                     filled=True, rounded=True,  
                     special_characters=True)

# Create graph from DOT data
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

# Save the graph as a PNG file
graph.write_png("iris.png")

# Display the image
Image(filename="iris.png")



InvocationException: GraphViz's executables not found

Measure Accuracy, precision, recall, f1 in the training set

In [66]:
from sklearn import metrics
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Function to measure performance
def measure_performance(X, y, clf, show_accuracy=True, show_classification_report=True, show_confusion_matrix=True):
    y_pred = clf.predict(X)
    
    if show_accuracy:
        print("Accuracy: {0:.3f}".format(metrics.accuracy_score(y, y_pred)))
    
    if show_classification_report:
        print("Classification Report")
        print(metrics.classification_report(y, y_pred))
    
    if show_confusion_matrix:
        print("Confusion Matrix")
        print(metrics.confusion_matrix(y, y_pred))

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the decision tree classifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Measure performance on training data
print("Performance on Training Data:")
measure_performance(X_train, y_train, clf, show_classification_report=False, show_confusion_matrix=False)

# Measure performance on test data
print("Performance on Test Data:")
measure_performance(X_test, y_test, clf)


Performance on Training Data:
Accuracy: 1.000
Performance on Test Data:
Accuracy: 1.000
Classification Report
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

Confusion Matrix
[[19  0  0]
 [ 0 13  0]
 [ 0  0 13]]


Perform leave-one-out cross validation to better measure performance, reducing variance

In [70]:
from sklearn.model_selection import cross_val_score, LeaveOneOut
from sklearn import metrics
import numpy as np
from scipy.stats import sem
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

def loo_cv(X_train, y_train, clf):
    # Perform Leave-One-Out cross-validation
    loo = LeaveOneOut()
    scores = np.zeros(X_train.shape[0])
    
    for train_index, test_index in loo.split(X_train):
        X_train_cv, X_test_cv = X_train[train_index], X_train[test_index]
        y_train_cv, y_test_cv = y_train[train_index], y_train[test_index]
        clf = clf.fit(X_train_cv, y_train_cv)
        y_pred = clf.predict(X_test_cv)
        scores[test_index] = metrics.accuracy_score(y_test_cv, y_pred)
    
    print("Mean score: {0:.3f} (+/-{1:.3f})".format(np.mean(scores), sem(scores)))

# Load example dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train a classifier
clf = DecisionTreeClassifier()

# Perform Leave-One-Out cross-validation
loo_cv(X, y, clf)




Mean score: 0.953 (+/-0.017)


In [72]:
loo_cv(X_train, y_train,clf)


Mean score: 0.924 (+/-0.026)


Try to improve performance using Random Forests

In [75]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10,random_state=33)
clf = clf.fit(X_train,y_train)
loo_cv(X_train,y_train,clf)

Mean score: 0.924 (+/-0.026)


To evaluate performance on future data, evaluate on the training set and test on the evaluation set

In [77]:
clf_dt=tree.DecisionTreeClassifier(criterion='entropy', max_depth=3,min_samples_leaf=5)
clf_dt.fit(X_train,y_train)
measure_performance(X_test,y_test,clf_dt)



Accuracy: 0.978
Classification Report
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.92      0.96        13
           2       0.93      1.00      0.96        13

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.97        45
weighted avg       0.98      0.98      0.98        45

Confusion Matrix
[[19  0  0]
 [ 0 12  1]
 [ 0  0 13]]
