First, we will load the relevant libraries and functions. Since we will be comparing ensemble models to our own ensemble and also to a basseline model, we will import the following models: Random Forest, AdaBoost, BaggingClassifier, VotingClassifier. We will compare these to the performance of a Logistic Regression and KNN model. We will also use the Decision Tree model to build our own ensemble.

In [71]:
# import relevant libraries
import numpy as np
import pandas as pd
import seaborn as sns
import time # to measure how long the models take
from sklearn import datasets
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, VotingClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, precision_recall_curve
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [58]:
# read the Titanic data set from seaborn
data = sns.load_dataset('titanic')

In [59]:
data

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [60]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


Now we will clean the data. Since we are familiar with this data set already, we know what to do: drop columns [deck, class, who, adult_male, embark_town, alive, alone]. Fill in missing values for Age, based on sex. Create a new feature to calculate total family size, and convert all categories to dummies. Let's begin: 

In [61]:
# drop columns
df = data.drop(['deck','class','who','adult_male','embark_town','alive','alone'], axis=1)

In [62]:
# fill in missing values for Age
df['age'] = df['age'].fillna(df.groupby('sex')['age'].transform('mean'))

In [63]:
# add total family size
df['fam'] = df['parch'] + df['sibsp']

In [64]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,fam
0,0,3,male,22.0,1,0,7.25,S,1
1,1,1,female,38.0,1,0,71.2833,C,1
2,1,3,female,26.0,0,0,7.925,S,0
3,1,1,female,35.0,1,0,53.1,S,1
4,0,3,male,35.0,0,0,8.05,S,0


Now that the data is ready, we can begin comparing models. First we will split data into training and testing, then Standardize using the StandardScaler and OneHotEncode the categoricals.

In [67]:
y = df["survived"]
X = df.drop("survived", axis=1)

In [69]:
# Identify numerical and categorical columns
num_cols = X.select_dtypes(include=['float64', 'int64']).columns
cat_cols = X.select_dtypes(include=['object']).columns


In [72]:
# Create a column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(), cat_cols)])

In [73]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [75]:
# Set up models to compare - I am adding some initial parameters

knn = KNeighborsClassifier(n_neighbors=10)
log_reg = LogisticRegression()
dt = DecisionTreeClassifier(max_depth=20)
rf = RandomForestClassifier()
ada = AdaBoostClassifier()
bag = BaggingClassifier()
voting = VotingClassifier(estimators=[('lr', log_reg), ('knn', knn), ('dt', dt)])

In [76]:
classifiers = {
    'K-Nearest Neighbors': knn,
    'Logistic Regression': log_reg,
    'Decision Tree': dt,
    'Random Forest': rf,
    'AdaBoost': ada,
    'Bagging': bag,
    'Voting': voting
}

In [77]:
# Create dictionary to store the results of each model
results = {}


In [78]:
# Loop through list of models to compare performance
for name, clf in classifiers.items():
    start_time = time.time()
    
    # Create pipeline
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', clf)])
    
    # Fit the model
    pipeline.fit(X_train, y_train)
    
    # Make predictions
    y_pred = pipeline.predict(X_test)
    
    # Compute metrics
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)
    
    end_time = time.time()
    elapsed_time = end_time - start_time
    
    # Store results
    results[name] = {
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'Accuracy': accuracy,
        'Time (s)': elapsed_time
    }

# Convert results to DataFrame for easier viewing
results_df = pd.DataFrame(results).T
print(results_df)

                     Precision    Recall  F1-Score  Accuracy  Time (s)
K-Nearest Neighbors   0.807018  0.686567  0.741935  0.821229  0.181435
Logistic Regression   0.681159  0.701493  0.691176  0.765363  0.073475
Decision Tree         0.742857  0.776119  0.759124  0.815642  0.032267
Random Forest         0.739130  0.761194  0.750000  0.810056  0.218131
AdaBoost              0.750000  0.761194  0.755556  0.815642  0.103001
Bagging               0.796875  0.761194  0.778626  0.837989  0.040676
Voting                0.781250  0.746269  0.763359  0.826816  0.039168


### Interpretation

* The KNN model has the highest precision and lowest recall. This means the model is good at predicting survivors, but may wrongly classify some survivors as non-survivors (false negative). The F1 score is decent, accuracy is fairly high, and compute time is fairly high. This is not a very efficient model.

* The Logistic Regression has poor precision, recall, F1 score, and accuracy. However, it's one of the fastest models.

* The Decision Tree with the max_depth parameter set to 20 trees is the fastest model, and overall has decent performance metrics.

* The Random Forest model is the most inefficient model of all, and surprisingly has worse metrics than the Decision Tree.

* AdaBoost takes roughly half the time of the Random Forest model and outperforms on most metrics.

* Bagging outperforms both AdaBoost and RandomForest, and is very efficient.

* Voting, which allowed us to combine models we are interested in is the most efficient and has decent metrics across the board.


Now let's tune each model and see if performance improves.

In the next section, I am importing a filter to avoid displaying the User Warning. The code below will issue a warning because not all hyper parameter combinations will result in a reasonable outcome. I don't want a huge display of warning so I am just supressing them.

In [90]:
import warnings

warnings.filterwarnings('ignore', category=UserWarning)

In this next section, we are testing some hyper parameter ranges for the models. You can pick whatever hyperparameters you're interested in testing. You should play around with the parameters of each model and compare the results!

In [92]:
# Import additional libraries
from sklearn.model_selection import GridSearchCV

# Hyperparameter grids for tuning
knn_params = {'classifier__n_neighbors': [3, 5, 7, 20, 30, 50, 100]}
log_reg_params = {'classifier__C': [0.1, 1, 10]}
dt_params = {'classifier__max_depth': [10,20,30,40,50]}
rf_params = {'classifier__n_estimators': [50, 100, 150], 'classifier__max_depth': [None, 10, 20, 30, 50]}
ada_params = {'classifier__n_estimators': [25, 50, 75]}
bag_params = {'classifier__n_estimators': [5, 10, 20]}
voting_params = {'classifier__voting': ['hard', 'soft']}

params_dict = {
    'K-Nearest Neighbors': knn_params,
    'Logistic Regression': log_reg_params,
    'Decision Tree': dt_params,
    'Random Forest': rf_params,
    'AdaBoost': ada_params,
    'Bagging': bag_params,
    'Voting': voting_params
}

# Initialize results dictionary for tuned models
tuned_results = {}

# Loop through classifiers for tuning
for name, clf in classifiers.items():
    start_time = time.time()
    
    # Create pipeline
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', clf)])
    
    # Create GridSearchCV object
    grid = GridSearchCV(pipeline, params_dict[name], cv=5)
    
    # Fit the model
    grid.fit(X_train, y_train)
    
    # Get the best estimator and predict
    best_model = grid.best_estimator_
    y_pred = best_model.predict(X_test)
    
    # Compute metrics
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    end_time = time.time()
    elapsed_time = end_time - start_time
    
    # Store results
    tuned_results[name] = {
        'Best Params': grid.best_params_,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'Time (s)': elapsed_time
    }

# Convert results to DataFrame for easier viewing
tuned_results_df = pd.DataFrame(tuned_results).T
print(tuned_results_df);


                                                           Best Params  \
K-Nearest Neighbors                     {'classifier__n_neighbors': 3}   
Logistic Regression                             {'classifier__C': 0.1}   
Decision Tree                            {'classifier__max_depth': 10}   
Random Forest        {'classifier__max_depth': None, 'classifier__n...   
AdaBoost                              {'classifier__n_estimators': 25}   
Bagging                                {'classifier__n_estimators': 5}   
Voting                                  {'classifier__voting': 'hard'}   

                    Precision    Recall  F1-Score  Time (s)  
K-Nearest Neighbors  0.793651  0.746269  0.769231  0.712588  
Logistic Regression  0.738462  0.716418  0.727273  0.230882  
Decision Tree        0.796875  0.761194  0.778626  0.319531  
Random Forest        0.757576  0.746269   0.75188  11.84034  
AdaBoost             0.720588  0.731343  0.725926  1.217149  
Bagging              0.770492  0.70

Let's review the results. Even with the tuned models, we see that the RandomForest model takes the longest time, but doesn't yield significantly better results than the KNN model. I would argue that either KNN or Voting is the best choice for this problem. From here, we can continue working on optimizing the KNN model. 