<a href="https://www.kaggle.com/code/yeemeitsang/titanic-xgboost-rf-ensemble?scriptVersionId=129648907" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**Introduction**<br>
Welcome to my Kaggle notebook, where I'll guide you through building a hybrid model of an XGBoost model and a Random Forest model.

The XGBoost model requires no filling in of null values and only needs one-hot encoding for categorical features. On the other hand, the Random Forest model requires imputation of missing values by replacing them with the respective median values, in addition to one-hot encoding.

To find the optimal hyperparameters for each model that would minimize the log loss, we conduct grid searches with cross-validation splitting. We use two different methods for grid searching: the XGBoost model uses itertools and a custom function, while the Random Forest model uses Scikit-learn's GridSearchCV.

Both models are given equal weight in the hybrid model. With minimal feature engineering, the hybrid model achieves an accuracy of approximately 0.77 on the competition test dataset.

If you're interested in exploring other machine learning techniques for this problem, I also have a TensorFlow Keras Sequential model and a Scikit-learn Logistic Regression model in [my GitHub repository](https://github.com/a-t-em/Kaggle-titanic-competition). Feel free to check it out and see how these models compare to the hybrid model presented here.

In [1]:
#import libraries
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.metrics import make_scorer, log_loss
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import itertools



**Load and prepare training data**

In [2]:
#load training dataset
df_train = pd.read_csv('/kaggle/input/titanic/train.csv')
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
#split data into features and targets
X_train = df_train.drop(['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
y_train = df_train.Survived
#observe the first few rows of each
display(X_train.head())
y_train[:5]

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.925,S
3,1,female,35.0,1,0,53.1,S
4,3,male,35.0,0,0,8.05,S


0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [4]:
#define the categorical and numerical features for preprocessing
cat_features = ['Sex', 'Embarked']
num_features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

**Build xgboost model**

- preprocessing

In [5]:
#define the column transformer to one-hot encode categorical features
xgb_preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_features)],
    remainder='passthrough'
)

- conduct grid search to find optimal hyperparameters

In [6]:
#define parameter space
param_space = {
    'max_depth': [3, 5, 7],
    'min_child_weight': [1.0, 2.0, 4.0],
    'n_estimators': [10, 50, 100],
    'learning_rate': [0.01, 0.1],
}

#get all possible combinations using itertools
param_combinations = itertools.product(param_space['max_depth'], 
                                       param_space['min_child_weight'], 
                                       param_space['n_estimators'], 
                                       param_space['learning_rate'])

In [7]:
params = []
scores = []

#loop over all combinations and calculate their log losses
for max_depth, min_child_weight, n_estimators, learning_rate in param_combinations:
    score_folds = []
    #perform cross validation
    kf = KFold(n_splits=5, shuffle=True)
    #split data into training and validation data for each fold
    for tr_idx, va_idx in kf.split(X_train):
        tr_x, va_x = X_train.iloc[tr_idx], X_train.iloc[va_idx]
        tr_y, va_y = y_train.iloc[tr_idx], y_train.iloc[va_idx]
        #use pipeline to facilitate processing
        pipeline = Pipeline([
                        ('preprocessor', xgb_preprocessor),
                        ('model', xgb.XGBClassifier(n_estimators=n_estimators, \
                                                    max_depth=max_depth, \
                                                    min_child_weight=min_child_weight, \
                                                    learning_rate=learning_rate))
                   ])
        pipeline.fit(tr_x, tr_y)
        #predict on validation data
        va_pred = pipeline.predict_proba(va_x)[:, 1]
        #calculate log loss
        logloss = log_loss(va_y, va_pred)
        score_folds.append(logloss)
    
    #take the average score across all folds for each combination
    score_mean = np.mean(score_folds)
    #record the params and the average score
    params.append((max_depth, min_child_weight, n_estimators, learning_rate))
    scores.append(score_mean)

In [8]:
#show the minimum log loss from all parameter combinations
display(np.array(scores).min())
best_idx = np.argsort(scores)[0]
#get the parameters that minimizes log loss
xgb_best_param = params[best_idx]
#view best parameters
display(xgb_best_param)
#get params stored in tuple as variables
max_depth, min_child_weight, n_estimators, learning_rate = xgb_best_param

0.40799503066353837

(5, 4.0, 50, 0.1)

- finalize model

In [9]:
# plug the best params values back into the model
xgb_pipeline = Pipeline([
                        ('preprocessor', xgb_preprocessor),
                        ('model', xgb.XGBClassifier(n_estimators=n_estimators, \
                                                    max_depth=max_depth, \
                                                    min_child_weight=min_child_weight, 
                                                    learning_rate=learning_rate))
                   ])
#fit pipeline
xgb_pipeline.fit(X_train, y_train)

**Build random forest model**

- preprocessing

In [10]:
#fill in missing values and normalize them
minmax_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', MinMaxScaler())
])

#bundle together with one hot encoding for categorical values 
rf_preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_features),
        ('minmax', minmax_transformer, num_features)]
)

- conduct grid search to find optimal hyperparameters

In [11]:
#define parameter space
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [5, 10, 15]
}

In [12]:
#define model and pipeline
model_rf = RandomForestClassifier()
rf_pipeline = Pipeline(steps=[
    ('preprocessor', rf_preprocessor),
    ('classifier', model_rf)
])
rf_pipeline.fit(X_train, y_train)

In [13]:
#create custom scorer for negative log loss
scorer = make_scorer(log_loss, greater_is_better=False, needs_proba=True)
#perform grid search with 5 fold cross validation
grid_search = GridSearchCV(rf_pipeline, param_grid=param_grid, cv=5, scoring=scorer)
grid_search.fit(X_train, y_train)
#view the best params and the best score 
grid_search.best_params_, grid_search.best_score_

({'classifier__max_depth': 5, 'classifier__n_estimators': 50},
 -0.4181848361259025)

In [14]:
#get best params as variables
n_estimators = grid_search.best_params_['classifier__n_estimators']
max_depth = grid_search.best_params_['classifier__max_depth']
n_estimators, max_depth

(50, 5)

- finalize model

In [15]:
#plug best params back into the model
model_rf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
rf_pipeline.fit(X_train, y_train)

**Predict on test dataset**

In [16]:
#load test dataset
df_test = pd.read_csv('/kaggle/input/titanic/test.csv')
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [17]:
#predict with xgboost model
xgb_preds = xgb_pipeline.predict_proba(df_test)[:, 1]
#view first few prediction probabilities
xgb_preds[:5]

array([0.06452899, 0.29458043, 0.08683947, 0.1409582 , 0.4075573 ],
      dtype=float32)

In [18]:
#predict with random forest model
rf_preds = rf_pipeline.predict_proba(df_test)[:, 1]
#view first few prediction probabilities
rf_preds[:5]

array([0.06, 0.21, 0.17, 0.67, 0.58])

In [19]:
#set weights for each model and combine results 
pred = xgb_preds*0.5 + rf_preds*0.5
#set threshold and derive labels
pred_label = np.where(pred > 0.5, 1, 0)
#view first few labels
pred_label[:5]

array([0, 0, 0, 0, 0])

**Prepare submission file**

In [20]:
#load sample file
df_sub = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')
df_sub.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [21]:
#replace target column with predicted labels
df_sub['Survived'] = pred_label
#view stats to see if they make sense
df_sub.Survived.describe()

count    418.000000
mean       0.337321
std        0.473362
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
Name: Survived, dtype: float64

In [22]:
#compile csv file for submission
df_sub.to_csv('submission.csv', index=False)