### Ensemble Method

#### Introduction
Ensemble methods are powerful techniques in machine learning that ***combine predictions from multiple models*** to improve accuracy and robustness. By aggregating the outputs of various models, ensemble methods can <u>***mitigate individual model weaknesses and leverage their strengths, resulting in better overall performance***</u>.

One common approach in ensemble methods is to use ***weighted averages*** of predictions from different models. This approach assigns weights to each model's prediction and combines them to produce a final prediction. <u>The weights can be optimized to minimize a loss function, such as log loss, to achieve the best possible performance</u>.

#### Weighted Average Ensemble with Optimization

To illustrate this, we will use predictions from several models and optimize the weights assigned to each model's predictions to minimize the log loss.

#### Step-by-Step Process

1. **Generate Initial Predictions**:
   We start by generating predictions from multiple models.

2. **Define the Objective Function**:
   The objective function is the log loss, which we aim to minimize.

3. **Optimize the Weights**:
   Using constrained optimization, we find the optimal weights that minimize the log loss.

4. **Calculate the Final Combined Predictions**:
   Use the optimized weights to calculate the combined predictions.

5. **Evaluate the Combined Predictions**:
   Calculate the accuracy and other metrics for the combined predictions.

### Explanation of Constraints

- **Equality Constraint**: A condition that requires an expression to be exactly zero. In this case, `np.sum(w) - 1 = 0` ensures the sum of weights \( w \) equals 1.
- **Usage in Optimization**: Ensures that the solution meets specific criteria, such as the sum of probabilities being 1.
- **Why Important**: Maintains the validity of combined predictions as probabilities.

### Summary

Ensemble methods, by combining predictions from multiple models, can significantly enhance the performance of machine learning systems. Using techniques like weighted averages and optimizing these weights to minimize loss functions like log loss, we can ensure robust and accurate predictions. Understanding and correctly implementing equality constraints is crucial in these optimizations to maintain the validity and interpretability of the results.

This approach showcases how ensemble methods can be practically applied and optimized, leveraging the strengths of multiple models to achieve superior performance.


In [19]:
# save the model
import joblib
import os

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import KNNImputer, SimpleImputer

# Models
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
import lightgbm as lgb
from sklearn.ensemble import RandomForestClassifier

# to tune the hyperparameters
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt

In [20]:
# Load the dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
df = pd.read_csv(url)

# Extract features and target
X = df.drop(columns=['survived', 'alive', 'pclass'])
y = df['survived']
# Declare pclass and survived as categorical
y = y.astype('category')

# One-hot encode categorical variables
X_encoded = pd.get_dummies(X, drop_first=True)

# Fill missing data using KNN imputation
imputer = KNNImputer(n_neighbors=5)  # You can set n_neighbors to the desired value
X_oh = pd.DataFrame(imputer.fit_transform(X_encoded), columns=X_encoded.columns)

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(exclude=['object']).columns.tolist()

# Impute numerical columns using KNN imputation
imputer_num = KNNImputer(n_neighbors=5)
X[numerical_cols] = imputer_num.fit_transform(X[numerical_cols])

# Impute categorical columns using SimpleImputer with the most frequent strategy
imputer_cat = SimpleImputer(strategy='most_frequent')
X[categorical_cols] = imputer_cat.fit_transform(X[categorical_cols])

# Ordinal encode categorical columns
encoder = OrdinalEncoder()
X[categorical_cols] = encoder.fit_transform(X[categorical_cols])

# Convert categorical columns to category type
for col in categorical_cols:
    X[col] = X[col].astype('category')

# Split the data into training and testing sets
X_train_oh, X_test_oh, y_train_oh, y_test_oh = train_test_split(X_oh, y, test_size=0.3, random_state=42, stratify=y)
X_train_ord, X_test_ord, y_train_ord, y_test_ord = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [21]:
# Load the models
best_logistic_model = joblib.load('./models/best_logistic_model.pkl')
best_catboost_model = joblib.load('./models/best_catboost_model.pkl')
best_xgb_model = joblib.load('./models/best_xgb_model.pkl')
best_lgb_model = joblib.load('./models/best_lgb_model.pkl')
best_rf_model = joblib.load('./models/best_rf_model.pkl')

In [8]:
# Define the models list
models = [
    ('logistic_regression', best_logistic_model, X_test_oh),
    ('catboost', best_catboost_model, X_test_oh),
    ('xgboost', best_xgb_model, X_test_ord),
    ('lightgbm', best_lgb_model, X_test_ord),
    ('random_forest', best_rf_model, X_test_ord)
]

In [22]:
# Generate predictions
predictions = {
    model_name: model.predict_proba(X_test)[:, 1] for model_name, model, X_test in models
}

# Convert predictions to a pandas DataFrame
predictions_df = pd.DataFrame(predictions)

In [23]:
predictions_df

Unnamed: 0,logistic_regression,catboost,xgboost,lightgbm,random_forest
0,0.505286,0.605833,0.317799,0.128509,0.270507
1,0.097941,0.098159,0.204992,0.076535,0.073525
2,0.084963,0.096180,0.227610,0.171168,0.189331
3,0.082094,0.052252,0.187112,0.027836,0.057506
4,0.055174,0.092683,0.195731,0.075663,0.105241
...,...,...,...,...,...
263,0.768450,0.844021,0.588063,0.755902,0.674954
264,0.312985,0.264240,0.468992,0.500815,0.420177
265,0.096045,0.094722,0.194328,0.063431,0.067604
266,0.072284,0.124410,0.195126,0.179979,0.111126


In [27]:
import numpy as np
from scipy.optimize import minimize
from sklearn.metrics import log_loss, accuracy_score

# Function to calculate the weighted average predictions
def weighted_predictions(weights, predictions_df):
    weighted_preds = np.dot(predictions_df.values, weights)
    return weighted_preds

# Objective function to minimize (log loss)
def objective_function(weights, predictions_df, y_true):
    weighted_preds = weighted_predictions(weights, predictions_df)
    return log_loss(y_true, weighted_preds)

# Initial weights (equal weights to start with)
initial_weights = np.ones(predictions_df.shape[1]) / predictions_df.shape[1]

# Bounds: weights should be between 0 and 1
bounds = [(0, 1)] * predictions_df.shape[1]

# Constraints: weights sum to 1
constraints = {'type': 'eq', 'fun': lambda w: np.sum(w) - 1}

# Optimize weights using SLSQP algorithm
result = minimize(
    objective_function,                 # The function to minimize
    initial_weights,                    # Initial guess for the weights
    args=(predictions_df, y_test_ord),  # Additional arguments passed to the objective function
    # method='SLSQP',                     # Optimization algorithm
    bounds=bounds,                      # Bounds for each weight
    constraints=constraints             # Constraints for the optimization
)

# Get the optimized weights
optimized_weights = result.x

# Calculate the final combined predictions using the optimized weights
final_predictions = weighted_predictions(optimized_weights, predictions_df)

# Convert final predictions to binary outcomes
final_predictions_binary = (final_predictions >= 0.5).astype(int)

# Evaluate the combined predictions
final_log_loss = log_loss(y_test_ord, final_predictions)
final_accuracy = accuracy_score(y_test_ord, final_predictions_binary)

print(f"Optimized Weights: {optimized_weights}")
print(f"Final Log Loss: {final_log_loss}")
print(f"Final Accuracy: {final_accuracy}")


Optimized Weights: [4.44418743e-01 2.26797597e-01 1.24683250e-18 3.28783660e-01
 1.33356867e-17]
Final Log Loss: 0.4116370431312995
Final Accuracy: 0.8246268656716418


In [17]:
predictions_df['final_predictions'] = final_predictions
predictions_df['final_predictions_binary'] = final_predictions_binary
predictions_df['y_true'] = y_test_ord.values
predictions_df.head(30)

Unnamed: 0,logistic_regression,catboost,xgboost,lightgbm,random_forest,final_predictions,final_predictions_binary,y_true
0,0.505286,0.605833,0.317799,0.128509,0.270507,0.404211,0,0
1,0.097941,0.098159,0.204992,0.076535,0.073525,0.090952,0,0
2,0.084963,0.09618,0.22761,0.171168,0.189331,0.11585,0,0
3,0.082094,0.052252,0.187112,0.027836,0.057506,0.057487,0,1
4,0.055174,0.092683,0.195731,0.075663,0.105241,0.070417,0,1
5,0.117588,0.193432,0.205152,0.137574,0.170325,0.14136,0,0
6,0.150361,0.041531,0.185214,0.022098,0.027323,0.083508,0,0
7,0.972699,0.882552,0.706221,0.93549,0.877583,0.94002,1,1
8,0.077099,0.121145,0.214668,0.160282,0.113782,0.114438,0,0
9,0.846926,0.930686,0.704011,0.93482,0.958956,0.894821,1,1
