# Stacking
Stacking is an ensemble method that creates a strong metamodel trained on the predictions of several independent base models.

- Comparison with Boosting and Bagging:
    * Training Data:
        - Stacking uses the full training set for both base models and the metamodel.
        - Boosting and bagging use sampling techniques to create training sets for their base models.
    * Base Models:
        - In stacking, base models can be different algorithms (e.g., logistic regression, random forest), leveraging their unique strengths.
        - Boosting and bagging often use a large number of similar base models (sometimes hundreds).
    * Metamodel Complexity:
        - Stacking builds a more complex metamodel that learns from the base model predictions.
        - Boosting and bagging typically use naive metamodels that average or vote on predictions.
- Functionality of the Metamodel:
The metamodel can prioritize and weight the contributions of different base models based on their unique insights.
Additional data can be incorporated into the metamodel alongside base model predictions, depending on the problem.
- Advantages of Stacking:
Utilizing different algorithms allows for diverse insights from the training data.
The framework combines outputs from various models, enhancing predictive power.

- Stacking starts with the full training set, generating predictions from different base models that are then fed into a metamodel.
The metamodel is trained to produce the final prediction, learning how to weight the insights from the base models effectively.
This structure allows stacking to be a powerful framework in ensemble learning, capturing a wider range of information for improved predictions.


In [59]:
# import relevant libraries
import joblib
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import StackingClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [61]:
# Check the hyperparameters

# Define the estimators
estimators = [('gb', GradientBoostingClassifier()), ('rf', RandomForestClassifier())]
# Initialize the model
sc = StackingClassifier(estimators = estimators)
sc.get_params()

{'cv': None,
 'estimators': [('gb', GradientBoostingClassifier()),
  ('rf', RandomForestClassifier())],
 'final_estimator': None,
 'n_jobs': None,
 'passthrough': False,
 'stack_method': 'auto',
 'verbose': 0,
 'gb': GradientBoostingClassifier(),
 'rf': RandomForestClassifier(),
 'gb__ccp_alpha': 0.0,
 'gb__criterion': 'friedman_mse',
 'gb__init': None,
 'gb__learning_rate': 0.1,
 'gb__loss': 'log_loss',
 'gb__max_depth': 3,
 'gb__max_features': None,
 'gb__max_leaf_nodes': None,
 'gb__min_impurity_decrease': 0.0,
 'gb__min_samples_leaf': 1,
 'gb__min_samples_split': 2,
 'gb__min_weight_fraction_leaf': 0.0,
 'gb__n_estimators': 100,
 'gb__n_iter_no_change': None,
 'gb__random_state': None,
 'gb__subsample': 1.0,
 'gb__tol': 0.0001,
 'gb__validation_fraction': 0.1,
 'gb__verbose': 0,
 'gb__warm_start': False,
 'rf__bootstrap': True,
 'rf__ccp_alpha': 0.0,
 'rf__class_weight': None,
 'rf__criterion': 'gini',
 'rf__max_depth': None,
 'rf__max_features': 'sqrt',
 'rf__max_leaf_nodes': None

In [63]:
# Get the training data
X_train = pd.read_csv('./data/train_features.csv')
y_train = pd.read_csv('./data/train_labels.csv')
X_train.head()

Unnamed: 0.1,Unnamed: 0,Pclass,Sex,Age,Fare,Family_cnt,Cabin_ind
0,570,2,0,62.0,10.5,0,0
1,787,3,0,8.0,29.125,5,0
2,74,3,0,32.0,56.4958,0,0
3,113,3,1,20.0,9.825,1,0
4,635,2,1,28.0,13.0,0,0


In [65]:
y_train.head()

Unnamed: 0,Survived
0,1
1,0
2,1
3,0
4,1


In [67]:
# Create a helper function to print accuracy score and standard deviation
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))
    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean,3), round(std*2,3),params))

In [79]:
# Define the hyperparameter grid for tuning
parameters = {
    'gb__n_estimators': [50,100],
    'rf__n_estimators': [50,100],
    'final_estimator': [LogisticRegression(C=0.1), LogisticRegression(C=1), LogisticRegression(C=10)],
    'passthrough': [True, False] # include the training data and only include output from two basemodels
}

# Set up GridSearchCV with 5-fold cross-validation
cv = GridSearchCV(sc, parameters, cv=5)

# Fit the model on the training data with the hyperparameter grid
cv.fit(X_train, y_train.values.ravel())

# Print the results of the grid search to find the best parameters and performance
print_results(cv)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

BEST PARAMS: {'final_estimator': LogisticRegression(C=1), 'gb__n_estimators': 50, 'passthrough': True, 'rf__n_estimators': 100}

0.82 (+/-0.142) for {'final_estimator': LogisticRegression(C=0.1), 'gb__n_estimators': 50, 'passthrough': True, 'rf__n_estimators': 50}
0.818 (+/-0.139) for {'final_estimator': LogisticRegression(C=0.1), 'gb__n_estimators': 50, 'passthrough': True, 'rf__n_estimators': 100}
0.824 (+/-0.068) for {'final_estimator': LogisticRegression(C=0.1), 'gb__n_estimators': 50, 'passthrough': False, 'rf__n_estimators': 50}
0.824 (+/-0.062) for {'final_estimator': LogisticRegression(C=0.1), 'gb__n_estimators': 50, 'passthrough': False, 'rf__n_estimators': 100}
0.824 (+/-0.142) for {'final_estimator': LogisticRegression(C=0.1), 'gb__n_estimators': 100, 'passthrough': True, 'rf__n_estimators': 50}
0.824 (+/-0.125) for {'final_estimator': LogisticRegression(C=0.1), 'gb__n_estimators': 100, 'passthrough': True, 'rf__n_estimators': 100}
0.824 (+/-0.058) for {'final_estimator': Lo

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [81]:
# Get the best estimators
cv.best_estimator_

In [83]:
# Get the best parametrs
cv.best_params_

{'final_estimator': LogisticRegression(C=1),
 'gb__n_estimators': 50,
 'passthrough': True,
 'rf__n_estimators': 100}

In [88]:
# Write the pickled model
joblib.dump(cv.best_estimator_, './data/models/stacked_model.pkl')

['./data/models/stacked_model.pkl']