Voting Classifier
The Voting Classifier is a scikit-learn meta-classifier for combining several similar or conceptually different Machine Learning estimators. Specifically, it consists of assembling a college of experts, represented by models such as decision trees,  𝑘
  nearest neighbors or logistic regression, and then putting them to the vote.
Scikit-learn's VotingClassifier class enables you to carry out a hard or soft vote.

In 'hard' voting, each classification model predicts a label, and the final label produced is the one predicted most frequently.

In 'soft' voting, each model returns a probability for each class, and the probabilities are averaged to predict the final class (only recommended if classifiers are well calibrated).

In both cases, you can assign a weight to each estimator, allowing you to give more weight to one or more models.

Stacking
Stacking is an ensemble method wherein the principle involves simultaneously training various Machine Learning algorithms, whose results are then used to train a new model that optimally combines the predictions of the initial estimators.
This method relies on the following technique :

The first step involves specifying a list of L base algorithms and their corresponding parameters, as well as the meta-learning algorithm.
Each of the L algorithms is then trained on the training set, containing N observations.
Cross-validation is used to obtain predictions from each of the models for the N observations.
The meta-learning algorithm is then trained on this collected data and makes new predictions.
The ensemble model ultimately consists of the set of L base algorithms and the meta-learning model and can be used to generate predictions on a new dataset.

In [19]:

from sklearn.ensemble import VotingClassifier, StackingClassifier
from sklearn.model_selection import train_test_split, KFold, cross_validate
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
import os

import pandas as pd
import kagglehub

path = kagglehub.dataset_download("mathchi/diabetes-data-set")




directory = '/Users/notagain/.cache/kagglehub/datasets/mathchi/diabetes-data-set/versions/1'

df = pd.read_csv(os.path.join(directory, 'diabetes.csv'))  
df.head()



Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [23]:
data = df.drop('Outcome', axis=1)
target = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3, random_state=4)
clf1 = KNeighborsClassifier(n_neighbors=3)
clf2 = RandomForestClassifier(random_state=123)
clf3 = LogisticRegression(max_iter=1000)
vclf = VotingClassifier(estimators=[('knn', clf1), ('rf', clf2), ('lr', clf3)], voting='hard')
cv3 = KFold(n_splits=3, random_state=111, shuffle=True)

for clf, label in zip([clf1, clf2, clf3, vclf], ['KNN', 'Random Forest', 'Logistic Regression', 'Voting Classifier']):
    scores = cross_validate(clf, X_train, y_train, cv=cv3, scoring=['accuracy','f1'])
    print("[%s]: \n Accuracy: %0.2f (+/- %0.2f)" % (label, scores['test_accuracy'].mean(), scores['test_accuracy'].std()),
          "F1 score: %0.2f (+/- %0.2f)" % (scores['test_f1'].mean(), scores['test_f1'].std()))

[KNN]: 
 Accuracy: 0.71 (+/- 0.02) F1 score: 0.58 (+/- 0.02)
[Random Forest]: 
 Accuracy: 0.76 (+/- 0.03) F1 score: 0.63 (+/- 0.04)
[Logistic Regression]: 
 Accuracy: 0.76 (+/- 0.02) F1 score: 0.61 (+/- 0.05)
[Voting Classifier]: 
 Accuracy: 0.77 (+/- 0.00) F1 score: 0.64 (+/- 0.02)


In [26]:
from sklearn.model_selection import GridSearchCV

params = {
    'estimators': [[('knn', clf1), ('lr', clf3)], 
                  [('knn', clf1), ('rf', clf2)]]
}
grid = GridSearchCV(estimator=vclf, param_grid=params, cv=5)
grid = grid.fit(X_train, y_train)
print(grid.best_params_)

{'estimators': [('knn', KNeighborsClassifier(n_neighbors=3)), ('lr', LogisticRegression(max_iter=1000))]}


In [27]:
sclf = StackingClassifier(estimators=[('knn', clf1), ('rf', clf2), ('lr', clf3)], final_estimator=clf3)

scores = cross_validate(sclf, X_train, y_train, cv=cv3, scoring=['accuracy', 'f1'])
    
print("[StackingClassifier]: \n Accuracy: %0.2f (+/- %0.2f)\n" % (scores['test_accuracy'].mean(), scores['test_accuracy'].std()),
      "F1 score: %0.2f (+/- %0.2f)" % (scores['test_f1'].mean(), scores['test_f1'].std()))


[StackingClassifier]: 
 Accuracy: 0.76 (+/- 0.02)
 F1 score: 0.62 (+/- 0.04)


In [28]:
vclf.fit(X_train, y_train)
sclf.fit(X_train, y_train)

print("Acc :", vclf.score(X_test, y_test))
print("Acc :", sclf.score(X_test, y_test))

Acc : 0.8138528138528138
Acc : 0.8008658008658008


In [29]:
clf2.fit(X_train, y_train)

print("Acc :", clf2.score(X_test, y_test))

Acc : 0.7878787878787878


In this notebook, we saw two overall methods: Voting Classifier and Stacking. Unlike Bagging or Boosting, these methods aim to combine several already efficient estimators to benefit from their specific advantages.

The Voting Classifier is a meta-classifier from scikit-learn which allows you to combine different Machine Learning models, whether similar or conceptually different. It can perform a 'hard' vote, where each model predicts a label and the final label is determined by the majority vote, or a 'soft' vote, where the probabilities predicted by each model are averaged.

Stacking is an ensemble method that trains multiple machine learning algorithms simultaneously, then uses their predictions to train a new model. This method involves specifying a list of base algorithms and a meta-learning algorithm, then training each algorithm on a dataset and using cross-validation to obtain the predictions. Finally, the meta-learning model is trained on these predictions.

In the case of very large databases, and if the calculation time should not be too long, these methods are not necessarily preferred. Some algorithms use boosting/bagging methods in an optimized manner to obtain solid performance while reducing their calculation time, this is the case for example of XGBoost, which you can discover in the following notebook!