# <p style="background-color:green;font-family:newtimeroman;font-size:200%;color:white;text-align:center;border-radius:20px 20px;"><b>Stacking Method (Titanic DataSet)</b></p>
![](https://www.techproeducation.com/logo/headerlogo.svg)

<b>Yeniliklerden ilk siz haberdar olmak istiyorsanız lütfen bizi takip etmeyi unutmayın </b>[YouTube](https://www.youtube.com/c/techproeducation) | [Instagram](https://www.instagram.com/techproeducation) | [Facebook](https://www.facebook.com/techproeducation) | [Telegram](https://t.me/joinchat/HH2qRvA-ulh4OWbb) | [Whatsapp](https://api.whatsapp.com/send/?phone=%2B15853042959&text&type=phone_number&app_absent=0) | [LinkedIn](https://www.linkedin.com/company/techproeducation/mycompany/) |

# Classification with Stacking Methods in Machine Learning

Bu notebook'ta, makine öğreniminde model performansını artırmak için kullanılan topluluk yöntemlerinden biri olan **Stacking** yöntemini ele alacağız. Ayrıca, ünlü bir veri seti üzerinde uygulamalı bir örnek gerçekleştirerek, stacking yönteminin nasıl kullanılacağını göstereceğiz.

## Table of Contents

1. [Introduction](#Introduction)
2. [What is Stacking?](#What-is-Stacking?)
3. [Loading the Dataset and Exploratory Data Analysis](#Loading-the-Dataset-and-Exploratory-Data-Analysis)
4. [Data Preprocessing](#Data-Preprocessing)
5. [Training Base Learners](#Training-Base-Learners)
6. [Training the Meta Learner](#Training-the-Meta-Learner)
7. [Model Performance Evaluation](#Model-Performance-Evaluation)
8. [Hyperparameter Tuning and Performance Evaluation using GridSearchCV](#Hyperparameter-Tuning-and-Performance-Evaluation-using-GridSearchCV)
9. [Comparison of Multiple Classification Algorithms](#Comparison-of-Multiple-Classification-Algorithms)
10. [Conclusion](#Conclusion)

## Introduction

Makine öğreniminde, birden fazla modeli birleştirerek daha güçlü ve genellenebilir bir model elde etmek yaygın bir yaklaşımdır. Bu yaklaşım, **ansambl yöntemleri** olarak bilinir ve bu yöntemlerin en popüler olanlarından biri de **Stacking**'dir.

Bu notebook'ta, stacking yöntemini detaylı bir şekilde inceleyecek ve **Titanic** veri seti üzerinde bir sınıflandırma problemi olarak uygulayacağız.

## What is Stacking?

**Stacking**, birden fazla makine öğrenimi modelini birleştirerek tahmin performansını artırmak amacıyla kullanılan bir ensemble (birleştirme) yöntemidir. Bu yöntemde birden fazla model birlikte çalışarak daha güçlü ve genel bir tahmin modeli oluştururlar.

### The Logic of Stacking:
- Farklı makine öğrenimi algoritmaları, veri üzerinde farklı öngörüler yapabilir.
- Her algoritmanın güçlü ve zayıf yanları vardır. Stacking, bu modellerin güçlü yönlerinden yararlanmayı amaçlar.
- İlk olarak birkaç temel model (base models) eğitilir ve bunların tahminleri, sonrasında **meta model** (üst model) tarafından kullanılır.
- Meta model, temel modellerin tahminlerini birleştirerek son tahmini yapar.

### Stacking Steps:
1. **Temel Modeller (Base Models):** Birinci katmanda çeşitli makine öğrenimi algoritmaları eğitilir.
2. **Meta Model:** İkinci katmanda ise temel modellerin tahmin sonuçlarını giriş olarak alan ve bu tahminlerden nihai tahmini yapan bir meta model eğitilir.

### Advantages of Stacking:
- **Genelleme Yeteneği:** Tek bir modelin aşırı uyum (overfitting) yapma riskini azaltır.
- **Performans:** Farklı modellerin güçlü yanlarını birleştirerek daha yüksek doğruluk oranına ulaşabilir.
- **Esneklik:** Farklı model tipleri (ağaç temelli modeller, lineer modeller, vb.) birlikte kullanılabilir.

### Popular Stacking Applications:
- **Kaggle Yarışmaları:** Özellikle Kaggle'da yarışan kullanıcılar arasında stacking çok popülerdir.
- **Büyük Veri ve Zor Problemler:** Stacking, karmaşık veri setleri ve zor sınıflandırma/regresyon problemleri için çok etkilidir.

## Loading the Dataset and Exploratory Data Analysis

**Titanic** veri seti üç dosyadan oluşur: 

- **train.csv:**
- **test.csv**
- **gender_submission.csv**

**train.csv** modeli eğitmek için,

**test.csv**tahmin yapmak için kullanılır. 

**gender_submission.csv**, tahmin sonuçlarını nasıl sunmamız gerektiğini gösterir.

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Suppressing warnings
import warnings
warnings.filterwarnings('ignore')

In [5]:
train_df = pd.read_csv("train.csv")
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Data Preprocessing

Bu bölümde eksik verilerle ilgilenecek ve veri setini makine öğrenimi modelleri için uygun hale getireceğiz.

In [9]:
train_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [11]:
train_df['Age'].fillna(train_df['Age'].median(), inplace=True)
train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace=True)

In [19]:
train_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

In [29]:
train_df.drop(columns="Cabin",inplace = True)

In [33]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [39]:
X = train_df.drop(columns = ["PassengerId", "Survived","Name","Ticket"])
y = train_df["Survived"] 

In [41]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [51]:
categorical_features = ["Sex","Embarked"]

In [53]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first', sparse_output=False), categorical_features)
    ],
    remainder=StandardScaler()  # Kategorik olmayan sütunlara StandardScaler uygular
)

In [55]:
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

In [57]:
X_train = pipeline.fit_transform(X_train)
X_val = pipeline.transform(X_val)

In [59]:
print("Transformed Training Data:")
print(X_train)
print("Transformed Validation Data:")
print(X_val)

Transformed Training Data:
[[ 1.          0.          1.         ... -0.47072241 -0.47934164
  -0.07868358]
 [ 1.          0.          1.         ... -0.47072241 -0.47934164
  -0.37714494]
 [ 1.          0.          1.         ... -0.47072241 -0.47934164
  -0.47486697]
 ...
 [ 1.          0.          1.         ...  1.23056874 -0.47934164
  -0.35580399]
 [ 0.          0.          1.         ...  0.37992316  2.04874166
   1.68320121]
 [ 1.          0.          1.         ... -0.47072241  0.78470001
   0.86074761]]
Transformed Validation Data:
[[ 1.          0.          0.         ...  0.37992316  0.78470001
  -0.33390078]
 [ 1.          0.          1.         ... -0.47072241 -0.47934164
  -0.42528387]
 [ 1.          0.          1.         ... -0.47072241 -0.47934164
  -0.47486697]
 ...
 [ 0.          0.          1.         ...  0.37992316  5.8408666
  -0.02308312]
 [ 0.          0.          1.         ... -0.47072241 -0.47934164
  -0.42528387]
 [ 0.          0.          1.         ...  

In [61]:
train_df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

## Training Base Learners

Bu bölümde, Lojistik Regresyon, Rastgele Orman ve K-Nearest Neighbor gibi çeşitli temel modelleri eğiteceğiz.

In [63]:
# V1 Base Learners - Logistic Regression, RandomForest, KNN
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

In [65]:
# Define and train V1 base learners
log_clf = LogisticRegression()
log_clf.fit(X_train, y_train)

In [67]:
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)

In [69]:
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)

In [71]:
# V2 Base Learners - CatBoost, XGBoost, LightGBM
from catboost import CatBoostClassifier
import xgboost as xgb
import lightgbm as lgb

In [73]:
# Define and train V2 base learners
catboost_clf = CatBoostClassifier(verbose=0)
catboost_clf.fit(X_train, y_train)

<catboost.core.CatBoostClassifier at 0x26cf96e39b0>

In [74]:
xgboost_clf = xgb.XGBClassifier()
xgboost_clf.fit(X_train, y_train)

In [75]:
lightgbm_clf = lgb.LGBMClassifier()
lightgbm_clf.fit(X_train, y_train)

[LightGBM] [Info] Number of positive: 268, number of negative: 444
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001281 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 198
[LightGBM] [Info] Number of data points in the train set: 712, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376404 -> initscore=-0.504838
[LightGBM] [Info] Start training from score -0.504838


## Training the Meta Learner

Bu bölümde stacking sınıflandırıcısını kullanarak meta öğreniciyi eğiteceğiz. Bu öğrenici, temel modellerin çıktısını kullanarak nihai tahminleri yapacak.

In [79]:
# Stacking Classifier for V1
from sklearn.ensemble import StackingClassifier, GradientBoostingClassifier

In [81]:
# Define the stacking classifier for V1
estimators_v1 = [
    ('lr', log_clf),
    ('rf', rf_clf),
    ('knn', knn_clf)
]

In [83]:
stacking_clf_v1 = StackingClassifier(
    estimators=estimators_v1,
    final_estimator=GradientBoostingClassifier(),
    cv=5
)

In [85]:
stacking_clf_v1.fit(X_train,y_train)

In [87]:
# Stacking Classifier for V2
from sklearn.ensemble import StackingClassifier, GradientBoostingClassifier
# Define the stacking classifier for V2
estimators_v2 = [
    ('catboost', catboost_clf),
    ('xgboost', xgboost_clf),
    ('lightgbm', lightgbm_clf)
]

In [89]:
stacking_clf_v2 = StackingClassifier(
    estimators=estimators_v2,
    final_estimator=GradientBoostingClassifier(),
    cv=3
)

In [91]:
# Train V2 stacking classifier
stacking_clf_v2.fit(X_train, y_train)

[LightGBM] [Info] Number of positive: 268, number of negative: 444
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000178 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 198
[LightGBM] [Info] Number of data points in the train set: 712, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376404 -> initscore=-0.504838
[LightGBM] [Info] Start training from score -0.504838
[LightGBM] [Info] Number of positive: 178, number of negative: 296
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000167 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 165
[LightGBM] [Info] Number of data points in the train set: 474, number of used features: 8
[LightGBM] [Info] [binary:BoostFro

## Model Performance Evaluation

Bu bölümde, stacking modelinin performansını doğruluk, ROC-AUC ve confusion matrix ile değerlendireceğiz.

In [95]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Predict and evaluate V1 on validation set
y_val_pred_v1 = stacking_clf_v1.predict(X_val)
# Evaluation metrics for V1
accuracy_v1 = accuracy_score(y_val, y_val_pred_v1)
precision_v1 = precision_score(y_val, y_val_pred_v1)
recall_v1 = recall_score(y_val, y_val_pred_v1)
f1_v1 = f1_score(y_val, y_val_pred_v1)

In [99]:
# Print V1 metrics
print("V1 Metrics:")
print(f"Accuracy: {accuracy_v1:.4f}")
print(f"Precision: {precision_v1:.4f}")
print(f"Recall: {recall_v1:.4f}")
print(f"F1 Score: {f1_v1:.4f}")


V1 Metrics:
Accuracy: 0.8380
Precision: 0.8358
Recall: 0.7568
F1 Score: 0.7943


In [101]:
# Predict and evaluate V2 on validation set
y_val_pred_v2 = stacking_clf_v2.predict(X_val)

In [103]:
# Evaluation metrics for V2
accuracy_v2 = accuracy_score(y_val, y_val_pred_v2)
precision_v2 = precision_score(y_val, y_val_pred_v2)
recall_v2 = recall_score(y_val, y_val_pred_v2)
f1_v2 = f1_score(y_val, y_val_pred_v2)

In [105]:
# Print V2 metrics
print("V2 Metrics:")
print(f"Accuracy: {accuracy_v2:.4f}")
print(f"Precision: {precision_v2:.4f}")
print(f"Recall: {recall_v2:.4f}")
print(f"F1 Score: {f1_v2:.4f}")

V2 Metrics:
Accuracy: 0.7989
Precision: 0.8167
Recall: 0.6622
F1 Score: 0.7313


In [107]:
print("Comparison between V1 and V2:")
print(f"Accuracy V1: {accuracy_v1:.4f} | V2: {accuracy_v2:.4f}")
print(f"Precision V1: {precision_v1:.4f} | V2: {precision_v2:.4f}")
print(f"Recall V1: {recall_v1:.4f} | V2: {recall_v2:.4f}")
print(f"F1 Score V1: {f1_v1:.4f} | V2: {f1_v2:.4f}")

Comparison between V1 and V2:
Accuracy V1: 0.8380 | V2: 0.7989
Precision V1: 0.8358 | V2: 0.8167
Recall V1: 0.7568 | V2: 0.6622
F1 Score V1: 0.7943 | V2: 0.7313


## Hyperparameter Tuning and Performance Evaluation using GridSearchCV

### v1

In [111]:
from sklearn.model_selection import GridSearchCV
# Defining V1 base learners
base_learners_v1 = [
    ('lr', LogisticRegression()),
    ('rf', RandomForestClassifier()),
    ('knn', KNeighborsClassifier())
]

In [123]:
# Defining the stacking classifier for V1
stacking_clf_v1 = StackingClassifier(
    estimators=base_learners_v1,
    final_estimator=GradientBoostingClassifier(),
    cv=3
)

In [125]:
# Defining the parameter grid for GridSearchCV for V1
param_grid_v1 = {
    'rf__n_estimators': [50, 100],  # Random Forest hyperparameters
    'rf__max_depth': [5, 10, 15],
    'knn__n_neighbors': [3, 5, 7],  # KNN hyperparameters
    'final_estimator__learning_rate': [0.01, 0.1, 0.2],  # Gradient Boosting hyperparameters
    'final_estimator__n_estimators': [50, 100]
}

In [127]:
# Using GridSearchCV for hyperparameter optimization for V1
grid_search_v1 = GridSearchCV(estimator=stacking_clf_v1, param_grid=param_grid_v1, cv=3, n_jobs=-1, verbose=0)

In [129]:
# Fit the grid search for V1
grid_search_v1.fit(X_train, y_train)

In [130]:
# Get the best parameters and score for V1
print("V1 Best parameters found: ", grid_search_v1.best_params_)
print("V1 Best cross-validation score: {:.4f}".format(grid_search_v1.best_score_))

# Evaluate the optimized model on the validation set for V1
best_model_v1 = grid_search_v1.best_estimator_
y_val_pred_v1 = best_model_v1.predict(X_val)

# Performance evaluation for V1
accuracy_v1 = accuracy_score(y_val, y_val_pred_v1)
precision_v1 = precision_score(y_val, y_val_pred_v1)
recall_v1 = recall_score(y_val, y_val_pred_v1)
f1_v1 = f1_score(y_val, y_val_pred_v1)

# Print evaluation metrics
print("V1 Metrics:")
print(f"Accuracy: {accuracy_v1:.4f}")
print(f"Precision: {precision_v1:.4f}")
print(f"Recall: {recall_v1:.4f}")
print(f"F1 Score: {f1_v1:.4f}")

V1 Best parameters found:  {'final_estimator__learning_rate': 0.01, 'final_estimator__n_estimators': 100, 'knn__n_neighbors': 3, 'rf__max_depth': 10, 'rf__n_estimators': 50}
V1 Best cross-validation score: 0.8202
V1 Metrics:
Accuracy: 0.8212
Precision: 0.8500
Recall: 0.6892
F1 Score: 0.7612


### v2

In [133]:
# Defining V2 base learners
base_learners_v2 = [
    ('catboost', CatBoostClassifier(verbose=0)),
    ('xgboost', xgb.XGBClassifier()),
    ('lightgbm', lgb.LGBMClassifier())
]

In [134]:
# Defining the stacking classifier for V2
stacking_clf_v2 = StackingClassifier(
    estimators=base_learners_v2,
    final_estimator=GradientBoostingClassifier(),
    cv=3
)


In [135]:
# Defining the parameter grid for GridSearchCV for V2
param_grid_v2 = {
    'xgboost__n_estimators': [50, 100],  # XGBoost hyperparameters
    'xgboost__max_depth': [5, 10],
    'lightgbm__n_estimators': [50, 100],  # LightGBM hyperparameters
    'final_estimator__learning_rate': [0.01, 0.1],  # Gradient Boosting hyperparameters
    'final_estimator__n_estimators': [50, 100]
}

In [136]:
# Using GridSearchCV for hyperparameter optimization for V2
grid_search_v2 = GridSearchCV(estimator=stacking_clf_v2, param_grid=param_grid_v2, cv=3, n_jobs=-1, verbose=2)

In [137]:
# Fit the grid search for V2
grid_search_v2.fit(X_train, y_train)

Fitting 3 folds for each of 32 candidates, totalling 96 fits
[LightGBM] [Info] Number of positive: 268, number of negative: 444
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000178 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 198
[LightGBM] [Info] Number of data points in the train set: 712, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376404 -> initscore=-0.504838
[LightGBM] [Info] Start training from score -0.504838
[LightGBM] [Info] Number of positive: 178, number of negative: 296
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000132 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 165
[LightGBM] [Info] Number of data points in the train set: 474, 

In [138]:
# Get the best parameters and score for V2
print("V2 Best parameters found: ", grid_search_v2.best_params_)
print("V2 Best cross-validation score: {:.4f}".format(grid_search_v2.best_score_))

V2 Best parameters found:  {'final_estimator__learning_rate': 0.01, 'final_estimator__n_estimators': 50, 'lightgbm__n_estimators': 50, 'xgboost__max_depth': 10, 'xgboost__n_estimators': 50}
V2 Best cross-validation score: 0.8174


In [139]:
# Evaluate the optimized model on the validation set for V2
best_model_v2 = grid_search_v2.best_estimator_
y_val_pred_v2 = best_model_v2.predict(X_val)

In [140]:
# Performance evaluation for V2
accuracy_v2 = accuracy_score(y_val, y_val_pred_v2)
precision_v2 = precision_score(y_val, y_val_pred_v2)
recall_v2 = recall_score(y_val, y_val_pred_v2)
f1_v2 = f1_score(y_val, y_val_pred_v2)

In [141]:
# Print evaluation metrics
print("V2 Metrics:")
print(f"Accuracy: {accuracy_v2:.4f}")
print(f"Precision: {precision_v2:.4f}")
print(f"Recall: {recall_v2:.4f}")
print(f"F1 Score: {f1_v2:.4f}")

V2 Metrics:
Accuracy: 0.8101
Precision: 0.8846
Recall: 0.6216
F1 Score: 0.7302


In [142]:
print("Comparison between V1 and V2 with hyperparameter tuning:")
print(f"Accuracy V1: {accuracy_v1:.4f} | V2: {accuracy_v2:.4f}")
print(f"Precision V1: {precision_v1:.4f} | V2: {precision_v2:.4f}")
print(f"Recall V1: {recall_v1:.4f} | V2: {recall_v2:.4f}")
print(f"F1 Score V1: {f1_v1:.4f} | V2: {f1_v2:.4f}")

Comparison between V1 and V2 with hyperparameter tuning:
Accuracy V1: 0.8212 | V2: 0.8101
Precision V1: 0.8500 | V2: 0.8846
Recall V1: 0.6892 | V2: 0.6216
F1 Score V1: 0.7612 | V2: 0.7302


## Comparison of Multiple Classification Algorithms

In [161]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [163]:
# Define all models
models = {
    'Logistic Regression': LogisticRegression(),
    'KNN': KNeighborsClassifier(),
    'SVM': SVC(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'AdaBoost': AdaBoostClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'LightGBM': lgb.LGBMClassifier(),
    'XGBoost': xgb.XGBClassifier(),
    'CatBoost': CatBoostClassifier(verbose=0)
}

In [165]:
# Empty list to store performance results for other models
performance_data = []
# Calculate performance metrics for other models
for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    # Make predictions on the validation set
    y_val_pred = model.predict(X_val)
    # Calculate performance metrics
    accuracy = accuracy_score(y_val, y_val_pred)
    precision = precision_score(y_val, y_val_pred)
    recall = recall_score(y_val, y_val_pred)
    f1 = f1_score(y_val, y_val_pred)
    # Append the results to the list
    performance_data.append([name, accuracy, precision, recall, f1])


[LightGBM] [Info] Number of positive: 268, number of negative: 444
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000370 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 198
[LightGBM] [Info] Number of data points in the train set: 712, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.376404 -> initscore=-0.504838
[LightGBM] [Info] Start training from score -0.504838
                  Model  Accuracy  Precision    Recall  F1 Score
0   Logistic Regression  0.810056   0.785714  0.743243  0.763889
1                   KNN  0.804469   0.782609  0.729730  0.755245
2                   SVM  0.815642   0.805970  0.729730  0.765957
3         Decision Tree  0.782123   0.733333  0.743243  0.738255
4         Random Forest  0.821229   0.783784  0.783784  0.783784
5              AdaBoost  0.804469   0.767123  0.756757  0.76

In [167]:
# Add v1 stacking results
performance_data.append([
    'Stacking (V1)',
    accuracy_v1,
    precision_v1,
    recall_v1,
    f1_v1
])
# Add v2 stacking results
performance_data.append([
    'Stacking (V2)',
    accuracy_v2,
    precision_v2,
    recall_v2,
    f1_v2
])
# Convert the list to a DataFrame
df = pd.DataFrame(performance_data, columns=['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])
# Display the DataFrame
print(df)

                  Model  Accuracy  Precision    Recall  F1 Score
0   Logistic Regression  0.810056   0.785714  0.743243  0.763889
1                   KNN  0.804469   0.782609  0.729730  0.755245
2                   SVM  0.815642   0.805970  0.729730  0.765957
3         Decision Tree  0.782123   0.733333  0.743243  0.738255
4         Random Forest  0.821229   0.783784  0.783784  0.783784
5              AdaBoost  0.804469   0.767123  0.756757  0.761905
6     Gradient Boosting  0.804469   0.819672  0.675676  0.740741
7              LightGBM  0.826816   0.794521  0.783784  0.789116
8               XGBoost  0.821229   0.800000  0.756757  0.777778
9              CatBoost  0.821229   0.850000  0.689189  0.761194
10        Stacking (V1)  0.821229   0.850000  0.689189  0.761194
11        Stacking (V2)  0.810056   0.884615  0.621622  0.730159
12        Stacking (V1)  0.821229   0.850000  0.689189  0.761194
13        Stacking (V2)  0.810056   0.884615  0.621622  0.730159
