# Water Quality and Potability Classification
  
#### [Dataset URL](https://www.kaggle.com/datasets/uom190346a/water-quality-and-potability)
  
## Opis zbioru danych

Ten zbiór danych zawiera pomiary jakości wody oraz oceny dotyczące jej zdatności do spożycia przez ludzi, czyli potencjał pitności. Głównym celem tego zbioru danych jest dostarczenie wglądu w parametry jakości wody i pomoc w określeniu, czy woda jest zdatna do spożycia. Każdy wiersz w zbiorze danych reprezentuje próbkę wody z określonymi cechami, a kolumna "Potability" wskazuje, czy woda jest odpowiednia do spożycia. Głównym celem tego zbioru danych jest ocena i przewidywanie potencjału potabilności wody na podstawie cech jakości wody. Może być używany do oceny bezpieczeństwa i odpowiedniości źródeł wody do spożycia przez ludzi, podejmowania świadomych decyzji dotyczących uzdatniania wody oraz zapewnienia zgodności z normami jakości wody.

## Opis cech

- pH: Poziom pH wody.
- Hardness: Twardość wody, miara zawartości minerałów.
- Solids: Całkowita zawartość substancji rozpuszczonych w wodzie.
- Chloramines: Stężenie chloramin w wodzie.
- Sulfate: Stężenie siarczanów w wodzie.
- Conductivity: Przewodność elektryczna wody.
- Organic_carbon: Zawartość węgla organicznego w wodzie.
- Trihalomethanes: Stężenie trihalometanów w wodzie.
- Turbidity: Poziom mętności, miara klarowności wody.
- Potability: Zmienna celu; wskazuje zdatność do spożycia wody, przyjmując wartości 1 (zdatna do spożycia - "potable") i 0 (niezdatna do spożycia - "not potable).

## Parametry zbioru danych

- Liczba rekordów: 3276
- Liczba cech: 9
- Dane brakujące: Tak (kolumny: pH, Sulfate and Trihalomethanes)
- Dane odstające: Tak (ok. 1.22% całego zbioru danych) 
- Typ problemu: Klasyfikacja (Potability - No (0), Yes (1))

## Rozkład klas

| Klasa | Liczba rekordów | Rozkład procentowy |
|-------|-----------------|--------------------|
| 0     | 1998            | 60.99%             |
| 1     | 1278            | 39.01%             |

## Importowanie bibliotek

In [1]:
import pickle
import random
import time

import xgboost as xgb
import numpy as np
import optuna
from deap import base, creator, tools, algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV

## Załadowanie zmiennych

Zmienne zostają załadowane z pliku wygenerowanego z notebooka DataAnalysis.ipynb przy pomocy biblioteki pickle, służącej do serializacji oraz deserializacji danych. Wybrano dane, które osiągnęły najlepsze wyniki podczas 2 etapu - tworzenia i trenowania modeli.

In [2]:
with open('data_dump/normalizedStdInterpolateVars.pkl', 'rb') as f:
    normalized_std_interpolate = pickle.load(f)
    scaler_std_interpolate = pickle.load(f)

## Ustawienie ziarna losowości

In [None]:
random_seed = 42

## Funkcje do podziału zbioru danych

Utworzono dwie funkcje do tworzenia podziału danych. Split_df_train_test odpowiada za podział danych na dwa zbiory: testowy i treningowy. Dane są dzielone z równomiernym podziałem klas, aby zapobiec sytuacji, w której podczas podziału danych, przydzielono do zbioru treningowego tylko jedną klasę danych. 

In [3]:
def split_df_train_test(data, test_size, seed):
    np.random.seed(seed)

    unique_labels = data['Potability'].unique()
    label_counts = data['Potability'].value_counts()

    test_indices = []

    for label in unique_labels:
        num_label_samples = label_counts[label]
        num_test_samples = int(test_size * num_label_samples)
        label_indices = data.index[data['Potability'] == label].tolist()
        label_test_indices = np.random.choice(label_indices, size=num_test_samples, replace=False)
        test_indices.extend(label_test_indices)

    train_indices = np.setdiff1d(data.index, test_indices)

    train_set = data.loc[train_indices]
    test_set = data.loc[test_indices]

    return train_set, test_set

### Funkcja do obliczania metryki F1-score

In [4]:
def calculate_f1_score(y_true, y_pred):
    true_positives = np.sum((y_true == 1) & (y_pred == 1))
    true_negatives = np.sum((y_true == 0) & (y_pred == 0))
    false_positives = np.sum((y_true == 0) & (y_pred == 1))
    false_negatives = np.sum((y_true == 1) & (y_pred == 0))

    precision_positives = true_positives / (true_positives + false_positives)
    recall_positives = true_positives / (true_positives + false_negatives)
    f1_score_positives = 2 * (precision_positives * recall_positives) / (precision_positives + recall_positives)

    precision_negatives = true_negatives / (true_negatives + false_negatives)
    recall_negatives = true_negatives / (true_negatives + false_positives)
    f1_score_negatives = 2 * (precision_negatives * recall_negatives) / (precision_negatives + recall_negatives)

    f1_score = (f1_score_positives + f1_score_negatives) / 2
    return f1_score

### Split danych

Podzielenie danych przy pomocy funkcji "split_df_train_test"

In [5]:
train_std_interpolate, test_std_interpolate = split_df_train_test(normalized_std_interpolate, 0.2, 123)

### Podział klas po splicie - test/train

In [6]:
train_class_counts = train_std_interpolate['Potability'].value_counts(normalize=True) * 100
test_class_counts = test_std_interpolate['Potability'].value_counts(normalize=True) * 100

print("Train set:")
print(train_class_counts)
print("\nTest set:")
print(test_class_counts)

Train set:
Potability
0    60.983982
1    39.016018
Name: proportion, dtype: float64

Test set:
Potability
0    61.009174
1    38.990826
Name: proportion, dtype: float64


## Random Forest - Scikit-Learn

In [7]:
class RandomForestClassifierWrapper:

    def __init__(self, train_set, test_set):
        self.train_set = train_set.iloc[:, :-1]
        self.test_set = test_set.iloc[:, :-1]
        self.train_label = train_set.iloc[:, -1]
        self.test_label = test_set.iloc[:, -1]
        self.model = None
        self.history = None
        self.test_pred = None
        self.hof = None
        self.best_optuna_trial = None
        self.best_score = None # TODO - podpiąć best_score i best_score_type, żeby był łatwy dostęp z całego obiektu.
        self.best_score_type = None
        self._create_model()

    def _create_model(self):
        self.model = RandomForestClassifier(random_state=random_seed)

    # ----------------- DEAP -----------------

    def deap_objective(self, individual, optimize_f1_score):
        n_estimators, max_depth, min_samples_split, min_samples_leaf, bootstrap, criterion = individual
        n_estimators = int(n_estimators)
        max_depth = int(max_depth) if max_depth > 0 else None
        min_samples_split = int(min_samples_split) if int(min_samples_split) > 1 else 2
        min_samples_leaf = int(min_samples_leaf) if int(min_samples_leaf) > 0 else 1
        bootstrap = bool(bootstrap)
        criterion = 'gini' if criterion < 0.5 else 'entropy'
        model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth,
                                       min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf,
                                       bootstrap=bootstrap, criterion=criterion, random_state=random_seed)
        model.fit(self.train_set, self.train_label)
        predictions = model.predict(self.test_set)

        if optimize_f1_score:
            score = calculate_f1_score(self.test_label, predictions)
        else:
            score = accuracy_score(self.test_label, predictions)

        return score,

    def train_model_with_ga(self, optimize_f1_score=False):
        if not hasattr(creator, 'FitnessMax'):
            creator.create("FitnessMax", base.Fitness, weights=(1.0,))
        if not hasattr(creator, 'Individual'):
            creator.create("Individual", list, fitness=creator.FitnessMax)

        toolbox = base.Toolbox()

        toolbox.register("attr_n_estimators", random.randint, 10, 200)
        toolbox.register("attr_max_depth", random.randint, 1, 50)
        toolbox.register("attr_min_samples_split", random.randint, 2, 10)
        toolbox.register("attr_min_samples_leaf", random.randint, 1, 10)
        toolbox.register("attr_bootstrap", random.randint, 0, 1)
        toolbox.register("attr_criterion", random.uniform, 0, 1)

        toolbox.register("individual", tools.initCycle, creator.Individual,
                         (toolbox.attr_n_estimators, toolbox.attr_max_depth, toolbox.attr_min_samples_split,
                          toolbox.attr_min_samples_leaf, toolbox.attr_bootstrap, toolbox.attr_criterion), n=1)

        toolbox.register("population", tools.initRepeat, list, toolbox.individual)

        toolbox.register("evaluate", self.deap_objective, optimize_f1_score=optimize_f1_score)
        toolbox.register("mate", tools.cxTwoPoint)
        toolbox.register("mutate", tools.mutGaussian, mu=0, sigma=1, indpb=0.2)
        toolbox.register("select", tools.selTournament, tournsize=5)

        pop = toolbox.population(n=100)

        self.hof = tools.HallOfFame(1)
        stats = tools.Statistics(lambda ind: ind.fitness.values)
        stats.register("avg", np.mean)
        stats.register("min", np.min)
        stats.register("max", np.max)

        pop, logbook = algorithms.eaMuPlusLambda(pop, toolbox, mu=30, lambda_=50, cxpb=0.5, mutpb=0.2,
                                                 ngen=5,
                                                 stats=stats, halloffame=self.hof, verbose=True)

        best_individual = tools.selBest(pop, 1)[0]

        self.model = RandomForestClassifier(n_estimators=int(best_individual[0]),
                                            max_depth=int(best_individual[1]),
                                            min_samples_split=int(best_individual[2] if int(best_individual[2]) > 1 else 2),
                                            min_samples_leaf=int(best_individual[3] if int(best_individual[3]) > 0 else 1),
                                            bootstrap=bool(best_individual[4]),
                                            criterion='gini' if best_individual[5] < 0.5 else 'entropy',
                                            random_state=random_seed)
        self.model.fit(self.train_set, self.train_label)

    def get_deap_best_individual(self):
        return self.hof[0]

    # ----------------- Optuna -----------------

    def optuna_objective(self, trial, optimize_f1_score):
        n_estimators = trial.suggest_int("n_estimators", 10, 200)
        max_depth = trial.suggest_int("max_depth", 1, 50)
        min_samples_split = trial.suggest_int("min_samples_split", 2, 10)
        min_samples_leaf = trial.suggest_int("min_samples_leaf", 1, 10)
        bootstrap = trial.suggest_categorical("bootstrap", [True, False])
        criterion = trial.suggest_categorical("criterion", ['gini', 'entropy'])

        model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth,
                                       min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf,
                                       bootstrap=bootstrap, criterion=criterion, random_state=random_seed)
        model.fit(self.train_set, self.train_label)
        predictions = model.predict(self.test_set)

        if optimize_f1_score:
            score = calculate_f1_score(self.test_label, predictions)
        else:
            score = accuracy_score(self.test_label, predictions)

        return score

    def train_model_with_optuna(self, optimize_f1_score=False, n_trials=100):
        study = optuna.create_study(direction="maximize")
        study.optimize(lambda trial: self.optuna_objective(trial, optimize_f1_score), n_jobs=-1, n_trials=n_trials)

        self.best_optuna_trial = study.best_trial

        self.model = RandomForestClassifier(n_estimators=self.best_optuna_trial.params["n_estimators"],
                                            max_depth=self.best_optuna_trial.params["max_depth"],
                                            min_samples_split=self.best_optuna_trial.params["min_samples_split"],
                                            min_samples_leaf=self.best_optuna_trial.params["min_samples_leaf"],
                                            bootstrap=self.best_optuna_trial.params["bootstrap"],
                                            criterion=self.best_optuna_trial.params["criterion"],
                                            random_state=random_seed)
        self.model.fit(self.train_set, self.train_label)

    def get_optuna_best_result(self):
        return self.best_optuna_trial
    
    # ----------------- RandomizedSearchCV -----------------
    
    def train_model_CV(self, optimize_f1_score=False):
        param_distributions = {
            "n_estimators": list(range(10, 201)),
            "max_depth": [None] + list(range(1, 51)),
            "min_samples_split": list(range(2, 11)),
            "min_samples_leaf": list(range(1, 11)),
            "bootstrap": [True, False],
            "criterion": ['gini', 'entropy']
        }
        
        if optimize_f1_score:
            scoring = 'f1'
        else:
            scoring = 'accuracy'

        random_search = RandomizedSearchCV(self.model, param_distributions=param_distributions, scoring=scoring, n_iter=12, cv=5, random_state=random_seed, n_jobs=-1)
        random_search.fit(self.train_set, self.train_label)
        self.model = random_search.best_estimator_

    def evaluate_model_CV(self, optimize_f1_score=False):
        self.test_pred = self.model.predict(self.test_set)
        
        if optimize_f1_score:
            score, score_type = accuracy_score(self.test_label, self.test_pred), "accuracy"
        else:
            score, score_type = calculate_f1_score(self.test_label, self.test_pred), "F1 score"
            
        print('-' * 13 + ' Random Forest ' + '-' * 13)
        for key, value in self.model.get_params().items():
            print(f"{key}: {value}")
        print('-' * 41)
        print(f"Test {score_type}: {score:.4f}")
        print('-' * 41)
        return score

### XGBoost

In [8]:
class XGBoostClassifierWrapper:

    def __init__(self, train_set, test_set):
        self.train_set = train_set.iloc[:, :-1]
        self.test_set = test_set.iloc[:, :-1]
        self.train_label = train_set.iloc[:, -1]
        self.test_label = test_set.iloc[:, -1]
        self.model = None
        self.history = None
        self.test_pred = None
        self.hof = None
        self.best_optuna_trial = None
        self.best_score = None # TODO - podpiąć best_score i best_score_type, żeby był łatwy dostęp z całego obiektu.
        self.best_score_type = None
        self._create_model()

    def _create_model(self):
        self.model = xgb.XGBClassifier(random_state=random_seed)

    # ----------------- DEAP -----------------

    def deap_objective(self, individual, optimize_f1_score):
        n_estimators, max_depth, min_child_weight, gamma, subsample, colsample_bytree, learning_rate, reg_lambda, reg_alpha = individual
        n_estimators = int(n_estimators)
        max_depth = int(max_depth) if max_depth > 0 else None
        min_child_weight = int(min_child_weight) if int(min_child_weight) > 0 else 1
        gamma = max(0, gamma)
        subsample = min(max(0.1, subsample), 1)
        colsample_bytree = min(max(0.1, colsample_bytree), 1)
        learning_rate = min(max(0.01, learning_rate), 1)
        reg_lambda = max(0, reg_lambda)
        reg_alpha = max(0, reg_alpha)
        model = xgb.XGBClassifier(n_estimators=n_estimators, max_depth=max_depth,
                                  min_child_weight=min_child_weight, gamma=gamma,
                                  subsample=subsample, colsample_bytree=colsample_bytree,
                                  learning_rate=learning_rate, reg_lambda=reg_lambda,
                                  reg_alpha=reg_alpha, random_state=random_seed)
        model.fit(self.train_set, self.train_label)
        predictions = model.predict(self.test_set)

        if optimize_f1_score:
            score = calculate_f1_score(self.test_label, predictions)
        else:
            score = accuracy_score(self.test_label, predictions)

        return score,

    def train_model_with_ga(self, optimize_f1_score=False):
        if not hasattr(creator, 'FitnessMax'):
            creator.create("FitnessMax", base.Fitness, weights=(1.0,))
        if not hasattr(creator, 'Individual'):
            creator.create("Individual", list, fitness=creator.FitnessMax)

        toolbox = base.Toolbox()

        toolbox.register("attr_n_estimators", random.randint, 10, 200)
        toolbox.register("attr_max_depth", random.randint, 1, 50)
        toolbox.register("attr_min_child_weight", random.randint, 1, 10)
        toolbox.register("attr_gamma", random.uniform, 0, 1)
        toolbox.register("attr_subsample", random.uniform, 0.1, 1)
        toolbox.register("attr_colsample_bytree", random.uniform, 0.1, 1)
        toolbox.register("attr_learning_rate", random.uniform, 0.01, 1)
        toolbox.register("attr_reg_lambda", random.uniform, 0, 1)
        toolbox.register("attr_reg_alpha", random.uniform, 0, 1)

        toolbox.register("individual", tools.initCycle, creator.Individual,
                         (toolbox.attr_n_estimators, toolbox.attr_max_depth, toolbox.attr_min_child_weight,
                          toolbox.attr_gamma, toolbox.attr_subsample, toolbox.attr_colsample_bytree,
                          toolbox.attr_learning_rate, toolbox.attr_reg_lambda, toolbox.attr_reg_alpha), n=1)

        toolbox.register("population", tools.initRepeat, list, toolbox.individual)

        toolbox.register("evaluate", self.deap_objective, optimize_f1_score=optimize_f1_score)
        toolbox.register("mate", tools.cxTwoPoint)
        toolbox.register("mutate", tools.mutGaussian, mu=0, sigma=1, indpb=0.2)
        toolbox.register("select", tools.selTournament, tournsize=5)

        pop = toolbox.population(n=100)

        self.hof = tools.HallOfFame(1)
        stats = tools.Statistics(lambda ind: ind.fitness.values)
        stats.register("avg", np.mean)
        stats.register("min", np.min)
        stats.register("max", np.max)

        pop, logbook = algorithms.eaMuPlusLambda(pop, toolbox, mu=30, lambda_=50, cxpb=0.5, mutpb=0.2,
                                                 ngen=5,
                                                 stats=stats, halloffame=self.hof, verbose=True)

        best_individual = tools.selBest(pop, 1)[0]

        self.model = xgb.XGBClassifier(n_estimators=int(best_individual[0]),
                                       max_depth=int(best_individual[1]),
                                       min_child_weight=int(best_individual[2] if int(best_individual[2]) > 0 else 1),
                                       gamma=max(0, best_individual[3]),
                                       subsample=min(max(0.1, best_individual[4]), 1),
                                       colsample_bytree=min(max(0.1, best_individual[5]), 1),
                                       learning_rate=min(max(0.01, best_individual[6]), 1),
                                       reg_lambda=max(0, best_individual[7]),
                                       reg_alpha=max(0, best_individual[8]),
                                       random_state=random_seed)
        self.model.fit(self.train_set, self.train_label)

    def get_deap_best_individual(self):
        return self.hof[0]

    # ----------------- Optuna -----------------

    def optuna_objective(self, trial, optimize_f1_score):
        n_estimators = trial.suggest_int("n_estimators", 10, 200)
        max_depth = trial.suggest_int("max_depth", 1, 50)
        min_child_weight = trial.suggest_int("min_child_weight", 1, 10)
        gamma = trial.suggest_float("gamma", 0, 1)
        subsample = trial.suggest_float("subsample", 0.1, 1)
        colsample_bytree = trial.suggest_float("colsample_bytree", 0.1, 1)
        learning_rate = trial.suggest_float("learning_rate", 0.01, 1)
        reg_lambda = trial.suggest_float("reg_lambda", 0, 1)
        reg_alpha = trial.suggest_float("reg_alpha", 0, 1)

        model = xgb.XGBClassifier(n_estimators=n_estimators, max_depth=max_depth,
                                  min_child_weight=min_child_weight, gamma=gamma,
                                  subsample=subsample, colsample_bytree=colsample_bytree,
                                  learning_rate=learning_rate, reg_lambda=reg_lambda,
                                  reg_alpha=reg_alpha, random_state=random_seed)
        model.fit(self.train_set, self.train_label)
        predictions = model.predict(self.test_set)

        if optimize_f1_score:
            score = calculate_f1_score(self.test_label, predictions)
        else:
            score = accuracy_score(self.test_label, predictions)

        return score

    def train_model_with_optuna(self, optimize_f1_score=False, n_trials=100):
        study = optuna.create_study(direction="maximize")
        study.optimize(lambda trial: self.optuna_objective(trial, optimize_f1_score), n_jobs=-1, n_trials=n_trials)

        self.best_optuna_trial = study.best_trial

        self.model = xgb.XGBClassifier(n_estimators=self.best_optuna_trial.params["n_estimators"],
                                       max_depth=self.best_optuna_trial.params["max_depth"],
                                       min_child_weight=self.best_optuna_trial.params["min_child_weight"],
                                       gamma=self.best_optuna_trial.params["gamma"],
                                       subsample=self.best_optuna_trial.params["subsample"],
                                       colsample_bytree=self.best_optuna_trial.params["colsample_bytree"],
                                       learning_rate=self.best_optuna_trial.params["learning_rate"],
                                       reg_lambda=self.best_optuna_trial.params["reg_lambda"],
                                       reg_alpha=self.best_optuna_trial.params["reg_alpha"],
                                       random_state=random_seed)
        self.model.fit(self.train_set, self.train_label)

    def get_optuna_best_result(self):
        return self.best_optuna_trial

    # ----------------- RandomizedSearchCV -----------------

    def train_model_CV(self, optimize_f1_score=False):
        param_distributions = {
            "n_estimators": list(range(10, 201)),
            "max_depth": list(range(1, 51)),
            "min_child_weight": list(range(1, 11)),
            "gamma": [i/10.0 for i in range(0,5)],
            "subsample": [i/10.0 for i in range(6,10)],
            "colsample_bytree": [i/10.0 for i in range(6,10)],
            "learning_rate": [0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.2, 0.3],
            "reg_lambda": [i/10.0 for i in range(0,5)],
            "reg_alpha": [1e-5, 1e-2, 0.1, 1, 100]
        }
        
        if optimize_f1_score:
            scoring = 'f1'
        else:
            scoring = 'accuracy'

        random_search = RandomizedSearchCV(self.model, param_distributions=param_distributions, scoring=scoring, n_iter=12, cv=5, random_state=random_seed, n_jobs=-1)
        random_search.fit(self.train_set, self.train_label)
        self.model = random_search.best_estimator_

    def evaluate_model_CV(self, optimize_f1_score=False):
        self.test_pred = self.model.predict(self.test_set)
        
        if optimize_f1_score:
            score, score_type = accuracy_score(self.test_label, self.test_pred), "accuracy"
        else:
            score, score_type = calculate_f1_score(self.test_label, self.test_pred), "F1 score"
            
        print('-' * 13 + ' XGBoost ' + '-' * 13)
        for key, value in self.model.get_params().items():
            print(f"{key}: {value}")
        print('-' * 41)
        print(f"Test {score_type}: {score:.4f}")
        print('-' * 41)
        return score

### Trenowanie modelu Random Forest - optymalizacja hiperparametrów za pomocą algorytmu genetycznego

In [9]:
def print_best_individual(model):
    best_individual = model.get_deap_best_individual()
    print("Best hiperparameters:")
    print(f"n_estimators: {best_individual[0]}")
    print(f"max_depth: {best_individual[1]}")
    print(f"min_samples_split: {best_individual[2]}")
    print(f"min_samples_leaf: {best_individual[3]}")
    print(f"bootstrap: {bool(best_individual[4])}")
    print(f"criterion: {'gini' if best_individual[5] < 0.5 else 'entropy'}")


print("DEAP - Random Forest - Accuracy: \n")
start_time = time.time()
RF_model_accuracy_optimizing_DEAP = RandomForestClassifierWrapper(train_std_interpolate, test_std_interpolate)
RF_model_accuracy_optimizing_DEAP.train_model_with_ga(optimize_f1_score=False)
print_best_individual(RF_model_accuracy_optimizing_DEAP)
end_time = time.time()
print(f"DEAP - Random Forest - Execution time for accuracy: {format(end_time - start_time, '.2f')} seconds")

print("-" * 50)

print("DEAP - Random Forest - F1 Score: \n")
start_time = time.time()
RF_model_f1score_optimizing_DEAP = RandomForestClassifierWrapper(train_std_interpolate, test_std_interpolate)
RF_model_f1score_optimizing_DEAP.train_model_with_ga(optimize_f1_score=True)
print_best_individual(RF_model_f1score_optimizing_DEAP)
end_time = time.time()
print(f"DEAP - Random Forest - Execution time for F1 Score: {format(end_time - start_time, '.2f')} seconds")

DEAP - Random Forest - Accuracy: 

gen	nevals	avg     	min     	max     
0  	100   	0.673609	0.619266	0.700306
1  	37    	0.686901	0.672783	0.698777
2  	36    	0.692661	0.686544	0.700306
3  	36    	0.696126	0.692661	0.700306
4  	32    	0.698573	0.695719	0.700306
5  	32    	0.700102	0.697248	0.700306
Best hiperparameters:
n_estimators: 65
max_depth: 31
min_samples_split: 2
min_samples_leaf: 1
bootstrap: False
criterion: entropy
DEAP - Random Forest - Execution time for accuracy: 468.04 seconds
--------------------------------------------------
DEAP - Random Forest - F1 Score: 



  precision_positives = true_positives / (true_positives + false_positives)


gen	nevals	avg	min	max
0  	100   	nan	nan	nan
1  	31    	0.630849	0.611587	0.658443
2  	30    	0.643454	0.625076	0.658443
3  	34    	0.654918	0.640072	0.659147
4  	35    	0.658532	0.653547	0.660366
5  	32    	0.659053	0.658443	0.659147
Best hiperparameters:
n_estimators: 103.96222590292335
max_depth: 37
min_samples_split: 8
min_samples_leaf: 1
bootstrap: False
criterion: gini
DEAP - Random Forest - Execution time for F1 Score: 458.38 seconds


### Trenowanie modelu Random Forest - optymalizacja hiperparametrów za pomocą biblioteki Optuna

In [10]:
def print_best_optuna_result(model):
    best_trial = model.get_optuna_best_result()
    print("Best hiperparameters:")
    for key, value in best_trial.params.items():
        print(f"  {key}: {value}")
    print("Best accuracy for this hiperparameters: " + str(best_trial.value))


print("Optuna - Random Forest - Accuracy: \n")
start_time = time.time()
RF_model_accuracy_optimizing_optuna = RandomForestClassifierWrapper(train_std_interpolate, test_std_interpolate)
RF_model_accuracy_optimizing_optuna.train_model_with_optuna(optimize_f1_score=False)
print_best_optuna_result(RF_model_accuracy_optimizing_optuna)
end_time = time.time()
print(f"Optuna - Random Forest - Execution time for Accuracy: {format(end_time - start_time, '.2f')} seconds")

print("-" * 50)

print("Optuna - F1 Score: \n")
start_time = time.time()
RF_model_f1score_optimizing_optuna = RandomForestClassifierWrapper(train_std_interpolate, test_std_interpolate)
RF_model_f1score_optimizing_optuna.train_model_with_optuna(optimize_f1_score=True)
print_best_optuna_result(RF_model_f1score_optimizing_optuna)
end_time = time.time()
print(f"Optuna - Random Forest - Execution time for F1 Score: {format(end_time - start_time, '.2f')} seconds")

[I 2024-06-07 00:42:39,338] A new study created in memory with name: no-name-f71fe9b2-3113-4f48-a58b-e36581ba2e0f


Optuna - Random Forest - Accuracy: 



[I 2024-06-07 00:42:40,471] Trial 15 finished with value: 0.6574923547400612 and parameters: {'n_estimators': 49, 'max_depth': 5, 'min_samples_split': 6, 'min_samples_leaf': 10, 'bootstrap': False, 'criterion': 'entropy'}. Best is trial 15 with value: 0.6574923547400612.
[I 2024-06-07 00:42:40,606] Trial 3 finished with value: 0.6697247706422018 and parameters: {'n_estimators': 37, 'max_depth': 46, 'min_samples_split': 4, 'min_samples_leaf': 10, 'bootstrap': False, 'criterion': 'gini'}. Best is trial 3 with value: 0.6697247706422018.
[I 2024-06-07 00:42:41,132] Trial 10 finished with value: 0.6758409785932722 and parameters: {'n_estimators': 47, 'max_depth': 42, 'min_samples_split': 8, 'min_samples_leaf': 2, 'bootstrap': True, 'criterion': 'entropy'}. Best is trial 10 with value: 0.6758409785932722.
[I 2024-06-07 00:42:41,176] Trial 11 finished with value: 0.6666666666666666 and parameters: {'n_estimators': 55, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 2, 'bootstrap'

Best hiperparameters:
  n_estimators: 134
  max_depth: 32
  min_samples_split: 9
  min_samples_leaf: 1
  bootstrap: False
  criterion: entropy
Best accuracy for this hiperparameters: 0.7018348623853211
Optuna - Random Forest - Execution time for Accuracy: 30.44 seconds
--------------------------------------------------
Optuna - F1 Score: 



[I 2024-06-07 00:43:10,364] Trial 9 finished with value: 0.6164222873900292 and parameters: {'n_estimators': 13, 'max_depth': 44, 'min_samples_split': 2, 'min_samples_leaf': 2, 'bootstrap': True, 'criterion': 'gini'}. Best is trial 9 with value: 0.6164222873900292.
[I 2024-06-07 00:43:10,478] Trial 12 finished with value: 0.6183438811538207 and parameters: {'n_estimators': 12, 'max_depth': 33, 'min_samples_split': 7, 'min_samples_leaf': 4, 'bootstrap': False, 'criterion': 'entropy'}. Best is trial 12 with value: 0.6183438811538207.
[I 2024-06-07 00:43:10,726] Trial 17 finished with value: 0.6278879673037625 and parameters: {'n_estimators': 22, 'max_depth': 30, 'min_samples_split': 8, 'min_samples_leaf': 4, 'bootstrap': False, 'criterion': 'gini'}. Best is trial 17 with value: 0.6278879673037625.
[I 2024-06-07 00:43:10,782] Trial 2 finished with value: 0.6163432267884322 and parameters: {'n_estimators': 21, 'max_depth': 25, 'min_samples_split': 9, 'min_samples_leaf': 6, 'bootstrap': Fal

Best hiperparameters:
  n_estimators: 61
  max_depth: 23
  min_samples_split: 3
  min_samples_leaf: 3
  bootstrap: False
  criterion: gini
Best accuracy for this hiperparameters: 0.6580081960215145
Optuna - Random Forest - Execution time for F1 Score: 19.33 seconds


### Trenowanie modelu Random Forest - optymalizacja hiperparametrów za pomocą RandomizedSearchCV

In [11]:
print("RandomizedSearchCV - Random Forest - Accuracy: \n")
start_time = time.time()
RF_model_accuracy_optimizing_CV = RandomForestClassifierWrapper(train_std_interpolate, test_std_interpolate)
RF_model_accuracy_optimizing_CV.train_model_CV(optimize_f1_score=False)
RF_model_accuracy_optimizing_CV.evaluate_model_CV(optimize_f1_score=False)
end_time = time.time()
print(f"RandomizedSearchCV - Execution time for Accuracy: {format(end_time - start_time, '.2f')} seconds")

print("-" * 50)

print("RandomizedSearchCV - Random Forest - F1 Score: \n")
start_time = time.time()
RF_model_f1score_optimizing_CV = RandomForestClassifierWrapper(train_std_interpolate, test_std_interpolate)
RF_model_f1score_optimizing_CV.train_model_CV(optimize_f1_score=True)
RF_model_f1score_optimizing_CV.evaluate_model_CV(optimize_f1_score=True)
end_time = time.time()
print(f"RandomizedSearchCV - Random Forest - Execution time for F1 Score: {format(end_time - start_time, '.2f')} seconds")

RandomizedSearchCV - Random Forest - Accuracy: 

------------- Random Forest -------------
bootstrap: False
ccp_alpha: 0.0
class_weight: None
criterion: entropy
max_depth: 8
max_features: sqrt
max_leaf_nodes: None
max_samples: None
min_impurity_decrease: 0.0
min_samples_leaf: 1
min_samples_split: 5
min_weight_fraction_leaf: 0.0
n_estimators: 154
n_jobs: None
oob_score: False
random_state: 42
verbose: 0
warm_start: False
-----------------------------------------
Test F1 score: 0.6015
-----------------------------------------
RandomizedSearchCV - Execution time for Accuracy: 11.36 seconds
--------------------------------------------------
RandomizedSearchCV - Random Forest - F1 Score: 

------------- Random Forest -------------
bootstrap: False
ccp_alpha: 0.0
class_weight: None
criterion: gini
max_depth: 35
max_features: sqrt
max_leaf_nodes: None
max_samples: None
min_impurity_decrease: 0.0
min_samples_leaf: 1
min_samples_split: 8
min_weight_fraction_leaf: 0.0
n_estimators: 164
n_jobs: N

### Trenowanie modelu XGBoost - optymalizacja hiperparametrów za pomocą algorytmu genetycznego

In [12]:
def print_best_individual(model):
    best_individual = model.get_deap_best_individual()
    print("Best hiperparameters:")
    print(f"n_estimators: {best_individual[0]}")
    print(f"max_depth: {best_individual[1]}")
    print(f"min_child_weight: {best_individual[2]}")
    print(f"gamma: {best_individual[3]}")
    print(f"subsample: {best_individual[4]}")
    print(f"colsample_bytree: {best_individual[5]}")
    print(f"learning_rate: {best_individual[6]}")
    print(f"reg_lambda: {best_individual[7]}")
    print(f"reg_alpha: {best_individual[8]}")

print("DEAP - XGBoost - Accuracy: \n")
start_time = time.time()
XGB_model_accuracy_optimizing_DEAP = XGBoostClassifierWrapper(train_std_interpolate, test_std_interpolate)
XGB_model_accuracy_optimizing_DEAP.train_model_with_ga(optimize_f1_score=False)
print_best_individual(XGB_model_accuracy_optimizing_DEAP)
end_time = time.time()
print(f"DEAP - XGBoost - Execution time for accuracy: {format(end_time - start_time, '.2f')} seconds")

print("-" * 50)

print("DEAP - XGBoost - F1 Score: \n")
start_time = time.time()
XGB_model_f1score_optimizing_DEAP = XGBoostClassifierWrapper(train_std_interpolate, test_std_interpolate)
XGB_model_f1score_optimizing_DEAP.train_model_with_ga(optimize_f1_score=True)
print_best_individual(XGB_model_f1score_optimizing_DEAP)
end_time = time.time()
print(f"DEAP - XGBoost - Execution time for F1 Score: {format(end_time - start_time, '.2f')} seconds")

DEAP - XGBoost - Accuracy: 

gen	nevals	avg     	min     	max     
0  	100   	0.602385	0.518349	0.668196
1  	35    	0.632518	0.594801	0.668196
2  	31    	0.655148	0.617737	0.680428
3  	35    	0.673445	0.643731	0.680428
4  	37    	0.679817	0.675841	0.680428
5  	34    	0.680428	0.680428	0.680428
Best hiperparameters:
n_estimators: 40
max_depth: 31
min_child_weight: 4
gamma: 0.12154042583927926
subsample: 0.6698229012471432
colsample_bytree: 0.9535284233682184
learning_rate: 0.060241362214599276
reg_lambda: 0.8311679697280072
reg_alpha: 0.4034999043153308
DEAP - XGBoost - Execution time for accuracy: 18.25 seconds
--------------------------------------------------
DEAP - XGBoost - F1 Score: 



  f1_score_positives = 2 * (precision_positives * recall_positives) / (precision_positives + recall_positives)
  precision_positives = true_positives / (true_positives + false_positives)


gen	nevals	avg	min	max
0  	100   	nan	nan	nan
1  	35    	nan	nan	nan
2  	34    	nan	nan	nan
3  	30    	0.620514	0.606924	0.626963
4  	38    	0.627121	0.619003	0.632819
5  	37    	0.631893	0.626136	0.643153
Best hiperparameters:
n_estimators: 128
max_depth: 31
min_child_weight: 3
gamma: 0.7832163211188644
subsample: 0.8497178056296394
colsample_bytree: 0.8273468625163021
learning_rate: 0.03174712073389739
reg_lambda: -0.9197465419898736
reg_alpha: 0.09445075808653292
DEAP - XGBoost - Execution time for F1 Score: 18.89 seconds


### Trenowanie modelu XGBoost - optymalizacja hiperparametrów za pomocą biblioteki Optuna

In [13]:
def print_best_optuna_result(model):
    best_trial = model.get_optuna_best_result()
    print("Best hiperparameters:")
    for key, value in best_trial.params.items():
        print(f"  {key}: {value}")
    print("Best accuracy for this hiperparameters: " + str(best_trial.value))


print("Optuna - XGBoost - Accuracy: \n")
start_time = time.time()
XGB_model_accuracy_optimizing_optuna = XGBoostClassifierWrapper(train_std_interpolate, test_std_interpolate)
XGB_model_accuracy_optimizing_optuna.train_model_with_optuna(optimize_f1_score=False)
print_best_optuna_result(XGB_model_accuracy_optimizing_optuna)
end_time = time.time()
print(f"Optuna - XGBoost - Execution time for Accuracy: {format(end_time - start_time, '.2f')} seconds")

print("-" * 50)

print("Optuna - XGBoost - F1 Score: \n")
start_time = time.time()
XGB_model_f1score_optimizing_optuna = XGBoostClassifierWrapper(train_std_interpolate, test_std_interpolate)
XGB_model_f1score_optimizing_optuna.train_model_with_optuna(optimize_f1_score=True)
print_best_optuna_result(XGB_model_f1score_optimizing_optuna)
end_time = time.time()
print(f"Optuna - XGBoost - Execution time for F1 Score: {format(end_time - start_time, '.2f')} seconds")

[I 2024-06-07 00:44:28,162] A new study created in memory with name: no-name-e193031a-89f7-4ef7-af71-70a17f42a009
[I 2024-06-07 00:44:28,308] Trial 3 finished with value: 0.6330275229357798 and parameters: {'n_estimators': 22, 'max_depth': 6, 'min_child_weight': 8, 'gamma': 0.508456788256398, 'subsample': 0.24934489957937955, 'colsample_bytree': 0.87620450068816, 'learning_rate': 0.2632083153833256, 'reg_lambda': 0.6121278066639152, 'reg_alpha': 0.42858296569017995}. Best is trial 3 with value: 0.6330275229357798.


Optuna - XGBoost - Accuracy: 



[I 2024-06-07 00:44:28,396] Trial 15 finished with value: 0.5397553516819572 and parameters: {'n_estimators': 19, 'max_depth': 21, 'min_child_weight': 7, 'gamma': 0.9456664900970976, 'subsample': 0.35670415115365894, 'colsample_bytree': 0.3577732055250682, 'learning_rate': 0.7566115348184841, 'reg_lambda': 0.4217564320220363, 'reg_alpha': 0.9388155837152007}. Best is trial 3 with value: 0.6330275229357798.
[I 2024-06-07 00:44:28,426] Trial 2 finished with value: 0.5443425076452599 and parameters: {'n_estimators': 77, 'max_depth': 29, 'min_child_weight': 10, 'gamma': 0.9983927172301522, 'subsample': 0.12205107256692944, 'colsample_bytree': 0.22193361137990414, 'learning_rate': 0.9056030286896299, 'reg_lambda': 0.9945151641931488, 'reg_alpha': 0.7107642458035356}. Best is trial 3 with value: 0.6330275229357798.
[I 2024-06-07 00:44:28,525] Trial 7 finished with value: 0.555045871559633 and parameters: {'n_estimators': 108, 'max_depth': 22, 'min_child_weight': 10, 'gamma': 0.93741080035797

Best hiperparameters:
  n_estimators: 162
  max_depth: 41
  min_child_weight: 2
  gamma: 0.35397734494389177
  subsample: 0.4285979184361682
  colsample_bytree: 0.8780724572623652
  learning_rate: 0.013791911115751097
  reg_lambda: 0.4233592487646939
  reg_alpha: 0.6638913242381421
Best accuracy for this hiperparameters: 0.6834862385321101
Optuna - XGBoost - Execution time for Accuracy: 7.64 seconds
--------------------------------------------------
Optuna - XGBoost - F1 Score: 



[I 2024-06-07 00:44:36,026] Trial 0 finished with value: 0.5415173703069014 and parameters: {'n_estimators': 61, 'max_depth': 14, 'min_child_weight': 4, 'gamma': 0.7878424938318213, 'subsample': 0.11228961997597124, 'colsample_bytree': 0.5121826104212664, 'learning_rate': 0.9562378803497771, 'reg_lambda': 0.7326475987717022, 'reg_alpha': 0.4152862310904586}. Best is trial 0 with value: 0.5415173703069014.
[I 2024-06-07 00:44:36,106] Trial 2 finished with value: 0.5709337112265785 and parameters: {'n_estimators': 58, 'max_depth': 26, 'min_child_weight': 10, 'gamma': 0.9608528053980017, 'subsample': 0.20971087608885777, 'colsample_bytree': 0.9856550451356655, 'learning_rate': 0.14713442950577837, 'reg_lambda': 0.46791744950506786, 'reg_alpha': 0.28328697509126366}. Best is trial 2 with value: 0.5709337112265785.
[I 2024-06-07 00:44:36,130] Trial 14 finished with value: 0.613576680672269 and parameters: {'n_estimators': 41, 'max_depth': 37, 'min_child_weight': 1, 'gamma': 0.81047285277184

Best hiperparameters:
  n_estimators: 77
  max_depth: 19
  min_child_weight: 3
  gamma: 0.6490987958071666
  subsample: 0.884037251990464
  colsample_bytree: 0.9997019639410412
  learning_rate: 0.24907556818144527
  reg_lambda: 0.7728594931105196
  reg_alpha: 0.6235350328179274
Best accuracy for this hiperparameters: 0.638274336283186
Optuna - XGBoost - Execution time for F1 Score: 5.22 seconds


### Trenowanie modelu XGBoost - optymalizacja hiperparametrów za pomocą RandomizedSearchCV

In [14]:
print("RandomizedSearchCV - XGBoost - Accuracy: \n")
start_time = time.time()
XGB_model_accuracy_optimizing_CV = XGBoostClassifierWrapper(train_std_interpolate, test_std_interpolate)
XGB_model_accuracy_optimizing_CV.train_model_CV(optimize_f1_score=False)
XGB_model_accuracy_optimizing_CV.evaluate_model_CV(optimize_f1_score=False)
end_time = time.time()
print(f"RandomizedSearchCV - Execution time for Accuracy: {format(end_time - start_time, '.2f')} seconds")

print("-" * 50)

print("RandomizedSearchCV - Random Forest - F1 Score: \n")
start_time = time.time()
XGB_model_f1score_optimizing_CV = XGBoostClassifierWrapper(train_std_interpolate, test_std_interpolate)
XGB_model_f1score_optimizing_CV.train_model_CV(optimize_f1_score=True)
XGB_model_f1score_optimizing_CV.evaluate_model_CV(optimize_f1_score=True)
end_time = time.time()
print(f"RandomizedSearchCV - XGBoost - Execution time for F1 Score: {format(end_time - start_time, '.2f')} seconds")

RandomizedSearchCV - XGBoost - Accuracy: 

------------- Random Forest -------------
objective: binary:logistic
base_score: None
booster: None
callbacks: None
colsample_bylevel: None
colsample_bynode: None
colsample_bytree: 0.8
device: None
early_stopping_rounds: None
enable_categorical: False
eval_metric: None
feature_types: None
gamma: 0.0
grow_policy: None
importance_type: None
interaction_constraints: None
learning_rate: 0.03
max_bin: None
max_cat_threshold: None
max_cat_to_onehot: None
max_delta_step: None
max_depth: 25
max_leaves: None
min_child_weight: 9
missing: nan
monotone_constraints: None
multi_strategy: None
n_estimators: 106
n_jobs: None
num_parallel_tree: None
random_state: 42
reg_alpha: 1e-05
reg_lambda: 0.3
sampling_method: None
scale_pos_weight: None
subsample: 0.8
tree_method: None
validate_parameters: None
verbosity: None
-----------------------------------------
Test F1 score: 0.6117
-----------------------------------------
RandomizedSearchCV - Execution time for 