# Water Quality and Potability Classification
  
#### [Dataset URL](https://www.kaggle.com/datasets/uom190346a/water-quality-and-potability)
  
## Opis zbioru danych

Ten zbiór danych zawiera pomiary jakości wody oraz oceny dotyczące jej zdatności do spożycia przez ludzi, czyli potencjał pitności. Głównym celem tego zbioru danych jest dostarczenie wglądu w parametry jakości wody i pomoc w określeniu, czy woda jest zdatna do spożycia. Każdy wiersz w zbiorze danych reprezentuje próbkę wody z określonymi cechami, a kolumna "Potability" wskazuje, czy woda jest odpowiednia do spożycia. Głównym celem tego zbioru danych jest ocena i przewidywanie potencjału potabilności wody na podstawie cech jakości wody. Może być używany do oceny bezpieczeństwa i odpowiedniości źródeł wody do spożycia przez ludzi, podejmowania świadomych decyzji dotyczących uzdatniania wody oraz zapewnienia zgodności z normami jakości wody.

## Opis cech

- pH: Poziom pH wody.
- Hardness: Twardość wody, miara zawartości minerałów.
- Solids: Całkowita zawartość substancji rozpuszczonych w wodzie.
- Chloramines: Stężenie chloramin w wodzie.
- Sulfate: Stężenie siarczanów w wodzie.
- Conductivity: Przewodność elektryczna wody.
- Organic_carbon: Zawartość węgla organicznego w wodzie.
- Trihalomethanes: Stężenie trihalometanów w wodzie.
- Turbidity: Poziom mętności, miara klarowności wody.
- Potability: Zmienna celu; wskazuje zdatność do spożycia wody, przyjmując wartości 1 (zdatna do spożycia - "potable") i 0 (niezdatna do spożycia - "not potable).

## Parametry zbioru danych

- Liczba rekordów: 3276
- Liczba cech: 9
- Dane brakujące: Tak (kolumny: pH, Sulfate and Trihalomethanes)
- Dane odstające: Tak (ok. 1.22% całego zbioru danych) 
- Typ problemu: Klasyfikacja (Potability - No (0), Yes (1))

## Rozkład klas

| Klasa | Liczba rekordów | Rozkład procentowy |
|-------|-----------------|--------------------|
| 0     | 1998            | 60.99%             |
| 1     | 1278            | 39.01%             |

## Importowanie bibliotek

In [1]:
import pickle
import random
import time

import numpy as np
import optuna
from deap import base, creator, tools, algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

## Załadowanie zmiennych

Zmienne zostają załadowane z pliku wygenerowanego z notebooka DataAnalysis.ipynb przy pomocy biblioteki pickle, służącej do serializacji oraz deserializacji danych. Wybrano dane, które osiągnęły najlepsze wyniki podczas 2 etapu - tworzenia i trenowania modeli.

In [2]:
with open('data_dump/normalizedStdInterpolateVars.pkl', 'rb') as f:
    normalized_std_interpolate = pickle.load(f)
    scaler_std_interpolate = pickle.load(f)

## Funkcje do podziału zbioru danych

Utworzono dwie funkcje do tworzenia podziału danych. Split_df_train_test odpowiada za podział danych na dwa zbiory: testowy i treningowy. Dane są dzielone z równomiernym podziałem klas, aby zapobiec sytuacji, w której podczas podziału danych, przydzielono do zbioru treningowego tylko jedną klasę danych. 

In [3]:
def split_df_train_test(data, test_size, seed):
    np.random.seed(seed)

    unique_labels = data['Potability'].unique()
    label_counts = data['Potability'].value_counts()

    test_indices = []

    for label in unique_labels:
        num_label_samples = label_counts[label]
        num_test_samples = int(test_size * num_label_samples)
        label_indices = data.index[data['Potability'] == label].tolist()
        label_test_indices = np.random.choice(label_indices, size=num_test_samples, replace=False)
        test_indices.extend(label_test_indices)

    train_indices = np.setdiff1d(data.index, test_indices)

    train_set = data.loc[train_indices]
    test_set = data.loc[test_indices]

    return train_set, test_set

### Funkcja do obliczania metryki F1-score

In [4]:
def calculate_f1_score(y_true, y_pred):
    true_positives = np.sum((y_true == 1) & (y_pred == 1))
    true_negatives = np.sum((y_true == 0) & (y_pred == 0))
    false_positives = np.sum((y_true == 0) & (y_pred == 1))
    false_negatives = np.sum((y_true == 1) & (y_pred == 0))

    precision_positives = true_positives / (true_positives + false_positives)
    recall_positives = true_positives / (true_positives + false_negatives)
    f1_score_positives = 2 * (precision_positives * recall_positives) / (precision_positives + recall_positives)

    precision_negatives = true_negatives / (true_negatives + false_negatives)
    recall_negatives = true_negatives / (true_negatives + false_positives)
    f1_score_negatives = 2 * (precision_negatives * recall_negatives) / (precision_negatives + recall_negatives)

    f1_score = (f1_score_positives + f1_score_negatives) / 2
    return f1_score

### Split danych

Podzielenie danych przy pomocy funkcji "split_df_train_test"

In [5]:
train_std_interpolate, test_std_interpolate = split_df_train_test(normalized_std_interpolate, 0.2, 123)

### Podział klas po splicie - test/train

In [6]:
train_class_counts = train_std_interpolate['Potability'].value_counts(normalize=True) * 100
test_class_counts = test_std_interpolate['Potability'].value_counts(normalize=True) * 100

print("Train set:")
print(train_class_counts)
print("\nTest set:")
print(test_class_counts)

Train set:
Potability
0    60.983982
1    39.016018
Name: proportion, dtype: float64

Test set:
Potability
0    61.009174
1    38.990826
Name: proportion, dtype: float64


## Random Forest - Scikit-Learn

In [7]:
class RandomForestClassifierWrapper:

    def __init__(self, train_set, test_set):
        self.train_set = train_set.iloc[:, :-1]
        self.test_set = test_set.iloc[:, :-1]
        self.train_label = train_set.iloc[:, -1]
        self.test_label = test_set.iloc[:, -1]
        self.model = None
        self.history = None
        self.test_pred = None
        self.hof = None
        self.best_optuna_trial = None
        self._create_model()
        self.mu = 30
        self.ngen = 5
        self.total_individuals = self.mu * self.ngen
        self.eval_counter = 0

    def _create_model(self):
        self.model = RandomForestClassifier(random_state=42)

    def evalRFModel(self, individual, optimize_f1_score):
        n_estimators, max_depth, min_samples_split, min_samples_leaf, bootstrap, criterion = individual
        n_estimators = int(n_estimators)
        max_depth = int(max_depth) if max_depth > 0 else None
        min_samples_split = int(min_samples_split) if int(min_samples_split) > 1 else 2
        min_samples_leaf = int(min_samples_leaf) if int(min_samples_leaf) > 0 else 1
        bootstrap = bool(bootstrap)
        criterion = 'gini' if criterion < 0.5 else 'entropy'
        model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth,
                                       min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf,
                                       bootstrap=bootstrap, criterion=criterion, random_state=42)
        model.fit(self.train_set, self.train_label)
        predictions = model.predict(self.test_set)

        if optimize_f1_score:
            score = calculate_f1_score(self.test_label, predictions)
        else:
            score = accuracy_score(self.test_label, predictions)

        return score,

    def train_model_with_ga(self, optimize_f1_score=False):
        if not hasattr(creator, 'FitnessMax'):
            creator.create("FitnessMax", base.Fitness, weights=(1.0,))
        if not hasattr(creator, 'Individual'):
            creator.create("Individual", list, fitness=creator.FitnessMax)

        toolbox = base.Toolbox()

        toolbox.register("attr_n_estimators", random.randint, 10, 200)
        toolbox.register("attr_max_depth", random.randint, 1, 50)
        toolbox.register("attr_min_samples_split", random.randint, 2, 10)
        toolbox.register("attr_min_samples_leaf", random.randint, 1, 10)
        toolbox.register("attr_bootstrap", random.randint, 0, 1)
        toolbox.register("attr_criterion", random.uniform, 0, 1)

        toolbox.register("individual", tools.initCycle, creator.Individual,
                         (toolbox.attr_n_estimators, toolbox.attr_max_depth, toolbox.attr_min_samples_split,
                          toolbox.attr_min_samples_leaf, toolbox.attr_bootstrap, toolbox.attr_criterion), n=1)

        toolbox.register("population", tools.initRepeat, list, toolbox.individual)

        toolbox.register("evaluate", self.evalRFModel, optimize_f1_score=optimize_f1_score)
        toolbox.register("mate", tools.cxTwoPoint)
        toolbox.register("mutate", tools.mutGaussian, mu=0, sigma=1, indpb=0.2)
        toolbox.register("select", tools.selTournament, tournsize=5)

        pop = toolbox.population(n=100)

        self.hof = tools.HallOfFame(1)
        stats = tools.Statistics(lambda ind: ind.fitness.values)
        stats.register("avg", np.mean)
        stats.register("min", np.min)
        stats.register("max", np.max)

        pop, logbook = algorithms.eaMuPlusLambda(pop, toolbox, mu=self.mu, lambda_=50, cxpb=0.5, mutpb=0.2,
                                                 ngen=self.ngen,
                                                 stats=stats, halloffame=self.hof, verbose=True)

        best_individual = tools.selBest(pop, 1)[0]

        self.model = RandomForestClassifier(n_estimators=int(best_individual[0]),
                                            max_depth=int(best_individual[1]),
                                            min_samples_split=int(best_individual[2] if int(best_individual[2]) > 1 else 2),
                                            min_samples_leaf=int(best_individual[3] if int(best_individual[3]) > 0 else 1),
                                            bootstrap=bool(best_individual[4]),
                                            criterion='gini' if best_individual[5] < 0.5 else 'entropy',
                                            random_state=42)
        self.model.fit(self.train_set, self.train_label)

    def optuna_objective(self, trial, optimize_f1_score):
        n_estimators = trial.suggest_int("n_estimators", 10, 200)
        max_depth = trial.suggest_int("max_depth", 1, 50)
        min_samples_split = trial.suggest_int("min_samples_split", 2, 10)
        min_samples_leaf = trial.suggest_int("min_samples_leaf", 1, 10)
        bootstrap = trial.suggest_categorical("bootstrap", [True, False])
        criterion = trial.suggest_categorical("criterion", ['gini', 'entropy'])

        model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth,
                                       min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf,
                                       bootstrap=bootstrap, criterion=criterion, random_state=42)
        model.fit(self.train_set, self.train_label)
        predictions = model.predict(self.test_set)

        if optimize_f1_score:
            score = calculate_f1_score(self.test_label, predictions)
        else:
            score = accuracy_score(self.test_label, predictions)

        return score

    def train_model_with_optuna(self, optimize_f1_score=False, n_trials=100):
        study = optuna.create_study(direction="maximize")
        study.optimize(lambda trial: self.optuna_objective(trial, optimize_f1_score), n_jobs=-1, n_trials=n_trials)

        self.best_optuna_trial = study.best_trial

        self.model = RandomForestClassifier(n_estimators=self.best_optuna_trial.params["n_estimators"],
                                            max_depth=self.best_optuna_trial.params["max_depth"],
                                            min_samples_split=self.best_optuna_trial.params["min_samples_split"],
                                            min_samples_leaf=self.best_optuna_trial.params["min_samples_leaf"],
                                            bootstrap=self.best_optuna_trial.params["bootstrap"],
                                            criterion=self.best_optuna_trial.params["criterion"],
                                            random_state=42)
        self.model.fit(self.train_set, self.train_label)

    def get_deap_best_individual(self):
        return self.hof[0]

    def get_optuna_best_result(self):
        return self.best_optuna_trial

### Trenowanie modelu Random Forest - optymalizacja hiperparametrów za pomocą algorytmu genetycznego

In [8]:
def print_best_individual(model):
    best_individual = model.get_deap_best_individual()
    print("Best hiperparameters:")
    print(f"n_estimators: {best_individual[0]}")
    print(f"max_depth: {best_individual[1]}")
    print(f"min_samples_split: {best_individual[2]}")
    print(f"min_samples_leaf: {best_individual[3]}")
    print(f"bootstrap: {bool(best_individual[4])}")
    print(f"criterion: {'gini' if best_individual[5] < 0.5 else 'entropy'}")


print("DEAP - Accuracy: \n")
start_time = time.time()
RF_model_accuracy_optimizing_DEAP = RandomForestClassifierWrapper(train_std_interpolate, test_std_interpolate)
RF_model_accuracy_optimizing_DEAP.train_model_with_ga(optimize_f1_score=False)
print_best_individual(RF_model_accuracy_optimizing_DEAP)
end_time = time.time()
print(f"DEAP - Execution time for accuracy: {format(end_time - start_time, '.2f')} seconds")

print("-" * 50)

print("DEAP - F1 Score: \n")
start_time = time.time()
RF_model_f1score_optimizing_DEAP = RandomForestClassifierWrapper(train_std_interpolate, test_std_interpolate)
RF_model_f1score_optimizing_DEAP.train_model_with_ga(optimize_f1_score=True)
print_best_individual(RF_model_f1score_optimizing_DEAP)
end_time = time.time()
print(f"DEAP - Execution time for F1 Score: {format(end_time - start_time, '.2f')} seconds")

DEAP - Accuracy: 

gen	nevals	avg     	min     	max     
0  	100   	0.672217	0.610092	0.697248
1  	37    	0.687207	0.67737 	0.695719
2  	37    	0.6921  	0.685015	0.695719
3  	35    	0.695515	0.691131	0.698777
4  	32    	0.696687	0.695719	0.698777
5  	33    	0.698063	0.695719	0.698777
Best hiperparameters:
n_estimators: 113
max_depth: 41
min_samples_split: 4
min_samples_leaf: 1
bootstrap: True
criterion: gini
DEAP - Execution time for accuracy: 470.99 seconds
--------------------------------------------------
DEAP - F1 Score: 

gen	nevals	avg     	min     	max     
0  	100   	0.598052	0.383184	0.647085
1  	37    	0.630906	0.608852	0.647085
2  	29    	0.639115	0.63358 	0.66714 
3  	33    	0.645971	0.63656 	0.66714 
4  	31    	0.657995	0.643608	0.66714 
5  	32    	0.665814	0.647245	0.66714 
Best hiperparameters:
n_estimators: 49
max_depth: 43
min_samples_split: 5
min_samples_leaf: 3
bootstrap: False
criterion: gini
DEAP - Execution time for F1 Score: 348.19 seconds


### Trenowanie modelu Random Forest - optymalizacja hiperparametrów za pomocą biblioteki Optuna

In [9]:
def print_best_optuna_result(model):
    best_trial = model.get_optuna_best_result()
    print("Best hiperparameters:")
    for key, value in best_trial.params.items():
        print(f"  {key}: {value}")
    print("Best accuracy for this hiperparameters: " + str(best_trial.value))


print("Optuna - Accuracy: \n")
start_time = time.time()
RF_model_accuracy_optimizing_optuna = RandomForestClassifierWrapper(train_std_interpolate, test_std_interpolate)
RF_model_accuracy_optimizing_optuna.train_model_with_optuna(optimize_f1_score=False)
print_best_optuna_result(RF_model_accuracy_optimizing_optuna)
end_time = time.time()
print(f"Optuna - Execution time for Accuracy: {format(end_time - start_time, '.2f')} seconds")

print("-" * 50)

print("Optuna - F1 Score: \n")
start_time = time.time()
RF_model_f1score_optimizing_optuna = RandomForestClassifierWrapper(train_std_interpolate, test_std_interpolate)
RF_model_f1score_optimizing_optuna.train_model_with_optuna(optimize_f1_score=True)
print_best_optuna_result(RF_model_f1score_optimizing_optuna)
end_time = time.time()
print(f"Optuna - Execution time for F1 Score: {format(end_time - start_time, '.2f')} seconds")

[I 2024-06-06 00:11:04,398] A new study created in memory with name: no-name-542915c1-b83f-44de-93e9-4d0763ed38a4


Optuna - Accuracy: 



[I 2024-06-06 00:11:04,930] Trial 17 finished with value: 0.6620795107033639 and parameters: {'n_estimators': 10, 'max_depth': 7, 'min_samples_split': 3, 'min_samples_leaf': 5, 'bootstrap': False, 'criterion': 'entropy'}. Best is trial 17 with value: 0.6620795107033639.
[I 2024-06-06 00:11:04,976] Trial 12 finished with value: 0.6743119266055045 and parameters: {'n_estimators': 10, 'max_depth': 11, 'min_samples_split': 9, 'min_samples_leaf': 5, 'bootstrap': False, 'criterion': 'gini'}. Best is trial 12 with value: 0.6743119266055045.
[I 2024-06-06 00:11:05,076] Trial 14 finished with value: 0.6773700305810397 and parameters: {'n_estimators': 11, 'max_depth': 43, 'min_samples_split': 6, 'min_samples_leaf': 3, 'bootstrap': False, 'criterion': 'gini'}. Best is trial 14 with value: 0.6773700305810397.
[I 2024-06-06 00:11:05,126] Trial 5 finished with value: 0.6697247706422018 and parameters: {'n_estimators': 13, 'max_depth': 35, 'min_samples_split': 7, 'min_samples_leaf': 10, 'bootstrap': 

Best hiperparameters:
  n_estimators: 57
  max_depth: 30
  min_samples_split: 2
  min_samples_leaf: 1
  bootstrap: False
  criterion: gini
Best accuracy for this hiperparameters: 0.7033639143730887
Optuna - Execution time for Accuracy: 21.32 seconds
--------------------------------------------------
Optuna - F1 Score: 



[I 2024-06-06 00:11:26,520] Trial 15 finished with value: 0.5990123918428845 and parameters: {'n_estimators': 21, 'max_depth': 30, 'min_samples_split': 8, 'min_samples_leaf': 6, 'bootstrap': True, 'criterion': 'gini'}. Best is trial 15 with value: 0.5990123918428845.
[I 2024-06-06 00:11:26,632] Trial 16 finished with value: 0.4220279007195223 and parameters: {'n_estimators': 75, 'max_depth': 2, 'min_samples_split': 3, 'min_samples_leaf': 9, 'bootstrap': True, 'criterion': 'gini'}. Best is trial 15 with value: 0.5990123918428845.
[I 2024-06-06 00:11:26,827] Trial 2 finished with value: 0.6142562231712779 and parameters: {'n_estimators': 27, 'max_depth': 30, 'min_samples_split': 7, 'min_samples_leaf': 5, 'bootstrap': True, 'criterion': 'entropy'}. Best is trial 2 with value: 0.6142562231712779.
[I 2024-06-06 00:11:27,090] Trial 22 finished with value: 0.4220279007195223 and parameters: {'n_estimators': 39, 'max_depth': 2, 'min_samples_split': 7, 'min_samples_leaf': 1, 'bootstrap': True, 

Best hiperparameters:
  n_estimators: 133
  max_depth: 27
  min_samples_split: 8
  min_samples_leaf: 1
  bootstrap: False
  criterion: entropy
Best accuracy for this hiperparameters: 0.6580081960215145
Optuna - Execution time for F1 Score: 21.82 seconds
