# Classification

This tutorial shows how to use the `PipegenieClassifier` class. This class is used to create a model that can be used for classification tasks.

First of all, we need to load the dataset. For this tutorial, we will use the `iris` dataset from the `sklearn` package, which is a simple dataset with 150 samples, 4 features and 3 classes.

In [1]:
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)

print(iris.data.head())
print(iris.target_names)

X = iris.data
y = iris.target

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2
['setosa' 'versicolor' 'virginica']


To check the generalization of the model, we will split the dataset into training and test sets. We will use 75% of the data for training and 25% for testing.

In [2]:
from pipegenie.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Once we have the data, we can create a model using the `PipegenieClassifier` class. The `PipegenieClassifier` requires the path to the grammar file as a parameter, but it also accepts other parameters such as the number of generations, the population size, operators to apply, seed for reproducibility, etc. To see all the parameters available, you can check the documentation of the class.

In [3]:
from pipegenie.classification import PipegenieClassifier
from pipegenie.evolutionary.crossover import MultiCrossover
from pipegenie.evolutionary.mutation import MultiMutation
from pipegenie.evolutionary.replacement import ElitistGenerationalReplacement
from pipegenie.evolutionary.selection import TournamentSelection

model = PipegenieClassifier(
    #Check the folder `/tutorials/` to access grammar XML files
    grammar="sample-grammar-classification.xml",
    generations=5,
    pop_size=50,
    elite_size=5,
    crossover=MultiCrossover(0.8),
    mutation=MultiMutation(0.5),
    selection=TournamentSelection(3),
    replacement=ElitistGenerationalReplacement(5),
    n_jobs=5,
    seed=42,
    outdir="sample-results",
)

With the model created, we can now train it using the `fit` method passing the data features and labels of the training set.

In [4]:
model.fit(X_train, y_train)

                                       fitness                                            size                                        fitness_elite                                      size_elite              
                 ----------------------------------------------------    --------------------------------------    ----------------------------------------------------    --------------------------------------
gen    nevals    min           max           avg           std           min    max    avg           std           min           max           avg           std           min    max    avg           std       
0      50        0.33333       0.97143       0.88701       0.14218       1      5      2.64          1.3381        0.95238       0.97143       0.95619       0.007619      2      3      2.2           0.4       
1      45        0.33333       0.97143       0.91138       0.10502       1      5      2.7021        1.2702        0.95238       0.97143       0.9581        0.0

<pipegenie.classification.pipegenie_classifier.PipegenieClassifier at 0x7345ca5118a0>

After training, we can evaluate the model's performance either using the `score` method or any other metric. The `score` method returns the accuracy of the model on the given data features and labels. Alternatively, you can use the `predict` method to get the predicted labels for the given data features or use the `predict_proba` method to get the probabilities of each class. You can then use these predictions to calculate other metrics provided by the `pipegenie` package, `sklearn` package or any other metric you want.

In [5]:
print(f"Model score: {model.score(X_test, y_test)}")

from pipegenie.metrics import accuracy_score, balanced_accuracy_score

y_pred = model.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred)}") # Should be the same as model.score(X_test, y_test)
print(f"Balanced accuracy: {balanced_accuracy_score(y_test, y_pred)}")

Model score: 1.0
Accuracy: 1.0
Balanced accuracy: 1.0


In the [ensemble.txt](sample-results/ensemble.txt) output file, you can find the pipelines contained in the final ensemble model. Something to keep in mind is that the pipelines may not be completely sorted based on their fitness. This is because of the diversity weight, which aims to benefit pipelines with different predictions from the others in the ensemble. This can be useful to improve the generalization of the model in difficult datasets, but in simpler datasets, like the `iris` dataset, it may not be necessary. It can even occur that pipelines with lower fitness, that is, wrong predictions and, therefore, different from the others, are included in the ensemble model, decreasing the overall performance. In this case, you can try to decrease the diversity weight parameter or even set it to zero to avoid this behavior.

On the other hand, in the [best_pipeline.txt](sample-results/best_pipeline.txt) output file, you can find the best pipeline found during the search process. This pipeline is the one with the highest fitness value, which must appear in the ensemble model, but not necessarily in the first position as mentioned before.