# Regression

This tutorial shows how to use the `PipegenieRegressor` class. This class is used to create a model that can be used for regression tasks.

First of all, we need to load the dataset. For this tutorial, we will use the `diabetes` dataset from the `sklearn` package, which is a simple dataset with 442 and 10 features.

In [6]:
from sklearn.datasets import load_diabetes

diabetes = load_diabetes(as_frame=True)

print(diabetes.data.head())

X = diabetes.data
y = diabetes.target

        age       sex       bmi        bp        s1        s2        s3  \
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2  0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   

         s4        s5        s6  
0 -0.002592  0.019907 -0.017646  
1 -0.039493 -0.068332 -0.092204  
2 -0.002592  0.002861 -0.025930  
3  0.034309  0.022688 -0.009362  
4 -0.002592 -0.031988 -0.046641  


To check the generalization of the model, we will split the dataset into training and test sets. We will use 75% of the data for training and 25% for testing.

In [7]:
from pipegenie.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=9)

Once we have the data, we can create a model using the `PipegenieRegressor` class. The `PipegenieRegressor` requires the path to the grammar file as a parameter, but it also accepts other parameters such as the number of generations, the population size, operators to apply, seed for reproducibility, etc. Keep in mind that, by default, the fitness function is the root mean squared error, which has to be minimized, so the maximization flag has to be set to False to properly optimize the model. To see all the parameters available, you can check the documentation of the class.

In [None]:
from pipegenie.regression import PipegenieRegressor
from pipegenie.evolutionary.crossover import MultiCrossover
from pipegenie.evolutionary.mutation import MultiMutation
from pipegenie.evolutionary.replacement import ElitistGenerationalReplacement
from pipegenie.evolutionary.selection import TournamentSelection

model = PipegenieRegressor(
    grammar="sample-grammar-regression.xml",
    generations=5,
    pop_size=50,
    elite_size=5,
    crossover=MultiCrossover(0.8),
    mutation=MultiMutation(0.5),
    selection=TournamentSelection(3),
    replacement=ElitistGenerationalReplacement(5),
    maximization=False,
    n_jobs=5,
    seed=9,
    outdir="sample-results",
)

With the model created, we can now train it using the `fit` method passing the data features and labels of the training set.

In [9]:
model.fit(X_train, y_train)

                                       fitness                                            size                                        fitness_elite                                      size_elite              
                 ----------------------------------------------------    --------------------------------------    ----------------------------------------------------    --------------------------------------
gen    nevals    min           max           avg           std           min    max    avg           std           min           max           avg           std           min    max    avg           std       
0      50        52.129        1.1564e+28    3.0432e+26    1.8511e+27    1      6      2.88          1.3363        52.129        52.396        52.328        0.10354       1      3      2.2           0.9798    
1      40        52.129        75.473        55.035        4.719         1      5      2.3043        1.1955        52.129        52.396        52.328        0.1

<pipegenie.regression.pipegenie_regressor.PipegenieRegressor at 0x78b2a5d07dc0>

After training, we can evaluate the model's performance either using the `score` method or any other metric. The `score` method returns the r2 score of the model on the given data features and target value. Alternatively, you can use the `predict` method to get the predicted values for the given data features. You can then use these predictions to calculate other metrics provided by the `pipegenie` package, `sklearn` package or any other metric you want.

In [10]:
print(f"Model score: {model.score(X_test, y_test)}")

from pipegenie.metrics import r2_score, root_mean_squared_error

y_pred = model.predict(X_test)

print(f"R2 score: {r2_score(y_test, y_pred)}") # Should be the same as model.score(X_test, y_test)
print(f"RMSE: {root_mean_squared_error(y_test, y_pred)}")

Model score: 0.4291762507492479
R2 score: 0.4291762507492479
RMSE: 61.085220012340514
