# Evaluate metrics and record fit and score times

#### Authors:

* Juan Carlos Alfaro Jiménez

In this notebook, we evaluate the performance of an estimator by cross-validation. Below, we detail the dataset used, the estimator tested and the results (in terms of score and time).

## 1. Data

First, we fetch the data to fit (`X`) and the target rankings to try to predict (`Y`) from the [`OpenML` repository](https://www.openml.org/u/25829/data) using the identifier of the dataset (`data_id`):

---

**Note:** All the parameters are provided by `guildai` via environment variables with prefix `FLAG`.

---

In [None]:
import os

In [None]:
data_id = os.environ.get("FLAG_DATA_ID")

In [None]:
from sklearn.datasets import fetch_openml

In [None]:
data = fetch_openml(data_id=data_id)

In [None]:
X, Y = data["data"], data["target"]

In [None]:
Y = Y.astype(float)

Let us print the name of the dataset:

In [None]:
name = data["details"]["name"]

In [None]:
print(f"The {name} dataset will be used.")

## 2. Estimator

Second, we import the estimator object (`estimator`) to use to fit the data according with the model (`module`) and the estimator type (`estimator_type`):

In [None]:
module = os.environ.get("GUILD_OP").split(":")[0]

In [None]:
estimator_type = os.environ.get("FLAG_ESTIMATOR_TYPE")

In [None]:
from ipynb.fs.full.estimators import get_estimator

In [None]:
estimator = get_estimator(module, estimator_type)

Let us print some information of the estimator:

In [None]:
print(f"The {module} {estimator_type} will be tested.")

Then, we initialize the sampler object (`sampler`) to delete a label from the training dataset. Before that, we need the probability for a deletion (`probability`) and the random number generator (`rng`):

---

**Note**: We define the `RandomState` instance outside of the function to delete a different set of labels on each fold of the cross-validation.

---

In [None]:
probability = os.environ.get("FLAG_PROBABILITY")

In [None]:
probability = float(probability)

In [None]:
random_state = os.environ.get("FLAG_RANDOM_STATE")

In [None]:
random_state = int(random_state)

In [None]:
from sklearn.utils import check_random_state

In [None]:
rng = check_random_state(random_state)

Let us print the probability:

In [None]:
print(f"The probability for the deletion of a label is {probability}.")

Now, we define the function (`func`) to use to delete a label. In particular, the following steps are required:

1. Generate a random sample of boolean values according with the probability.
2. Replace the values where the random sample is `False` (`~True`) with `NaN`.
3. Compute numerical data ranks to format the target rankings to try to predict.
4. Fill `NaN` values with `-1` to codify the deleted labels.

In [None]:
a = [False, True]

In [None]:
p = [1 - probability, probability]

In [None]:
import numpy as np

In [None]:
def func(X, Y): return X, Y.where(~np.random.choice(a, Y.shape, p=p), np.nan).rank(method="dense", axis=1).fillna(-1)

Finally, we initialize the sampler object:

In [None]:
from imblearn import FunctionSampler

In [None]:
sampler = FunctionSampler(func=func, validate=False)

And integrate the estimators within a pipeline:

In [None]:
from imblearn.pipeline import make_pipeline

In [None]:
estimator = make_pipeline(sampler, estimator)

## 3. Cross-validation strategy

Third, we define the strategy to evaluate the performance of the cross-validated estimator on the test dataset. In particular, we use a $ r $ (`n_repeats`) $ \times $ $ k $ (`n_splits`) cross-validation method:

In [None]:
n_splits = os.environ.get("FLAG_N_SPLITS")

In [None]:
n_splits = int(n_splits)

In [None]:
n_repeats = os.environ.get("FLAG_N_REPEATS")

In [None]:
n_repeats = int(n_repeats)

In [None]:
from sklearn.model_selection import RepeatedKFold

In [None]:
cv = RepeatedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=random_state)

## 4. Evaluation

Fourth, we obtain the array scores of the estimator for each (parallel) run (`n_jobs`) of the cross-validation:

In [None]:
n_jobs = os.environ.get("FLAG_N_JOBS")

In [None]:
n_jobs = int(n_jobs)

In [None]:
from sklearn.model_selection import cross_validate

In [None]:
scores = cross_validate(estimator, X, Y, cv=cv, n_jobs=n_jobs, return_train_score=True, return_estimator=True)

And extract information from the estimators for each cross-validation split:

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(scores)

In [None]:
df = df.drop("estimator", axis=1)

In [None]:
estimators = scores["estimator"]

In [None]:
from ipynb.fs.full.estimators import get_information

In [None]:
information = get_information(module, estimators)

In [None]:
objs = [df, information]

In [None]:
scores = pd.concat(objs, axis=1)

Now, we store the scores in a `csv` file:

In [None]:
df.to_csv("scores.csv", index=False)

And save the estimators in a `tar.gz` file:

In [None]:
import tarfile

In [None]:
tar = tarfile.open("estimators.tar.gz", "w:gz")

In [None]:
from joblib import dump

In [None]:
for index, estimator in enumerate(estimators):
    # Define the path of the estimator file
    file = f"estimator_{index}.joblib"

    # Persist the estimator in the file
    dump(estimator, file)

    # Add the estimator file to the archive
    tar.add(file)

    # Remove the temporary estimator file
    os.remove(file)

## 5. Results

Finally, we show the results, that is, the test score:

In [None]:
test_score = scores["test_score"].mean(axis=0)

In [None]:
print(f"test_score: {test_score}")

The train score:

In [None]:
train_score = scores["train_score"].mean(axis=0)

In [None]:
print(f"train_score: {train_score}")

The time for fitting the estimator:

In [None]:
fit_time = scores["fit_time"].mean(axis=0)

In [None]:
print(f"fit_time: {fit_time}")

And the time for scoring the estimator:

In [None]:
score_time = scores["score_time"].mean(axis=0)

In [None]:
print(f"score_time: {score_time}")