## Results manipulation with remayn

### 1. Running some experiments with GridSearchCV and saving the results

A Logistic Regression model and a Ridge Classifier are trained using a GridSearch cross-validation procedure. Then, the results are saved including the best parameters found.

In [1]:
from remayn.result import make_result
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
import time
from shutil import rmtree
from remayn.result_set import ResultFolder

In [2]:
# Clean up the results folder if exists
rmtree('./results', ignore_errors=True)

# Repeat the experiment 10 times with different random seeds
for seed in range(10):
    for model, param_grid in [(LogisticRegression, {'C': [0.1, 1, 10], 'max_iter': [50, 100, 150]}),
                              (RidgeClassifier, {'alpha': [0.1, 1, 10], 'max_iter': [50, 100, 150]})]:
        # Generate a sample dataset
        X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=2, n_clusters_per_class=2, random_state=0)

        # Split the dataset into training and test sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

        # Train the model
        start_time = time.time()

        gs = GridSearchCV(model(), param_grid=param_grid, cv=5)
        gs.fit(X_train, y_train)

        train_time = time.time() - start_time

        # Make predictions
        y_train_pred = gs.predict(X_train)
        y_test_pred = gs.predict(X_test)

        # Prepare estimator config that is going to be saved
        estimator_config = gs.get_params()
        # Remove the 'estimator' key from the config, as it is not serializable
        estimator_config.pop('estimator')

        # Create a dictionary that represents the config of this experiment.
        # Any information relevant for the experiment can be included here.
        # In this case, all the hyperparameters of the estimator are included.
        experiment_config = {
            "estimator_config": estimator_config,
            "estimator_name": model.__name__,
            "seed": seed,
        }

        # Save the results of the experiment
        make_result(
            base_path='./results',
            config=experiment_config,
            targets=y_test,
            predictions=y_test_pred,
            train_targets=y_train,
            train_predictions=y_train_pred,
            time=train_time,

            # Save the best hyperparameters and the best model
            best_params=gs.best_params_,
            best_model=gs.best_estimator_
        ).save()

### 2. Loading the results folder and visualizing the results

In [3]:
# Load the results from the folder
rf = ResultFolder('./results')
print(rf)

# Iterate over the results and print them
for i, result in enumerate(rf):
    print(result)

    # Print only the first 3 results
    if i == 2:
        break

ResultSet with 20 results
Config: {
    "estimator_config": {
        "cv": 5,
        "error_score": NaN,
        "estimator__C": 1.0,
        "estimator__class_weight": null,
        "estimator__dual": false,
        "estimator__fit_intercept": true,
        "estimator__intercept_scaling": 1,
        "estimator__l1_ratio": null,
        "estimator__max_iter": 100,
        "estimator__multi_class": "auto",
        "estimator__n_jobs": null,
        "estimator__penalty": "l2",
        "estimator__random_state": null,
        "estimator__solver": "lbfgs",
        "estimator__tol": 0.0001,
        "estimator__verbose": 0,
        "estimator__warm_start": false,
        "n_jobs": null,
        "param_grid": {
            "C": [
                0.1,
                1,
                10
            ],
            "max_iter": [
                50,
                100,
                150
            ]
        },
        "pre_dispatch": "2*n_jobs",
        "refit": true,
        "return_trai

### 3. Deleting a specific experiment

If we want to remove a specific experiments, for example, to repeat it, we can find it and remove it using remayn functions. The directory where the results are stored should not be manipulated manually.

In [4]:
def filter_fn(result):
    return result.config['estimator_name'] == 'LogisticRegression' and result.config['seed'] == 0

# Filter the results
filtered_results = rf.filter(filter_fn)
print(filtered_results)

for result in filtered_results:
    print(f"Deleting the result {result}")
    # Delete the result from disk
    result.delete()

ResultSet with 1 result
Deleting the result Config: {
    "estimator_config": {
        "cv": 5,
        "error_score": NaN,
        "estimator__C": 1.0,
        "estimator__class_weight": null,
        "estimator__dual": false,
        "estimator__fit_intercept": true,
        "estimator__intercept_scaling": 1,
        "estimator__l1_ratio": null,
        "estimator__max_iter": 100,
        "estimator__multi_class": "auto",
        "estimator__n_jobs": null,
        "estimator__penalty": "l2",
        "estimator__random_state": null,
        "estimator__solver": "lbfgs",
        "estimator__tol": 0.0001,
        "estimator__verbose": 0,
        "estimator__warm_start": false,
        "n_jobs": null,
        "param_grid": {
            "C": [
                0.1,
                1,
                10
            ],
            "max_iter": [
                50,
                100,
                150
            ]
        },
        "pre_dispatch": "2*n_jobs",
        "refit": true,
  

Now we can load the result folder again and check that the result has been removed

In [5]:
rf = ResultFolder('./results')
print(rf)

# Filtered results should be empty now
filtered_results = rf.filter(filter_fn)
print(filtered_results)

ResultSet with 19 results
ResultSet with 0 result


### 4. Copying some experiments to a different path

In this example, we will move all the experiments from the RidgeClassifier to a new directory named `results_ridge`

In [6]:
def filter_fn(result):
    return result.config['estimator_name'] == 'RidgeClassifier'

# Remove the results in the new directory if any exists
rmtree('./results_ridge', ignore_errors=True)

for result in rf.filter(filter_fn):
    print(f'Moving the result {result}')

    # Move the result to a different folder
    new_result = result.copy_to('./results_ridge')

    # The new result will show the new path
    print(new_result)

Moving the result Config: {
    "estimator_config": {
        "cv": 5,
        "error_score": NaN,
        "estimator__alpha": 1.0,
        "estimator__class_weight": null,
        "estimator__copy_X": true,
        "estimator__fit_intercept": true,
        "estimator__max_iter": null,
        "estimator__positive": false,
        "estimator__random_state": null,
        "estimator__solver": "auto",
        "estimator__tol": 0.0001,
        "n_jobs": null,
        "param_grid": {
            "alpha": [
                0.1,
                1,
                10
            ],
            "max_iter": [
                50,
                100,
                150
            ]
        },
        "pre_dispatch": "2*n_jobs",
        "refit": true,
        "return_train_score": false,
        "scoring": null,
        "verbose": 0
    },
    "estimator_name": "RidgeClassifier",
    "seed": 9
}
Results info path: results/d0496ae2-2a0b-4002-b2fe-20513ce89633.json (data not loaded)

Config: {
  

Now we can load the results in the new directory:

In [7]:
rf_ridge = ResultFolder('./results_ridge')
print(rf_ridge)

ResultSet with 10 results


### 5. Moving the results to a new location

Moving the results consist of copying them and then removing them from the original location. We will repeat the previous example, but, in this case, we will move the results instead of copying them.

In [8]:
def filter_fn(result):
    return result.config['estimator_name'] == 'RidgeClassifier'

# Remove the results in the new directory if any exists
rmtree('./results_ridge', ignore_errors=True)

for result in rf.filter(filter_fn):
    print(f'Moving the result {result}')

    # Move the result to a different folder
    new_result = result.copy_to('./results_ridge')
    result.delete()

    # The new result will show the new path
    print(new_result)

Moving the result Config: {
    "estimator_config": {
        "cv": 5,
        "error_score": NaN,
        "estimator__alpha": 1.0,
        "estimator__class_weight": null,
        "estimator__copy_X": true,
        "estimator__fit_intercept": true,
        "estimator__max_iter": null,
        "estimator__positive": false,
        "estimator__random_state": null,
        "estimator__solver": "auto",
        "estimator__tol": 0.0001,
        "n_jobs": null,
        "param_grid": {
            "alpha": [
                0.1,
                1,
                10
            ],
            "max_iter": [
                50,
                100,
                150
            ]
        },
        "pre_dispatch": "2*n_jobs",
        "refit": true,
        "return_train_score": false,
        "scoring": null,
        "verbose": 0
    },
    "estimator_name": "RidgeClassifier",
    "seed": 9
}
Results info path: results/d0496ae2-2a0b-4002-b2fe-20513ce89633.json (data not loaded)

Config: {
  

To check that the results have been succesfully moved, we will create a dataframe for each directory. In this way, we can verify that the first will only have results from the LogisticRegression and the second will only have results from RidgeClassifier.

In [9]:
# Define a simple metrics function
def compute_metrics(targets, predictions):
    return {
        'accuracy': (targets == predictions).mean()
    }

# Create the first dataframe
# Note that we have to reload the result folder because it may still include moved/deleted results
rf = ResultFolder('./results')
df1 = rf.create_dataframe(
    config_columns=['estimator_name', 'seed'],
    metrics_fn=compute_metrics,
)

df1

Unnamed: 0,config_estimator_name,config_seed,accuracy,time
0,LogisticRegression,4,0.93,0.447752
1,LogisticRegression,6,0.95,0.4492
2,LogisticRegression,9,0.93,0.461416
3,LogisticRegression,2,0.95,0.458361
4,LogisticRegression,8,0.925,0.445751
5,LogisticRegression,1,0.945,0.438621
6,LogisticRegression,7,0.93,0.440201
7,LogisticRegression,3,0.97,0.438959
8,LogisticRegression,5,0.945,0.428503


In [10]:
# Create the second dataframe
rf_ridge = ResultFolder('./results_ridge')
df2 = rf_ridge.create_dataframe(
    config_columns=['estimator_name'],
    metrics_fn=compute_metrics,
)

df2

Unnamed: 0,config_estimator_name,accuracy,time
0,RidgeClassifier,0.925,0.291094
1,RidgeClassifier,0.96,0.259894
2,RidgeClassifier,0.92,0.293992
3,RidgeClassifier,0.95,0.290282
4,RidgeClassifier,0.925,0.294039
5,RidgeClassifier,0.935,0.287419
6,RidgeClassifier,0.945,0.332724
7,RidgeClassifier,0.965,0.29275
8,RidgeClassifier,0.94,0.287842
9,RidgeClassifier,0.92,0.305349


### 6. Combine two different result sets
It is also possible to combine the results stored in two different result sets (or folders).

In [11]:
# Load both ResultFolders
rf = ResultFolder('./results')
rf_ridge = ResultFolder('./results_ridge')

print(f"Results: {rf}")
print(f"Results ridge: {rf_ridge}")

# Combine them
combined_rf = rf + rf_ridge
print(f"Combined results: {combined_rf}")

# Also, the intersection of two ResultFolders can be removed
subtracted_rf = combined_rf - rf_ridge
print(f"Subtracted results: {subtracted_rf}")

Results: ResultSet with 9 results
Results ridge: ResultSet with 10 results
Combined results: ResultSet with 19 results
Subtracted results: ResultSet with 9 results


### 7. Removing a experiment from the result set (or result folder)

We can also remove a experiment from the collection without physically removing it from the disk. It can be done with `.remove()` method of the `ResultSet`.

In [12]:
# Load the results
rf = ResultFolder('./results')
print(f"Initial ResultSet: {rf}")

# For simplicity, get the first result
first_result = list(rf)[0]

# Remove it from the ResultFolder
rf.remove(first_result)

print(f"ResultSet after removing {rf}")

# Check that if we load the result folder again, the removed result is still present
rf = ResultFolder('./results')
print(f"ResultSet after reloading {rf}")

# We can also remove it by config
rf.remove(first_result.config)

print(f"ResultSet after removing by config: {rf}")

Initial ResultSet: ResultSet with 9 results
ResultSet after removing ResultSet with 8 results
ResultSet after reloading ResultSet with 9 results
ResultSet after removing by config: ResultSet with 8 results
