# Saving and loading populations

Another feature Brush implements is the ability to save and load entire populations.
We use JSON notation to store the population into a file that is human readable. The same way, we can feed an estimator a previous population file to serve as starting point for the evolution.

In this notebook, we will walk through how to use the `save_population` and `load_population` parameters. 

We start by getting a sample dataset and splitting it into `X` and `y`:

In [1]:
import pandas as pd
from pybrush import BrushRegressor

# load data
df = pd.read_csv('../examples/datasets/d_enc.csv')
X = df.drop(columns='label')
y = df['label']

To save the population after finishing the evolution, you nee to set `save_population` parameter to a value different than an empty string. Then, the final population is going to be stored in that specific file.

In this example, we create a temporary file.

In [2]:
import pickle
import os, tempfile

pop_file = os.path.join(tempfile.mkdtemp(), 'population.json')

# set verbosity==2 to see the full report
est = BrushRegressor(
    functions=['SplitBest','Add','Mul','Sin','Cos','Exp','Logabs'],
    max_gens=10,
    objectives=["error", "complexity"],
    save_population=pop_file,
    use_arch=True, # Only the pareto front of last gen will be stored in archive
    verbosity=2
)

est.fit(X,y)
y_pred = est.predict(X)
print('score:', est.score(X,y))

Generation 1/10 [//////                                            ]
Train Loss (Med): 14.12979 (72.35345)
Val Loss (Med): 90.38514 (72.35345)
Median Size (Max): 3 (24)
Median complexity (Max): 20 (25928)
Time (s): 0.06040

Generation 2/10 [///////////                                       ]
Train Loss (Med): 14.12979 (17.94969)
Val Loss (Med): 14.12979 (17.94969)
Median Size (Max): 3 (20)
Median complexity (Max): 20 (19464)
Time (s): 0.11851

Generation 3/10 [////////////////                                  ]
Train Loss (Med): 10.84173 (17.94969)
Val Loss (Med): 14.12979 (17.94969)
Median Size (Max): 7 (21)
Median complexity (Max): 344 (10696)
Time (s): 0.18745

Generation 4/10 [/////////////////////                             ]
Train Loss (Med): 10.84173 (17.94969)
Val Loss (Med): 10.84173 (17.94969)
Median Size (Max): 7 (23)
Median complexity (Max): 344 (10696)
Time (s): 0.25553

Generation 5/10 [//////////////////////////                        ]
Train Loss (Med): 10.43983 (16.75

Loading a previous population is done providing `load_population` a string value corresponding to a JSON file generated by Brush. In our case, we will use the same file from the previous code block.

After loading the population, we run the evolution for 10 more generations, and we can see that the first generation started from the previous population. This means that the population was successfully saved and loaded.

In [3]:
est = BrushRegressor(
    functions=['SplitBest','Add','Mul','Sin','Cos','Exp','Logabs'],
    load_population=pop_file,
    max_gens=10,
    verbosity=1
)

est.fit(X,y)
y_pred = est.predict(X)
print('score:', est.score(X,y))

Loaded population from /tmp/tmprhw9ljoe/population.json of size = 200
saving final population as archive...
score: 0.887846384165187


## Saving just the archive

In case you want to use another expression rather than the final `best_estimator_`, brush provides the archive option.

The archive is just the pareto front from the population. You can use `predict_archive` (and `predict_proba_archive` if using a `BrushClassifier`) to call the prediction methods for the entire archive, instead of the selected best individual.

But first, you need to enable this option with `use_arch=True`. When set to False, it will store the entire final population

In [4]:
est = BrushRegressor(
    functions=['SplitBest','Add','Mul','Sin','Cos','Exp','Logabs'],
    load_population=pop_file,
    use_arch=True,
    max_gens=10,
    verbosity=1
)

est.fit(X,y)

# accessing first expression from the archive. It is serialized as a dict
print(est.archive_[0]['fitness'])

Loaded population from /tmp/tmprhw9ljoe/population.json of size = 200
{'complexity': 7032, 'crowding_dist': 0.0, 'dcounter': 0, 'depth': 3, 'dominated': [], 'linear_complexity': 45, 'loss': 10.137018203735352, 'loss_v': 10.137018203735352, 'rank': 1, 'size': 18, 'values': [10.137018203735352, 18.0], 'weights': [-1.0, -1.0], 'wvalues': [-10.137018203735352, -18.0]}


You can open the serialized file and change individuals' programs manually.

This also allow us to have checkpoints in the execution.

## Using population files with classification

To give another example, we do a two-step fit in the cells below.

First, we run the evolution and save the population to a file; then, we load it and keep evolving the individuals.

What is different though is that the first run is optimizing `error` and `complexity`, and the second run is optimizing `average_precision_score` and `linear_complexity`.

In [5]:
from pybrush import BrushClassifier

# load data
df = pd.read_csv('../examples/datasets/d_analcatdata_aids.csv')
X = df.drop(columns='target')
y = df['target']

pop_file = os.path.join(tempfile.mkdtemp(), 'population.json')

est = BrushClassifier(
    functions=['SplitBest','Add','Mul','Sin','Cos','Exp','Logabs'],
    max_gens=10,
    objectives=["error", "complexity"],
    scorer="log",
    save_population=pop_file,
    pop_size=200,
    verbosity=2
)

est.fit(X,y)

print("Best model:", est.best_estimator_.get_model())
print('score:', est.score(X,y))

Generation 1/10 [//////                                            ]
Train Loss (Med): 0.54848 (0.69315)
Val Loss (Med): 0.69315 (0.69315)
Median Size (Max): 5 (12)
Median complexity (Max): 128 (38816)
Time (s): 0.06602

Generation 2/10 [///////////                                       ]
Train Loss (Med): 0.54848 (0.69315)
Val Loss (Med): 0.54848 (0.69315)
Median Size (Max): 5 (12)
Median complexity (Max): 128 (38816)
Time (s): 0.12490

Generation 3/10 [////////////////                                  ]
Train Loss (Med): 0.54848 (0.69315)
Val Loss (Med): 0.54848 (0.69315)
Median Size (Max): 5 (12)
Median complexity (Max): 128 (3488)
Time (s): 0.18595

Generation 4/10 [/////////////////////                             ]
Train Loss (Med): 0.54848 (0.69315)
Val Loss (Med): 0.54848 (0.69315)
Median Size (Max): 5 (12)
Median complexity (Max): 80 (3488)
Time (s): 0.24847

Generation 5/10 [//////////////////////////                        ]
Train Loss (Med): 0.54848 (0.69315)
Val Loss (Med)

In [6]:
from sklearn.metrics import accuracy_score

accuracy_score(y, est.predict(X))

0.68

In [7]:
est = BrushClassifier(
    functions=['SplitBest','Add','Mul','Sin','Cos','Exp','Logabs'],
    load_population=pop_file,
    objectives=["error", "complexity"],
    scorer="average_precision_score",
    max_gens=10,
    validation_size=0.0,
    pop_size=200, # make sure this is the same as loaded pop
    use_arch=True,
    verbosity=1
)

est.fit(X,y)

print("Best model:", est.best_estimator_.get_model())
print('score:', est.score(X,y))

Loaded population from /tmp/tmp9ngz74qa/population.json of size = 400
Best model: Logistic(Sum(0.11283493,0.00*AIDS))
score: 0.5


We can see the fitness object, and that the error now matches the average precision score metric:

In [8]:
# Fitness is (error, linear complexity)
print(est.best_estimator_.fitness)

Fitness(0.617195 992.000000 )


In [9]:
from sklearn.metrics import average_precision_score

# takes y_true as first argument, and y_pred as second argument.
average_precision_score(y, est.predict_proba(X)[:, 1]) #, average='weighted')

0.6972469637643848

## Serialization with pickle

You can save the entire model (best individual, parameters, and archive) with pickle. 

> At the current stage, Brush does not serialize the search space and dataset references, but only the necessary information to be able to load a previously trained model and do predictions with it.

In [10]:
est

In [11]:
import pickle

est_file = os.path.join(tempfile.mkdtemp(), 'est.pkl')

with open(est_file, 'wb') as f:
    pickle.dump(est, f)

In [12]:
loaded_est = pickle.load(open(est_file, 'rb'))

In [13]:
print(est.predict(X))
print(loaded_est.predict(X))

[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True]
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True]


In [14]:
print(est.predict_archive(X)[0])
print(loaded_est.predict_archive(X)[0])

{'id': 447, 'y_pred': array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True])}
{'id': 447, 'y_pred': array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True])}


## Stop/resume the fitting of an estimator

In the code below I try to mimic how pytorch models are trained: we can stop the training at any time, and we can resume it later. 

The idea is to demonstrate how to use population files to store checkpoints, and continuing from the last saved checkpoint.

In [18]:
def train(est, X, y):
    
    checkpoint = os.path.join(tempfile.mkdtemp(), 'brush_pop_checkpoint.json')
    
    step = 5
    max_gens = est.max_gens
    est.max_gens = step
    est.save_population = checkpoint
    est.load_population = ""
    
    # You can set validation_size to a value greater than zero
    # and shuffle_split to true to have random bathes of data
    est.shuffle_split = True
    est.validation_size = 0.2
    
    for g in range(max_gens // step):
        print(f"Progress {g + 1}/{max_gens // step}")
        
        est.fit(X, y) # Notice that this will reset the MAB everytime!

        # Enable loading the checkpoint after a first run
        est.load_population = checkpoint
        
        print("Best model:", est.best_estimator_.get_model())
        print('score     :', est.score(X, y))

    # Restoring initial state
    est.max_gens = max_gens

In [23]:
est = BrushClassifier(
    objectives=["error", "linear_complexity"],
    scorer="balanced_accuracy",
    max_gens=50,
    validation_size=0.2,
    pop_size=100,
    max_depth=20,
    max_size=50,
    verbosity=1
)

train(est, X, y)

Progress 1/10
saving final population as archive...
Saved population to file /tmp/tmpaiwl_q3b/brush_pop_checkpoint.json
Best model: Logistic(Sum(-0.6284293,0.00*AIDS))
score     : 0.6
Progress 2/10
Loaded population from /tmp/tmpaiwl_q3b/brush_pop_checkpoint.json of size = 200
saving final population as archive...
Saved population to file /tmp/tmpaiwl_q3b/brush_pop_checkpoint.json
Best model: Logistic(Sum(-0.5824446,0.00*AIDS))
score     : 0.68
Progress 3/10
Loaded population from /tmp/tmpaiwl_q3b/brush_pop_checkpoint.json of size = 200
saving final population as archive...
Saved population to file /tmp/tmpaiwl_q3b/brush_pop_checkpoint.json
Best model: Logistic(Sum(0.003987044,Sin(1.32*Log1p(Sin(-0.69*Log1p(-0.09*Prod(AIDS,-0.09*AIDS)))))))
score     : 0.8
Progress 4/10
Loaded population from /tmp/tmpaiwl_q3b/brush_pop_checkpoint.json of size = 200
saving final population as archive...
Saved population to file /tmp/tmpaiwl_q3b/brush_pop_checkpoint.json
Best model: Logistic(Sum(-0.03098