# Using Feat's archive

Feat optimizes a population of models. 
At the end of the run, it can be useful to explore this population to find a trade-off between objectives, 
such as performance and complexity. 

In this example, we apply Feat to a regression problem and visualize the archive of representations. 

Note: this code uses the Penn ML Benchmark Suite (https://github.com/EpistasisLab/penn-ml-benchmarks/) to fetch data. You can install it using `pip install pmlb`.


First, we import the data and create a train-test split.

In [1]:
from pmlb import fetch_data
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
import numpy as np
# fix the random state
random_state=42
dataset='690_visualizing_galaxy'
X, y = fetch_data(dataset,return_X_y=True)
X_t,X_v, y_t, y_v = train_test_split(X,y,train_size=0.75,test_size=0.25,random_state=random_state)

Then we set up a Feat instance and train the model, storing the final archive.

In [2]:
from feat import FeatRegressor


fest = FeatRegressor(pop_size=10, # population size
            gens=100, # maximum generations                            
            max_time=60, # max time in seconds 
            max_depth=2, # constrain features depth                                                      
            max_dim=5, # constrain representation dimensionality                                                      
            random_state=random_state,                                                            
            hillclimb=True, # use stochastic hillclimbing to optimize weights                                                   
            iters=10, # iterations of hillclimbing
            n_jobs=1, # restricts to single thread                                                      
            verbosity=2, # verbose output (this will go to terminal, sry..)                                                      
           ) 



In [4]:
print('FEAT version:', fest.__version__)
fest

FEAT version: 0.5.2.post54


In [None]:
# train the model
fest.fit(X_t,y_t)

> [0;32m/home/bill/mambaforge/envs/feat2/lib/python3.11/site-packages/feat_ml-0.5.2.post54-py3.11-linux-x86_64.egg/feat/feat.py[0m(289)[0;36mfit[0;34m()[0m
[0;32m    287 [0;31m            [0;32mimport[0m [0mpdb[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    288 [0;31m            [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 289 [0;31m            [0mself[0m[0;34m.[0m[0m_fit[0m[0;34m([0m[0mX[0m[0;34m,[0m[0my[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    290 [0;31m[0;34m[0m[0m
[0m[0;32m    291 [0;31m        self.set_params(**{k[1:]:v for k,v in self._get_params().items() 
[0m
ipdb> c


In [None]:
# get the test score
test_score = {}
test_score['feat'] = mse(y_v,fest.predict(X_v))

# store the archive
archive = fest.get_archive(justfront=True)

# print the archive
print('complexity','fitness','validation fitness',
     'eqn')
order = np.argsort([a['complexity'] for a in archive])
complexity = []
fit_train = []
fit_test = []
eqn = []

for o in order:
    model = archive[o]
    if model['rank'] == 1:
        print(model['complexity'],
              model['fitness'],
              model['fitness_v'],
              model['eqn'],
             )

        complexity.append(model['complexity'])
        fit_train.append(model['fitness'])
        fit_test.append(model['fitness_v'])
        eqn.append(model['eqn'])

For comparison, we can fit an Elastic Net and Random Forest regression model to the same data.


In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state=random_state)

rf.fit(X_t,y_t)

test_score['rf'] = mse(y_v,rf.predict(X_v))


In [None]:
from sklearn.linear_model import ElasticNet

linest = ElasticNet()

linest.fit(X_t,y_t)

test_score['elasticnet'] = mse(y_v,linest.predict(X_v))


Let's look at the test set mean squared errors by method.

In [None]:
test_score

## Visualizing the Archive

Let's visualize this archive with the test scores. This gives us a sense of how increasing the representation
complexity affects the quality of the model and its generalization.


In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import math

matplotlib.rcParams['figure.figsize'] = (10, 6)
%matplotlib inline 
sns.set_style('white')
h = plt.figure(figsize=(14,8))

# plot archive points 
plt.plot(fit_train,complexity,'--ro',label='Train',markersize=6)
plt.plot(fit_test,complexity,'--bx',label='Validation')
# some models to point out
best = np.argmin(np.array(fit_test))
middle = np.argmin(np.abs(np.array(fit_test[:best])-test_score['rf']))
small = np.argmin(np.abs(np.array(fit_test[:middle])-test_score['elasticnet']))

print('best:',complexity[best])
print('middle:',complexity[middle])
print('small:',complexity[small])
plt.plot(fit_test[best],complexity[best],'sk',markersize=16,markerfacecolor='none',label='Model Selection')

# test score lines
y1 = -1
y2 = np.max(complexity)+1
plt.plot((test_score['feat'],test_score['feat']),(y1,y2),'--k',label='FEAT Test',alpha=0.5)
plt.plot((test_score['rf'],test_score['rf']),(y1,y2),'-.xg',label='RF Test',alpha=0.5)
plt.plot((test_score['elasticnet'],test_score['elasticnet']),(y1,y2),'-sm',label='ElasticNet Test',alpha=0.5)

print('complexity',complexity)
xoff = 100
for e,t,c in zip(eqn,fit_test,complexity):
    if c in [complexity[best],complexity[middle],complexity[small]]:
        t = t+xoff
        tax = plt.text(t,c,'$\leftarrow'+e+'$',size=18,horizontalalignment='left',
                      verticalalignment='center')
        tax.set_bbox(dict(facecolor='white', alpha=0.75, edgecolor='k'))

l = plt.legend(prop={'size': 16},loc=[1.01,0.25])
plt.xlabel('MSE',size=16)
plt.xlim(np.min(fit_train)*.75,np.max(fit_test)*2)
plt.gca().set_xscale('log')
plt.gca().set_yscale('log')

plt.gca().set_yticklabels('')
plt.gca().set_xticklabels('')

plt.ylabel('Complexity',size=18)
h.tight_layout()

plt.show()

Note that ElasticNet produces a similar test score to the linear representation
in Feat's archive, and that Random Forest's test score is near the representation shown in the middle.

The best model, marked with a square, is selected from the validation curve (blue line).
The validation curve shows how models begin to overfit as complexity grows.
By visualizing the archive, we can see that some lower complexity models achieve nearly as good of a validation score.
In this case it may be preferable to choose that representation instead.

By default, FEAT will choose the model with the lowest validation error, marked with a square above. 
Let's look at that model.

the function `get_model()` will print a table of the learned features, optionally ordered by the magnitude of their weights.

In [None]:
print(fest.get_model(sort=False))