# Fine-tune models with Eureqa

## Summary

[Eureqa](https://docs.datarobot.com/en/docs/modeling/analyze-models/describe/eureqa.html) is a symbolic regression algorithm that searches the space of mathematical expressions to find the best fit for a given dataset, while minimizing both error and complexity. Eureqa came out of Cornell's AI Lab in 2009, developed into a company called Nutonian, and then was acquired by DataRobot in 2017. Eureqa now lives as a blueprint within DataRobot.

### Why use Eureqa?

1. Eureqa returns human-readable and interpretable analytic expressions, which are easily reviewed by subject matter experts.

2. Eureqa excels at feature selection; the algorithm is forced to reduce complexity during the model building process. For example, if the dataset has 20 different features used to predict the target variable, then the search for a simple expression would result in an expression that only uses the strongest predictors.

3. Eureqa works well with small datasets, so it's very popular with scientific researchers who gather data from physical experiments that don’t produce massive amounts of data.

4. Eureqa provides an easy way to incorporate domain knowledge. If you know the underlying relationship in the system that you're modeling, you can give Eureqa a "hint" (for example, the formula for heat transfer or how house prices work in a particular neighborhood) as a building block or a starting point to learn from. Eureqa will build machine learning corrections from there.

### Background

In classical Newtonian mechanics, it is possible to derive that the behavior (position, velocity, and acceleration) of a [double pendulum](https://en.wikipedia.org/wiki/Double_pendulum) is exactly modeled by the following equation:

`a2 = v1<sup>2</sup> * sin(x1 - x2) - a1 * cos(x2 - x1) - g * sin(x2)`

Where `g` is the gravitational constant – approximately 9.81 or 9.82 m/s<sup>2</sup> on the surface of the Earth, depending on your latitude, longitude, and altitude. Assume that the two pendulums are both of unit length; otherwise additional length coefficients are involved.

You can use DataRobot to predict the acceleration of the second pendulum (`a2`).

## Setup

### Import libraries

In [1]:
import os
import sys
import time
import warnings

import datarobot as dr
import pandas as pd
import requests

warnings.filterwarnings("ignore")

# wider .head()s
pd.options.display.width = 0
pd.options.display.max_columns = 200
pd.options.display.max_rows = 2000

RANDOM_SEED = 321

### Connect to DataRobot

Read more about different options for [connecting to DataRobot from the client](https://docs.datarobot.com/en/docs/api/api-quickstart/api-qs.html).

In [4]:
# The URL may vary depending on your hosting preference, this example is for DataRobot Managed AI Cloud
DATAROBOT_ENDPOINT = "https://app.datarobot.com/api/v2"

# The API Token can be found by clicking the avatar icon and then </> Developer Tools
DATAROBOT_API_TOKEN = "<INSERT YOUR DataRobot API Token>"

client =dr.Client(
    token=DATAROBOT_API_TOKEN, 
    endpoint=DATAROBOT_ENDPOINT,
    user_agent_suffix='AIA-AE-EUQ-110' #Optional but helps DataRobot improve this workflow
)

dr.client._global_client = client

### Import data

Data dictionary:
* _t_: Timestep (seconds)
* _x1_: Position (m) of the first pendulum
* _x2_: Position (m) of the second pendulum
* _v1_: Velocity (m/s) of the first pendulum
* _v2_: Velocity (m/s) of the second pendulum
* _a1_: Acceleration (m/$s^2$) of the first pendulum
* _a2_: Accerlation (m/$s^2$) of the second pendulum

In [2]:
df = pd.read_csv(
    "https://s3.amazonaws.com/datarobot_public_datasets/double-pendulum.csv"
)
df.head()

Unnamed: 0,t,x1,x2,v1,v2,a1,a2
0,0.0,2.36,3.14,-0.01,-0.01,-9.24,6.53
1,0.000862,2.36,3.14,-0.018,-0.00437,-9.24,6.53
2,0.00172,2.36,3.14,-0.0259,0.00126,-9.24,6.53
3,0.00259,2.36,3.14,-0.0339,0.00689,-9.24,6.53
4,0.00345,2.36,3.14,-0.0418,0.0125,-9.24,6.53


In [3]:
len(df)

2429

## Create a DataRobot project

For the example covered in this workflow, you want to predict `a2`, the acceleration of the second pendulum. To do so you will use the other columns within the dataset as your features.

### Start Autopilot

In [5]:
project_eureqa = dr.Project.create(
    sourcedata=df,
    project_name="AI_Accelerator_Eureqa_{}".format(
        pd.datetime.now().strftime("%Y-%m-%d %H:%M")
    ),
)

# Set the project's target and set quick mode
project_eureqa.set_target(
    target="a2",
    mode="quick",
    worker_count=-1,
    advanced_options=dr.AdvancedOptions(seed=RANDOM_SEED),
)

Project(AI_Accelerator_Eureqa_2023-02-22 15:49)

### Add Eureqa blueprints

Once Autopilot has finished running, add and train a Eureqa blueprint from the Repository.

In [6]:
# There are many Eureqa blueprints available in the repository
# The "Eureqa Regressor" (or "classifier") blueprints are the original Eureqa evolutionary model
# The "Eureqa General Additive Model" blueprints are a 2-stage approach that first fits an xgboost model, then runs Eureqa on the smoothed signal
# The number of generations just denotes how long each blueprint will continue running
for bp in project_eureqa.get_blueprints():
    if bp.model_type.startswith("Eureqa"):
        print(bp.model_type)

Eureqa Regressor (Instant Search: 40 Generations)
Eureqa Generalized Additive Model (40 Generations)
Eureqa Generalized Additive Model (1000 Generations)
Eureqa Generalized Additive Model (10000 Generations)
Eureqa Regressor (Quick Search: 250 Generations)
Eureqa Regressor (Default Search: 3000 Generations)


In [7]:
# Run the Eureqa 3k gen regressor blueprint from the repository, adding it to the Leaderboard
eq_bp = next(
    bp
    for bp in project_eureqa.get_blueprints()
    if bp.model_type == "Eureqa Regressor (Default Search: 3000 Generations)"
)
job_id = project_eureqa.train(eq_bp.id)
eq_model = dr.ModelJob.get(project_eureqa.id, job_id).get_result_when_complete()
job = eq_model.cross_validate()
job.wait_for_completion()
eq_model

Model('Eureqa Regressor (Default Search: 3000 Generations)')

In [8]:
def view_cv_scores(project, metric="RMSE"):
    """
    View the models with the best validation scores.
    """
    models = project.get_models()
    leaderboard = []
    for m in models:
        leaderboard.append([m.model_type, m.metrics[metric]["crossValidation"]])
    leaderboard_df = pd.DataFrame(leaderboard, columns=["model", "cv_score"]).dropna()
    return leaderboard_df.sort_values(by="cv_score")

In [9]:
# See where the unmodified Eureqa model initially shows up on the Leaderboard
project_eureqa.wait_for_autopilot()
view_cv_scores(project_eureqa)

In progress: 16, queued: 0 (waited: 0s)
In progress: 16, queued: 0 (waited: 1s)
In progress: 15, queued: 0 (waited: 1s)
In progress: 15, queued: 0 (waited: 2s)
In progress: 15, queued: 0 (waited: 3s)
In progress: 13, queued: 0 (waited: 5s)
In progress: 10, queued: 0 (waited: 9s)
In progress: 4, queued: 0 (waited: 15s)
In progress: 0, queued: 0 (waited: 29s)
In progress: 0, queued: 0 (waited: 49s)
In progress: 0, queued: 0 (waited: 69s)
In progress: 1, queued: 0 (waited: 90s)
In progress: 1, queued: 0 (waited: 110s)
In progress: 1, queued: 0 (waited: 130s)
In progress: 0, queued: 0 (waited: 151s)
In progress: 0, queued: 0 (waited: 171s)
In progress: 0, queued: 0 (waited: 192s)


Unnamed: 0,model,cv_score
1,Light Gradient Boosting on ElasticNet Predicti...,1.931188
0,Light Gradient Boosting on ElasticNet Predicti...,1.93267
2,Light Gradient Boosted Trees Regressor with Ea...,2.17385
3,eXtreme Gradient Boosted Trees Regressor,2.251016
4,RandomForest Regressor,2.698264
7,Eureqa Regressor (Default Search: 3000 Generat...,14.649408


## Tune Eureqa

There are some use cases where Eureqa shows up at the top of the Leaderboard just by using default settings, but other cases (including this one) need modifications to fit domain knowledge into how Eureqa is run.

In this case,rotation is a key component to predicting the movement of a double pendulum. Therefore, modify the Eureqa model to include the `sin`/`cos` building blocks.

In [None]:
# This is how you begin tuning Eureqa models
# Examine the available tuning options
tune = eq_model.start_advanced_tuning_session()
tune.get_parameters()["tuning_parameters"]

In [11]:
# Looking specifically at the sin/cos operators and target expression, these are disabled by default
[
    param
    for param in tune.get_parameters()["tuning_parameters"]
    if param["parameter_name"].endswith("__sine")
    or param["parameter_name"].endswith("__cosine")
    or param["parameter_name"] == "target_expression_string"
]

[{'parameter_name': 'building_block__cosine',
  'parameter_id': 'eyJhcmciOiJidWlsZGluZ19ibG9ja19fY29zaW5lIiwidmlkIjoiMiJ9',
  'default_value': 'Disabled',
  'current_value': 'Disabled',
  'task_name': 'Eureqa Regressor (Default Search: 3000 Generations)',
  'constraints': {'select': {'values': ['Disabled']},
   'int': {'min': 0, 'max': 100, 'supports_grid_search': False}},
  'vertex_id': '2',
  'value': None},
 {'parameter_name': 'building_block__sine',
  'parameter_id': 'eyJhcmciOiJidWlsZGluZ19ibG9ja19fc2luZSIsInZpZCI6IjIifQ',
  'default_value': 'Disabled',
  'current_value': 'Disabled',
  'task_name': 'Eureqa Regressor (Default Search: 3000 Generations)',
  'constraints': {'select': {'values': ['Disabled']},
   'int': {'min': 0, 'max': 100, 'supports_grid_search': False}},
  'vertex_id': '2',
  'value': None},
 {'parameter_name': 'target_expression_string',
  'parameter_id': 'eyJhcmciOiJ0YXJnZXRfZXhwcmVzc2lvbl9zdHJpbmciLCJ2aWQiOiIyIn0',
  'default_value': '',
  'current_value': '',
 

In [12]:
# First, disable all building blocks
print("Setting to disabled:")
for param in tune.get_parameters()["tuning_parameters"]:
    if (
        param["parameter_name"].startswith("building_block")
        and param["current_value"] != "Disabled"
    ):
        tune.set_parameter(parameter_name=param["parameter_name"], value="Disabled")
        print(param["parameter_name"], param["current_value"])

# Now, enable the specific building blocks that you know are relevant to this problem
# The final expression should contain trigonometric functions and consist of the form f(x1, x2, v1, a1)
tune.set_parameter(parameter_name="building_block__sine", value=1)
tune.set_parameter(parameter_name="building_block__cosine", value=1)
tune.set_parameter(parameter_name="target_expression_string", value="f(x1, x2, v1, a1)")
tune.set_parameter(parameter_name="building_block__input_variable", value=1)
tune.set_parameter(parameter_name="building_block__constant", value=1)
tune.set_parameter(parameter_name="building_block__addition", value=1)
tune.set_parameter(parameter_name="building_block__subtraction", value=0)
tune.set_parameter(parameter_name="building_block__multiplication", value=0)

# Now run this tuned eureqa model
tune.description = "Enabled sin/cos"
job = tune.run()

tuned_eq = job.get_result_when_complete()
job = tuned_eq.cross_validate()
job.wait_for_completion()

Setting to disabled:
building_block__addition 0
building_block__constant 0
building_block__division 2
building_block__if-then-else 1
building_block__input_variable 1
building_block__less-than 1
building_block__logistic_function 4
building_block__maximum 1
building_block__minimum 1
building_block__multiplication 0
building_block__natural_logarithm 2
building_block__square_root 1
building_block__step_function 2
building_block__subtraction 0


In [13]:
view_cv_scores(project_eureqa)

Unnamed: 0,model,cv_score
0,Eureqa Regressor (Default Search: 3000 Generat...,0.8634
2,Light Gradient Boosting on ElasticNet Predicti...,1.931188
1,Light Gradient Boosting on ElasticNet Predicti...,1.93267
3,Light Gradient Boosted Trees Regressor with Ea...,2.17385
4,eXtreme Gradient Boosted Trees Regressor,2.251016
5,RandomForest Regressor,2.698264
8,Eureqa Regressor (Default Search: 3000 Generat...,14.649408


You can see a much better performance here by the Eureqa model.

## Compare DataRobot results with the physical equation

Remember that physicists have determined that the equation governing the motion of a double pendulum is:

`a2 = v1<sup>2</sup> * sin(x1 - x2) - a1 * cos(x2 - x1) - g * sin(x2)`

Where `g` is the gravitational constant – approximately 9.81 or 9.82 m/s.

In [14]:
# Access the closed form of the solution
pareto = tuned_eq.get_pareto_front()
solutions = pareto.solutions
best_solution = solutions[-1]
best_solution.expression

'Target = 0.99845736281396*v1^2*sin(x1 - x2) - 9.86312521430275*sin(x2) - 0.999583191303474*a1*cos(x1 - x2)'

The Eureqa model has produced a coefficient on that last term of 9.86, remarkably close to the gravitational constant.