[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alucantonio/data_enhanced_simulation/blob/master/11_SymbolicRegression.ipynb)

# Symbolic regression

Symbolic regression (SR) is a machine learning technique that aims to discover mathematical
expressions that best describe a dataset. Unlike traditional regression methods that fit
predefined equation structures (like linear or polynomial regression), symbolic
regression searches through a space of possible mathematical expressions to find both
the form and parameters of the equation.

Here's a simple example:

- Traditional regression might try to fit $y = ax + b$
- Symbolic regression could discover $y = \sin(x^2) + x/2$

The main advantages of SR over other machine learning techniques are:

- No need to assume a specific form for the relationship
- Can discover novel mathematical relationships
- Results are interpretable mathematical expressions

Mathematical expressions can be represented and manipulated as _expression trees_:

<figure>
    <img src="Genetic_Program_Tree.png" alt="Caption" width="300" />
    <figcaption>Representation of a mathematical expression as a tree (from Wikipedia).</figcaption>
</figure>

## Genetic Programming-based symbolic regression

Symbolic regression typically uses **genetic programming (GP)**, an _evolutionary_ technique that evolves a
population of trees (individuals) to find the best fit expression. 
In GP, the variables and constants in the expression are leaves of the tree and they are called _terminals_, while the arithmetic
operations are internal nodes called _functions_. The sets of
allowed functions and terminals together form the _primitive set_ of a GP
system.

As preliminary steps of any GP run, we need to define the primitive set and the **fitness** function,
which measures how good an individual (candidate expression) is at fitting a given
dataset. For example, the fitness could be related to the MSE on the training set, and
it could include a penalty term to favor simpler solutions (i.e. shorter expressions).
We should also define some stopping criteria for the evolution, such as the maximum
number of _generations_ or a threshold on the fitness.

Pseudo-code of the GP-based SR algorithm:
```pseudo
Initialize population (random)
Evaluate fitness of each individual

For each generation:
    Select parents using selection strategy
    Apply crossover and mutation to create offspring
    Evaluate fitness of new individuals
    Update population (replace least fit)
    
    If stopping criteria met:
        Break

Return best individual


Typical parents selection methods are **tournament** and **uniform**, which are analogous to
those found in **genetic algorithms**. Cross-over and mutation operators are typical of
genetic algorithms, as well; in the context of GP, they are applied to the expression trees:

<figure>
    <img src="gp_schematic.png" alt="Caption" width="600" />
    <figcaption>Cross-over and mutation in Genetic Programming (taken from Quade et al., 2016).</figcaption>
</figure>

To perform a GP run, we should set some hyperparameters, which can be adjusted
during model **validation**. The most important control parameter is the _population
size_, as it controls the number of parallel explorations of the solution space. Other control
parameters include the probabilities of performing the genetic operations (cross-over
and mutation), the number of individuals involved in tournaments (in the case of
tournament selction), and regularization factors (such as the penalty on expression length).

## Discovering the dynamics of an environment

In this exercise, you will discover the equation of evolution for the velocity of the
`MountainCarContinuous` environment (see
[docs](https://gymnasium.farama.org/environments/classic_control/mountain_car_continuous/))contained
in `gymnasium` using symbolic regression. The equation implemented in the enviroment is:

$$ v_{t+1} = v_t + 0.0015 a_t -0.0025\cos(3 x_t)$$

where $a_t$ is the action (float between -1 and 1) and $x_t$ is the position.

As a symbolic regression tool, we will use the
[`pyoperon`](https://github.com/heal-research/pyoperon) library. **Study** this
[example](https://github.com/heal-research/pyoperon/blob/main/example/operon-sklearn.ipynb)
and the [docs](https://operongp.readthedocs.io/en/latest/) of `operon` before trying to solve the problem.

In [None]:
!pip install pyoperon

In [None]:
import gymnasium as gym

env = gym.make("MountainCarContinuous-v0")

1. Generate a training set for symbolic regression made of 5000 samples, where each
   sample is a list ($x_t$, $v_t$, $a_t$, $v_{t+1}$) recorded while interacting with the
   environment. Create the arrays $X$ and $y$ (features and labels).

In [None]:
#@title Solution:

import pandas as pd

num_samples = 5000  # Number of samples to generate

# Initialize storage for dataset
data = []

# Reset the environment to a random initial state
state, _ = env.reset()

for _ in range(num_samples):
    # Get current position and velocity from the state
    position, velocity = state
    
    # Use the environment's action space to sample a random action
    action = env.action_space.sample()
    
    # Apply the action and observe the next state
    next_state, _, terminated, truncated, _ = env.step(action)
    
    # Extract next velocity
    next_velocity = next_state[1]
    
    # Append the current state, action, and next velocity to the dataset
    data.append([position, velocity, action[0], next_velocity])
    
    # Check if the episode has ended
    if terminated or truncated:
        # Reset the environment and start a new episode
        state, _ = env.reset()
    else:
        # Update the state for the next step
        state = next_state

# Convert to a Pandas DataFrame
columns = ["position", "velocity", "action", "next_velocity"]
dataset = pd.DataFrame(data, columns=columns)

# Dataset is now stored in the `dataset` variable.
env.close()

X = dataset[["position", "velocity", "action"]].to_numpy()
y = dataset[["next_velocity"]].to_numpy().ravel()

2. Use `pyoperon` to find the analytical expression of the equation for the evolution of
   the velocity. Adjust the parameters `population_size`, `generations`,
   `tournament_size`, `max_length`, `optimizer_iterations` and `allowed_symbols` (start
   with the default values). Evaluate the `R^2` score on the training and the test sets.

In [None]:
#@title Solution:

from sklearn.model_selection import train_test_split
from pyoperon.sklearn import SymbolicRegressor

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, shuffle=True)

reg = SymbolicRegressor(
    allowed_symbols='add,mul,sub,cos,constant,variable',
    optimizer_iterations=10,
    max_length=15,
    n_threads=32,
    objectives = ['r2'],
    generations=100,
)

reg.fit(X_train, y_train)
print(reg.score(X_train, y_train))
print(reg.score(X_test, y_test))

3. Use the `get_model_string` method of the `SymbolicRegressor` object that you have
   just fitted to extract and print the string of the best model. Use the [`simplify`](https://docs.sympy.org/latest/tutorials/intro-tutorial/simplification.html)
   function of the `sympy` library to simplify the expression.

In [None]:
!pip install sympy

In [None]:
#@title Solution:

import sympy as sp

print(sp.simplify(reg.get_model_string(reg.model_)))

**Bonus exercise**: discover the equations of motions of different falling objects using
the experimental dataset contained [here](https://github.com/briandesilva/discovery-of-physics-from-data).