[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alucantonio/data_enhanced_simulation/blob/master/11_SymbolicRegression.ipynb)

# Symbolic regression

Symbolic regression (SR) is a machine learning technique that aims to discover mathematical
expressions that best describe a dataset. Unlike traditional regression methods that fit
predefined equation structures (like linear or polynomial regression), symbolic
regression searches through a space of possible mathematical expressions to find both
the form and parameters of the equation.

Here's a simple example:

- Traditional regression might try to fit $y = ax + b$
- Symbolic regression could discover $y = \sin(x^2) + x/2$

The main advantages of SR over other machine learning techniques are:

- No need to assume a specific form for the relationship
- Can discover novel mathematical relationships
- Results are interpretable mathematical expressions

Mathematical expressions can be represented and manipulated as _expression trees_:

<figure>
    <img src="Genetic_Program_Tree.png" alt="Caption" width="300" />
    <figcaption>Representation of a mathematical expression as a tree (from Wikipedia).</figcaption>
</figure>

## Genetic Programming-based symbolic regression

Symbolic regression typically uses **genetic programming (GP)**, an _evolutionary_ technique that evolves a
population of trees (individuals) to find the best fit expression. 
In GP, the variables and constants in the expression are leaves of the tree and they are called _terminals_, while the arithmetic
operations are internal nodes called _functions_. The sets of
allowed functions and terminals together form the _primitive set_ of a GP
system.

As preliminary steps of any GP run, we need to define the primitive set and the **fitness** function,
which measures how good an individual (candidate expression) is at fitting a given
dataset. For example, the fitness could be related to the MSE on the training set, and
it could include a penalty term to favor simpler solutions (i.e. shorter expressions).
We should also define some stopping criteria for the evolution, such as the maximum
number of _generations_ or a threshold on the fitness.

Pseudo-code of the GP-based SR algorithm:
```pseudo
Initialize population (random)
Evaluate fitness of each individual

For each generation:
    Select parents using selection strategy
    Apply crossover and mutation to create offspring
    Evaluate fitness of new individuals
    Update population (replace least fit)
    
    If stopping criteria met:
        Break

Return best individual


Typical parents selection methods are **tournament** and **uniform**, which are analogous to
those found in **genetic algorithms**. Cross-over and mutation operators are typical of
genetic algorithms, as well; in the context of GP, they are applied to the expression trees:

<figure>
    <img src="gp_schematic.png" alt="Caption" width="600" />
    <figcaption>Cross-over and mutation in Genetic Programming (taken from Quade et al., 2016).</figcaption>
</figure>

## Discovering the dynamics of the environment

In [170]:
import gymnasium as gym
import numpy as np
import pandas as pd

# Initialize the MountainCarContinuous-v0 environment
env = gym.make("MountainCarContinuous-v0")

# Parameters
num_samples = 20000  # Number of samples to generate

# Initialize storage for dataset
data = []

# Reset the environment to a random initial state
state, _ = env.reset()

for _ in range(num_samples):
    # Get current position and velocity from the state
    position, velocity = state
    
    # Use the environment's action space to sample a random action
    action = env.action_space.sample()
    
    # Apply the action and observe the next state
    next_state, _, done, truncated, _ = env.step(action)
    
    # Extract next velocity
    next_velocity = next_state[1]
    
    # Append the current state, action, and next velocity to the dataset
    data.append([position, velocity, action[0], next_velocity])
    
    # Check if the episode has ended
    if done or truncated:
        # Reset the environment and start a new episode
        state, _ = env.reset()
    else:
        # Update the state for the next step
        state = next_state

# Convert to a Pandas DataFrame
columns = ["position", "velocity", "action", "next_velocity"]
dataset = pd.DataFrame(data, columns=columns)

# Dataset is now stored in the `dataset` variable.
env.close()

In [171]:
dataset

Unnamed: 0,position,velocity,action,next_velocity
0,-0.566266,0.000000,0.504779,0.001076
1,-0.565190,0.001076,-0.154902,0.001155
2,-0.564034,0.001155,-0.687653,0.000426
3,-0.563608,0.000426,-0.244649,0.000358
4,-0.563250,0.000358,-0.159764,0.000416
...,...,...,...,...
19995,-0.556397,0.004981,-0.829353,0.003982
19996,-0.552415,0.003982,-0.301219,0.003746
19997,-0.548669,0.003746,0.354045,0.004465
19998,-0.544203,0.004465,0.191231,0.004907


In [172]:
X = dataset[["position", "velocity", "action"]].to_numpy()
y = dataset[["next_velocity"]].to_numpy().ravel()

In [173]:
exact_next_velocity = X[:,1] + 0.0015*X[:,2] - 0.0025*np.cos(3*X[:,0])

In [174]:
exact_next_velocity

array([0.0010763 , 0.00115507, 0.00042611, ..., 0.00446539, 0.00490667,
       0.00651842], dtype=float32)

In [175]:
y

array([0.0010763 , 0.00115507, 0.00042611, ..., 0.00446539, 0.00490667,
       0.00651842], dtype=float32)

In [219]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler

from pyoperon.sklearn import SymbolicRegressor

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, shuffle=True)

scaler = StandardScaler().fit(X_train)
scaler_y = StandardScaler().fit(y_train.reshape(-1,1))

# X_train_scaled = scaler.transform(X_train)
# X_test_scaled = scaler.transform(X_test)
X_train_scaled = X_train
X_test_scaled = X_test
# y_train_scaled = scaler_y.transform(y_train.reshape(-1,1))
# y_test_scaled = scaler_y.transform(y_test.reshape(-1,1))


reg = SymbolicRegressor(
    allowed_symbols='add,mul,sub,cos,constant,variable',
    # offspring_generator='basic',
    optimizer_iterations=10,
    max_length=15,
    # initialization_method='btc',
    n_threads=32,
    objectives = ['r2'],
    # epsilon = 0,
    # random_state=None,
    # reinserter='keep-best',
    # max_evaluations=int(1e6),
    population_size=500,
    # symbolic_mode=True,
    tournament_size=2,
    generations=100,
)

reg.fit(X_train_scaled, y_train)
print(reg.score(X_train_scaled, y_train))
print(reg.score(X_test_scaled, y_test))

1.0
1.0


In [220]:
reg.get_model_string(reg.model_)

'((-0.000) + (1.000 * ((cos(((3.000 * X1) + (-0.000))) * (-0.003)) - (((-0.002) * X3) - ((1.000 * X2) + (0.000 * X1))))))'

In [221]:
import sympy as sp

print(sp.simplify(reg.get_model_string(reg.model_)))

1.0*X2 + 0.002*X3 - 0.003*cos(3.0*X1)
