# Symbolic Regression on Feynman Equations using Genetic Programming

Symbolic regression is a type of regression analysis where the goal is to discover mathematical expressions that best describe a given dataset. Unlike traditional regression, symbolic regression does not assume a predefined model structure. Instead, it searches for both the structure and the parameters that best fit the data. This approach can yield interpretable, analytical models that help uncover underlying relationships in the data.

In this notebook, we will apply symbolic regression to a set of well-known physical equations: the Feynman Equations. These equations, derived by physicist Richard Feynman, describe fundamental physical phenomena in areas such as mechanics, electromagnetism, and thermodynamics.

We will perform symbolic regression with **genetic programming** (GP), using `gplearn`, a `scikit-learn`-inspired Python library for GP.

Let us import some useful modules. 
If you are using `conda`, you can install `graphviz` with the following commands:

```
conda install graphviz
conda install python-graphviz
conda install pydot
```


In [9]:
#Penn Machine Learning Benchmarks
%pip install -U git+https://github.com/EpistasisLab/pmlb

Collecting git+https://github.com/EpistasisLab/pmlb
  Cloning https://github.com/EpistasisLab/pmlb to c:\users\utente\appdata\local\temp\pip-req-build-t9zqhpqz
  Resolved https://github.com/EpistasisLab/pmlb to commit 7c1f4bdc00136dc2e55c87fa6b8ba6e8af6d1a68
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Note: you may need to restart the kernel to use updated packages.


  Running command git clone --filter=blob:none --quiet https://github.com/EpistasisLab/pmlb 'C:\Users\Utente\AppData\Local\Temp\pip-req-build-t9zqhpqz'


In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from IPython.display import clear_output
from pmlb import fetch_data
import gplearn.genetic as gp
import matplotlib.pyplot as plt
import graphviz

Read the eqations from the csv file and fetch the correcponding data.

In [2]:
eq_df = pd.read_csv('../Data/FeynmanEquations.csv')
eq_df.dropna(axis = 0, how = 'all', inplace = True)
eq_df.head()

Unnamed: 0,Filename,Number,Output,Formula,# variables,v1_name,v1_low,v1_high,v2_name,v2_low,...,v7_high,v8_name,v8_low,v8_high,v9_name,v9_low,v9_high,v10_name,v10_low,v10_high
0,I.6.2a,1.0,f,exp(-theta**2/2)/sqrt(2*pi),1.0,theta,1.0,3.0,,,...,,,,,,,,,,
1,I.6.2,2.0,f,exp(-(theta/sigma)**2/2)/(sqrt(2*pi)*sigma),2.0,sigma,1.0,3.0,theta,1.0,...,,,,,,,,,,
2,I.6.2b,3.0,f,exp(-((theta-theta1)/sigma)**2/2)/(sqrt(2*pi)*...,3.0,sigma,1.0,3.0,theta,1.0,...,,,,,,,,,,
3,I.8.14,4.0,d,sqrt((x2-x1)**2+(y2-y1)**2),4.0,x1,1.0,5.0,x2,1.0,...,,,,,,,,,,
4,I.9.18,5.0,F,G*m1*m2/((x2-x1)**2+(y2-y1)**2+(z2-z1)**2),9.0,m1,1.0,2.0,m2,1.0,...,2.0,z1,3.0,4.0,z2,1.0,2.0,,,


In [3]:
eq_df.Filename = eq_df.Filename.apply(lambda x: 'feynman_' + x.replace('.', '_'))
eq_df = eq_df.loc[:, ['Filename', 'Formula']]

#feynman_I_15_10 in pmlb equal to I.15.1 in original source
eq_df.Filename = eq_df.Filename.apply(lambda x: x.replace('feynman_I_15_1', 'feynman_I_15_10'))

Select a subset of equations (we don't have enough time to test them all)

In [4]:
dataset_names = eq_df["Filename"].to_list()
datasets_to_test_names = dataset_names[3:8]

In [5]:
datasets={}
for name in datasets_to_test_names:
    datasets[name] = fetch_data(name)

Now, for each equation in the dataset, perform symbolic regression using GP. Split each dataset into a training set and a validation set. Select a validation metric to evaluate performance, noting that this metric is not necessarily the same as the GP fitness function. Experiment with different sets of hyperparameters to observe how the results change.

Take a look at the [documentation](https://gplearn.readthedocs.io/en/stable/intro.html).

In [6]:
from gplearn.genetic import BaseSymbolic
from sklearn.utils.validation import validate_data

BaseSymbolic._validate_data = lambda self, *args, **kwargs: validate_data(
    self,
    *args,
    **kwargs,
)

In [7]:
random_state = 0

#hyperparameters
max_gen = 50
fset = ("add", "sub", "mul", "div", "sqrt", "log", "abs", "sin") # Hint: you can also define your own!
pop_size = 500
tournament_size = 3
parsimony_coefficient = 0.01
p_crossover = 0.8
p_subtree_mutation = 0.05
p_hoist_mutation = 0.05
p_point_mutation = 0.05
fitness = "rmse"
val_fit = "rmse" # not used


In [14]:
for name, df in datasets.items():
    print(f"nome {name} :\n {(df)}")

nome feynman_I_8_14 :
              x1        x2        y1        y2    target
0      2.982188  1.035629  4.865618  2.303791  3.217460
1      1.605159  2.459064  4.388424  1.136661  3.362010
2      4.514210  4.468167  1.110199  4.640095  3.530196
3      2.653459  2.930368  2.057558  1.698226  0.453650
4      3.240077  4.505289  1.533158  3.633271  2.451783
...         ...       ...       ...       ...       ...
99995  1.811641  1.699586  2.205231  4.559877  2.357311
99996  2.046862  2.901612  1.652238  3.206971  1.774201
99997  3.862267  4.321029  4.439863  3.786394  0.798426
99998  4.786567  1.106425  3.091165  1.830566  3.890059
99999  1.714058  4.634211  2.306098  3.846372  3.301475

[100000 rows x 5 columns]
nome feynman_I_9_18 :
              m1        m2         G        x1        x2        y1        y2  \
0      1.859444  1.589649  1.597527  3.298641  1.573143  3.549269  1.220063   
1      1.031807  1.190607  1.289690  3.721903  1.287928  3.576842  1.614692   
2      1.994773  1

In [None]:
np.random.seed(random_state)
results = pd.DataFrame(columns=['dataset','best_fit', 'original', 'equation'])
loss_histories = {}

cnt = 0
for name, df in datasets.items():
    cnt = cnt + 1
    # CODE HERE
    X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=["target"]), df["target"], test_size=0.2, random_state=random_state)

    sr = gp.SymbolicRegressor(population_size=pop_size,
                          generations=1,
                          function_set=fset,
                          stopping_criteria=0.01,
                          p_crossover=p_crossover, # Probability of performing subtree crossover
                          p_subtree_mutation=p_subtree_mutation, # Probability of subtree mutation
                          p_hoist_mutation=p_hoist_mutation, # Small probability of hoist mutation
                          p_point_mutation=p_point_mutation, # Small probability of point mutation
                          parsimony_coefficient=parsimony_coefficient, # Penalization of large trees
                          verbose=0, # Set to 1 to obtain the fitness values
                          random_state=random_state,
                          warm_start=True)
    
    loss_history = []
    var_names = df.drop(columns=['target']).columns.tolist()
    
    sr = gp.SymbolicRegressor(population_size=pop_size,
                                tournament_size=tournament_size,
                                function_set=fset,
                                parsimony_coefficient=parsimony_coefficient,
                                p_crossover=p_crossover,
                                p_subtree_mutation=p_subtree_mutation, # Probability of subtree mutation
                                p_hoist_mutation=p_hoist_mutation, # Small probability of hoist mutation
                                p_point_mutation=p_point_mutation, # Small probability of point mutation
                                generations=1,
                                random_state=random_state,
                                feature_names=var_names,
                                warm_start=var_names,
                                metric=fitness
    )
    
    for i in range(0, max_gen+1):
        # CODE HERE
        sr.set_params(generations=i+1)
        sr.fit(X_train, y_train)
        loss_history.append(sr.score(X_test, y_test))
    
    orig = eq_df[eq_df.Filename == name]['Formula'].tolist()
    
    best_fit = loss_history[-1]
    loss_histories[name] = loss_history
    results.loc[len(results)] = [name, best_fit, orig[0], sr._program]
    clear_output()
    print(f'{cnt} {name} best_fit: {best_fit}')

clear_output()
results

Visualize the results.

In [None]:
# Plot the history for the validation metrics

In [None]:
# Useful way to visualize a given formula
dot_data=results.loc[3,"equation"].export_graphviz()
graphviz.Source(dot_data)