# Symbolic Regression on Feynman Equations using Genetic Programming

Symbolic regression is a type of regression analysis where the goal is to discover mathematical expressions that best describe a given dataset. Unlike traditional regression, symbolic regression does not assume a predefined model structure. Instead, it searches for both the structure and the parameters that best fit the data. This approach can yield interpretable, analytical models that help uncover underlying relationships in the data.

In this notebook, we will apply symbolic regression to a set of well-known physical equations: the Feynman Equations. These equations, derived by physicist Richard Feynman, describe fundamental physical phenomena in areas such as mechanics, electromagnetism, and thermodynamics.

We will perform symbolic regression with **genetic programming** (GP), using `gplearn`, a `scikit-learn`-inspired Python library for GP.

Let us import some useful modules. 
If you are using `conda`, you can install `graphviz` with the following commands:

```
conda install graphviz
conda install python-graphviz
conda install pydot
```


In [None]:
#Penn Machine Learning Benchmarks
%pip install -U git+https://github.com/EpistasisLab/pmlb

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from IPython.display import clear_output
from pmlb import fetch_data
import gplearn.genetic as gp
import matplotlib.pyplot as plt
import graphviz

Read the eqations from the csv file and fetch the correcponding data.

In [4]:
eq_df = pd.read_csv('../Data/FeynmanEquations.csv')
eq_df.dropna(axis = 0, how = 'all', inplace = True)
eq_df.head()

Unnamed: 0,Filename,Number,Output,Formula,# variables,v1_name,v1_low,v1_high,v2_name,v2_low,...,v7_high,v8_name,v8_low,v8_high,v9_name,v9_low,v9_high,v10_name,v10_low,v10_high
0,I.6.2a,1.0,f,exp(-theta**2/2)/sqrt(2*pi),1.0,theta,1.0,3.0,,,...,,,,,,,,,,
1,I.6.2,2.0,f,exp(-(theta/sigma)**2/2)/(sqrt(2*pi)*sigma),2.0,sigma,1.0,3.0,theta,1.0,...,,,,,,,,,,
2,I.6.2b,3.0,f,exp(-((theta-theta1)/sigma)**2/2)/(sqrt(2*pi)*...,3.0,sigma,1.0,3.0,theta,1.0,...,,,,,,,,,,
3,I.8.14,4.0,d,sqrt((x2-x1)**2+(y2-y1)**2),4.0,x1,1.0,5.0,x2,1.0,...,,,,,,,,,,
4,I.9.18,5.0,F,G*m1*m2/((x2-x1)**2+(y2-y1)**2+(z2-z1)**2),9.0,m1,1.0,2.0,m2,1.0,...,2.0,z1,3.0,4.0,z2,1.0,2.0,,,


In [5]:
eq_df.Filename = eq_df.Filename.apply(lambda x: 'feynman_' + x.replace('.', '_'))
eq_df = eq_df.loc[:, ['Filename', 'Formula']]

#feynman_I_15_10 in pmlb equal to I.15.1 in original source
eq_df.Filename = eq_df.Filename.apply(lambda x: x.replace('feynman_I_15_1', 'feynman_I_15_10'))

Select a subset of equations (we don't have enough time to test them all)

In [6]:
dataset_names = eq_df["Filename"].to_list()
datasets_to_test_names = dataset_names[3:8]

In [8]:
datasets={}
for name in datasets_to_test_names:
    datasets[name] = fetch_data(name)

ConnectionError: ('Connection aborted.', ConnectionResetError(10054, "Connessione in corso interrotta forzatamente dall'host remoto", None, 10054, None))

Now, for each equation in the dataset, perform symbolic regression using GP. Split each dataset into a training set and a validation set. Select a validation metric to evaluate performance, noting that this metric is not necessarily the same as the GP fitness function. Experiment with different sets of hyperparameters to observe how the results change.

Take a look at the [documentation](https://gplearn.readthedocs.io/en/stable/intro.html).

In [None]:
from gplearn.genetic import BaseSymbolic
from sklearn.utils.validation import validate_data

BaseSymbolic._validate_data = lambda self, *args, **kwargs: validate_data(
    self,
    *args,
    **kwargs,
)

In [None]:
random_state = 0

#hyperparameters
max_gen = 
fset = () # Hint: you can also define your own!
pop_size = 
tournament_size = 
parsimony_coefficient = 
p_crossover =
p_subtree_mutation = 
p_hoist_mutation=
p_point_mutation=
fitness =
val_fit = 


In [None]:
np.random.seed(random_state)
results = pd.DataFrame(columns=['dataset','best_fit', 'original', 'equation'])
loss_histories = {}

cnt = 0
for name, df in datasets.items():
    cnt = cnt + 1
    # CODE HERE
    
    loss_history = []
    var_names = df.drop(columns=['target']).columns.tolist()
    
    sr = gp.SymbolicRegressor(population_size=pop_size,
                                tournament_size=tournament_size,
                                function_set=fset,
                                parsimony_coefficient=parsimony_coefficient,
                                p_crossover=p_crossover,
                                p_subtree_mutation=p_subtree_mutation, # Probability of subtree mutation
                                p_hoist_mutation=p_hoist_mutation, # Small probability of hoist mutation
                                p_point_mutation=p_point_mutation, # Small probability of point mutation
                                generations=1,
                                random_state=random_state,
                                feature_names=var_names,
                                warm_start=var_names,
                                metric=fitness
    )
    
    for i in range(0, max_gen+1):
        # CODE HERE
    
    orig = eq_df[eq_df.Filename == name]['Formula'].tolist()
    
    best_fit = loss_history[-1]
    loss_histories[name] = loss_history
    results.loc[len(results)] = [name, best_fit, orig[0], sr._program]
    clear_output()
    print(f'{cnt} {name} best_fit: {best_fit}')

clear_output()
results

Visualize the results.

In [None]:
# Plot the history for the validation metrics

In [None]:
# Useful way to visualize a given formula
dot_data=results.loc[3,"equation"].export_graphviz()
graphviz.Source(dot_data)