# 2022 Flatiron Machine Learning x Science Summer School

## Step 1: Create generic data from algebraic equations

In this step, we want to create data from algebraic equations of the form $f(g(x))$, where $g: \mathbb{R}^a \rightarrow \mathbb{R}^b$, $f: \mathbb{R}^b \rightarrow \mathbb{R}^c$ and $f \circ g: \mathbb{R}^a \rightarrow \mathbb{R}^c$. One example could be

$g(x) = x_1 \cdot cos(x_3)$

$f(x) = x^2 + x$

$f(g(x)) = g(x)^2 + g(x) = x_1^2 \cdot cos^2(x_3) + x_1 \cdot cos(x_3)$

where $a = 3$, $b = 1$ and $c = 1$. 

Ideally, we would like to define $f$ and $g$ such that $f \circ g$ is difficult to discover for symbolic regression, while $f$ and $g$ individually are easily discoverable.

Let's get started!

In [1]:
import os
import numpy as np

### Step 1.1: Create standard input data

Since the input data in machine learning is often standardized as part of the preprocessing anyway, we simply sample from $\mathcal{N}(0, 1)$.

In [2]:
seed = 0

inputs = {
    "X01": (int(1e3), 2),
    "X02": (int(1e3), 2),
    "X03": (int(1e3), 3),
    "X04": (int(1e3), 3),
    "X05": (int(1e3), 5),
    "X00": (int(1e3), 2),
    "X06": (int(1e3), 8),
    "X07": (int(1e3), 2),
    "X09": (int(1e3), 2),
    "X10": (int(1e3), 2),
}

In [3]:
np.random.seed(seed)

data = {}
for var in inputs:
    data[var] = np.random.normal(size=inputs[var])

In [4]:
data_path = "data_1k"
data_ext = ".gz"
reload = False

for file_name in os.listdir(data_path):
    var = file_name.split('.')[0]
    file_ext = file_name.split('.')[-1]
    
    if var[0] == "X" and file_ext == data_ext[1:] and (reload or var not in data):    
        load_data = np.loadtxt(os.path.join(data_path, file_name))
        if len(load_data.shape) == 1:
            load_data = load_data.reshape(-1,1)
        data[var] = load_data
        print(f"Loaded {var} data.")

Loaded X08 data.
Loaded X10_100 data.
Loaded X11 data.
Loaded X11_100 data.
Loaded X11_1000 data.
Loaded X11_10000 data.
Loaded X11_std0100 data.
Loaded X11_std0125 data.
Loaded X11_std0200 data.
Loaded X11_std0300 data.
Loaded X11_std1000 data.


### Step 1.2: Create target function data

As discussed above, we want to define easily discoverable functions $f$ and $g$ where their composition $f \circ g$ is difficult to discover for symbolic regression.

Additionally, it would be good to cover different combinations of $b$ and $c$.

For $b \gt 1$, a potential issue is that there might not be a unique solution. For $a = 2$, $b = 2$, $c = 1$, the example

$g(x) = [x_1^2, x_1 x_2]$

$f(x) = x_1 + x_2$

allows adding and subtracting any function $h(x_1)$ to $g_1$ and $g_2$ and thus, there are infinitely many solutions. Raising $c$, however, could provide more information. For example,

$g(x) = [x_1^2, x_1 x_2]$

$f(x) = [x_1 + x_2, x_1^2]$

would avoid this issue as $g(x) = [x_1^2 + h(x_1), x_1 x_2 - h(x_1)]$ would lead to $f(g(x)) = [x_1^2 + x_1 x_2, (x_1^2 + h(x_1))^2]$.

What are the available mathematical operators in PySR? See https://astroautomata.com/PySR/#/operators.

* Unary: `neg`, `square`, `cube`, `exp`, `abs`, `log_abs = log(abs(x) + 1e-8)`, `log10_abs`, `log2_abs`, `log1p_abs = log(abs(x) + 1)`, `sqrt_abs = sqrt(abs(x))`, `sin`, `cos`, `tan`, `sinh`, `cosh`, `tanh`, `atan`, `asinh`, `acosh_abs`, `atanh_clip = atanh((x+1)%2 - 1)`, `erf`, `erfc`, `gamma`, `relu`, `round`, `floor`, `ceil`, `round`, `sign`

* Binary: `plus`, `sub`, `mult`, `pow`, `div`, `greater`, `mod`, `logical_or`, `logical_and`

We select the following operators:

* Unary: `sin`, `cos`, `exp`, `log_abs`

* Binary: `plus`, `sub`, `mult`, (`pow`)

Additionally, we define the functionality to add Gaussian noise to each function, but we set $\sigma^2 = 0$ for now.

Functions:

* $a=2$, $b=1$, $c=1$:

    * $g^1(x) = x_0^2 + \text{cos}(x_1) + x_0 \cdot x_1$

    * $f^1(y) = y^2 + y$

* $a=2$, $b=3$, $c=1$:

    * $g^2(x) = [x_0^2, \text{cos}(x_1), x_0 \cdot x_1]$

    * $f^2(y) = (y_0 + y_1 + y_2)^2 + y_0 + y_1 + y_2$

* $a=3$, $b=1$, $c=1$:

    * $g^3(x) = \text{sin}(x_0) \cdot \text{cos}(x_1) + x_2^3 + \text{exp}(x_2)$

    * $f^3(y) = y^2 + 2.745 \cdot y$

* $a=3$, $b=3$, $c=1$:

    * $g^4(x) = [\text{sin}(x_0) \cdot \text{cos}(x_1), x_2^3, \text{exp}(x_2)]$

    * $f^4(y) = y_0^2 + y_1 \cdot y_2$

* $a=5$, $b=3$, $c=1$:

    * $g^5(x) = [x_0, x_1 \cdot x_3^2, \text{log}(\text{abs}(x_3 + x_4))]$

    * $f^5(x) = (y_0 + y_1) \cdot \text{exp}(0.31 \cdot y_2)$
    
* $a=2$, $b=3$, $c=1$:

    * $g^0(x) = [x_0^2, \text{cos}(x_1), x_0 \cdot x_1]$

    * $f^0(y) = y_0 + y_1 + y_2$
    
* $a=8$, $b=3$, $c=1$:

    * $g^6(x) = [x_0^2, \text{cos}(x_3), x_5 \cdot x_7]$

    * $f^6(y) = y_0 + y_1 + y_2$
    
* $a=2$, $b=3$, $c=1$:

    * $g^7(x) = [2.7 \cdot x_0^2, 5 \cdot \text{cos}(3 \cdot x_1), 4.5 \cdot x_0 \cdot x_1]$

    * $f^7(y) = y_0 + y_1 + y_2$
    
* $a=2$, $b=5$, $c=1$:

    * $g^9(x) = [1.5 \cdot x_0^2, 3.5 \cdot \text{sin}(2.5 \cdot x_1), 3.0 \cdot x_0 \cdot \text{cos}(0.5 \cdot x_0), x_0 \cdot x_1, 0.5 \cdot x_1 \cdot \text{exp}(x_0)]$

    * $f^9(y) = y_0 + y_1 + y_2 + y_3 + y_4$

In [5]:
fun_tups = [
    ("G01", ["X01[:,0]**2 + np.cos(X01[:,1]) + X01[:,0] * X01[:,1]"], 0),
    ("F01", ["G01[:,0]**2 + G01[:,0]"], 0),
    ("G02", ["X02[:,0]**2", "np.cos(X02[:,1])", "X02[:,0] * X02[:,1]"], 0),
    ("F02", ["(G02[:,0] + G02[:,1] + G02[:,2])**2 + G02[:,0] + G02[:,1] + G02[:,2]"], 0),
    ("G03", ["np.sin(X03[:,0]) * np.cos(X03[:,1]) + X03[:,2]**3 + np.exp(X03[:,2])"], 0),
    ("F03", ["G03[:,0]**2 + 2.745 * G03[:,0]"], 0),
    ("G04", ["np.sin(X04[:,0]) * np.cos(X04[:,1])", "X04[:,2]**3", "np.exp(X04[:,2])"], 0),
    ("F04", ["G04[:,0]**2 + G04[:,1] * G04[:,2]"], 0),
    ("G05", ["X05[:,0]", "X05[:,1] * X05[:,3]**2", "np.log(np.abs(X05[:,3] + X05[:,3]))"], 0),
    ("F05", ["(G05[:,0] + G05[:,1]) * np.exp(0.31 * G05[:,2])"], 0),
    ("G00", ["X00[:,0]**2", "np.cos(X00[:,1])", "X00[:,0] * X00[:,1]"], 0),
    ("F00", ["G00[:,0] + G00[:,1] + G00[:,2]"], 0),
    ("G06", ["X06[:,0]**2", "np.cos(X06[:,3])", "X06[:,5] * X06[:,7]"], 0),
    ("F06", ["G06[:,0] + G06[:,1] + G06[:,2]"], 0),
    ("G07", ["2.7*X07[:,0]**2", "5*np.cos(3*X07[:,1])", "4.5*X07[:,0]*X07[:,1]"], 0),
    ("F07", ["G07[:,0] + G07[:,1] + G07[:,2]"], 0),
    ("G09", ["1.5*X09[:,0]**2", 
             "3.5*np.sin(2.5*X09[:,1])", 
             "3.0*X09[:,0]*np.cos(0.5*X09[:,0])", 
             "X09[:,0]*X09[:,1]",
             "0.5*X09[:,1]*np.exp(X09[:,0])"], 0),
    ("F09", ["G09[:,0] + G09[:,1] + G09[:,2] + G09[:,3] + G09[:,4]"], 0),
    ("G11", ["-0.08*(X11[:,0] + 0.165)**2 - 0.21", 
             "0.0785*(X11[:,0] - 0.63)**2 - 0.252", 
             "0.0895*(X11[:,0] + 0.21)**2 - 0.0375"], 0),
    ("F11", ["G11[:,0] * G11[:,1] + np.sin(G11[:,2])"], 0),
]

In [6]:
for fun_tup in fun_tups:
    
    fun_name = fun_tup[0]
    print(f"Evaluating {fun_name}.")

    res = []    
    for fun in fun_tup[1]:
        
        # evaluate target function string
        fun_data = eval(fun, {'np': np}, data)

        # add Gaussian noise
        fun_data += np.random.normal(scale=fun_tups[0][2], size=fun_data.shape[0])

        res.append(fun_data)

    data[fun_name] = np.array(res).T

Evaluating G01.
Evaluating F01.
Evaluating G02.
Evaluating F02.
Evaluating G03.
Evaluating F03.
Evaluating G04.
Evaluating F04.
Evaluating G05.
Evaluating F05.
Evaluating G00.
Evaluating F00.
Evaluating G06.
Evaluating F06.
Evaluating G07.
Evaluating F07.
Evaluating G09.
Evaluating F09.
Evaluating G11.
Evaluating F11.


### Step 1.3: Save data

In [7]:
data_path = "data_1k"
data_ext = ".gz"
info_ext = ".info"
update = False

# create data folder
os.makedirs(data_path, exist_ok=True)

# save input data
for var in inputs:
    if update or var + data_ext not in os.listdir(data_path):
        np.savetxt(os.path.join(data_path, var + data_ext), data[var])
        print(f"Saved {var} data.")

# save target data
for fun_tup in fun_tups:
    var = fun_tup[0]
    
    if update or var + data_ext not in os.listdir(data_path):
        np.savetxt(os.path.join(data_path, var + data_ext), data[var])

        with open(os.path.join(data_path, var + info_ext), "w") as f:
            for fun in fun_tup[1]:
                f.write(fun + '\n')

            f.write(str(fun_tup[2]) + '\n')

        print(f"Saved {var} data.")

Saved G11 data.
Saved F11 data.
