# 2022 Flatiron Machine Learning x Science Summer School

## Step 1: Create generic data from algebraic equations

In this step, we want to create data from algebraic equations of the form $f(g(x))$, where $g: \mathbb{R}^a \rightarrow \mathbb{R}^b$, $f: \mathbb{R}^b \rightarrow \mathbb{R}^c$ and $f \circ g: \mathbb{R}^a \rightarrow \mathbb{R}^c$. One example could be

$g(x) = x_1 \cdot cos(x_3)$

$f(x) = x^2 + x$

$f(g(x)) = g(x)^2 + g(x) = x_1^2 \cdot cos^2(x_3) + x_1 \cdot cos(x_3)$

where $a = 3$, $b = 1$ and $c = 1$. Ideally, we would like to define $f$ and $g$ such that $f \circ g$ is difficult to discover for symbolic regression, while $f$ and $g$ individually are easily discoverable.

Note that for $b \gt 1$, one potential issue is that there is no unique solution. For $a = 2$, $b = 2$, $c = 1$, one example is

$g(x) = [x_1^2, x_1 x_2]$

$f(x) = x_1 + x_2$

where $h(x) = [x_1^2 + x_1, x_1 x_2 - x_1]$ is an equally valid solution, i.e. $f(g(x)) = f(h(x))$. Furthermore, there are infinitely many solutions where, for example, only at certain points data is added to $g_1$ and subtracted from $g_2$.

We estimate that this issue could be one of the main challenges of this project.

Let's get started!

In [1]:
import os
import numpy as np

### Step 1.1: Create standard input data

Since the input data in machine learning is often standardized as part of the preprocessing anyway, we simply sample from $\mathcal{N}(0, 1)$. 

Furthermore, we specify the input data to be 10 dimensional, which provides enough flexibility to define complex target functions.

We start with sampling 1000 data points in each dimension.

In [2]:
seed = 0

input_name = "X04"
input_size = 2

data_size = int(1e3)

In [3]:
np.random.seed(seed)

data = {}
data[input_name] = np.random.normal(size=(data_size, input_size))

### Step 1.2: Create target function data

As discussed above, we want to define easily discoverable functions $f$ and $g$ where their composition $f \circ g$ is difficult to discover for symbolic regression.

Additionally, it would be good to cover different combinations of $b$ and $c$.

For $b > 1$, there is the above mentioned identifiability issue. Raising $c$, however, could provide more information and reduce the issue. For example,

$g(x) = [x_1^2, x_1 x_2]$

$f(x) = x_1 + x_2$

allows adding and subtracting any function $h(x_1)$ to $g_1$ and $g_2$, respectively. In contrast,

$g(x) = [x_1^2, x_1 x_2]$

$f(x) = [x_1 + x_2, x_1^2]$

would avoid this issue as $g(x) = [x_1^2 + h(x_1), x_1 x_2 - h(x_1)]$ would lead to $f(g(x)) = [x_1^2 + x_1 x_2, (x_1^2 + h(x_1))^2]$.

What are the available mathematical operators in PySR? See https://astroautomata.com/PySR/#/operators.

* Unary: `neg`, `square`, `cube`, `exp`, `abs`, `log_abs = log(abs(x) + 1e-8)`, `log10_abs`, `log2_abs`, `log1p_abs = log(abs(x) + 1)`, `sqrt_abs = sqrt(abs(x))`, `sin`, `cos`, `tan`, `sinh`, `cosh`, `tanh`, `atan`, `asinh`, `acosh_abs`, `atanh_clip = atanh((x+1)%2 - 1)`, `erf`, `erfc`, `gamma`, `relu`, `round`, `floor`, `ceil`, `round`, `sign`

* Binary: `plus`, `sub`, `mult`, `pow`, `div`, `greater`, `mod`, `logical_or`, `logical_and`

We select the following operators:

* Unary: `sin`, `cos`, `exp`, `log_abs`

* Binary: `plus`, `sub`, `mult`, (`pow`)

Additionally, we define the functionality to add Gaussian noise to each function, but we set $\sigma^2 = 0$ for now.

The data is not split into training, validation and testing data at this point.

Functions:

* $b=1$, $c=1$:

    * $g^1(x) = x_0 + \text{cos}(x_2) + x_3 \cdot x_6 + x_7^2$

    * $f^1(x) = x^2$

    * $g^2(x) = \text{sin}(x_0) \cdot \text{cos}(x_1) + x_5^3 + \text{exp}(x_9)$

    * $f^2(x) = 3.95 x^2 + x$

    * $g^3(x) = x_0 - x_3 \cdot x_5^2 + \text{log}(\text{abs}(x_7 + x_8))$

    * $f^3(x) = x \cdot \text{exp}(0.25 x)$

* $b=3$, $c=1$:

    * $g^4(x) = $

    * $f^4(x) = $

    * $g^5(x) = $

    * $f^5(x) = $

    * $g^6(x) = $

    * $f^6(x) = $

* $b=3$, $c=3$:

    * $g^7(x) = $

    * $f^7(x) = $

    * $g^8(x) = $

    * $f^8(x) = $

    * $g^9(x) = $

    * $f^9(x) = $

* $b=5$, $c=1$:

    * $g^{10}(x) = $

    * $f^{10}(x) = $

    * $g^{11}(x) = $

    * $f^{11}(x) = $

    * $g^{12}(x) = $

    * $f^{12}(x) = $

* $b=5$, $c=3$:

    * $g^{13}(x) = $

    * $f^{13}(x) = $

    * $g^{14}(x) = $

    * $f^{14}(x) = $

    * $g^{15}(x) = $

    * $f^{15}(x) = $

* $b=5$, $c=5$:

    * $g^{16}(x) = $

    * $f^{16}(x) = $

    * $g^{17}(x) = $

    * $f^{17}(x) = $

    * $g^{18}(x) = $

    * $f^{18}(x) = $

In [4]:
fun_tups = [
    # ("G01", ["X01[:,0] + np.cos(X01[:,2]) + X01[:,3] * X01[:,6] + X01[:,7]**2"], 0),
    # ("F01", ["G01[:,0]**2"], 0),
    # ("G02", ["np.sin(X01[:,0]) * np.cos(X01[:,1]) + X01[:,5]**3 + np.exp(X01[:,9])"], 0),
    # ("F02", ["3.95 * G02[:,0]**2 + G02[:,0]"], 0),
    # ("G03", ["X01[:,0] - X01[:,3] * X01[:,5]**2 + np.log(np.abs(X01[:,7] + X01[:,8]))"], 0),
    # ("F03", ["G03[:,0] * np.exp(0.25 * G03[:,0])"], 0),
    ("G04", ["X04[:,0] + np.cos(X04[:,1]) + X04[:,0] * X04[:,1] + X04[:,1]**2"], 0),
    ("F04", ["G04[:,0]**2"], 0),
]

In [5]:
for fun_tup in fun_tups:
    
    fun_name = fun_tup[0]
    print(f"Evaluating {fun_name}.")

    res = []    
    for fun in fun_tup[1]:
        
        # evaluate target function string
        fun_data = eval(fun, {'np': np}, data)

        # add Gaussian noise
        fun_data += np.random.normal(scale=fun_tups[0][2], size=data_size)

        res.append(fun_data)

    data[fun_name] = np.array(res).T

Evaluating G04.
Evaluating F04.


### Step 1.3: Save data

In [6]:
data_path = "data"
data_ext = ".gz"
info_ext = ".info"

# create data folder
os.makedirs(data_path, exist_ok=True)

# save input data
var = input_name
np.savetxt(os.path.join(data_path, var + data_ext), data[var])
print(f"Saved {var} data.")

# save target data
for fun_tup in fun_tups:
    var = fun_tup[0]

    np.savetxt(os.path.join(data_path, var + data_ext), data[var])

    with open(os.path.join(data_path, var + info_ext), "w") as f:
        for fun in fun_tup[1]:
            f.write(fun + '\n')

        f.write(str(fun_tup[2]) + '\n')

    print(f"Saved {var} data.")

Saved X04 data.
Saved G04 data.
Saved F04 data.
