# 2022 Flatiron Machine Learning x Science Summer School

## Step 1: Create generic data from algebraic equations

In this step, we want to create data from algebraic equations of the form $f(g(x))$, where $g: \mathbb{R}^a \rightarrow \mathbb{R}^b$, $f: \mathbb{R}^b \rightarrow \mathbb{R}^c$ and $f \circ g: \mathbb{R}^a \rightarrow \mathbb{R}^c$. One example could be

$g(x) = x_1 \cdot cos(x_3)$

$f(x) = x^2 + x$

$f(g(x)) = g(x)^2 + g(x) = x_1^2 \cdot cos^2(x_3) + x_1 \cdot cos(x_3)$

where $a = 3$, $b = 1$ and $c = 1$. Ideally, we would like to define $f$ and $g$ such that $f \circ g$ is difficult to discover for symbolic regression, while $f$ and $g$ individually are easily discoverable.

Note that for $b \gt 1$, one potential issue is that there is no unique solution. For $a = 2$, $b = 2$, $c = 1$, one example is

$g(x) = [x_1^2, x_1 x_2]$

$f(x) = x_1 + x_2$

where $h(x) = [x_1^2 + x_1, x_1 x_2 - x_1]$ is an equally valid solution, i.e. $f(g(x)) = f(h(x))$. Furthermore, there are infinitely many solutions where, for example, only at certain points data is added to $g_1$ and subtracted from $g_2$.

We estimate that this issue could be one of the main challenges of this project.

Let's get started!

In [1]:
import os
import numpy as np

### Step 1.1: Create standard input data

Range of input data: N(0, 1)

Size of input data: 1000

In [23]:
seed = 0

data_size = int(1e3)

input_name = "X01"
input_size = 10

In [24]:
np.random.seed(seed)

data = {}
data[input_name] = np.random.normal(size=(data_size, input_size))

### Step 1.2: Create target function data

What kind of target functions do we want to investigate? What kind of operators are we including?

addition
multiplication
subtraction
exponential
power
logarithm
sine
cosine

noise

train val test split?

In [25]:
fun_tups = [
    ("G01", ["X01[:,0]**2", "X01[:,1] * X01[:,2]"], 0),
    ("F01", ["G01[:,0] + G01[:,1]"], 0)
]

In [27]:
for fun_tup in fun_tups:
    
    fun_name = fun_tup[0]
    print(f"Evaluating {fun_name}.")

    res = []    
    for fun in fun_tup[1]:
        
        # evaluate target function string
        fun_data = eval(fun, data)

        # add Gaussian noise
        fun_data += np.random.normal(scale=fun_tups[0][2], size=data_size)

        res.append(fun_data)

    data[fun_name] = np.array(res).T

Evaluating G01.
Evaluating F01.


### Step 1.3: Save data

In [22]:
data_path = "data"
data_ext = ".gz"
info_ext = ".info"

# create data folder
os.makedirs(data_path, exist_ok=True)

# save input data
var = input_name
np.savetxt(os.path.join(data_path, var + data_ext), data[var])
print(f"Saved {var} data.")

# save target data
for fun_tup in fun_tups:
    var = fun_tup[0]

    np.savetxt(os.path.join(data_path, var + data_ext), data[var])

    with open(os.path.join(data_path, var + info_ext), "w") as f:
        for fun in fun_tup[1]:
            f.write(fun + '\n')

        f.write(str(fun_tup[2]) + '\n')

    print(f"Saved {var} data.")

Saved X01 data.
Saved G01 data.
Saved F01 data.
