# A gentle introduction

## Preamble
In this tutorial we will look into *py-dsyre* expression encoding. To explain it, it is perhaps best to start from a concrete example. 

Consider the following mathematical formula $f(x,c) = \frac{x+\sin(cx)}{cx}$ containing one independent variable $x$ and one constant $c$. 

A (possible) decomposition in simple, at most binary, operations is:

$$
\begin{array}{l}
    u_0 = x \\
    u_1 = c \\
    u_2 = u_0 \cdot u_1 \\
    u_3 = \sin(u_2) \\
    u_4 = u_3+u_0 \\
    u_5 = u_4 / u_2
\end{array}
$$

This sequence of decompositions is what *py-dsyre* encodes in a fixed length **genotype** made of integer triplets, each representing one of the $u_i$.

Assigning, in this case, to the single operations the ids: $+: 0, \cdot: 1, /: 2, \sin: 3$ the above sequence of decompositions is uniquely determined by the triplets:
$$
\mathbf x = [\underbrace{1, 0, 1}_{u_2}, \underbrace{3, 2, 0}_{u_3}, \underbrace{0, 3, 0}_{u_4}, \underbrace{2, 4, 2}_{u_5}]
$$
Note that the independent variable and parameters define the first $u_0=x$ and $u_1=c$.

Any **genotype** is then easily expressed into the various $u_i$, which are ALL considered in *py-dsyre* as **phenotype**, that is, as possible models. With respect to Cartesian Genetic Programming, or other popular genetic programming approaches, *py-dsyre* does not define output nodes nor makes use of introns.

Lets see how this looks, first we import the module:

In [1]:
import pydsyre as dsy
import numpy as np

We start defining the main object able to manipulate genotypes and phenotypes of our symbolic regression system, letting it know about the number of input terminals (variables and constants) and the kernels (binary and unary operations) we intend to use:

In [2]:
ex = dsy.expression(nvars=1, 
                    ncons=1, 
                    kernels=["sum", "mul", "div", "sin"])

we then define a genotype assembling triplets as detailed above:

In [3]:
geno = [1,0,1,3,2,0,0,3,0,2,4,2]

Finally, we peek at the symbolic representation of the phenotype and see that indeed it contains the correct expressions:

In [4]:
print(ex.sphenotype(geno=geno, vars=["x"], cons=["c"]))

['x', 'c', '(x*c)', 'sin((x*c))', '(sin((x*c))+x)', '((sin((x*c))+x)/(x*c))']


## Working with randomness
Now that we have understood how the basic idea works, we can quickly take a look at how to **generate** and **manipulate** genotypes. We start by creatin a random genotype of length 10. Since the expression has *nvars*=1 and *ncons*=1 the number of expressed models (dimensionality of the vector $[u_0, u_1, ... ]) will be 12:

In [5]:
# We create a random genotype
geno = ex.random_genotype(length = 10)
# We compute the symbolic form of the phenotype
sphen = ex.sphenotype(geno=geno, vars=["x"], cons=["c"])
print(f"Number of models: {len(sphen)}\nPhenotype: {sphen}")

Number of models: 12
Phenotype: ['x', 'c', '(c/c)', '(c+(c/c))', '(x+(c+(c/c)))', 'sin(x)', 'sin(x)', '(sin(x)+c)', '((c/c)+(x+(c+(c/c))))', '(((c/c)+(x+(c+(c/c))))/(x+(c+(c/c))))', '(sin(x)*(sin(x)+c))', '(((c/c)+(x+(c+(c/c))))/(c/c))']


Let us now mutate at random three elements in the genotype and see what effect it has on the phenotype:

In [6]:
mutated_geno = ex.mutate(geno = geno, N = 3)
sphen = ex.sphenotype(geno=mutated_geno, vars=["x"], cons=["c"])
print(f"Number of models: {len(sphen)}\nPhenotype: {sphen}")

Number of models: 12
Phenotype: ['x', 'c', '(c/c)', '(c+(c/c))', '(x+(c+(c/c)))', 'sin(x)', 'sin(x)', '(sin(x)+c)', '((c/c)+sin(x))', '(((c/c)+sin(x))/(c+(c/c)))', '(sin(x)*(sin(x)+c))', '(((c/c)+sin(x))/(c/c))']


## Evaluating a phenotype on data
All the models contained in the phenotype need to be evaluated and assessed on data. To show how this is done we now make use of an expression with 10 variables and 3 constants:

In [7]:
ex = dsy.expression(nvars=3, 
                    ncons=2, 
                    kernels=["sum", "mul", "div", "sin"])


and we generate a random genotype, this time of greater length:

In [8]:
geno = ex.random_genotype(length = 100)
# Out of curiosity we also print the last expression in the phenotype
print(ex.sphenotype(geno)[-1])

(((((x2/x0)*x1)*x2)*(((x2/x0)*x1)*x2))+((sin(x0)*((x0/c0)/x0))*((((x2/x0)*x1)*x2)+c0)))


we create a meaningless dataset:

In [9]:
# Data are just 1024 randomly distributed points
X = np.random.randn(1024, 3)

and compute the phenotype when the model parameters are (arbitrarily) [0.2, -0.4]:

In [16]:
phen = ex.phenotype(geno = geno, xs = X, cons = [0.2, -0.4])
print("First 20 values in the phenotype for the first instance: ", phen[0][:20])

First 20 values in the phenotype for the first instance:  [-1.3137669274472548, -0.48195417432681165, 0.5837987234947484, 0.2, -0.4, -0.444370086731527, -0.9671493967307924, -6.568834637236273, 0.21416601824622677, 0.1250298480681002, 0.01563246290793222, 5.0, 0.21253257371430564, -0.9671493967307924, 0.19278166973072466, -0.26778815608058487, 0.11254569270198322, 0.3250298480681002, 0.9410322215777032, 0.21093616406148863]


and we time it:

In [17]:
%timeit ex.phenotype(geno = geno, xs = X, cons = [0.2, -0.4])

32.5 µs ± 82.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
