# PTO on GE
## Some GE symbolic regression problems

In this notebook we quickly visualise the results from PTO-GE on some symbolic regression problems. We have some problems defined by datasets, and then some problems defined by random generation of coefficients in polynomials.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os.path
import seaborn as sns
%matplotlib inline

def count_nodes(s): return 1 + s.count(",") + s.count("(")

# Problems from datasets

We have:

* 5 problem instances (Pagie 2D, Dow Chemical, Tower, Housing, Vladislavleva 4).
* 4 solvers (RS, HC, LA, EA)
* 2 generators (generating a random string by making a random choice at each non-terminal in the grammar recursively: we have the grammar in BNF format and an "executable" format).
* 2 trace types (linear and structured).

Our budget of evaluations is 20000. We carry out each combination of parameters 30 times. 

Our results will be:

* Objective function value -- that is training fitness. (We have results on unseen data also but it is not visualised for now.) It is the negative of the RMSE. Higher is better.
* Node count in the best solution. Lower is better.

We simply use boxplots to visualise the results. Boxplots appear decisive: no statistical tests are needed.

Each figure represents the results of one problem instance. Within each figure, four boxplots represent the combinations of (solver, trace type).

The **overall conclusion** is that the RS is the worst, EA second-worst, and the two hill-climbing approaches (HC and LA) the best. The structured trace makes a big improvement in all cases -- sometimes clear-cut and sometimes better but still overlapping with the linear trace. There are a couple of cases where the EA with linear trace is worse than RS.

**Generators**: one possible explanation for the poor EA performance is that the naive generator (a simple recursive grammar derivation function, using a BNF grammar just as in previous implementations of GE) does not allow PTO to capture problem structure. Therefore, we implement a second generator (an "executable" grammar where each non-terminal is a random function, as in `test_tracer.py`). To compare the results, compare the first (BNF grammar) and second (executable grammar) plots in each problem. The results are identical (not just similar but identical) for all problems and setups *except* the EA with structured trace. For this setup, the executable grammar gives a small but noticeable improvement. (For each problem, compare ('EA', 1) in the first plot with ('EA', 1) in the second plot.) With this improvement, EA is still worse than HC approaches.

The total runtime for the "original" algorithm was less than an hour on a 24-core 24Gb RAM Mac Pro with Ubuntu Linux. 

Boxplots are created and shown below. Further results are discussed after.

In [None]:
filenames = ["GE_results.dat"]
problems = [
    "Pagie2D",
    "DowNorm",
    "HousingNorm",
    "TowerNorm",
    "Vladislavleva4"
    ]
generators = ["BNF", "exec"]


In [None]:
for filename in filenames:
    d = pd.read_csv(filename, delimiter="\t", 
                    names=["problem", "grammar", "solver", "generator", "str_trace", "budget", "rep", "obj", "test_obj", "fn"])
    # shorten the names
    d.loc[d["generator"] == "GE_randsol", "generator"] = "BNF"
    d.loc[d["generator"] == "GE_randsol_sr_nobnf", "generator"] = "exec"
    d["node_count"] = d.apply(lambda row: count_nodes(row.fn), axis=1)
    
    for problem in problems:
        for generator in generators:
            d[(d["problem"] == problem) & (d["generator"] == generator)].boxplot(
                column="obj", by=["solver", "str_trace"], grid=False)
            plt.title(": ".join(("GE", problem, generator)))
            plt.suptitle("")
            plt.ylabel("Objective")
            
            d[(d["problem"] == problem) & (d["generator"] == generator)].boxplot(
                column="node_count", by=["solver", "str_trace"], grid=False)
            plt.title(": ".join(("GE", problem, generator)))
            plt.suptitle("")
            plt.ylabel("Node count")
            

Duplicate avoidance
===

Another possible explanation for the poor EA result is the repetition/duplication of many individuals, which is probably more common in EAs than HC. A simple cache is added to each algorithm to avoid duplicates. Thus, we now have:

* For each solver, a new version with duplicate avoidance
* 1 grammar, because to save runtime, in the folowing experiment we use only the executable grammar.

The **main conclusion** now is that performance is slightly better, but not decisively. When we add duplicate-avoidance, individuals become much larger, because they are forced out of the part of the space where all individuals are small. Memory usage grows hugely: more than 2Gb per run. Fewer runs can be run simultaneously on our 24-core server, so this in combination with the longer time needed to carry out extra mutations means runtime is now several days.

In [None]:
# in the experiment with duplicate avoidance,
# we only used the "executable" grammar, not the bnf
filenames = ["GE_duplicate_results.dat"]
generators = ["exec"]


In [None]:
for filename in filenames:
    d = pd.read_csv(filename, delimiter="\t", 
                    names=["problem", "grammar", "solver", "generator", "str_trace", "budget", "rep", "obj", "test_obj", "fn"])
    # shorten the names
    d.loc[d["generator"] == "GE_randsol", "generator"] = "BNF"
    d.loc[d["generator"] == "GE_randsol_sr_nobnf", "generator"] = "exec"
    d["node_count"] = d.apply(lambda row: count_nodes(row.fn), axis=1)
    
    for problem in problems:
        for generator in generators:
            d[(d["problem"] == problem) & (d["generator"] == generator)].boxplot(
                column="obj", by=["solver", "str_trace"], grid=False)
            plt.title(": ".join(("GE no dups", problem, generator)))
            plt.suptitle("")
            plt.ylabel("Objective")
            
            d[(d["problem"] == problem) & (d["generator"] == generator)].boxplot(
                column="node_count", by=["solver", "str_trace"], grid=False)
            plt.title(": ".join(("GE no dups", problem, generator)))
            plt.suptitle("")
            plt.ylabel("Node count")
            

Performance by generation
===


Next, we will look at performance by generation, comparing trace types and solvers for their performance during the run. Our suspicion is that EA should beat HC, so if not, perhaps it's because the runs are too short, which may benefit HC, relative to EA. The runs are 20,000 iterations. Based on these results, it does not seem that that is the case: the EA and HC methods are plateau-ing in much the same way.

For each dataset, we show two plots: one for all solvers and structured trace, the other for all solvers and linear trace. The structured trace is **much** better than linear for EA. The gap is smaller but still **decisively** better for HC and LA. In several cases, the EA with linear trace *stagnates* after about one quarter of the run has elapsed.

In [None]:
dirname = "/Users/jmmcd/Desktop/GE_duplicate_results_gens"

problems = [
    "Pagie2D",
    "DowNorm",
    "HousingNorm",
    "TowerNorm",
    "Vladislavleva4"
    ]
grammar_file = "sr.bnf"
generators = ["GE_randsol_sr_nobnf"]
solvers = ["RS", "HC", "LA", "EA"]
str_traces = [False, True]
reps = 30
budget = 20000

for problem in problems:
    for generator in generators:
        for str_trace in str_traces:
            for solver in solvers:
                basename = "_".join((problem, grammar_file, solver, generator, str(int(str_trace)), str(budget)))
                filenames = [basename + "_" + str(rep) + ".gens" for rep in range(reps)]
                
                d = np.array([np.genfromtxt(os.path.join(dirname, filename))[:budget, :] for filename in filenames])
                d = d.mean(axis=0)
                plt.plot(d[:, 0], d[:, 1], label=solver)
                
            plt.title(": ".join((problem, "Structured" if str_trace else "Linear", generator)))
            plt.xlabel("Iteration")
            plt.ylabel("Objective")
            plt.legend()
            plt.show()

# Random polynomials

Next, we consider a set of problems defined by target polynomials. We have the same solvers and trace types as before. We use just the executable grammar, and we switch off duplicate avoidance. The problems are defined as polynomials of $n \in \{1, 2, 3\}$ variables, of degree $d \in \{2, 4, 6, 8, 10, 12, 14, 16, 18, 20\}$. The coefficients for all terms are randomly generated.

**Results**: the *number of variables* ($n$) doesn't make a big difference. The degree does: higher degree is harder. The structured trace helps the EA a lot: with linear trace, it is about the same as random search; with structured trace, it is nearly as good as the hill-climbing approaches. Concerning *node count*: RS always has the smallest node count. Recall the node count is the number of nodes in the best individual per run. We are considering training error only, not unseen data. 

In [None]:
filename = "/Users/jmmcd/Desktop/GE_results_poly.dat"
generators = ["exec"]
problems = ["poly_%d_%d" % (d, n) for d in range(2, 21, 2) for n in range(1, 4)]
solvers = ["RS", "HC", "EA"]

In [None]:
d = pd.read_csv(filename, delimiter="\t", 
                names=["problem", "grammar", "solver", "generator", "str_trace", "budget", "rep", "obj", "test_obj", "fn"])
# shorten the names
d.loc[d["generator"] == "GE_randsol", "generator"] = "BNF"
d.loc[d["generator"] == "GE_randsol_sr_nobnf", "generator"] = "exec"
d["node_count"] = d.apply(lambda row: count_nodes(row.fn), axis=1)

d.head()

In [None]:
for n in range(1, 4):
    for str_trace in [False, True]:
        for field in ["obj", "node_count"]:
            for solver in solvers:
                x = list(range(2, 21, 2))
                y = [d[
                    (d["problem"] == "poly_%d_%d" % (deg, n)) & 
                    (d["solver"] == solver) &
                    (d["str_trace"] == str_trace)
                      ][field].mean() for deg in x]
                plt.plot(x, y, label=solver)
                plt.title(": ".join(("GE", "poly", str(n), "vars", "str trace", str(str_trace))))
                if field == "obj": plt.ylabel("Objective")
                else: plt.ylabel("Node count")
                plt.xlabel("Degree")
                plt.xticks(range(2, 21, 2))
                plt.legend()
            plt.show()

# Random polynomials with more variables

We have a "background hypothesis" that EAs tend to beat HC on harder/larger problems. The above random polynomial results may suggest that large polynomial degree doesn't make EA win, but may suggest that a larger number of variables would help. So, in the following we choose a fixed degree $d=4$, and scale $n\in\{1, ... , 10\}$. The results do show that EA holds a larger advantage over RS as we scale up, but is still disimproving relative to HC.

**Importantly**, the EA, HC and LA methods are considerably better with the structured trace, versus linear trace, when we scale up.

Concerning node-count, the RS still makes smaller solutions, but there isn't a strong effect of scaling with $n$ (perhaps a surprise).