# Data Generation with Flatland

This notebook begins to outline how training datasets will be generated in Flatland. This notebook will evolve into a documentation of how to use the tool through a more formal and simplified API.

### Here's the idea

Some modern protein structure prediction approaches are a little bit complicated to implement. That's why it seems useful to have a simulator that can generate data at least of the same structure that researchers seek to use in such systems. Over time, these simulators can be improved progressively to add some minimal level of realism that should be helpful for initial model debugging. This might include for example modifying simulation parameters to enable a model to train effectively then returning to the more complex form once the simpler problem has been solved. Thus we hope to create a much more smooth path to solving the larger problem than is often followed by those seeking to solve it directly.

Further, even when training larger systems on real data it will be important that system components remain integrated and both the system and its individual components continue to function correctly. Simple toy test cases are often used for this purpose in software test engineering. But in the case of ML software engineering, sometimes it helps if these are slightly realistic. Even further, we are interested in understanding the potential of various additional sources of data to enhance the performance of structure prediction systems.

Below we will evolve populations of polymers using a trivially simple fitness metric and in the course of that retain a "genetic history" of the evolved populations. Then, we will compute structures for these polymers using Jax MD. For each "solved" structure we will compute a pairwise "residue" distance matrix and a vector of "bond" angles. Lastly, we will simulate a compound-protein interaction experiment again using Jax MD.


### Setup

Ensure the most recent version of Flatland in installed.

In [None]:

!pip install git+git://github.com/cayley-group/flatland.git --quiet


In [1]:

import pprint
from flatland import dataset




In [7]:

ds = dataset.FlatlandBase()


## Configuration

Here we'll configure our evolutionary simulations. We'll configure these to be very simple given that this is a demo - polymers of length 10 with elements from an alphabet of size 3. And just 10 population members for only 2 generations.

In [3]:

pprint.pprint(dict(ds._simulation_config()._asdict()))


{'alphabet_size': 3,
 'genome_length': 10,
 'keep_full_population_history': True,
 'mutation_rate': 0.15,
 'num_generations': 2,
 'pop_size': 2,
 'report_every': 10}


## Run simulations

Now we'll evolve our polymer populations using a trivial fitness measure - i.e. how closely the average of the integer encodings of polymer elements come to 1.0! Here we could specify the fitness_fn to be one that simulates the polymers, computes their energy, and simply considered the simulated polymer energy to be a measure of fitness (energies are more negative for more energetically-stable strutures). Or likewise we could simulate the interaction of polymers with a set of compounds and define fitness as the selectivity of systems to be low energy only when including one or more target compounds.


#### Evolve the training population

In the future we could scale this up arbitrarily by doing as many of these simulations in parallel as we like and aggregating the result. This is analagous to simulating the independent evolution of polymer families that are evolutionarily distant.

In [4]:

ds.simulate_dataset(split="train", shard_id=0)


Current average fitness: 0.29999998


INFO:absl:Generating examples for polymer population of size 2
  lax._check_user_dtype_supported(dtype, "array")
  lax._check_user_dtype_supported(dtype, name)
INFO:absl:Finished processing 1 examples.
INFO:absl:Finished processing 2 examples.
INFO:absl:Finished generating examples, writing to disk.


#### Evolve the test population

For now, evolve a single test polymer population. This will be analagous to the problem of inferring the structure of a polymer that is evolutionarily distant from any we have seen before.

In some sense, this kind of generalization may not be necessary - as solvers may do well to just memmorize solutions to the kinds of structures that are known to occur in nature. One way to interpret recent success using evolutionary information for folding is that it does exactly this - cues solvers regarding how to re-use previously-accumulated knowledge about how certain subsequences fold. 

The benefit of such a test would be regarding completely novel polymers that are not homologous to anything currently known to occur in nature. Or at least which arise from a anciently-diverged part of the evolutionary tree from the one our model was trained on.

It would be feasible to construct a test set sharing a closer evolutionary history with the training populations by selecting polymers to hold out from these to use in testing - sharing alignments across both.


In [5]:

ds.simulate_dataset(split="test", shard_id=0)


INFO:absl:Generating examples for polymer population of size 2


Current average fitness: 0.29999998


INFO:absl:Finished processing 1 examples.
INFO:absl:Finished processing 2 examples.
INFO:absl:Finished generating examples, writing to disk.


#### Evolve the validation population

The same for the validation set as for the test set.

In [6]:

ds.simulate_dataset(split="validation", shard_id=0)


INFO:absl:Generating examples for polymer population of size 2


Current average fitness: 0.29999998


INFO:absl:Finished processing 1 examples.
INFO:absl:Finished processing 2 examples.
INFO:absl:Finished generating examples, writing to disk.
