# Data Generation with Flatland

This notebook begins to outline how training datasets will be generated in Flatland. This notebook will evolve into a documentation of how to use the tool through a more formal and simplified API.

### Here's the idea

Some modern protein structure prediction approaches are a little bit complicated to implement. That's why it seems useful to have a simulator that can generate data at least of the same structure that researchers seek to use in such systems. Over time, these simulators can be improved progressively to add some minimal level of realism that should be helpful for initial model debugging. This might include for example modifying simulation parameters to enable a model to train effectively then returning to the more complex form once the simpler problem has been solved. Thus we hope to create a much more smooth path to solving the larger problem than is often followed by those seeking to solve it directly.

Further, even when training larger systems on real data it will be important that system components remain integrated and both the system and its individual components continue to function correctly. Simple toy test cases are often used for this purpose in software test engineering. But in the case of ML software engineering, sometimes it helps if these are slightly realistic. Even further, we are interested in understanding the potential of various additional sources of data to enhance the performance of structure prediction systems.

Below we will evolve a population of polymers using a trivially simple fitness metric and in the course of that retain a "genetic history" of the evolved population. Then, we will compute structures for these polymers using Jax MD. For each "solved" structure we will compute a pairwise "residue" distance matrix and a vector of "bond" angles. Lastly, we will simulate a compound-protein interaction experiment again using Jax MD.


### Setup

Ensure the most recent version of Flatland in installed.

In [None]:

!pip install git+git://github.com/cayley-group/flatland.git --quiet


### Configuration

In [4]:

alphabet_size = 3
population_size = 500
genome_length = 10
mutation_rate = 0.15
num_generations = 50

dimension = 2
box_size = 6.8

num_water_particles = 100


### Work in progress

Working on porting the prototype contents of dev/data-generation-dev.ipynb to well designed and tested library form which will then be accessed by this notebook through a greatly simplified interface. Other Jax-based projects have kept datasets organized using [TensorFlow Datasets](https://github.com/tensorflow/datasets), e.g. [Trax](https://github.com/google/trax), with a `tfds.as_numpy` call added to the stream to convert from tensors e.g. [here](https://github.com/google/trax/blob/2583058c4d52865a29bf9f6684b5919cd4a385e5/trax/supervised/pretrain_finetune.py#L32). This has the benefit of making use of a rich system for organizing datasets as well as rapidly serializing and de-serializing them but incurs the overhead of the `as_numpy` call. Another option would be to emulate the relevant subset of functionality of TFDS for our specific purposes.
