Skip to content
/ datagen Public

Diverse dataset generation for program synthesis

Notifications You must be signed in to change notification settings

djslzx/datagen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

datagen

David J. Lee, 2022-2024

L-systems

Synthesize novel and diverse datasets for low-resource program synthesis domains.

Domains

  • Lindenmayer systems
  • Mujoco/2D ant walker programs
  • Python programming puzzles
  • Regular expressions

Samples

Lindenmayer systems

Sample 1 Sample 2 Sample 3 Sample 4

Ant walker paths

Ant outputs!

Programming problems

See examples/puzzles

Framework

Treat dataset generation as Markov chain Monte Carlo search over sets of candidate programs, with the following proposal distributions:

  • LLM: generate new programs via prompting an LLM with program mutation prompts (Python puzzles)
  • PCFG: fit a probabilistic context-free grammar (PCFG) to a program or set of programs using inside-outside, and then sample from the fitted grammar

and the following target distributions:

  • energy: maximize a physically motivated "energy" function over program embeddings (pulled from an off-the-shelf embedding model and compressed using dimensionality reduction techniques when necessary)
  • variance: maximize the sum of embedding distances from the mean embedding among other variations

Note: this is research code, so browse at your own peril. :)

About

Diverse dataset generation for program synthesis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published