Skip to content

a parsing framework that virtualizes and parallelizes the act of parsing. my future phd thesis

License

AGPL-3.0, Unknown licenses found

Licenses found

AGPL-3.0
LICENSE
Unknown
license-header-template
Notifications You must be signed in to change notification settings

cosmicexplorer/simultaneous-productions

Repository files navigation

simultaneous-productions

[0/3] overall timeline

refs

code TODO [33%]

  1. [X] remove no_std impl
  2. [X] graphviz for every intermediate graph in grammar processing! [3/3]
    1. [X] S.P. [2/2]
      1. [X] implement graphvis for basic S.P. in text_backend
      2. [X] expose basic S.P. and rest of text_backend publicly!
    2. [X] TokenGrammar
    3. [X] PreprocessedGrammar
  3. [X] text-based grammar input format (for generating test cases)! [4/4]
    1. [X] make basic grammar format
    2. [X] expose it publicly!
    3. [X] make it support literal > chars!
      • a single literal ~>~ is escaped by simply doubling it: ~>>~ becomes ~>~!
        • this made more sense than needing to resort to a separate escape character such as backslash!
    4. [X] make it go through SerializableGrammar [4/4]
      1. [X] implement SP::parse()
      2. [X] implement SP::serialize() [1/1]
        1. [X] make it accept any production name with doubled $$ if needed!
      3. [X] put SerializableGrammar in grammar_specification::constraints!
      4. [X] do proptest tests!!!
  4. [ ] make parsing understand stack cycles [0/3]
    1. [ ] represent an entire stack cycle graph as a case of NamedOrAnonStep
    2. [ ] make it possible to traverse/resume all paths through the stack cycle graph in e.g. ParseableGrammar::stack_diff_pair_zipper
    3. [ ] graphvis for ParseableGrammar!
  5. [ ] introduce a mode to generate a string from a given PreprocessedGrammar
  6. [-] enable regex operators! [2/8]
    1. [X] introduce Group as a CaseElement!
    2. [X] enable grouping with arbitrary depth in SPTextFormat
      • e.g. (<ab> -> $A$) to group CaseHead lists
      • THIS WAS CRAZY HARD AND EXTREMELY USEFUL!!!
    3. [ ] VERIFY THAT ALL OF THESE TRANSFORMATIONS PARSE CORRECTLY ONCE PARSING WORKS!
      • i.e.: make parsing with stack cycles work first!!!
    4. [-] suffix [1/4]
      1. [X] ?
      2. [ ] *
      3. [ ] +
      4. [ ] {n,m}
    5. [ ] alternator |
    6. [ ] literal
      • e.g. /a(b)+/, with // as a new type of CaseHead
    7. [ ] generate these operators for serde tests!
    8. [ ] do proptest tests for equality of PreprocessedGrammar instances (not just serde)!
      • e.g. <a>? is equivalent to (<a>)?
  7. [ ] implement typed callbacks [0/5]
    1. [ ] execute a callback whenever a sub-parse succeeds
    2. [ ] signal parse failures
    3. [ ] implement typed callbacks (for e.g. AST construction) by stuffing intermediate arguments in the stack symbols!
    4. [ ] replace reconstruction.rs with this!
    5. [ ] do a graphviz for this!
  8. [ ] implement context-sensitive and recursively-enumerable grammatical constructs!
    • also consider backrefs (can we make these sub-exponential?)!
  9. [ ] make a single iteration of parsing/resolution into a re-entrant(/lock-free?) process!
    • we’ll probably want to explicitly separate:
      1. selecting an adjacent pair to resolve: pop off an adjacent pair of sub-parses from the sub-parse forest.
      2. resolving the adjacent pair: push the merged sub-parse back onto the sub-parse forest.
    • having a lock/atomic flag per sub-parse might be the easiest way to do this?
      • atomic flag is more feasible than usual for this use case, because there are so many other sub-parses to choose from
      • could make it: per sub-parse /greater than ~k~ tokens wide/?

S.P. paper

abstract

grammar specification

motivation

  • with some basic example(s), without introducing new notation
  • <<eureka>> realization that this nice interface that i’d been looking for happened to be highly amenable to parallelizable/incremental parsing
    • and that i was starting off looking for something totally different!
    • not too much into depth

usage

  • introduce notation for describing the grammar, ellipses, etc
  • delve a little into eureka

parsing algorithm description

describe the input

  • the input is a SimultaneousProductions instance
  • make sure to make it clear how this generalizes to arbitrary tokens, not just text
    • and try to go into why

the lowering steps, eventually into PreprocessedGrammar

  • to be figured out in code

applying parsing

  • to be figured out in code

analysis

runtime

  • this is where you can show people how everyone has always been wrong. this should be the first section. no games.

reduction from SAT

  • don’t even need to mention this except in the abstract maybe? it can be a fun surprise and make the reader go “huh, i guess that’s where the runtime comes from”
    • make it clear how this doesn’t become a nondeterministic turing machine
      • maybe this has something to do with the fact that it only processes straight line input? this might be wrong

differences from “formal grammars”

  • but don’t even go into this too much, just enough to explain how we can have better performance with a better interface
  • make sure to explain what has been wrong about parsing and not get caught up in why

implementation

  • talk a little about how rust is a truly fantastic language to implement algorithms in
    • move construction by default and lifetimes are amazing for correctness
  • benchmarks
    • what use cases does it do better or worse on?
    • what’s holding it back?
  • PARALLELISM
    • this needs some intense thought, because this is how we can demonstrate massive speedups over other methods

unknown / future work

  • simd or other stuff
    • enough to show i’ve thought about how to implement it on a microprocessor level as well
    • gives people who know what they’re talking about enough of a ladder to almost immediately do that

[0/3] running it in reverse to guess grammars YES, BEFORE PUBLISHING! (BUT AFTER THE FORWARD ALGORITHM)

  • this may all be invalidated by tweet translation
  • this is a good idea because we have proven the model can be reduced from SAT
    • and therefore capable of arbitrary computation, or that’s the idea
    • so if you figure out how to tweak the knobs you can maybe assume it’ll be a <<perfectly general inference method>>
      • (the idea of this is completely bonkers to me)
  • IF YOU DON’T PUBLISH THIS ALONG WITH THE ORIGINAL PAPER, SOMEONE ELSE WHO IS MORE FAMOUS WILL, SO YES, IT NEEDS TO BE IN HERE, AND IT NEEDS TO BE DEVELOPED
    • this is a sad but unfortunate reality
    • if you do this right though, then you really have you choice of <<phd>> locked in
      • so in that case, no need to rush
  • this should be a separate paper
    • but it would need to be posted at the exact same time thanks to lack of trust
    • should cite the first paper
  • [ ] find a good example of a nondeterministic sequentual input which isn’t necessarily hierarchical
    • <<DNA/RNA>>
      • there may be many strong examples of this throughout bio which are not related to genes
        • alternative: guessing chaotic models based off of readings taken at regular intervals
          • e.g. heartbeat, see “Does God Play Dice?” CITE THAT BOOK!!!
    • <<natural language>>
  • [ ] determine a good statistical model to tweak
    • honestly, i would be very surprised if the answer wasn’t “hook up a monte carlo tree search and call it a day”
  • [ ] get a good result
    • this is maybe going to be easier with natural language than with DNA/RNA due to data availability, however:
      1. i care about bio
      2. the natural language field is oversaturated and it’ll be hard to get a unique result
      3. i don’t think anyone is doing anything like this in bioinformatics (and i think they should be)
        • and i want that phd
    • patience is key, i have forever
    • we definitely want a good result, but we don’t need to go as hard as on the initial algorithm
      • i would love to take on a collaborator, but i don’t <<trust>> anyone enough
      • so we want something here that:
        1. is pretty significant
          • demonstrates clear advancement of the state of the art
          • could be considered a founding paper of a field
        2. shows i know what i’m talking about
        3. shows the idea was mine
  • this work is likely to spark ideas about the original algorithm!

tweet translation

  • a hell of a shower thought <2019-01-21 Mon 13:23:24> (MLK day)

why this is the best idea ever

  • allows me to stay at twitter (forever?)
    • twitter likes using patents defensively (has taken a pledge to do so? FIND THE WORDING OF THIS PLEDGE)
    • if the rsc pitch works, then i can even remain on the build team, which would be incredible
  • gives me ML hardware, expertise, and guidance
  • provides a FANTASTIC, maybe the BEST example of why “S.P in reverse” (“P.S.”?) is a great idea
    • tweets are <<small bits of language>>, UNLIKE what other machine translation services train on (presumably)
      • S.P. allows for cross-serial dependencies and is a perfectly general inference method (?)
      • S.P. works in parallel by default as opposed to running sequentially across a long string of text
  • allows twitter to do its own translation
    • can’t tell if this is immediately a win for cost/maintainability/flexibility reasons
      • it probably is, though, just because we don’t have to ship our text to an external service
        • and if the external service only knows about the individual tweet it’s asked to translate?
          • then the fact that tweets are small bits of language that twitter alone can train on at scale might mean we can achieve domain-specific accuracy that would be impossible for an external service to achieve

[0/2] rollout / pitch inside twitter

[ ] <<S.P. real benchmarking>>
start off with S.P. and showing there is some nontrivial speedup against at least lex/bison
  • <<rsc>>: this continues the investment in tooling performance as per rsc and expands the already-unprecedented mindshare we have for making compilers fast and easy to use
    • (i think this is a very good pitch line)
  • start off with either of the following, to demonstrate some nontrivial speedup in specific scenarios:
    1. implementing rsc pre-parsing to decouple file ingestion from compilation
    2. implementing rsc pre-parsing along with S.P. at the same time
  • it may not be necessary to do it along with pre-parsing for rsc, but pre-parsing may be a good way for me to become familiar enough with the performance characteristics and benchmarking so that i can know whether to make the S.P. proposal
[ ] <<P.S. prototype>>
demonstrate some prototype of P.S. (reverse) working
  • this might be hard without asking for help
  • people are going to assume i think this is a good idea because it’s my pet project
    • that can be fine, if we make part of the pitch “give me time to develop this P.S. concept” along with S.P
      • find clear success criteria to propose
      • iterate on the application
      • might be possible to get someone else excited about trying this or showing it doesn’t work
        • “showing it doesn’t work” would be an acceptable end goal for me, because i can then know for a fact it is ok to publish S.P. by itself, and be sure that i’m not missing out on <<the real prize>>
          • “the real prize” part can be a good pitch line
            • it explains why i myself really want to investigate it, and why i really wanted to work with twitter for this
              • (along with the relationship of S.P to rsc work)
            • in the contex of “i am a compiler person who wants to write compilers” (easy to show), this is believable
            • it also might excite someone else
  • “P.S.” also sounds like “post script”, and if i put that in the proposal, people will think it is funny and also maybe see more how it is the secondary goal
  • in pitch, can ask for “second half of the year” to work on P.S. (or something)

old

A Scala parser combinator library efficiently implementing “simultaneous productions”, a model equivalent to a Turing Machine (I think). The method of simultaneous productions allows specifying languages extremely naturally, and maps perfectly to the parser combinator operations I have in mind. It can also be implemented with a linear (?) partitioning algorithm.

Ideal Code

let expr = sp![
  E = ( e: E ) => e;
  E = (base:E "^" exp:E) => Pow(base, exp);
  /* \.E = { \.base[.E] "^" \.exp[.E] } => $Pow(.base, .exp); */
  /* \.E = { \.[.E] "^" \.[.E] } ~=> $Pow; */
  E = (E "^" E) => Pow(_.1, _.2);
  E = (E "^" E) => Pow;
  E = E "*" E;
  E = E "/" E;
  E = E "+" E;
  E = E "-" E;
  E = IntegerLiteral => IntLit(_);
  E = FloatingPointLiteral => FPLit(_);
];
val FloatingPointLiteral = sp.productions(
  ("float-signed" -> Cases(Parser(Tok("-") * Ref("float-unsigned"), { - _._2 }),
                           Parser(Tok("+") * Ref("float-unsigned"), { _._2 }))),
  // NB: should make sure sp.NumberLiterals returns 0 for an empty string
  ("float-base" -> SingleCase(sp.NumberLiterals)),
  ("float-mantissa" -> SingleCase(sp.NumberLiterals)),
  ("float-unsigned" -> Cases(Parser(Ref("float-base"), { toFloat(sp.parseIntegral(_._1)) }),
                             Parser(Ref("float-base") * Tok(".") * Ref("float-mantissa"), {
                               // glossing over the details of converting e.g. ".123" to 1/10 + 2/10 + 3/10
                               case (base, _, mantissa) => toFloat(sp.parseIntegral(base)) + sp.parseFloat(mantissa)
                             }))),
  ("exponent" -> Cases(Parser(Tok("e") * Ref("exponent-negated")),
                       Parser(Tok("e") * Ref("exponent-unsigned")))),
  ("exponent-negated" -> SingleCase(Parser(Tok("-") * Ref("exponent-unsigned")))),
  // sp.NumberLiterals is a whole Parser, and should probably return a string
  ("exponent-unsigned" -> SingleCase(sp.NumberLiterals)),
)

val WithWeirdIntegerLiterals = Grammars.C.productions.entry[IntegerLiteral] // Use a type-indexed map!
  .replaceCases
  .addCase(('A', SomeSubProductionType, 'C') ~> { (a: Token, inner: SomeSubProductionType, c: Token) =>
    IntegerLiteral(s"${a}${inner.toString}${c}")
  }.build() // This could be hidden behind an implicit.
)

[0/6]

  • [ ] specify a simple language so that it compiles
    • use fixed strings instead of regex for now
    • use strings instead of type-indexing the productions for now
  • [ ] implement the simple language so that it can be parsed
  • [ ] figure out how to allow productions to be type-indexed and require type-checking for that type in all the cases of the production
  • [ ] make a simple language that is usable for some simple task
    • csv parsing? or at least a simple subset of it
  • [ ] develop benchmarking and (fuzz)? testing methods
  • [ ] parse C and C++

LICENSE

AGPL 3.0+

About

a parsing framework that virtualizes and parallelizes the act of parsing. my future phd thesis

Resources

License

AGPL-3.0, Unknown licenses found

Licenses found

AGPL-3.0
LICENSE
Unknown
license-header-template

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published