- [ ] implement S.P. (see code TODO)
- [ ] begin S.P. real benchmarking, potentially alongside rsc pre-parsing
- [ ] hack up a P.S. prototype and develop a pitch
- the pitch should explain that i haven’t published yet because i didn’t want to get scooped on the real prize
- [X] remove
no_std
impl - [X] graphviz for every intermediate graph in grammar processing! [3/3]
- [X] S.P. [2/2]
- [X] implement graphvis for basic S.P. in
text_backend
- [X] expose basic S.P. and rest of
text_backend
publicly!
- [X] implement graphvis for basic S.P. in
- [X]
TokenGrammar
- [X]
PreprocessedGrammar
- [X] S.P. [2/2]
- [X] text-based grammar input format (for generating test cases)! [4/4]
- [X] make basic grammar format
- [X] expose it publicly!
- [X] make it support literal
>
chars!- a single literal ~>~ is escaped by simply doubling it: ~>>~ becomes ~>~!
- this made more sense than needing to resort to a separate escape character such as backslash!
- a single literal ~>~ is escaped by simply doubling it: ~>>~ becomes ~>~!
- [X] make it go through
SerializableGrammar
[4/4]- [X] implement
SP::parse()
- [X] implement
SP::serialize()
[1/1]- [X] make it accept any production name with doubled
$$
if needed!
- [X] make it accept any production name with doubled
- [X] put
SerializableGrammar
ingrammar_specification::constraints
! - [X] do
proptest
tests!!!
- [X] implement
- [ ] make parsing understand stack cycles [0/3]
- [ ] represent an entire stack cycle graph as a case of
NamedOrAnonStep
- [ ] make it possible to traverse/resume all paths through the stack cycle graph in e.g.
ParseableGrammar::stack_diff_pair_zipper
- [ ] graphvis for
ParseableGrammar
!
- [ ] represent an entire stack cycle graph as a case of
- [ ] introduce a mode to generate a string from a given
PreprocessedGrammar
- [-] enable regex operators! [2/8]
- [X] introduce
Group
as aCaseElement
! - [X] enable grouping with arbitrary depth in
SPTextFormat
- e.g.
(<ab> -> $A$)
to groupCaseHead
lists - THIS WAS CRAZY HARD AND EXTREMELY USEFUL!!!
- e.g.
- [ ] VERIFY THAT ALL OF THESE TRANSFORMATIONS PARSE CORRECTLY ONCE PARSING WORKS!
- i.e.: make parsing with stack cycles work first!!!
- [-] suffix [1/4]
- [X]
?
- [ ]
*
- [ ]
+
- [ ]
{n,m}
- [X]
- [ ] alternator
|
- [ ] literal
- e.g.
/a(b)+/
, with//
as a new type ofCaseHead
- e.g.
- [ ] generate these operators for
serde
tests! - [ ] do
proptest
tests for equality ofPreprocessedGrammar
instances (not just serde)!- e.g.
<a>?
is equivalent to(<a>)?
- e.g.
- [X] introduce
- [ ] implement typed callbacks [0/5]
- [ ] execute a callback whenever a sub-parse succeeds
- [ ] signal parse failures
- [ ] implement typed callbacks (for e.g. AST construction) by stuffing intermediate arguments in the stack symbols!
- [ ] replace
reconstruction.rs
with this! - [ ] do a graphviz for this!
- [ ] implement context-sensitive and recursively-enumerable grammatical constructs!
- also consider backrefs (can we make these sub-exponential?)!
- [ ] make a single iteration of parsing/resolution into a re-entrant(/lock-free?) process!
- we’ll probably want to explicitly separate:
- selecting an adjacent pair to resolve: pop off an adjacent pair of sub-parses from the sub-parse forest.
- resolving the adjacent pair: push the merged sub-parse back onto the sub-parse forest.
- having a lock/atomic flag per sub-parse might be the easiest way to do this?
- atomic flag is more feasible than usual for this use case, because there are so many other sub-parses to choose from
- could make it: per sub-parse /greater than ~k~ tokens wide/?
- we’ll probably want to explicitly separate:
- with some basic example(s), without introducing new notation
- <<eureka>> realization that this nice interface that i’d been looking for happened to be highly amenable to parallelizable/incremental parsing
- and that i was starting off looking for something totally different!
- not too much into depth
- introduce notation for describing the grammar, ellipses, etc
- delve a little into eureka
- the input is a
SimultaneousProductions
instance - make sure to make it clear how this generalizes to arbitrary tokens, not just text
- and try to go into why
- to be figured out in code
- to be figured out in code
- this is where you can show people how everyone has always been wrong. this should be the first section. no games.
- don’t even need to mention this except in the abstract maybe? it can be a fun surprise and make the reader go “huh, i guess that’s where the runtime comes from”
- make it clear how this doesn’t become a nondeterministic turing machine
- maybe this has something to do with the fact that it only processes straight line input? this might be wrong
- make it clear how this doesn’t become a nondeterministic turing machine
- but don’t even go into this too much, just enough to explain how we can have better performance with a better interface
- make sure to explain what has been wrong about parsing and not get caught up in why
- talk a little about how rust is a truly fantastic language to implement algorithms in
- move construction by default and lifetimes are amazing for correctness
- benchmarks
- what use cases does it do better or worse on?
- what’s holding it back?
- PARALLELISM
- this needs some intense thought, because this is how we can demonstrate massive speedups over other methods
- simd or other stuff
- enough to show i’ve thought about how to implement it on a microprocessor level as well
- gives people who know what they’re talking about enough of a ladder to almost immediately do that
[0/3] running it in reverse to guess grammars YES, BEFORE PUBLISHING! (BUT AFTER THE FORWARD ALGORITHM)
- this may all be invalidated by tweet translation
- this is a good idea because we have proven the model can be reduced from SAT
- and therefore capable of arbitrary computation, or that’s the idea
- so if you figure out how to tweak the knobs you can maybe assume it’ll be a <<perfectly general inference method>>
- (the idea of this is completely bonkers to me)
- IF YOU DON’T PUBLISH THIS ALONG WITH THE ORIGINAL PAPER, SOMEONE ELSE WHO IS MORE FAMOUS WILL, SO YES, IT NEEDS TO BE IN HERE, AND IT NEEDS TO BE DEVELOPED
- this is a sad but unfortunate reality
- if you do this right though, then you really have you choice of <<phd>> locked in
- so in that case, no need to rush
- this should be a separate paper
- but it would need to be posted at the exact same time thanks to lack of trust
- should cite the first paper
- [ ] find a good example of a nondeterministic sequentual input which isn’t necessarily hierarchical
- <<DNA/RNA>>
- there may be many strong examples of this throughout bio which are not related to genes
- alternative: guessing chaotic models based off of readings taken at regular intervals
- e.g. heartbeat, see “Does God Play Dice?” CITE THAT BOOK!!!
- alternative: guessing chaotic models based off of readings taken at regular intervals
- there may be many strong examples of this throughout bio which are not related to genes
- <<natural language>>
- tweet translation!!!!!!
- <<DNA/RNA>>
- [ ] determine a good statistical model to tweak
- honestly, i would be very surprised if the answer wasn’t “hook up a monte carlo tree search and call it a day”
- [ ] get a good result
- this is maybe going to be easier with natural language than with DNA/RNA due to data availability, however:
- i care about bio
- the natural language field is oversaturated and it’ll be hard to get a unique result
- i don’t think anyone is doing anything like this in bioinformatics (and i think they should be)
- and i want that phd
- patience is key, i have forever
- we definitely want a good result, but we don’t need to go as hard as on the initial algorithm
- i would love to take on a collaborator, but i don’t <<trust>> anyone enough
- so we want something here that:
- is pretty significant
- demonstrates clear advancement of the state of the art
- could be considered a founding paper of a field
- shows i know what i’m talking about
- shows the idea was mine
- is pretty significant
- this is maybe going to be easier with natural language than with DNA/RNA due to data availability, however:
- this work is likely to spark ideas about the original algorithm!
- a hell of a shower thought <2019-01-21 Mon 13:23:24> (MLK day)
- allows me to stay at twitter (forever?)
- twitter likes using patents defensively (has taken a pledge to do so? FIND THE WORDING OF THIS PLEDGE)
- if the rsc pitch works, then i can even remain on the build team, which would be incredible
- gives me ML hardware, expertise, and guidance
- provides a FANTASTIC, maybe the BEST example of why “S.P in reverse” (“P.S.”?) is a great idea
- tweets are <<small bits of language>>, UNLIKE what other machine translation services train on (presumably)
- S.P. allows for cross-serial dependencies and is a perfectly general inference method (?)
- S.P. works in parallel by default as opposed to running sequentially across a long string of text
- tweets are <<small bits of language>>, UNLIKE what other machine translation services train on (presumably)
- allows twitter to do its own translation
- can’t tell if this is immediately a win for cost/maintainability/flexibility reasons
- it probably is, though, just because we don’t have to ship our text to an external service
- and if the external service only knows about the individual tweet it’s asked to translate?
- then the fact that tweets are small bits of language that twitter alone can train on at scale might mean we can achieve domain-specific accuracy that would be impossible for an external service to achieve
- and if the external service only knows about the individual tweet it’s asked to translate?
- it probably is, though, just because we don’t have to ship our text to an external service
- can’t tell if this is immediately a win for cost/maintainability/flexibility reasons
- [ ] <<S.P. real benchmarking>>
- start off with S.P. and showing there is some nontrivial speedup against at least lex/bison
- <<rsc>>: this continues the investment in tooling performance as per rsc and expands the already-unprecedented mindshare we have for making compilers fast and easy to use
- (i think this is a very good pitch line)
- start off with either of the following, to demonstrate some nontrivial speedup in specific scenarios:
- implementing rsc pre-parsing to decouple file ingestion from compilation
- implementing rsc pre-parsing along with S.P. at the same time
- it may not be necessary to do it along with pre-parsing for rsc, but pre-parsing may be a good way for me to become familiar enough with the performance characteristics and benchmarking so that i can know whether to make the S.P. proposal
- <<rsc>>: this continues the investment in tooling performance as per rsc and expands the already-unprecedented mindshare we have for making compilers fast and easy to use
- [ ] <<P.S. prototype>>
- demonstrate some prototype of P.S. (reverse) working
- this might be hard without asking for help
- people are going to assume i think this is a good idea because it’s my pet project
- that can be fine, if we make part of the pitch “give me time to develop this P.S. concept” along with S.P
- find clear success criteria to propose
- iterate on the application
- might be possible to get someone else excited about trying this or showing it doesn’t work
- “showing it doesn’t work” would be an acceptable end goal for me, because i can then know for a fact it is ok to publish S.P. by itself, and be sure that i’m not missing out on <<the real prize>>
- “the real prize” part can be a good pitch line
- it explains why i myself really want to investigate it, and why i really wanted to work with twitter for this
- (along with the relationship of S.P to rsc work)
- in the contex of “i am a compiler person who wants to write compilers” (easy to show), this is believable
- it also might excite someone else
- it explains why i myself really want to investigate it, and why i really wanted to work with twitter for this
- “the real prize” part can be a good pitch line
- “showing it doesn’t work” would be an acceptable end goal for me, because i can then know for a fact it is ok to publish S.P. by itself, and be sure that i’m not missing out on <<the real prize>>
- that can be fine, if we make part of the pitch “give me time to develop this P.S. concept” along with S.P
- “P.S.” also sounds like “post script”, and if i put that in the proposal, people will think it is funny and also maybe see more how it is the secondary goal
- in pitch, can ask for “second half of the year” to work on P.S. (or something)
A Scala parser combinator library efficiently implementing “simultaneous productions”, a model equivalent to a Turing Machine (I think). The method of simultaneous productions allows specifying languages extremely naturally, and maps perfectly to the parser combinator operations I have in mind. It can also be implemented with a linear (?) partitioning algorithm.
let expr = sp![
E = ( e: E ) => e;
E = (base:E "^" exp:E) => Pow(base, exp);
/* \.E = { \.base[.E] "^" \.exp[.E] } => $Pow(.base, .exp); */
/* \.E = { \.[.E] "^" \.[.E] } ~=> $Pow; */
E = (E "^" E) => Pow(_.1, _.2);
E = (E "^" E) => Pow;
E = E "*" E;
E = E "/" E;
E = E "+" E;
E = E "-" E;
E = IntegerLiteral => IntLit(_);
E = FloatingPointLiteral => FPLit(_);
];
val FloatingPointLiteral = sp.productions(
("float-signed" -> Cases(Parser(Tok("-") * Ref("float-unsigned"), { - _._2 }),
Parser(Tok("+") * Ref("float-unsigned"), { _._2 }))),
// NB: should make sure sp.NumberLiterals returns 0 for an empty string
("float-base" -> SingleCase(sp.NumberLiterals)),
("float-mantissa" -> SingleCase(sp.NumberLiterals)),
("float-unsigned" -> Cases(Parser(Ref("float-base"), { toFloat(sp.parseIntegral(_._1)) }),
Parser(Ref("float-base") * Tok(".") * Ref("float-mantissa"), {
// glossing over the details of converting e.g. ".123" to 1/10 + 2/10 + 3/10
case (base, _, mantissa) => toFloat(sp.parseIntegral(base)) + sp.parseFloat(mantissa)
}))),
("exponent" -> Cases(Parser(Tok("e") * Ref("exponent-negated")),
Parser(Tok("e") * Ref("exponent-unsigned")))),
("exponent-negated" -> SingleCase(Parser(Tok("-") * Ref("exponent-unsigned")))),
// sp.NumberLiterals is a whole Parser, and should probably return a string
("exponent-unsigned" -> SingleCase(sp.NumberLiterals)),
)
val WithWeirdIntegerLiterals = Grammars.C.productions.entry[IntegerLiteral] // Use a type-indexed map!
.replaceCases
.addCase(('A', SomeSubProductionType, 'C') ~> { (a: Token, inner: SomeSubProductionType, c: Token) =>
IntegerLiteral(s"${a}${inner.toString}${c}")
}.build() // This could be hidden behind an implicit.
)
- [ ] specify a simple language so that it compiles
- use fixed strings instead of regex for now
- use strings instead of type-indexing the productions for now
- [ ] implement the simple language so that it can be parsed
- [ ] figure out how to allow productions to be type-indexed and require type-checking for that type in all the cases of the production
- [ ] make a simple language that is usable for some simple task
- csv parsing? or at least a simple subset of it
- [ ] develop benchmarking and (fuzz)? testing methods
- [ ] parse C and C++