GSoC Progress #1

mr-martian · 2019-06-14T16:21:05Z

mr-martian · 2019-06-14T21:24:05Z

Now there are tests, and they have revealed some bugs, particularly with default attributes.

mr-martian · 2019-06-17T20:53:52Z

All 8 tests now pass, and I should probably write some more.

I'm wondering if it might be a good idea to add a way to specify that a rule can only apply in certain layers. A use case might be if you had a rule for adj NP and a fallback rule for adj. Without the restriction, the fallback rule always applies because the input is n, not NP.

Clipping and default attributes usually work, but they seem to occasionally fail so tomorrow I'll mess with those a bit and also add a few other bits and pieces.

mr-martian · 2019-06-18T21:23:51Z

I restructured the compiler and wrote some more tests.

The eng-spa rules seem to be occasionally failing at agreement and I'm not sure why (corresponding tests work ok). It's also not entirely clear that overriding output patterns is working properly. I'll have to write some tests for that as well.

mr-martian · 2019-06-20T14:38:27Z

Profiling yesterday indicated that the most significant time-sink was repeatedly removing the first element of a vector. I replaced the vector in question with a list and the result was a roughly 3x speedup, making the parser faster than chunker and interchunk running partially in parallel.

The current largest piece is that the match function is being called more than 2 million times on 150 thousand tokens due to the multipass nature of the parser.

Once the parser is converted to some version of LR, it looks like possibly as much as 1/4 of the program's runtime will be spent in Chunk::chunkPart. It may be possible to modify the interface with PCRE to remove unnecessary conversions between string, wstring, and C strings, but beyond that optimization seems unlikely to be worthwhile, so I've marked it as completed for now.

mr-martian · 2019-06-24T21:22:21Z

For unknown reasons, the profiling mentioned in the previous comment gave 1.3s as the running time with ~60 rules on 75k tokens, but every subsequent run has given that time as about 3s, which is roughly level with the existing transfer system.

The first attempt to switch from multipass to LR on Thursday and Friday went badly due to an overuse of std::map and std::vector. A second attempt this morning yielded runtimes just under 3s.

I did some optimizing today and got the runtime down to about 2.2s. I think some further gains might be possible, but they would require messing with strings, so I'm not sure if it's worth it.

mr-martian · 2019-06-24T21:35:27Z

The LR version is in the transducer-lr branch.

Under LR, the REJECTRULE operation presents some problems. Rejecting in favor of another rule of the same length is fine, but if the intent is that a shorter rule be used, it seems like that would require some sort of backtracking, so I'm not sure what to do about that.

mr-martian · 2019-06-26T13:59:48Z

I've switched to GLR, which should fix the problems from the last comment.

Reduce-reduce conflicts are resolved as they were under multipass and LR (by length and weight). Shift-reduce conflicts involve splitting the stack.

Unfortunately, this makes the stack a directed graph which seems to be significantly slow things down. I'll post more details when the profiler finishes.

mr-martian · 2019-06-26T20:52:41Z

As it turns out, the substantial increase in time was due to a value being set wrong and occasionally causing an infinite loop. I'm still not completely certain what was causing it, but it doesn't seem to be happening anymore and the overall runtime is now better than multipass.

However, multipass and GLR sometimes apply rules in different orders, which means that the output is different, so some of the test rules may need to be rewritten. It's also possible that this difference in behavior is actually another bug. Either way, writing more tests will hopefully make things clearer.

mr-martian · 2019-06-27T23:20:07Z

Most obsolete code has now been deleted or moved to dev/. Most non-obsolete functions and classes now have actually descriptive names.

As it turns out, none of the existing tests had shift-reduce conflicts. A grammar without conflicts will run pretty quickly, but one with conflicts will eat up a lot of memory and run rather slowly due to all the stack splitting.

Given that each rule is executed separately on each branch, I'm now pretty sure that having some rules run at output rather than at parse would substantially speed things up.

Clipping tags currently approaches half the runtime. It may be possible to make the compiler generate fewer clip instructions or to have the tokenizer break the input stream into tags.

mr-martian · 2019-06-28T15:12:19Z

Adding output rules sped things up and saved enough copies that I can now actually run the test data I've been using. Unfortunately it takes 20 seconds and the memory usage goes as high as 5GB.

I'll see if the profiler turns up anything. If not, I think I can get the compiler to generate less bytecode on the input side.

mr-martian · 2019-06-28T20:38:59Z

I added a pool allocator for chunks and stack branches that resets on output, which seems to have sped things up a bit (now 15s), and it also substantially decreased the memory usage, which looks like it's now down to about 1GB.

It also indicates that as things are currently set up, there's at least one sentence in my test data that allocates more than 500,000 chunks while parsing.

mr-martian · 2019-07-01T14:30:22Z

I just noticed something that is either a bug or a really annoying caveat. A set of rules like

NP -> n adj {2 1} |
      adj NP {1 2}

With input n adj cm will produce n adj cm rather than adj n cm because it gets to the adjective and sees that a shift might be possible for the unreduced version because of the second rule, but there is no possible continuation for the reduced version, so that branch is discarded.

mr-martian · 2019-07-02T20:43:08Z

I started on an XML compiler today. In the absence of substantial setbacks I might have it working tomorrow.

mr-martian · 2019-07-03T21:31:59Z

I finished most of the TODOs in the XML compiler, but didn't get as far as actually testing it.

Things still unfinished:

<transfer default="chunk"> vs <transfer default="lu">
update the output rule if we change the lemma of a chunk
The DTD seems to imply that macros can take variable numbers of arguments
The DTD says that <append> can modify clips
What does the queue attribute on clips do?
<mlu> in the chunker has a check against producing ++ and +#
I implemented <chunk> with both name and namefrom being empty as having the lemma "default"
<clip> inside <chunk> could use more error checking

mr-martian · 2019-07-04T14:47:19Z

It compiles the eng-spa files without apparent issue, but for some reason the first transition in the transducer is \x82 rather than ^ so nothing matches and I haven't been able to properly test it.

mr-martian · 2019-07-05T19:33:29Z

Issues with the XML compiler and interpreter:

Capitalization is not automatic
If t1x generates a chunk with the same tags as the input LU, it will apply twice, giving weird results
It looks like the set of rules that's applying isn't quite the same, but I'm not sure why

However, it does appear to be somewhat faster than chunker/interchunk.

jonorthwash · 2019-07-05T20:18:24Z

With input n adj cm will produce n adj cm rather than adj n cm because it gets to the adjective and sees that a shift might be possible for the unreduced version because of the second rule, but there is no possible continuation for the reduced version, so that branch is discarded.

This is basically identical to the HFST bug.

mr-martian · 2019-07-05T20:24:33Z

With input n adj cm will produce n adj cm rather than adj n cm because it gets to the adjective and sees that a shift might be possible for the unreduced version because of the second rule, but there is no possible continuation for the reduced version, so that branch is discarded.

This is basically identical to the HFST bug.

One possible solution to that might be to not discard branches when reading a blank - then it would read the comma and both branches would die, but I don't know if there's a more complicated scenario where that would fail.

jonorthwash · 2019-07-05T22:12:37Z

@ftyers, thoughts on the above issue?

mr-martian · 2019-07-15T20:19:30Z

The solution that I've found to the HFST-like bug is to write something like the following:

fallback: _;
nonexistent: _;
fallback -> NP nonexistent {1};

Since no rule generates nonexistent, this rule will never apply, but will make NP look like it's not a dead end, thereby keeping it alive long enough to be output.

Unfortunately, implementing this solution slows things down considerably (roughly a factor of 4, it looks like).

mr-martian closed this as completed Aug 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC Progress #1

GSoC Progress #1

mr-martian commented Jun 14, 2019 •

edited

Loading

mr-martian commented Jun 14, 2019

mr-martian commented Jun 17, 2019

mr-martian commented Jun 18, 2019

mr-martian commented Jun 20, 2019

mr-martian commented Jun 24, 2019

mr-martian commented Jun 24, 2019

mr-martian commented Jun 26, 2019

mr-martian commented Jun 26, 2019

mr-martian commented Jun 27, 2019

mr-martian commented Jun 28, 2019

mr-martian commented Jun 28, 2019

mr-martian commented Jul 1, 2019

mr-martian commented Jul 2, 2019

mr-martian commented Jul 3, 2019

mr-martian commented Jul 4, 2019

mr-martian commented Jul 5, 2019 •

edited

Loading

jonorthwash commented Jul 5, 2019

mr-martian commented Jul 5, 2019

jonorthwash commented Jul 5, 2019

mr-martian commented Jul 15, 2019

GSoC Progress #1

GSoC Progress #1

Comments

mr-martian commented Jun 14, 2019 • edited Loading

mr-martian commented Jun 14, 2019

mr-martian commented Jun 17, 2019

mr-martian commented Jun 18, 2019

mr-martian commented Jun 20, 2019

mr-martian commented Jun 24, 2019

mr-martian commented Jun 24, 2019

mr-martian commented Jun 26, 2019

mr-martian commented Jun 26, 2019

mr-martian commented Jun 27, 2019

mr-martian commented Jun 28, 2019

mr-martian commented Jun 28, 2019

mr-martian commented Jul 1, 2019

mr-martian commented Jul 2, 2019

mr-martian commented Jul 3, 2019

mr-martian commented Jul 4, 2019

mr-martian commented Jul 5, 2019 • edited Loading

jonorthwash commented Jul 5, 2019

mr-martian commented Jul 5, 2019

jonorthwash commented Jul 5, 2019

mr-martian commented Jul 15, 2019

mr-martian commented Jun 14, 2019 •

edited

Loading

mr-martian commented Jul 5, 2019 •

edited

Loading