Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSoC Progress #1

Closed
31 tasks done
mr-martian opened this issue Jun 14, 2019 · 20 comments
Closed
31 tasks done

GSoC Progress #1

mr-martian opened this issue Jun 14, 2019 · 20 comments

Comments

@mr-martian
Copy link
Collaborator

mr-martian commented Jun 14, 2019

General TODO list:

  • Formalism and compiler
    • Multiple output chunks
    • Interpolation (for clitics, etc.)
      • This is currently possible but potentially messier than we want
    • Capitalization (probably using pseudo attribute "lemcase")
    • Clean up distinction between input and output rules
    • Tag replacement rules ?
    • Implement non-overwriting (currently being parsed and ignored)
    • Conjoined LUs
  • Interpreter
    • Profile
    • Speed up, if possible (currently takes about twice as long as chunker/interchunk/postchunk)
  • XML compiler
    • Understand what postchunk does
    • Deal with naming conflicts
    • Detect literal chunks
    • Test
  • eng->spa rules
    • Verbs are a bit of a mess
    • and seem to get overwritten a lot (this should maybe be a compiler thing)
  • Tests
  • Documentation
    • More examples
    • More comments in code
      • Comments in header files
      • Explanations of algorithms in code
    • And just clean up code in general
    • Ensure documentation aligns with current behavior
@mr-martian
Copy link
Collaborator Author

Now there are tests, and they have revealed some bugs, particularly with default attributes.

@mr-martian
Copy link
Collaborator Author

All 8 tests now pass, and I should probably write some more.

I'm wondering if it might be a good idea to add a way to specify that a rule can only apply in certain layers. A use case might be if you had a rule for adj NP and a fallback rule for adj. Without the restriction, the fallback rule always applies because the input is n, not NP.

Clipping and default attributes usually work, but they seem to occasionally fail so tomorrow I'll mess with those a bit and also add a few other bits and pieces.

@mr-martian
Copy link
Collaborator Author

I restructured the compiler and wrote some more tests.

The eng-spa rules seem to be occasionally failing at agreement and I'm not sure why (corresponding tests work ok). It's also not entirely clear that overriding output patterns is working properly. I'll have to write some tests for that as well.

@mr-martian
Copy link
Collaborator Author

Profiling yesterday indicated that the most significant time-sink was repeatedly removing the first element of a vector. I replaced the vector in question with a list and the result was a roughly 3x speedup, making the parser faster than chunker and interchunk running partially in parallel.

The current largest piece is that the match function is being called more than 2 million times on 150 thousand tokens due to the multipass nature of the parser.

Once the parser is converted to some version of LR, it looks like possibly as much as 1/4 of the program's runtime will be spent in Chunk::chunkPart. It may be possible to modify the interface with PCRE to remove unnecessary conversions between string, wstring, and C strings, but beyond that optimization seems unlikely to be worthwhile, so I've marked it as completed for now.

@mr-martian
Copy link
Collaborator Author

For unknown reasons, the profiling mentioned in the previous comment gave 1.3s as the running time with ~60 rules on 75k tokens, but every subsequent run has given that time as about 3s, which is roughly level with the existing transfer system.

The first attempt to switch from multipass to LR on Thursday and Friday went badly due to an overuse of std::map and std::vector. A second attempt this morning yielded runtimes just under 3s.

I did some optimizing today and got the runtime down to about 2.2s. I think some further gains might be possible, but they would require messing with strings, so I'm not sure if it's worth it.

@mr-martian
Copy link
Collaborator Author

The LR version is in the transducer-lr branch.

Under LR, the REJECTRULE operation presents some problems. Rejecting in favor of another rule of the same length is fine, but if the intent is that a shorter rule be used, it seems like that would require some sort of backtracking, so I'm not sure what to do about that.

@mr-martian
Copy link
Collaborator Author

I've switched to GLR, which should fix the problems from the last comment.

Reduce-reduce conflicts are resolved as they were under multipass and LR (by length and weight). Shift-reduce conflicts involve splitting the stack.

Unfortunately, this makes the stack a directed graph which seems to be significantly slow things down. I'll post more details when the profiler finishes.

@mr-martian
Copy link
Collaborator Author

As it turns out, the substantial increase in time was due to a value being set wrong and occasionally causing an infinite loop. I'm still not completely certain what was causing it, but it doesn't seem to be happening anymore and the overall runtime is now better than multipass.

However, multipass and GLR sometimes apply rules in different orders, which means that the output is different, so some of the test rules may need to be rewritten. It's also possible that this difference in behavior is actually another bug. Either way, writing more tests will hopefully make things clearer.

@mr-martian
Copy link
Collaborator Author

Most obsolete code has now been deleted or moved to dev/. Most non-obsolete functions and classes now have actually descriptive names.

As it turns out, none of the existing tests had shift-reduce conflicts. A grammar without conflicts will run pretty quickly, but one with conflicts will eat up a lot of memory and run rather slowly due to all the stack splitting.

Given that each rule is executed separately on each branch, I'm now pretty sure that having some rules run at output rather than at parse would substantially speed things up.

Clipping tags currently approaches half the runtime. It may be possible to make the compiler generate fewer clip instructions or to have the tokenizer break the input stream into tags.

@mr-martian
Copy link
Collaborator Author

Adding output rules sped things up and saved enough copies that I can now actually run the test data I've been using. Unfortunately it takes 20 seconds and the memory usage goes as high as 5GB.

I'll see if the profiler turns up anything. If not, I think I can get the compiler to generate less bytecode on the input side.

@mr-martian
Copy link
Collaborator Author

I added a pool allocator for chunks and stack branches that resets on output, which seems to have sped things up a bit (now 15s), and it also substantially decreased the memory usage, which looks like it's now down to about 1GB.

It also indicates that as things are currently set up, there's at least one sentence in my test data that allocates more than 500,000 chunks while parsing.

@mr-martian
Copy link
Collaborator Author

I just noticed something that is either a bug or a really annoying caveat. A set of rules like

NP -> n adj {2 1} |
      adj NP {1 2}

With input n adj cm will produce n adj cm rather than adj n cm because it gets to the adjective and sees that a shift might be possible for the unreduced version because of the second rule, but there is no possible continuation for the reduced version, so that branch is discarded.

@mr-martian
Copy link
Collaborator Author

I started on an XML compiler today. In the absence of substantial setbacks I might have it working tomorrow.

@mr-martian
Copy link
Collaborator Author

I finished most of the TODOs in the XML compiler, but didn't get as far as actually testing it.

Things still unfinished:

  • <transfer default="chunk"> vs <transfer default="lu">
  • update the output rule if we change the lemma of a chunk
  • The DTD seems to imply that macros can take variable numbers of arguments
  • The DTD says that <append> can modify clips
  • What does the queue attribute on clips do?
  • <mlu> in the chunker has a check against producing ++ and +#
  • I implemented <chunk> with both name and namefrom being empty as having the lemma "default"
  • <clip> inside <chunk> could use more error checking

@mr-martian
Copy link
Collaborator Author

It compiles the eng-spa files without apparent issue, but for some reason the first transition in the transducer is \x82 rather than ^ so nothing matches and I haven't been able to properly test it.

@mr-martian
Copy link
Collaborator Author

mr-martian commented Jul 5, 2019

Issues with the XML compiler and interpreter:

  • Capitalization is not automatic
  • If t1x generates a chunk with the same tags as the input LU, it will apply twice, giving weird results
  • It looks like the set of rules that's applying isn't quite the same, but I'm not sure why

However, it does appear to be somewhat faster than chunker/interchunk.

@jonorthwash
Copy link
Member

With input n adj cm will produce n adj cm rather than adj n cm because it gets to the adjective and sees that a shift might be possible for the unreduced version because of the second rule, but there is no possible continuation for the reduced version, so that branch is discarded.

This is basically identical to the HFST bug.

@mr-martian
Copy link
Collaborator Author

With input n adj cm will produce n adj cm rather than adj n cm because it gets to the adjective and sees that a shift might be possible for the unreduced version because of the second rule, but there is no possible continuation for the reduced version, so that branch is discarded.

This is basically identical to the HFST bug.

One possible solution to that might be to not discard branches when reading a blank - then it would read the comma and both branches would die, but I don't know if there's a more complicated scenario where that would fail.

@jonorthwash
Copy link
Member

@ftyers, thoughts on the above issue?

@mr-martian
Copy link
Collaborator Author

The solution that I've found to the HFST-like bug is to write something like the following:

fallback: _;
nonexistent: _;
fallback -> NP nonexistent {1};

Since no rule generates nonexistent, this rule will never apply, but will make NP look like it's not a dead end, thereby keeping it alive long enough to be output.

Unfortunately, implementing this solution slows things down considerably (roughly a factor of 4, it looks like).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants