You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This bug results in the behaviour described by #7035.
For v3 I refactored the parser transition system and some of the StateClass internals. In doing this, I introduced a bug that could result in states incorrectly being marked as finished. Parses in this state would be disconnected, which would show up as sentence boundaries.
The details of this are tricky and technical, but for the record:
The parser's transition system is based on the arc-eager transitions, but it allows "non-monotonic" actions that can correct for previous mistakes. One such action is if the model predicts Reduce when that stack-item does not have a head, the stack-item is placed back in the buffer.
In order to improve the efficiency of the beam parsing, I changed the StateC data structure so that these rebuffered items are accumulated in a vector, but otherwise the position in the sentence is marked by an integer. This lets us copy states more efficiently than the previous structure that maintained a queue of all words in the buffer.
The bug that I introduced was to not check the state._rebuffered queue when checking whether we were at the end of a sentence. Therefore, if we reached the end of the sentence and there were words on the rebuffered queue, the state would say "Okay I have no words in the buffer and no words on the stack, we're finished!". But we weren't finished --- we hadn't processed the rebuffered words.
Getting to the end of the sentence and rebuffering words is part of the intended design of the transition system, especially for short sentences. Consider this example of the bug, using the transitions predicted by the en_core_web_sm model. We'll write the state like this, Stack | Buffer, and then the transition on the next line, followed by the resulting state.
| Severe pain , after trauma
Shift
Severe | pain , after trauma
L-amod (pain, Severe)
| pain , after trauma
Shift
pain | , after trauma
Shift (This is an error, R-pobj is correct here)
pain , | after trauma
Shift (This is also an error: we should Reduce (unshift) the "," so we can attach 'after' to 'pain')
pain , after | trauma
R-pobj (after, trauma)
pain , after trauma |
Reduce
pain , after |
At this state we have pain, , and after on the stack, and none of them have heads (as they were all moved onto the stack with Shift, rather than Right-Arc). We're in this state due to previous mistakes, but we want to teach the parser to recover from it. The non-monotonic "unshift" action lets us do this. The parser model correctly predicts the next action:
pain , after |
Reduce (unshift)
pain , | after
This is where the bug occurrs. At this state, the parse state would be regarded as complete, and we'd have three sentences headed by the words that weren't attached yet. After the bug fix, the parser continues correctly:
pain , after |
Reduce (unshift)
pain , | after
Reduce (unshift)
pain | , after
R-punct (pain, )
pain , | after
Reduce
pain | after
R-prep
pain after |
Reduce
pain |
Reduce
|
The text was updated successfully, but these errors were encountered:
* Add test for #7035
* Update test for issue 7056
* Fix test
* Fix transitions method used in testing
* Fix state eol detection when rebuffer
* Clean up redundant fix
This bug results in the behaviour described by #7035.
For v3 I refactored the parser transition system and some of the
StateClass
internals. In doing this, I introduced a bug that could result in states incorrectly being marked as finished. Parses in this state would be disconnected, which would show up as sentence boundaries.The details of this are tricky and technical, but for the record:
The parser's transition system is based on the arc-eager transitions, but it allows "non-monotonic" actions that can correct for previous mistakes. One such action is if the model predicts
Reduce
when that stack-item does not have a head, the stack-item is placed back in the buffer.In order to improve the efficiency of the beam parsing, I changed the
StateC
data structure so that these rebuffered items are accumulated in a vector, but otherwise the position in the sentence is marked by an integer. This lets us copy states more efficiently than the previous structure that maintained a queue of all words in the buffer.The bug that I introduced was to not check the
state._rebuffered
queue when checking whether we were at the end of a sentence. Therefore, if we reached the end of the sentence and there were words on the rebuffered queue, the state would say "Okay I have no words in the buffer and no words on the stack, we're finished!". But we weren't finished --- we hadn't processed the rebuffered words.Getting to the end of the sentence and rebuffering words is part of the intended design of the transition system, especially for short sentences. Consider this example of the bug, using the transitions predicted by the
en_core_web_sm
model. We'll write the state like this,Stack | Buffer
, and then the transition on the next line, followed by the resulting state.At this state we have
pain
,,
andafter
on the stack, and none of them have heads (as they were all moved onto the stack withShift
, rather thanRight-Arc
). We're in this state due to previous mistakes, but we want to teach the parser to recover from it. The non-monotonic "unshift" action lets us do this. The parser model correctly predicts the next action:This is where the bug occurrs. At this state, the parse state would be regarded as complete, and we'd have three sentences headed by the words that weren't attached yet. After the bug fix, the parser continues correctly:
The text was updated successfully, but these errors were encountered: