"Unshift" action can result in early parse termination and disconnected parses #7056

honnibal · 2021-02-14T01:33:26Z

This bug results in the behaviour described by #7035.

For v3 I refactored the parser transition system and some of the StateClass internals. In doing this, I introduced a bug that could result in states incorrectly being marked as finished. Parses in this state would be disconnected, which would show up as sentence boundaries.

The details of this are tricky and technical, but for the record:

The parser's transition system is based on the arc-eager transitions, but it allows "non-monotonic" actions that can correct for previous mistakes. One such action is if the model predicts Reduce when that stack-item does not have a head, the stack-item is placed back in the buffer.

In order to improve the efficiency of the beam parsing, I changed the StateC data structure so that these rebuffered items are accumulated in a vector, but otherwise the position in the sentence is marked by an integer. This lets us copy states more efficiently than the previous structure that maintained a queue of all words in the buffer.

The bug that I introduced was to not check the state._rebuffered queue when checking whether we were at the end of a sentence. Therefore, if we reached the end of the sentence and there were words on the rebuffered queue, the state would say "Okay I have no words in the buffer and no words on the stack, we're finished!". But we weren't finished --- we hadn't processed the rebuffered words.

Getting to the end of the sentence and rebuffering words is part of the intended design of the transition system, especially for short sentences. Consider this example of the bug, using the transitions predicted by the en_core_web_sm model. We'll write the state like this, Stack | Buffer, and then the transition on the next line, followed by the resulting state.

| Severe pain , after trauma 
    Shift 
Severe | pain , after trauma
    L-amod (pain, Severe) 
| pain , after trauma
    Shift
pain | , after trauma
    Shift       (This is an error, R-pobj is correct here)
pain , | after trauma
    Shift       (This is also an error: we should Reduce (unshift) the "," so we can attach 'after' to 'pain')
pain , after | trauma
    R-pobj (after, trauma)
pain , after trauma |
    Reduce
pain , after |

At this state we have pain, , and after on the stack, and none of them have heads (as they were all moved onto the stack with Shift, rather than Right-Arc). We're in this state due to previous mistakes, but we want to teach the parser to recover from it. The non-monotonic "unshift" action lets us do this. The parser model correctly predicts the next action:

pain , after |
    Reduce (unshift)
pain , | after

This is where the bug occurrs. At this state, the parse state would be regarded as complete, and we'd have three sentences headed by the words that weren't attached yet. After the bug fix, the parser continues correctly:

pain , after |
    Reduce (unshift)
pain , | after
    Reduce (unshift)
pain | , after
    R-punct (pain, )
pain , | after
    Reduce
pain | after
    R-prep
pain after |
    Reduce
pain |
    Reduce
|

The text was updated successfully, but these errors were encountered:

* Add test for #7035 * Update test for issue 7056 * Fix test * Fix transitions method used in testing * Fix state eol detection when rebuffer * Clean up redundant fix

github-actions · 2021-10-26T00:01:53Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added bug Bugs and behaviour differing from documentation feat / parser Feature: Dependency Parser labels Feb 14, 2021

honnibal mentioned this issue Feb 14, 2021

Fix sentence fragments bug (#7056, #7035) #7057

Merged

honnibal closed this as completed in #7057 Feb 14, 2021

honnibal added a commit that referenced this issue Feb 14, 2021

Fix sentence fragments bug (#7056, #7035) (#7057)

0fb8d43

* Add test for #7035 * Update test for issue 7056 * Fix test * Fix transitions method used in testing * Fix state eol detection when rebuffer * Clean up redundant fix

This was referenced Mar 15, 2021

Bump spacy from 3.0.1 to 3.0.5 marco-c/bugbug#286

Closed

Bump spacy from 2.3.4 to 3.0.5 bgrins/bugbug#152

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Unshift" action can result in early parse termination and disconnected parses #7056

"Unshift" action can result in early parse termination and disconnected parses #7056

honnibal commented Feb 14, 2021 •

edited

github-actions bot commented Oct 26, 2021

"Unshift" action can result in early parse termination and disconnected parses #7056

"Unshift" action can result in early parse termination and disconnected parses #7056

Comments

honnibal commented Feb 14, 2021 • edited

github-actions bot commented Oct 26, 2021

honnibal commented Feb 14, 2021 •

edited