Refactor Token and Sentence Positional Properties #3001

dobbersc · 2022-11-26T15:00:39Z

This PR refactors the Token and Sentence positional properties inherited from the DataPoint to replace the self.start_pos and self.end_pos attributes.

For the Sentence, both variants of accessing positional information behaved differently, resulting in inconsistent results where they appropriately should have been aliases. These are the inconsistencies:

Initializing the Sentence with str, i.e. untokenized input.
- We allow the user to set an offset start_position in the init but this is not respected in the start_position property. It always returns zero. The inconsistency is also that in start_pos the start position offset is included.
  Suggestion: Only use and expose the property inherited from the DataPoint. Having multiple attributes doing the same thing with different names may get confusing.

Initializing the Sentence with List[str], i.e. pre-tokenized input.

Same concern as in (1)

Added to (1), end_pos and end_position actually do not behave the same.

from flair.data import Sentence
s = Sentence(['This', 'is', 'an', 'example', '.'])
print(s.end_position)  # Prints 17 -> Corresponding to the character-level end position
print(s.end_pos)  # Prints 5 -> Corresponding to the token-level end position

Suggestion: Always use the character-level end position since the token-level end position is accessible with len(s).

The token start and end positions were incorrect.

from flair.data import Sentence
sentence = Sentence(["This", "is", "a", "sentence", "."])
print([(token.start_position, token.end_position) for token in sentence])  
# Prints [(0, 4), (5, 7), (7, 8), (8, 16), (16, 17)] but expected [(0, 4), (5, 7), (8, 9), (10, 18), (19, 20)]

Suggestion: Do not use two separate methods to construct the tokens. Instead, convert the case of initializing the Sentence with List[str] to the case of initializing the str.

Please see the commits as isolated corresponding to the suggestions.

…rrect positions

alanakbik · 2022-11-27T07:47:27Z

@dobbersc thanks for fixing this! I see you removed the try-catch block in the token offset calculation. I actually don't remember why we needed this, and we have no unit test for a problem case, so removing it is fine.

dobbersc · 2022-11-28T18:40:47Z

@dobbersc thanks for fixing this! I see you removed the try-catch block in the token offset calculation. I actually don't remember why we needed this, and we have no unit test for a problem case, so removing it is fine.

From my debugging, I found that the try-catch was used only for the initialization with List[str]. Since the current_offset is calculated over the character lengths, the indices did not align with the words in the given list. The handling of this error caused the token start and end positions to be incorrect. Since now we join the words from the list to a single string, this try-catch is no longer needed.

dobbersc added 5 commits November 24, 2022 19:49

Refactor Token start and end position properties

e7110b8

Format with black

b6a4e17

Refactor Sentence start and end position properties to reflect the co…

1b89184

…rrect positions

Add unit tests for sentence start and end position properties

a2623ac

Fix token positions when initializing a sentences with pretokenized text

4bed698

dobbersc requested a review from alanakbik November 26, 2022 15:00

alanakbik merged commit 734ea96 into master Nov 27, 2022

alanakbik deleted the refactor-position-properties branch November 27, 2022 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Token and Sentence Positional Properties #3001

Refactor Token and Sentence Positional Properties #3001

dobbersc commented Nov 26, 2022

alanakbik commented Nov 27, 2022

dobbersc commented Nov 28, 2022

Refactor Token and Sentence Positional Properties #3001

Refactor Token and Sentence Positional Properties #3001

Conversation

dobbersc commented Nov 26, 2022

alanakbik commented Nov 27, 2022

dobbersc commented Nov 28, 2022