Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Token and Sentence Positional Properties #3001

Merged
merged 5 commits into from
Nov 27, 2022

Conversation

dobbersc
Copy link
Collaborator

This PR refactors the Token and Sentence positional properties inherited from the DataPoint to replace the self.start_pos and self.end_pos attributes.

For the Sentence, both variants of accessing positional information behaved differently, resulting in inconsistent results where they appropriately should have been aliases. These are the inconsistencies:

  1. Initializing the Sentence with str, i.e. untokenized input.

    • We allow the user to set an offset start_position in the init but this is not respected in the start_position property. It always returns zero. The inconsistency is also that in start_pos the start position offset is included.
      Suggestion: Only use and expose the property inherited from the DataPoint. Having multiple attributes doing the same thing with different names may get confusing.
  2. Initializing the Sentence with List[str], i.e. pre-tokenized input.

    • Same concern as in (1)
    • Added to (1), end_pos and end_position actually do not behave the same.
      from flair.data import Sentence
      s = Sentence(['This', 'is', 'an', 'example', '.'])
      print(s.end_position)  # Prints 17 -> Corresponding to the character-level end position
      print(s.end_pos)  # Prints 5 -> Corresponding to the token-level end position
      Suggestion: Always use the character-level end position since the token-level end position is accessible with len(s).
    • The token start and end positions were incorrect.
      from flair.data import Sentence
      sentence = Sentence(["This", "is", "a", "sentence", "."])
      print([(token.start_position, token.end_position) for token in sentence])  
      # Prints [(0, 4), (5, 7), (7, 8), (8, 16), (16, 17)] but expected [(0, 4), (5, 7), (8, 9), (10, 18), (19, 20)]
      Suggestion: Do not use two separate methods to construct the tokens. Instead, convert the case of initializing the Sentence with List[str] to the case of initializing the str.

Please see the commits as isolated corresponding to the suggestions.

@alanakbik
Copy link
Collaborator

@dobbersc thanks for fixing this! I see you removed the try-catch block in the token offset calculation. I actually don't remember why we needed this, and we have no unit test for a problem case, so removing it is fine.

@alanakbik alanakbik merged commit 734ea96 into master Nov 27, 2022
@alanakbik alanakbik deleted the refactor-position-properties branch November 27, 2022 07:47
@dobbersc
Copy link
Collaborator Author

@dobbersc thanks for fixing this! I see you removed the try-catch block in the token offset calculation. I actually don't remember why we needed this, and we have no unit test for a problem case, so removing it is fine.

From my debugging, I found that the try-catch was used only for the initialization with List[str]. Since the current_offset is calculated over the character lengths, the indices did not align with the words in the given list. The handling of this error caused the token start and end positions to be incorrect. Since now we join the words from the list to a single string, this try-catch is no longer needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants