Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve whitespace during PreProcessor.split() #1121

Merged
merged 1 commit into from Jun 2, 2021

Conversation

brandenchan
Copy link
Contributor

This PR solves #1023.

Previously, PreProcessor.split() would normalize whitespace as an unintended side effect (split_by = "word" and respect_sentence_boundary=False). This is because text would go through these processing steps:

elements = text.split(" ")
segments = windowed(elements, n=split_length, step=split_length)
for seg in segments:
    txt = " ".join([t for t in seg if not t])

The if not t in the list comprehension is used to filter out Nones in the list but would also remove whitespace elements which would appear in the list if we had more than one space in a row in the original text. This list comprehension has been changed so that whitespaces are fully maintained.

This PR also removes a warning message that used to trigger when whitespace normalisation had inadvertently occurred.

@brandenchan
Copy link
Contributor Author

One small side effect of this approach is that clusters of spaces will be broken up into tokens that will count as an element. This means that when we perform windowing, some of these spaces will count towards the window word count.

I think such cases are rare enough that this shouldn't have any considerable impact.

Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! If life would always be that simple.

Just food for thought: The solution does only split on " " and not other whitespace chars or newlines.
Using re.split("\\s","Ti\ts a s\nmuh \n maeh") does what we need but removes the formatting, so removes tabs and newlines when we glue the tokens back together. Do you think keeping newlines etc is needed in later stages?

I am also fine with just splitting on " " and leaving as is. Ready to merge.

@brandenchan
Copy link
Contributor Author

Nice one! If life would always be that simple.

Just food for thought: The solution does only split on " " and not other whitespace chars or newlines.
Using re.split("\\s","Ti\ts a s\nmuh \n maeh") does what we need but removes the formatting, so removes tabs and newlines when we glue the tokens back together. Do you think keeping newlines etc is needed in later stages?

I am also fine with just splitting on " " and leaving as is. Ready to merge.

If I understand you correctly, then I am in favor of a reversible, non-destructive style of splitting that's currently implemented by this PR. This issue came up in the first place because our processing was removing duplicate spaces in the text and therefore original label spans could no longer be found in the text.

I agree that this current solution is very naive about \t and \n, but I think its good for us to keep PreProcessor.split() non-destructive, and leave whitespace handling to PreProcessor.clean()

@Timoeller
Copy link
Contributor

Timoeller commented Jun 1, 2021

I like this non destructive splitting, so lets merge like this.

This issue came up in the first place because our processing was removing duplicate spaces in the text and therefore original label spans could no longer be found in the text.

I thought the method would keep the offsets as is since it adds empty strings even when there is multiple tabs/newlines/other whitespaces. We just cannot reconstruct the original string because we join on " " only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants