Preserve whitespace during PreProcessor.split() #1121

brandenchan · 2021-06-01T14:49:11Z

This PR solves #1023.

Previously, PreProcessor.split() would normalize whitespace as an unintended side effect (split_by = "word" and respect_sentence_boundary=False). This is because text would go through these processing steps:

elements = text.split(" ")
segments = windowed(elements, n=split_length, step=split_length)
for seg in segments:
    txt = " ".join([t for t in seg if not t])

The if not t in the list comprehension is used to filter out Nones in the list but would also remove whitespace elements which would appear in the list if we had more than one space in a row in the original text. This list comprehension has been changed so that whitespaces are fully maintained.

This PR also removes a warning message that used to trigger when whitespace normalisation had inadvertently occurred.

brandenchan · 2021-06-01T15:01:08Z

One small side effect of this approach is that clusters of spaces will be broken up into tokens that will count as an element. This means that when we perform windowing, some of these spaces will count towards the window word count.

I think such cases are rare enough that this shouldn't have any considerable impact.

Timoeller

Nice one! If life would always be that simple.

Just food for thought: The solution does only split on " " and not other whitespace chars or newlines.
Using re.split("\\s","Ti\ts a s\nmuh \n maeh") does what we need but removes the formatting, so removes tabs and newlines when we glue the tokens back together. Do you think keeping newlines etc is needed in later stages?

I am also fine with just splitting on " " and leaving as is. Ready to merge.

brandenchan · 2021-06-01T15:32:35Z

Nice one! If life would always be that simple.

Just food for thought: The solution does only split on " " and not other whitespace chars or newlines.
Using re.split("\\s","Ti\ts a s\nmuh \n maeh") does what we need but removes the formatting, so removes tabs and newlines when we glue the tokens back together. Do you think keeping newlines etc is needed in later stages?

I am also fine with just splitting on " " and leaving as is. Ready to merge.

If I understand you correctly, then I am in favor of a reversible, non-destructive style of splitting that's currently implemented by this PR. This issue came up in the first place because our processing was removing duplicate spaces in the text and therefore original label spans could no longer be found in the text.

I agree that this current solution is very naive about \t and \n, but I think its good for us to keep PreProcessor.split() non-destructive, and leave whitespace handling to PreProcessor.clean()

Timoeller · 2021-06-01T16:28:19Z

I like this non destructive splitting, so lets merge like this.

This issue came up in the first place because our processing was removing duplicate spaces in the text and therefore original label spans could no longer be found in the text.

I thought the method would keep the offsets as is since it adds empty strings even when there is multiple tabs/newlines/other whitespaces. We just cannot reconstruct the original string because we join on " " only.

Preserve whitespace

fcbb299

brandenchan requested a review from Timoeller June 1, 2021 14:49

Timoeller approved these changes Jun 1, 2021

View reviewed changes

brandenchan merged commit d8c47ed into master Jun 2, 2021

brandenchan deleted the preserve_whitespace branch June 2, 2021 10:08

brandenchan mentioned this pull request Jun 2, 2021

Document splitting based on word count can cause whitespace normalization #1023

Closed

This was referenced Jul 15, 2021

Fix requirements etalab-ia/piaf-ml#114

Merged

Leading space in answers since haykstack 0.9.0 #1299

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve whitespace during PreProcessor.split() #1121

Preserve whitespace during PreProcessor.split() #1121

brandenchan commented Jun 1, 2021

brandenchan commented Jun 1, 2021

Timoeller left a comment

brandenchan commented Jun 1, 2021

Timoeller commented Jun 1, 2021 •

edited

Preserve whitespace during PreProcessor.split() #1121

Preserve whitespace during PreProcessor.split() #1121

Conversation

brandenchan commented Jun 1, 2021

brandenchan commented Jun 1, 2021

Timoeller left a comment

Choose a reason for hiding this comment

brandenchan commented Jun 1, 2021

Timoeller commented Jun 1, 2021 • edited

Timoeller commented Jun 1, 2021 •

edited