Improve preprocessing and adding of eval data #780

Timoeller · 2021-01-27T14:20:26Z

There were problems with empty text in documents during preprocessing, as mentioned in #751 (comment)

This PR fixes #763

Now also collects question IDs where the answer could not be converted and prints statistics about them, so it closes #774

tholor

Looking good. Is split_respect_sentence_boundary not supported in eval because it's messing up the offsets?

Timoeller · 2021-01-27T15:44:15Z

Exactly. Under the hood it uses nltks sentence splitting, which might remove symbols between sentences, e.g. when doing:

nltk.tokenize.sent_tokenize("This is a test.  With probs.") 
['This is a test.', 'With probs.']

xpatronum · 2021-01-28T22:04:58Z

Someone double check this, but the list_splits defined on line 103 could contain empty slices.
Specifically if the first sentence would contain more than self.split_length words , then the below condition would be triggered:

if word_count + current_word_count > self.split_length:
    list_splits.append(current_slice)

The quick fix is to add additional check:

if word_count + current_word_count > self.split_length:
    if len(current_slice) > 0:
        list_splits.append(current_slice)

Timoeller · 2021-01-29T09:26:57Z

hey @thenewera-ru good catch.
We removed the support for split respect sentence boundary in the second to last commit here in this PR. Do you agree that this should prevent the bug you described?

xpatronum · 2021-01-29T13:10:09Z

hey @thenewera-ru good catch.
We removed the support for split respect sentence boundary in the second to last commit here in this PR. Do you agree that this should prevent the bug you described?

I tested fixed code on my own data and so far it works perfectly. Basically the code in the commit does the same but at the end. A bit clearer when we do not add empty slices current_slice to the answer list_splits at all rather than adding it empty and checking at the end if len(' '.join(text)) > 0:, IMHO.
But the code output is the same as I said previously since basically in the end you're forming text:str from each current_slice stored in list_splits.

P.S. I'm NOT opening the pull request because I'm strongly against nltk module. It works good for english texts but performs poorly on others: in my case Russian. For that purpose I would prefer switching to spacy since it has prebuilt linguistic rules for most languages:

from spacy.lang.en import English
from spacy.lang.ru import Russian
from langdetect import detect
text = 'The dursley family of number four privet drive was the reason that harry never enjoyed his summer holidays.\
            Uncle vernon aunt petunia and their son dudley were harry’s only living relatives. \
            They were muggles and they had a very medieval attitude toward magic.\
            Harry’s dead parents who had been a witch and wizard themselves were never mentioned under the dursleys’ roof.'
if language == 'en':
    nlp = English()
elif language == 'ru':
    nlp = Russian()
else:
    raise NotImplemented
nlp.add_pipe(nlp.create_pipe('senticezer'))
text = nlp(text)
for s in text.sents:
    pure_text = s.text

At least for language == 'ru' spacy performs much better than nltk. I've heard from other folks working in industry too that nltk sucks on certain types of text.

P.P.S.
I would also consider adding parameter to split on :param split_on_phrase: .... For that purpose there's fantastic open-sourced solution flashtext. Much faster than python's re on big texts.

Of course I can share my code on separate repo if anyone is interested. Since more than half of the original code is modified - I'm not opening PR. Let's discuss it first.

Timoeller · 2021-01-29T14:59:51Z

Hey, I like your suggestion and prefer spacy over nltk as well. Unfortunately spacys nlp pipeline can be rather slow during processing, especially if all pipeline components are used. Nevertheless we need a proper solution for other languages than English and we would dearly value your contribution.

How about you raise a Work In Progress PR with very preliminary code and we continue the discussion from there? You could also remove spacy pipeline components for better speed and benchmark both approaches?

Timoeller · 2021-02-01T16:08:12Z

@thenewera-ru I will merge this PR for now.

As mentioed feel free to open a very preliminary WIP PR with using spacys nlp pipeline to start the discussion.

Remove empty document when splitting text

84d65e2

Timoeller requested a review from tholor January 27, 2021 14:21

tholor approved these changes Jan 27, 2021

View reviewed changes

Move error message of problematic ids to a highler level

0a5eedb

Timoeller changed the title ~~Remove empty document when splitting text~~ Improve preprocessing and adding of eval data Jan 27, 2021

Timoeller mentioned this pull request Jan 29, 2021

PreProcessor produces empty Document during splitting #763

Closed

Timoeller merged commit f3ccd59 into master Feb 1, 2021

Timoeller deleted the bugfix_evaldata_addition branch February 1, 2021 16:08

brandenchan mentioned this pull request Feb 1, 2021

Answer Text and Answer Position mismatch during retriever query bencharmking #774

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve preprocessing and adding of eval data #780

Improve preprocessing and adding of eval data #780

Timoeller commented Jan 27, 2021 •

edited

Loading

tholor left a comment

Timoeller commented Jan 27, 2021 •

edited

Loading

xpatronum commented Jan 28, 2021

Timoeller commented Jan 29, 2021

xpatronum commented Jan 29, 2021

Timoeller commented Jan 29, 2021

Timoeller commented Feb 1, 2021

Improve preprocessing and adding of eval data #780

Improve preprocessing and adding of eval data #780

Conversation

Timoeller commented Jan 27, 2021 • edited Loading

tholor left a comment

Choose a reason for hiding this comment

Timoeller commented Jan 27, 2021 • edited Loading

xpatronum commented Jan 28, 2021

Timoeller commented Jan 29, 2021

xpatronum commented Jan 29, 2021

Timoeller commented Jan 29, 2021

Timoeller commented Feb 1, 2021

Timoeller commented Jan 27, 2021 •

edited

Loading

Timoeller commented Jan 27, 2021 •

edited

Loading