Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve preprocessing and adding of eval data #780

Merged
merged 2 commits into from
Feb 1, 2021

Conversation

Timoeller
Copy link
Contributor

@Timoeller Timoeller commented Jan 27, 2021

There were problems with empty text in documents during preprocessing, as mentioned in #751 (comment)

This PR fixes #763

Now also collects question IDs where the answer could not be converted and prints statistics about them, so it closes #774

@Timoeller Timoeller requested a review from tholor January 27, 2021 14:21
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Is split_respect_sentence_boundary not supported in eval because it's messing up the offsets?

@Timoeller
Copy link
Contributor Author

Timoeller commented Jan 27, 2021

Exactly. Under the hood it uses nltks sentence splitting, which might remove symbols between sentences, e.g. when doing:

nltk.tokenize.sent_tokenize("This is a test.  With probs.") 
['This is a test.', 'With probs.']

@Timoeller Timoeller changed the title Remove empty document when splitting text Improve preprocessing and adding of eval data Jan 27, 2021
@xpatronum
Copy link
Contributor

Someone double check this, but the list_splits defined on line 103 could contain empty slices.
Specifically if the first sentence would contain more than self.split_length words , then the below condition would be triggered:

if word_count + current_word_count > self.split_length:
    list_splits.append(current_slice)

The quick fix is to add additional check:

if word_count + current_word_count > self.split_length:
    if len(current_slice) > 0:
        list_splits.append(current_slice)

@Timoeller
Copy link
Contributor Author

hey @thenewera-ru good catch.
We removed the support for split respect sentence boundary in the second to last commit here in this PR. Do you agree that this should prevent the bug you described?

@xpatronum
Copy link
Contributor

hey @thenewera-ru good catch.
We removed the support for split respect sentence boundary in the second to last commit here in this PR. Do you agree that this should prevent the bug you described?

I tested fixed code on my own data and so far it works perfectly. Basically the code in the commit does the same but at the end. A bit clearer when we do not add empty slices current_slice to the answer list_splits at all rather than adding it empty and checking at the end if len(' '.join(text)) > 0:, IMHO.
But the code output is the same as I said previously since basically in the end you're forming text:str from each current_slice stored in list_splits.

P.S. I'm NOT opening the pull request because I'm strongly against nltk module. It works good for english texts but performs poorly on others: in my case Russian. For that purpose I would prefer switching to spacy since it has prebuilt linguistic rules for most languages:

from spacy.lang.en import English
from spacy.lang.ru import Russian
from langdetect import detect
text = 'The dursley family of number four privet drive was the reason that harry never enjoyed his summer holidays.\
            Uncle vernon aunt petunia and their son dudley were harry’s only living relatives. \
            They were muggles and they had a very medieval attitude toward magic.\
            Harry’s dead parents who had been a witch and wizard themselves were never mentioned under the dursleys’ roof.'
if language == 'en':
    nlp = English()
elif language == 'ru':
    nlp = Russian()
else:
    raise NotImplemented
nlp.add_pipe(nlp.create_pipe('senticezer'))
text = nlp(text)
for s in text.sents:
    pure_text = s.text

At least for language == 'ru' spacy performs much better than nltk. I've heard from other folks working in industry too that nltk sucks on certain types of text.

P.P.S.
I would also consider adding parameter to split on :param split_on_phrase: .... For that purpose there's fantastic open-sourced solution flashtext. Much faster than python's re on big texts.

Of course I can share my code on separate repo if anyone is interested. Since more than half of the original code is modified - I'm not opening PR. Let's discuss it first.

@Timoeller
Copy link
Contributor Author

Hey, I like your suggestion and prefer spacy over nltk as well. Unfortunately spacys nlp pipeline can be rather slow during processing, especially if all pipeline components are used. Nevertheless we need a proper solution for other languages than English and we would dearly value your contribution.

How about you raise a Work In Progress PR with very preliminary code and we continue the discussion from there? You could also remove spacy pipeline components for better speed and benchmark both approaches?

@Timoeller
Copy link
Contributor Author

@thenewera-ru I will merge this PR for now.

As mentioed feel free to open a very preliminary WIP PR with using spacys nlp pipeline to start the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants