-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve preprocessing and adding of eval data #780
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Is split_respect_sentence_boundary
not supported in eval because it's messing up the offsets?
Exactly. Under the hood it uses nltks sentence splitting, which might remove symbols between sentences, e.g. when doing:
|
Someone double check this, but the if word_count + current_word_count > self.split_length:
list_splits.append(current_slice) The quick fix is to add additional check: if word_count + current_word_count > self.split_length:
if len(current_slice) > 0:
list_splits.append(current_slice) |
hey @thenewera-ru good catch. |
I tested fixed code on my own data and so far it works perfectly. Basically the code in the commit does the same but at the end. A bit clearer when we do not add empty slices P.S. I'm NOT opening the pull request because I'm strongly against from spacy.lang.en import English
from spacy.lang.ru import Russian
from langdetect import detect
text = 'The dursley family of number four privet drive was the reason that harry never enjoyed his summer holidays.\
Uncle vernon aunt petunia and their son dudley were harry’s only living relatives. \
They were muggles and they had a very medieval attitude toward magic.\
Harry’s dead parents who had been a witch and wizard themselves were never mentioned under the dursleys’ roof.'
if language == 'en':
nlp = English()
elif language == 'ru':
nlp = Russian()
else:
raise NotImplemented
nlp.add_pipe(nlp.create_pipe('senticezer'))
text = nlp(text)
for s in text.sents:
pure_text = s.text At least for P.P.S. Of course I can share my code on separate repo if anyone is interested. Since more than half of the original code is modified - I'm not opening |
Hey, I like your suggestion and prefer spacy over nltk as well. Unfortunately spacys nlp pipeline can be rather slow during processing, especially if all pipeline components are used. Nevertheless we need a proper solution for other languages than English and we would dearly value your contribution. How about you raise a Work In Progress PR with very preliminary code and we continue the discussion from there? You could also remove spacy pipeline components for better speed and benchmark both approaches? |
@thenewera-ru I will merge this PR for now. As mentioed feel free to open a very preliminary WIP PR with using spacys nlp pipeline to start the discussion. |
There were problems with empty text in documents during preprocessing, as mentioned in #751 (comment)
This PR fixes #763
Now also collects question IDs where the answer could not be converted and prints statistics about them, so it closes #774