New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added a named param to records() to allow stripping of section headings #224
added a named param to records() to allow stripping of section headings #224
Conversation
textacy/datasets/wikipedia.py
Outdated
if len(page['text']) < min_len: | ||
continue | ||
page['title'] = title | ||
page['page_id'] = page_id | ||
page['text'] = title + '\n\n' + page['text'] | ||
if keep_section_headings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this only handles the case when keep_section_headings
is True. You'll need an else
statement for the other case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after looking at this, page['text'] already has it's value, we only want to modify it's value if we want to prepend the title, so unless I'm missing something we don't need an else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're correct, i misread my own code 😅
textacy/datasets/wikipedia.py
Outdated
@@ -402,6 +402,8 @@ def records(self, min_len=100, limit=-1, fast=False): | |||
fast (bool): If True, text is extracted using a faster method but | |||
which gives lower quality results. Otherwise, a slower but better | |||
method is used to extract article text. | |||
keep_section_headings (bool): Whether to include section headings and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer to follow the naming convention of the underlying library, and call this include_headings
. Could you make that change everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about adding a minimal wikimedia dump file, which just a few pages (could be bogus) as test data so that unit tests (for this and any other changes) could be run against? spot-checking of stuff makes me kind of nervous...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, though that doesn't necessarily have to be in this PR. I have inconsistent test coverage for textacy
; unfortunately, this particular functionality is not well tested.
Hey @ckot , I see your new issues. I'd like to close out existing PRs before opening related new ones, if only to keep things neat. Would you be able to make the changes I mentioned above? If not, I'm happy to do this myself — it's a small change :) — but didn't want to take the credit for your good work. Let me know! |
I changed the named param to 'include_headers' as you requested, and tested (via my own application which uses this, due to this not having a unit test). I agree regarding wanting to push this PR before having me push out any more. Although all my PRs have been quite simple, it's easier for me as well to not have multiple outstanding PRs and thus have lots of branches on my fork and need to keep them all in sync. |
Rather than requiring the user to parse out section headings from the extracted page text, I added a
keep_section_headings
named param (default True) torecords()
Description
Section headings are currently kept in the page text, requiring the user to manually strip them out.
Motivation and Context
Section headings remain in the page text, delimited by newlines, but unfortunately it requires intelligence to determine whether they are a section heading or simply a short sentence
How Has This Been Tested?
I simply saved the pages text to files and verified that the section headings (and page-title) aren't present in the page text when I pass False to this parameter. Everything else is the same.
Screenshots (if appropriate):
Types of changes
Checklist: