Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added a named param to records() to allow stripping of section headings #224

Merged
merged 3 commits into from Feb 2, 2019

Conversation

ckot
Copy link
Contributor

@ckot ckot commented Jan 19, 2019

Rather than requiring the user to parse out section headings from the extracted page text, I added a keep_section_headings named param (default True) to records()

Description

Section headings are currently kept in the page text, requiring the user to manually strip them out.

Motivation and Context

Section headings remain in the page text, delimited by newlines, but unfortunately it requires intelligence to determine whether they are a section heading or simply a short sentence

How Has This Been Tested?

I simply saved the pages text to files and verified that the section headings (and page-title) aren't present in the page text when I pass False to this parameter. Everything else is the same.

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation, and I have updated it accordingly.

if len(page['text']) < min_len:
continue
page['title'] = title
page['page_id'] = page_id
page['text'] = title + '\n\n' + page['text']
if keep_section_headings:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this only handles the case when keep_section_headings is True. You'll need an else statement for the other case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after looking at this, page['text'] already has it's value, we only want to modify it's value if we want to prepend the title, so unless I'm missing something we don't need an else

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're correct, i misread my own code 😅

@@ -402,6 +402,8 @@ def records(self, min_len=100, limit=-1, fast=False):
fast (bool): If True, text is extracted using a faster method but
which gives lower quality results. Otherwise, a slower but better
method is used to extract article text.
keep_section_headings (bool): Whether to include section headings and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer to follow the naming convention of the underlying library, and call this include_headings. Could you make that change everywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about adding a minimal wikimedia dump file, which just a few pages (could be bogus) as test data so that unit tests (for this and any other changes) could be run against? spot-checking of stuff makes me kind of nervous...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, though that doesn't necessarily have to be in this PR. I have inconsistent test coverage for textacy; unfortunately, this particular functionality is not well tested.

@bdewilde
Copy link
Collaborator

Hey @ckot , I see your new issues. I'd like to close out existing PRs before opening related new ones, if only to keep things neat. Would you be able to make the changes I mentioned above? If not, I'm happy to do this myself — it's a small change :) — but didn't want to take the credit for your good work. Let me know!

@ckot
Copy link
Contributor Author

ckot commented Feb 1, 2019

I changed the named param to 'include_headers' as you requested, and tested (via my own application which uses this, due to this not having a unit test).

I agree regarding wanting to push this PR before having me push out any more. Although all my PRs have been quite simple, it's easier for me as well to not have multiple outstanding PRs and thus have lots of branches on my fork and need to keep them all in sync.

@bdewilde bdewilde merged commit 7e1f3a8 into chartbeat-labs:master Feb 2, 2019
@ckot ckot deleted the feat-strip-section-headings branch February 2, 2019 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants