added a named param to records() to allow stripping of section headings #224

ckot · 2019-01-19T20:53:28Z

Rather than requiring the user to parse out section headings from the extracted page text, I added a keep_section_headings named param (default True) to records()

Description

Section headings are currently kept in the page text, requiring the user to manually strip them out.

Motivation and Context

Section headings remain in the page text, delimited by newlines, but unfortunately it requires intelligence to determine whether they are a section heading or simply a short sentence

How Has This Been Tested?

I simply saved the pages text to files and verified that the section headings (and page-title) aren't present in the page text when I pass False to this parameter. Everything else is the same.

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation, and I have updated it accordingly.

…ds()

bdewilde · 2019-01-26T23:35:47Z

textacy/datasets/wikipedia.py

            if len(page['text']) < min_len:
                continue
            page['title'] = title
            page['page_id'] = page_id
-            page['text'] = title + '\n\n' + page['text']
+            if keep_section_headings:


Looks like this only handles the case when keep_section_headings is True. You'll need an else statement for the other case.

after looking at this, page['text'] already has it's value, we only want to modify it's value if we want to prepend the title, so unless I'm missing something we don't need an else

you're correct, i misread my own code 😅

bdewilde · 2019-01-26T23:37:10Z

textacy/datasets/wikipedia.py

@@ -402,6 +402,8 @@ def records(self, min_len=100, limit=-1, fast=False):
            fast (bool): If True, text is extracted using a faster method but
                which gives lower quality results. Otherwise, a slower but better
                method is used to extract article text.
+            keep_section_headings (bool): Whether to include section headings and


I think I'd prefer to follow the naming convention of the underlying library, and call this include_headings. Could you make that change everywhere?

I agree. will do

what do you think about adding a minimal wikimedia dump file, which just a few pages (could be bogus) as test data so that unit tests (for this and any other changes) could be run against? spot-checking of stuff makes me kind of nervous...

Agreed, though that doesn't necessarily have to be in this PR. I have inconsistent test coverage for textacy; unfortunately, this particular functionality is not well tested.

bdewilde · 2019-01-31T17:06:51Z

Hey @ckot , I see your new issues. I'd like to close out existing PRs before opening related new ones, if only to keep things neat. Would you be able to make the changes I mentioned above? If not, I'm happy to do this myself — it's a small change :) — but didn't want to take the credit for your good work. Let me know!

…s' for consistancy purposes

ckot · 2019-02-01T03:56:05Z

I changed the named param to 'include_headers' as you requested, and tested (via my own application which uses this, due to this not having a unit test).

I agree regarding wanting to push this PR before having me push out any more. Although all my PRs have been quite simple, it's easier for me as well to not have multiple outstanding PRs and thus have lots of branches on my fork and need to keep them all in sync.

added a named parameter keep_section_headings (default True) to recor…

b8b2613

…ds()

bdewilde requested changes Jan 26, 2019

View reviewed changes

ckot added 2 commits January 31, 2019 21:31

Merge branch 'master' into feat-strip-section-headings

80e8b41

renamed all occurances of 'keep_section_headings' to 'include_heading…

b28478c

…s' for consistancy purposes

bdewilde approved these changes Feb 2, 2019

View reviewed changes

bdewilde merged commit 7e1f3a8 into chartbeat-labs:master Feb 2, 2019

ckot deleted the feat-strip-section-headings branch February 2, 2019 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added a named param to records() to allow stripping of section headings #224

added a named param to records() to allow stripping of section headings #224

ckot commented Jan 19, 2019

bdewilde Jan 26, 2019

ckot Jan 30, 2019

ckot Feb 1, 2019

bdewilde Feb 2, 2019

bdewilde Jan 26, 2019

ckot Jan 30, 2019

ckot Jan 30, 2019

bdewilde Jan 30, 2019

bdewilde commented Jan 31, 2019

ckot commented Feb 1, 2019

added a named param to records() to allow stripping of section headings #224

added a named param to records() to allow stripping of section headings #224

Conversation

ckot commented Jan 19, 2019

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdewilde commented Jan 31, 2019

ckot commented Feb 1, 2019