Method `page.sections` return html stuff in some cases #42

delchiaro · 2018-02-23T15:52:10Z

Hello,
I'm using this library to get textual descriptions for classes in the CUB 2011 dataset.

For each class of the 200 bird classes in the CUB dataset, I get the relative wikipedia page and look at the sections with the property page.sections.
In some cases I get html codes inside the sections, for example:

from mediawiki import MediaWiki
wikipedia = MediaWiki()
page = wikipedia.page('Pied billed Grebe')
print(page.sections)

output:
[u'Taxonomy and name', u'Subspecies<sup>[8]</sup>', u'Description', u'Vocalization', u'Distribution and habitat', u'Behaviour', u'Breeding', u'Diet', u'Threats', u'In culture', u'Status', u'References', u'External links']

Then, if I use the page.section(str) method with the string u'Subspecies<sup>[8]</sup>':

print(page.section(page.sections[1]))

output: None

The correct string to find the object with the method page.sections(str) is simply 'Subspecies'.

I actually managed to fix this issue implementing this method:

def fixed_sections(page_content, verbose=False):
    sections = []
    import re
    section_regexp = r'\n==* .* ==*\n' # '== {STUFF_NOT_\n} =='
    found_obj = re.findall( section_regexp, page.content)
    
    if found_obj is not None:
        for obj in found_obj:
            obj = obj.lstrip('\n= ').rstrip(' =\n')
            sections.append(obj)
            if verbose: print("Found section: {}".format(obj))
    return sections

correct_sections  = fixed_sections(page.content)
print(correct_sections)
print(page.section(correct_sections[1]))

With this code I get the correct output, i.e. the content of the section (sub-section in this case):

[u'Taxonomy and name', u'Subspecies', u'Description', u'Vocalization', u'Distribution and habitat', u'Behaviour', u'Breeding', u'Diet', u'Threats', u'In culture', u'Status', u'References', u'External links']
P. p. podiceps, (Linnaeus, 1758), North America to Panama & Cuba.
P. p. antillarum, (Bangs, 1913), Greater & Lesser Antilles.
P. p. antarcticus, (Lesson, 1842), South America to central Chile & Argentina.

This fix works for me, but it require to execute a reg-exp for each page, so maybe is not optimal.

The text was updated successfully, but these errors were encountered:

barrust · 2018-02-23T16:13:03Z

Thank you for your interest. I noticed something like this long ago but forgot to get back to it. As sections are only used on demand I am not opposed to using regex. If you want to submit a PR to fix the sections title parsing I would love to review it!

barrust · 2018-02-23T21:32:30Z

I think 2 or three test will fail once this is changed. If you submit a PR and they are failing, I can help fix them!

barrust · 2018-03-08T02:01:57Z

@nagash91 I had some time this evening so I incorporated your change into the 0.3.17 branch. Thank you for the code to make this change! I will likely merge this into the main branch in a day or so and then push an updated version to pypi.

* Add fix to use the `query-continue` parameter to continue to pull category members [issue #39](#39) * Better handle large categorymember selections * Add better handling of exception attributes including adding them to the documentation * Correct the pulling of the section titles without additional markup [#42](#42) * Handle memoization of unicode parameters in python 2.7 * ***Change default timeout*** for HTTP requests to 15 seconds

barrust · 2018-03-09T02:41:34Z

This has been published in version 0.4.0; please let me know if you encounter further issues!

delchiaro · 2018-03-20T10:04:39Z

@barrust I tried your last version and the bug is fixed.
Thank you and sorry for not uploading the fix, I was really busy in this period.

barrust · 2018-03-20T12:47:10Z

No problem! Glad it worked and thank you for reporting and providing the solution!

barrust mentioned this issue Mar 9, 2018

0.4.0 #43

Merged

barrust closed this as completed Mar 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Method `page.sections` return html stuff in some cases #42

Method `page.sections` return html stuff in some cases #42

delchiaro commented Feb 23, 2018 •

edited

barrust commented Feb 23, 2018

barrust commented Feb 23, 2018

barrust commented Mar 8, 2018

barrust commented Mar 9, 2018

delchiaro commented Mar 20, 2018

barrust commented Mar 20, 2018

Method page.sections return html stuff in some cases #42

Method page.sections return html stuff in some cases #42

Comments

delchiaro commented Feb 23, 2018 • edited

barrust commented Feb 23, 2018

barrust commented Feb 23, 2018

barrust commented Mar 8, 2018

barrust commented Mar 9, 2018

delchiaro commented Mar 20, 2018

barrust commented Mar 20, 2018

Method `page.sections` return html stuff in some cases #42

Method `page.sections` return html stuff in some cases #42

delchiaro commented Feb 23, 2018 •

edited