Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method page.sections return html stuff in some cases #42

Closed
delchiaro opened this issue Feb 23, 2018 · 6 comments
Closed

Method page.sections return html stuff in some cases #42

delchiaro opened this issue Feb 23, 2018 · 6 comments

Comments

@delchiaro
Copy link

delchiaro commented Feb 23, 2018

Hello,
I'm using this library to get textual descriptions for classes in the CUB 2011 dataset.

For each class of the 200 bird classes in the CUB dataset, I get the relative wikipedia page and look at the sections with the property page.sections.
In some cases I get html codes inside the sections, for example:

from mediawiki import MediaWiki
wikipedia = MediaWiki()
page = wikipedia.page('Pied billed Grebe')
print(page.sections)

output:
[u'Taxonomy and name', u'Subspecies<sup>&#91;8&#93;</sup>', u'Description', u'Vocalization', u'Distribution and habitat', u'Behaviour', u'Breeding', u'Diet', u'Threats', u'In culture', u'Status', u'References', u'External links']

Then, if I use the page.section(str) method with the string u'Subspecies<sup>&#91;8&#93;</sup>':

print(page.section(page.sections[1]))

output: None

The correct string to find the object with the method page.sections(str) is simply 'Subspecies'.

I actually managed to fix this issue implementing this method:

def fixed_sections(page_content, verbose=False):
    sections = []
    import re
    section_regexp = r'\n==* .* ==*\n' # '== {STUFF_NOT_\n} =='
    found_obj = re.findall( section_regexp, page.content)
    
    if found_obj is not None:
        for obj in found_obj:
            obj = obj.lstrip('\n= ').rstrip(' =\n')
            sections.append(obj)
            if verbose: print("Found section: {}".format(obj))
    return sections

correct_sections  = fixed_sections(page.content)
print(correct_sections)
print(page.section(correct_sections[1]))

With this code I get the correct output, i.e. the content of the section (sub-section in this case):

[u'Taxonomy and name', u'Subspecies', u'Description', u'Vocalization', u'Distribution and habitat', u'Behaviour', u'Breeding', u'Diet', u'Threats', u'In culture', u'Status', u'References', u'External links']
P. p. podiceps, (Linnaeus, 1758), North America to Panama & Cuba.
P. p. antillarum, (Bangs, 1913), Greater & Lesser Antilles.
P. p. antarcticus, (Lesson, 1842), South America to central Chile & Argentina.

This fix works for me, but it require to execute a reg-exp for each page, so maybe is not optimal.

@barrust
Copy link
Owner

barrust commented Feb 23, 2018

Thank you for your interest. I noticed something like this long ago but forgot to get back to it. As sections are only used on demand I am not opposed to using regex. If you want to submit a PR to fix the sections title parsing I would love to review it!

@barrust
Copy link
Owner

barrust commented Feb 23, 2018

I think 2 or three test will fail once this is changed. If you submit a PR and they are failing, I can help fix them!

@barrust
Copy link
Owner

barrust commented Mar 8, 2018

@nagash91 I had some time this evening so I incorporated your change into the 0.3.17 branch. Thank you for the code to make this change! I will likely merge this into the main branch in a day or so and then push an updated version to pypi.

@barrust barrust mentioned this issue Mar 9, 2018
barrust added a commit that referenced this issue Mar 9, 2018
* Add fix to use the `query-continue` parameter to continue to pull category members [issue #39](#39)
* Better handle large categorymember selections
* Add better handling of exception attributes including adding them to the documentation
* Correct the pulling of the section titles without additional markup [#42](#42)
* Handle memoization of unicode parameters in python 2.7
* ***Change default timeout*** for HTTP requests to 15 seconds
@barrust
Copy link
Owner

barrust commented Mar 9, 2018

This has been published in version 0.4.0; please let me know if you encounter further issues!

@barrust barrust closed this as completed Mar 9, 2018
@delchiaro
Copy link
Author

@barrust I tried your last version and the bug is fixed.
Thank you and sorry for not uploading the fix, I was really busy in this period.

@barrust
Copy link
Owner

barrust commented Mar 20, 2018

No problem! Glad it worked and thank you for reporting and providing the solution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants