New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Method page.sections
return html stuff in some cases
#42
Comments
Thank you for your interest. I noticed something like this long ago but forgot to get back to it. As sections are only used on demand I am not opposed to using regex. If you want to submit a PR to fix the sections title parsing I would love to review it! |
I think 2 or three test will fail once this is changed. If you submit a PR and they are failing, I can help fix them! |
@nagash91 I had some time this evening so I incorporated your change into the 0.3.17 branch. Thank you for the code to make this change! I will likely merge this into the main branch in a day or so and then push an updated version to pypi. |
* Add fix to use the `query-continue` parameter to continue to pull category members [issue #39](#39) * Better handle large categorymember selections * Add better handling of exception attributes including adding them to the documentation * Correct the pulling of the section titles without additional markup [#42](#42) * Handle memoization of unicode parameters in python 2.7 * ***Change default timeout*** for HTTP requests to 15 seconds
This has been published in version 0.4.0; please let me know if you encounter further issues! |
@barrust I tried your last version and the bug is fixed. |
No problem! Glad it worked and thank you for reporting and providing the solution! |
Hello,
I'm using this library to get textual descriptions for classes in the CUB 2011 dataset.
For each class of the 200 bird classes in the CUB dataset, I get the relative wikipedia page and look at the sections with the property
page.sections
.In some cases I get html codes inside the sections, for example:
output:
[u'Taxonomy and name', u'Subspecies<sup>[8]</sup>', u'Description', u'Vocalization', u'Distribution and habitat', u'Behaviour', u'Breeding', u'Diet', u'Threats', u'In culture', u'Status', u'References', u'External links']
Then, if I use the
page.section(str)
method with the stringu'Subspecies<sup>[8]</sup>'
:output:
None
The correct string to find the object with the method
page.sections(str)
is simply'Subspecies'
.I actually managed to fix this issue implementing this method:
With this code I get the correct output, i.e. the content of the section (sub-section in this case):
This fix works for me, but it require to execute a reg-exp for each page, so maybe is not optimal.
The text was updated successfully, but these errors were encountered: