Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying more than (500/5000) results in an automated way? #39

Closed
lovelaced opened this issue Feb 14, 2018 · 6 comments
Closed

Querying more than (500/5000) results in an automated way? #39

lovelaced opened this issue Feb 14, 2018 · 6 comments

Comments

@lovelaced
Copy link

Hi,

I'm attempting to get the results of a query (categorymembers) and the number of results are more than 500. I'm cool with doing multiple queries, but is there a way to continue where I left off? I know the "cmcontinue" param is available in raw_res in the categorymembers function, but I'm not sure if I can leverage it directly to get the results I want, or if I'm missing something.

For example, let's say I want to use the user default max (500) to get a list of all the pages that exist in a category, but there's 8000 pages in the category. Is it possible to loop a query to get all the pages?

@barrust
Copy link
Owner

barrust commented Feb 15, 2018

Can you provide me with the category you are looking to pull from? The category tree does get all records but this could be something added in.

@lovelaced
Copy link
Author

I'm looking at this:
http://practicalplants.org/wiki/Category:Plant

Thanks, I gave categorytree a try as well but it only gives me a short list of pages, strangely (the entire first page and halfway through the second page, up to Anemone altaica, which is 500 entries).

{'Plant': {'depth': 0, 'sub-categories': {}, 'links': ['Abelia triflora', ... , 'Anemone altaica'], 'parent-categories': []}}

The categorymembers query returns 500 entries as well if I set "results" to anything higher than 500. I'm calling it like this:

all_plant_names = plantwiki.categorymembers("Plant", results=8000, subcategories=False)

Forgive me if I'm overlooking something obvious. If there's something that needs changing here I'm also happy to contribute.
Thanks so much for your help!

@barrust
Copy link
Owner

barrust commented Feb 16, 2018

You should be able to get all the category members using the following:

from mediawiki import MediaWiki

wiki = MediaWiki('http://practicalplants.org/w/api.php')
all_plant_names = wiki.categorymembers('Plants', results=None, subcategories=False)  

After a bit of poking around, it seems as though pymediawiki is looking for continue in the resulting request where the api you are hitting is sending back query-continue. I will need to look into it some more, but if you replaced continue with query-continue it should work (~ line 590)

if 'continue' not in raw_res or last_cont == raw_res['continue']:
                break

It may take me some time to get this working and to see if this is a long-term change or if the query-continue is an older construct.

@barrust
Copy link
Owner

barrust commented Feb 18, 2018

@lovelaced I seemed to have tracked down the issue to a difference between actions and props and how they handle the continue parameter. I have a fix in the works with a test to ensure that it is resolved. I hope to push it to PyPi in the coming day. If you would like to try the fix it is currently in the hotfix/query-continue branch.

With this, if you set the number of results to return to None, it will pull back all the category members.

@barrust
Copy link
Owner

barrust commented Feb 18, 2018

I pushed the change to pypi as version 0.3.17; if you would upgrade pymediawiki and test again, that would be great! I actually used this category in the test suite.

@barrust
Copy link
Owner

barrust commented Feb 25, 2018

@lovelaced I am going to close this issue. If something is still not working, please reopen and let me know!

@barrust barrust closed this as completed Feb 25, 2018
@barrust barrust mentioned this issue Mar 9, 2018
barrust added a commit that referenced this issue Mar 9, 2018
* Add fix to use the `query-continue` parameter to continue to pull category members [issue #39](#39)
* Better handle large categorymember selections
* Add better handling of exception attributes including adding them to the documentation
* Correct the pulling of the section titles without additional markup [#42](#42)
* Handle memoization of unicode parameters in python 2.7
* ***Change default timeout*** for HTTP requests to 15 seconds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants