Querying more than (500/5000) results in an automated way? #39

lovelaced · 2018-02-14T15:46:59Z

Hi,

I'm attempting to get the results of a query (categorymembers) and the number of results are more than 500. I'm cool with doing multiple queries, but is there a way to continue where I left off? I know the "cmcontinue" param is available in raw_res in the categorymembers function, but I'm not sure if I can leverage it directly to get the results I want, or if I'm missing something.

For example, let's say I want to use the user default max (500) to get a list of all the pages that exist in a category, but there's 8000 pages in the category. Is it possible to loop a query to get all the pages?

barrust · 2018-02-15T22:08:33Z

Can you provide me with the category you are looking to pull from? The category tree does get all records but this could be something added in.

lovelaced · 2018-02-16T12:24:55Z

I'm looking at this:
http://practicalplants.org/wiki/Category:Plant

Thanks, I gave categorytree a try as well but it only gives me a short list of pages, strangely (the entire first page and halfway through the second page, up to Anemone altaica, which is 500 entries).

{'Plant': {'depth': 0, 'sub-categories': {}, 'links': ['Abelia triflora', ... , 'Anemone altaica'], 'parent-categories': []}}

The categorymembers query returns 500 entries as well if I set "results" to anything higher than 500. I'm calling it like this:

all_plant_names = plantwiki.categorymembers("Plant", results=8000, subcategories=False)

Forgive me if I'm overlooking something obvious. If there's something that needs changing here I'm also happy to contribute.
Thanks so much for your help!

barrust · 2018-02-16T14:57:11Z

You should be able to get all the category members using the following:

from mediawiki import MediaWiki

wiki = MediaWiki('http://practicalplants.org/w/api.php')
all_plant_names = wiki.categorymembers('Plants', results=None, subcategories=False)

After a bit of poking around, it seems as though pymediawiki is looking for continue in the resulting request where the api you are hitting is sending back query-continue. I will need to look into it some more, but if you replaced continue with query-continue it should work (~ line 590)

if 'continue' not in raw_res or last_cont == raw_res['continue']:
                break

It may take me some time to get this working and to see if this is a long-term change or if the query-continue is an older construct.

barrust · 2018-02-18T03:31:57Z

@lovelaced I seemed to have tracked down the issue to a difference between actions and props and how they handle the continue parameter. I have a fix in the works with a test to ensure that it is resolved. I hope to push it to PyPi in the coming day. If you would like to try the fix it is currently in the hotfix/query-continue branch.

With this, if you set the number of results to return to None, it will pull back all the category members.

barrust · 2018-02-18T21:40:27Z

I pushed the change to pypi as version 0.3.17; if you would upgrade pymediawiki and test again, that would be great! I actually used this category in the test suite.

barrust · 2018-02-25T15:50:22Z

@lovelaced I am going to close this issue. If something is still not working, please reopen and let me know!

* Add fix to use the `query-continue` parameter to continue to pull category members [issue #39](#39) * Better handle large categorymember selections * Add better handling of exception attributes including adding them to the documentation * Correct the pulling of the section titles without additional markup [#42](#42) * Handle memoization of unicode parameters in python 2.7 * ***Change default timeout*** for HTTP requests to 15 seconds

barrust closed this as completed Feb 25, 2018

barrust mentioned this issue Mar 9, 2018

0.4.0 #43

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querying more than (500/5000) results in an automated way? #39

Querying more than (500/5000) results in an automated way? #39

lovelaced commented Feb 14, 2018

barrust commented Feb 15, 2018

lovelaced commented Feb 16, 2018

barrust commented Feb 16, 2018

barrust commented Feb 18, 2018

barrust commented Feb 18, 2018

barrust commented Feb 25, 2018

Querying more than (500/5000) results in an automated way? #39

Querying more than (500/5000) results in an automated way? #39

Comments

lovelaced commented Feb 14, 2018

barrust commented Feb 15, 2018

lovelaced commented Feb 16, 2018

barrust commented Feb 16, 2018

barrust commented Feb 18, 2018

barrust commented Feb 18, 2018

barrust commented Feb 25, 2018