Pagination #30

jennybc · 2015-01-08T00:36:44Z

Have you thought about pagination at all?

Requests that return multiple items will be paginated to 30 items by default.

https://developer.github.com/v3/#pagination

I'm not sure what the conventions are for API wrappers and assisting the user to fetch multiple or all pages. But the Ruby wrapper for the GitHub API talks about this in its README:

For smallish resource lists, Octokit provides auto pagination. When this is enabled, calls for paginated resources will fetch and concatenate the results from every page into a single array.

I'm thinking about this because I'm using get.my.repositories() and have realized that by default I'm getting the only the first 30. Before I start playing around with explicit requests for specific pages, I'm wondering if you are contemplating adding any auto pagination to your package…..

The text was updated successfully, but these errors were encountered:

cscheid · 2015-01-08T01:11:03Z

That's a good question.

This is not a particularly strong opinion, but all else being equal, I would prefer to let pagination be handled downstream. There are a few reasons for this:

Right now rgithub has a very strong one-to-one equivalence between one call to rgithub and one HTTP request. This has the nice benefit that call rate is very predictable, and it's much less likely that you'll hit rate limits by accident.
It's easy to increase page size. I think per_page works uniformly across GitHub's API, which means that (because of 1) per_page behaves uniformly in rgithub as well.
It's more work :)

I would have assumed that GitHub itself wouldn't be fans of auto-pagination because it will hit their servers more often than necessary, but octokit is written by GitHubbers, so I'm clearly wrong there.

So let's look at it from the other side, and assume we want to implement auto-pagination. What would it look like in the API?

Option 1) We could make it a named parameter that is interpreted in a special way. This would mean that code which calls rgithub functions that return paginated results would change very little.

At the same time, I like that the base implementation of rgithub is very uniform: I don't want to have to add anything new to github.R: a general .api.request seems to be the right way to do it.

So the solution is to find all the HTTP entry points that return paginated results, and write a wrapper. Something like .paginated.api.get.request, and then have get.my.repositories, etc. all call .paginated.api.get.request instead of .api.get.request.

Option 2) For all entry points that return paginated results, create a new entry point that does auto-pagination. so you'd have get.my.repositories and get.my.autopaginated.repositories (bear with my terrible naming skills for now).

Which would you prefer?

jennybc · 2015-01-08T16:51:22Z

This is not a proper response, but adding another link where the GitHubbers really dig into the details of Traversing with pagination. This seems relevant to the discussion regardless.

jennybc · 2015-01-08T16:57:34Z

I haven't done enough of this manually yet to say how deeply I want this integrated into rgithub itself. So why don't I do that first? I appreciate the discussion. Even if you go with your instinct to keep this out of your package, maybe we could collaborate on a vignette that shows some idioms for, e.g., traversing all pages.

cscheid · 2015-01-09T03:52:44Z

Your mention of the idiom made me think of a general “pagination wrapper” call which uses “computing-on-the-language” (to use Hadley’s parlance) to do the right thing. Something like this:

get.my.repositories() is the nonpaginated version, but

auto.page(get.my.repositories()) actually does the right thing for paginating the inner call.

I think that should be doable, looks elegant, and wouldn’t require changing the base code at all.

What do you think?

jennybc · 2015-01-20T18:56:15Z

I think that is great idea! I will try to turn my evolving "manual" approach into that. I have already written helpers to parse the link headers, which will be generally useful for the auto.page() function you propose.

aronlindberg · 2015-02-06T14:22:03Z

Maybe this is a simple question, but I can't figure it out. How do I implement a "manual" approach? I'm trying to get a list of all files changed by a specific pull request: pull <- get.pull.request.files(owner = "rails", repo = "rails", id = 572, ctx = get.github.context()) however, it seems to stop at 30 entries/files or so. I've tried e.g. pull <- get.pull.request.files(owner = "rails", repo = "rails", id = 572, ctx = get.github.context(), page = 3, per_page=30) but it gives Error in get.pull.request.files(owner = "rails", repo = "rails", id = 572, : unused arguments (page = 3, per_page = 30)

How can I access them all?

jennybc · 2015-02-06T17:25:49Z

You will have to do all of this "manually". I have written some functions to facilitate this but this is all in my own local fork of this package and is very rough. Here's the sketch of what I do. In my case, I was trying to list repositories, but I would imagine the same thing works for pull requests.

First, max out the number of repos per page, i.e. set it to 100, and see if that gets everything.

repos  <- get.organization.repositories(org = "STAT545-UBC", per_page = 100)

In the repos result, see if the headers contain a link field:

repos$headers$link

If it does not, you have gotten everything! YAY! You're done.

If it does, you absolutely must traverse pages. Boo. We continue.

I get the total number of repositories by asking for all repositories, setting per_page to 1:

repos  <- get.organization.repositories(org = "STAT545-UBC", per_page = 1)

Then you must parse the information in the link field of the header. Here's where I've written functions to turn this into a tidy data.frame of links, one per row. You extract the page number of the link labelled "last" and you've learned how many repos there are.

You go back to asking for 100 repos per page. Determine how many pages the results will be broken into. Do this inside some iterative structure. Request the repos with per page set to 100, parse the link field of the headers to get the URL for the next page, Repeat until you're done. Then you also have to glue all the results together again manually, which was my motivation for this question I put out on Twitter.

In short, this is possible but a total pain in the butt.

aronlindberg · 2015-02-06T19:26:03Z

I don't think the per_page argument works for all API calls, for example : pull <- get.pull.request.files(owner = "rails", repo = "rails", id = 572, ctx = get.github.context(), per_page = 1) returns the error: unused argument (per_page = 1).

@jennybc, would you mind pointing to where in your forked repo you have the workaround?

jennybc · 2015-02-06T21:44:52Z

@aronlindberg I haven't pushed that stuff to my fork on GitHub. It's in a truly embarrassing state.

If you look at the header of the example of listing pull requests via the API, you'll see there are indeed the usual "next" and "last" links:

https://developer.github.com/v3/pulls/#list-pull-requests

so obviously pagination happens, as of course you have already discovered.

Looking at the relevant function in this package, I see there is no way for the user to pass per_page:

rgithub/R/pulls.R

Lines 84 to 85 in 153bde1

    
           get.pull.request.files <- function(owner, repo, id, ctx = get.github.context()) 
        
             .api.get.request(ctx, c("repos", owner, repo, "pulls", id, "files"))

Contrast that to the function for requesting a list of organization repositories, which uses the … argument:

rgithub/R/repositories.R

Lines 35 to 36 in 153bde1

    
           get.organization.repositories <- function(org, ..., ctx = get.github.context()) 
        
             .api.get.request(ctx, c("orgs", org, "repos"), params=list(...))

Best case, the definition of github::get.pull.request.files() just needs to be modified to pass … correctly and then the manual workflow outlined above is possible. Anecdotally, from my own usage, I think there are other functions that have the same problem.

aronlindberg · 2015-02-07T18:17:01Z

I tried adding params=list(...) to get.pull.request.files so that it reads:

get.pull.request.files <- function(owner, repo, id, ctx = get.github.context()) .api.get.request(ctx, c("repos", owner, repo, "pulls", id, "files"), params=list(...))

However, calling get.pull.request.files(owner = "rails", repo = "rails", id = 572, ctx = get.github.context(), page = 1) still generates unused argument (page = 1). I'm new at this - can anyone point me in the right direction for how to solve this? I looked at:

https://developer.github.com/v3/pulls/#list-pull-requests-files

and it looks like I should be able to pass the page parameter there...

jennybc · 2015-02-07T18:57:37Z

isn't it per_page?

aronlindberg · 2015-02-07T18:59:13Z

Neither works...Shouldn't I be able to pass both of them once the ... parameter is implemented correctly?

jennybc · 2015-02-07T18:59:59Z

I'm going to try changing def'n of get.pull.request.files() and see if I succeed ….

jennybc · 2015-02-07T19:18:30Z

OK that worked. I'll make a PR to @cscheid in a moment. Here you can prove it to yourself:

library(github)

# I store my access token in an environment variable
ctx <- create.github.context(access_token = Sys.getenv("GITHUB_TOKEN"))

## using rgithub "as is"
req <- get.pull.request.files(owner = "rails", repo = "rails", id = 572)
length(req$content) # by default, 30 items are retrieved
req$headers$link # I can see 30 items per page will imply we need 3 pages

## new definition of get.pull.request.files()
## I will make a PR to cscheid/rgithub momentarily
jfun <- function(owner, repo, id, ..., ctx = get.github.context())
  github:::.api.get.request(ctx, c("repos", owner, repo, "pulls", id, "files"),
                   params=list(...))

jreq <- jfun(owner = "rails", repo = "rails", id = 572, per_page = 3)
length(jreq$content) # yep, it's honoring my request for 3 per page
jreq$headers$link # at this rate, we'll need 23 pages

aronlindberg · 2015-02-07T19:25:50Z

Cool! =) So I was using:
ctx = interactive.login("client_id", "client_secret")

while you are using
ctx <- create.github.context(access_token = Sys.getenv("GITHUB_TOKEN"))

is it then that the token allows you to pass params=list(...) but not client_id/client_secret?

jennybc · 2015-02-07T19:29:29Z

No I don't think this has anything to do with our different approaches to authentication.

When you redefined get.pull.request.files() above, the definition doesn't actually include … in the formal arguments. So when you added params=list(…) to the call to .api.get.request(), R complains that it doesn't know what to do with additional argument, such as per_page.

aronlindberg · 2015-02-07T19:36:25Z

Aaaah! I see, yes, it works with my authentication too! The dots were just sooooo small, so I missed them. =) Thanks!

jennybc · 2015-02-07T19:45:03Z

@aronlindberg I've made a PR now (#39). This change is reflected in my fork, in case you're in a huge hurry. You could install from my fork to get the functionality and switch back once it's merged.

aronlindberg · 2015-02-07T19:46:52Z

@jennybc - thanks, I was able to implement the same change in my own fork. Thanks for having a dialogue about it, it helps me understand how these functions are written better. Now I can start trying to construct a manual pagination approach for the script I'm writing.

jennybc · 2015-02-07T19:53:09Z

@aronlindberg Here's the function I have in my "private and embarrassing" branch I haven't pushed that's a good start on digesting link headers for the depagination operation 😄

https://gist.github.com/jennybc/862a01dc9243118d83c9#file-digest_header_links-r

As you can see, I use some functions from plyr and dplyr, in addition to stringr, which this package already imports. This data.frame is helpful for determining how many pages there are and getting the URL for the next page.

aronlindberg · 2015-02-13T14:16:59Z

Thanks @jennybc that's helpful! It seems that since ... can be passed, when I run get.pull.request.files(owner = "django", repo = "django", id = i, ctx = get.github.context(), per_page=1000) I can get all the results that I want. Do you know if there is an absolute limit to the per_page argument?

jennybc · 2015-02-13T16:23:31Z

I thought it was 100. Are you actually getting 1000 items in one request?

aronlindberg · 2015-02-13T18:25:24Z

Whoops. I think it's pretty clear that the limit is 100: https://developer.github.com/guides/traversing-with-pagination/ I'll have to start digging into your manual mechanism...

aronlindberg · 2015-02-13T19:51:50Z

A rather messy example (mine) of how to implement @jennybc 's function can be found here: https://gist.github.com/aronlindberg/2a9e9802579b2d239655

jennybc · 2015-02-13T21:27:35Z

Glad you are making progress @aronlindberg! I think that will work for your specific application (pull request files).

However, to build something into this package, you actually have to rely on the "next" links, rather than requesting specific pages via page = i. Here's the relevant excerpt from the official docs re pagination:

Keep in mind that you should always rely on these link relations provided to you. Don’t try to guess or construct your own URL. Some API calls, like listing commits on a repository, use pagination results that are based on SHA values, not numbers.

So, using a while loop and creeping from one page to the next looks virtually unavoidable. In the long-term and in the general case, @cscheid's inclination to create a general pagination wrapper seems promising.

hadley · 2015-10-22T19:17:41Z

Just skimming, but I don't think you should need computing on the language to do this - if the response object was a bit richer, I think you should be able to do autopagination by inspecting the request metadata.

jennybc · 2015-10-22T19:23:37Z

I am now enjoying the page traversal @gaborcsardi put into gh.

https://github.com/gaborcsardi/gh/blob/master/R/pagination.R

cscheid · 2015-10-22T20:10:54Z

Oh wait, any reason we're using rgithub instead of jumping ship to gh?

EDIT: That one is a little too low-level for my taste. So we'll leave with rgithub and its crumminess for now.

hadley · 2015-10-22T20:16:52Z

Maybe rgithub could rely on gh for the lowlevel stuff?

realAkhmed · 2015-10-22T20:59:09Z

Just wanted to chime in. I did create a simple paginator for rgithub that @hadley seems to be describing (the one that is inspecting request results in case the object was richer). I intended to submit it as a pull request as a proposal but given that this discussion has started already it may make sense to introduce it right now.

I used it a lot last few months and it seems to work reliably.

It could be used as follows:

# without paginator
first_30_forks <- get.repository.forks(owner_name, repo_name)

# with paginator
all_forks <- auto.page( get.repository.forks(owner_name, repo_name) )

Here is the code of that future pull request (given that it is short I am typing it here as a comment as well)

# Make automated paging till response is empty
auto.page <- function(f) {
  f_call <- substitute(f)
  stopifnot(is.call(f_call))

  i <- 1
  req <- list()
  result_lst <- list()

  repeat {
    # Specify the page to download
    f_call$page <- i

    req <- eval(f_call, parent.frame())

    # Last page has empty content
    if (length(req$content)<=0) break

    result_lst[[i]] <- req$content
    i <- i+1
  }

  result_req <- req
  result_req$content <- unlist(result_lst, recursive = FALSE)

  (result_req)
}

cscheid · 2015-10-22T21:09:10Z

@akhmed1, that looks great! @jennybc, @hadley: does that look sufficiently in style? I'm always wary of substitute, eval, and parent.frame, so I'd prefer to get some expert reviewers to go over it :)

realAkhmed · 2015-10-22T21:55:30Z

@cscheid great! Would love to hear the feedback! One word of caution:

some existing API functions in rgithub do not allow page argument to be passed to Github while other functions do.

Say get.organization.teams is one example that does not. It is currently defined as:

get.organization.teams <- function(org, ctx = get.github.context())
  .api.get.request(ctx, c("orgs", org, "teams"))

However, auto.page needs an ability to pass page argument to the API. (Otherwise, pagination can't work). The good news: this is a trivial change in current function declarations such as:

get.organization.teams2 <- function(org, ctx = get.github.context(), ...)
  .api.get.request(ctx, c("orgs", org, "teams"), params = list(...))

The latter would work perfectly with auto.page.

cscheid · 2015-10-22T22:00:12Z

@akhmed1, that's issue #40 :)

jennybc · 2015-10-22T22:45:45Z

One observation from the pagination workaround I was using with this package:

Once you deal the #40 problem, you can make an initial request with per_page = 1. Then dig the total number of objects out of the link header for the "last" page. This way there's no mystery about how many things you can get. You can pick your own page size and make the real request(s) w/o having to grow objects or test for empty content. This may not really matter, but I liked it better this way.

realAkhmed · 2015-10-23T00:24:01Z

@jennybc Your points are well-taken! Unfortunately, that approach did not work for me (we are using Github Enterprise): last or next page links were not reliable: sometimes empty strings were returned when other pages did exist. Not sure if Github.com had the same issues though.

jennybc · 2015-10-23T00:40:00Z

That's weird! No I haven't had that problem on Github.com.

In Traversing with Pagination they try to scare us away from asking for specific pages by number:

Keep in mind that you should always rely on these link relations provided to you. Don’t try to guess or construct your own URL. Some API calls, like listing commits on a repository, use pagination results that are based on SHA values, not numbers.

but ... I've never accessed an exotic endpoint w/ non-numeric pagination. Sounds like you have something that works 😃.

Kinzowa · 2017-10-29T11:37:30Z

Hello, is it me or the pagination doesn't work with GET /repos/:owner/:repo/stats/contributors ? looks like it is limited to 100 even on the repository graph page and the argument page=2 doesn't return the expected result.

https://developer.github.com/v3/repos/statistics/

cscheid · 2017-10-29T15:23:54Z

Yes, this is an issue. As you can see, I've been unable to work on this for a long time now. I would really welcome a pull request if you have one. Sorry about that.

cscheid · 2018-06-05T23:20:44Z

Closed (wontfix) by #70.

jennybc mentioned this issue Jan 8, 2015

make README work #31

Merged

jennybc mentioned this issue Feb 7, 2015

Adds … to formal args of get.pull.request.files() #39

Merged

cscheid mentioned this issue Feb 7, 2015

add ellipsis for extra parameters in calls that need it (for pagination, for example) #40

Closed

gordonwoodhull mentioned this issue Oct 22, 2015

Pagination? #56

Closed

krlmlr mentioned this issue Mar 31, 2016

WIP: DIY pagination #64

Open

7 tasks

cscheid closed this as completed Jun 5, 2018

Pagination #30

Pagination #30

Comments

jennybc commented Jan 8, 2015

cscheid commented Jan 8, 2015

jennybc commented Jan 8, 2015

jennybc commented Jan 8, 2015

cscheid commented Jan 9, 2015

jennybc commented Jan 20, 2015

aronlindberg commented Feb 6, 2015

jennybc commented Feb 6, 2015

aronlindberg commented Feb 6, 2015

jennybc commented Feb 6, 2015

aronlindberg commented Feb 7, 2015

jennybc commented Feb 7, 2015

aronlindberg commented Feb 7, 2015

jennybc commented Feb 7, 2015

jennybc commented Feb 7, 2015

aronlindberg commented Feb 7, 2015

jennybc commented Feb 7, 2015

aronlindberg commented Feb 7, 2015

jennybc commented Feb 7, 2015

aronlindberg commented Feb 7, 2015

jennybc commented Feb 7, 2015

aronlindberg commented Feb 13, 2015

jennybc commented Feb 13, 2015

aronlindberg commented Feb 13, 2015

aronlindberg commented Feb 13, 2015

jennybc commented Feb 13, 2015

hadley commented Oct 22, 2015

jennybc commented Oct 22, 2015

cscheid commented Oct 22, 2015

hadley commented Oct 22, 2015

realAkhmed commented Oct 22, 2015

cscheid commented Oct 22, 2015

realAkhmed commented Oct 22, 2015

cscheid commented Oct 22, 2015

jennybc commented Oct 22, 2015

realAkhmed commented Oct 23, 2015

jennybc commented Oct 23, 2015

Kinzowa commented Oct 29, 2017

cscheid commented Oct 29, 2017

cscheid commented Jun 5, 2018