Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pagination #30

Closed
jennybc opened this issue Jan 8, 2015 · 39 comments
Closed

Pagination #30

jennybc opened this issue Jan 8, 2015 · 39 comments

Comments

@jennybc
Copy link
Contributor

jennybc commented Jan 8, 2015

Have you thought about pagination at all?

Requests that return multiple items will be paginated to 30 items by default.

https://developer.github.com/v3/#pagination

I'm not sure what the conventions are for API wrappers and assisting the user to fetch multiple or all pages. But the Ruby wrapper for the GitHub API talks about this in its README:

For smallish resource lists, Octokit provides auto pagination. When this is enabled, calls for paginated resources will fetch and concatenate the results from every page into a single array.

I'm thinking about this because I'm using get.my.repositories() and have realized that by default I'm getting the only the first 30. Before I start playing around with explicit requests for specific pages, I'm wondering if you are contemplating adding any auto pagination to your package…..

@cscheid
Copy link
Owner

cscheid commented Jan 8, 2015

That's a good question.

This is not a particularly strong opinion, but all else being equal, I would prefer to let pagination be handled downstream. There are a few reasons for this:

  1. Right now rgithub has a very strong one-to-one equivalence between one call to rgithub and one HTTP request. This has the nice benefit that call rate is very predictable, and it's much less likely that you'll hit rate limits by accident.
  2. It's easy to increase page size. I think per_page works uniformly across GitHub's API, which means that (because of 1) per_page behaves uniformly in rgithub as well.
  3. It's more work :)

I would have assumed that GitHub itself wouldn't be fans of auto-pagination because it will hit their servers more often than necessary, but octokit is written by GitHubbers, so I'm clearly wrong there.

So let's look at it from the other side, and assume we want to implement auto-pagination. What would it look like in the API?

Option 1) We could make it a named parameter that is interpreted in a special way. This would mean that code which calls rgithub functions that return paginated results would change very little.

At the same time, I like that the base implementation of rgithub is very uniform: I don't want to have to add anything new to github.R: a general .api.request seems to be the right way to do it.

So the solution is to find all the HTTP entry points that return paginated results, and write a wrapper. Something like .paginated.api.get.request, and then have get.my.repositories, etc. all call .paginated.api.get.request instead of .api.get.request.

Option 2) For all entry points that return paginated results, create a new entry point that does auto-pagination. so you'd have get.my.repositories and get.my.autopaginated.repositories (bear with my terrible naming skills for now).

Which would you prefer?

@jennybc
Copy link
Contributor Author

jennybc commented Jan 8, 2015

This is not a proper response, but adding another link where the GitHubbers really dig into the details of Traversing with pagination. This seems relevant to the discussion regardless.

@jennybc
Copy link
Contributor Author

jennybc commented Jan 8, 2015

I haven't done enough of this manually yet to say how deeply I want this integrated into rgithub itself. So why don't I do that first? I appreciate the discussion. Even if you go with your instinct to keep this out of your package, maybe we could collaborate on a vignette that shows some idioms for, e.g., traversing all pages.

@jennybc jennybc mentioned this issue Jan 8, 2015
@cscheid
Copy link
Owner

cscheid commented Jan 9, 2015

Your mention of the idiom made me think of a general “pagination wrapper” call which uses “computing-on-the-language” (to use Hadley’s parlance) to do the right thing. Something like this:

get.my.repositories() is the nonpaginated version, but

auto.page(get.my.repositories()) actually does the right thing for paginating the inner call.

I think that should be doable, looks elegant, and wouldn’t require changing the base code at all.

What do you think?

@jennybc
Copy link
Contributor Author

jennybc commented Jan 20, 2015

I think that is great idea! I will try to turn my evolving "manual" approach into that. I have already written helpers to parse the link headers, which will be generally useful for the auto.page() function you propose.

@aronlindberg
Copy link
Contributor

Maybe this is a simple question, but I can't figure it out. How do I implement a "manual" approach? I'm trying to get a list of all files changed by a specific pull request: pull <- get.pull.request.files(owner = "rails", repo = "rails", id = 572, ctx = get.github.context()) however, it seems to stop at 30 entries/files or so. I've tried e.g. pull <- get.pull.request.files(owner = "rails", repo = "rails", id = 572, ctx = get.github.context(), page = 3, per_page=30) but it gives Error in get.pull.request.files(owner = "rails", repo = "rails", id = 572, : unused arguments (page = 3, per_page = 30)

How can I access them all?

@jennybc
Copy link
Contributor Author

jennybc commented Feb 6, 2015

You will have to do all of this "manually". I have written some functions to facilitate this but this is all in my own local fork of this package and is very rough. Here's the sketch of what I do. In my case, I was trying to list repositories, but I would imagine the same thing works for pull requests.

First, max out the number of repos per page, i.e. set it to 100, and see if that gets everything.

repos  <- get.organization.repositories(org = "STAT545-UBC", per_page = 100)

In the repos result, see if the headers contain a link field:

repos$headers$link

If it does not, you have gotten everything! YAY! You're done.

If it does, you absolutely must traverse pages. Boo. We continue.

I get the total number of repositories by asking for all repositories, setting per_page to 1:

repos  <- get.organization.repositories(org = "STAT545-UBC", per_page = 1)

Then you must parse the information in the link field of the header. Here's where I've written functions to turn this into a tidy data.frame of links, one per row. You extract the page number of the link labelled "last" and you've learned how many repos there are.

You go back to asking for 100 repos per page. Determine how many pages the results will be broken into. Do this inside some iterative structure. Request the repos with per page set to 100, parse the link field of the headers to get the URL for the next page, Repeat until you're done. Then you also have to glue all the results together again manually, which was my motivation for this question I put out on Twitter.

In short, this is possible but a total pain in the butt.

@aronlindberg
Copy link
Contributor

I don't think the per_page argument works for all API calls, for example : pull <- get.pull.request.files(owner = "rails", repo = "rails", id = 572, ctx = get.github.context(), per_page = 1) returns the error: unused argument (per_page = 1).

@jennybc, would you mind pointing to where in your forked repo you have the workaround?

@jennybc
Copy link
Contributor Author

jennybc commented Feb 6, 2015

@aronlindberg I haven't pushed that stuff to my fork on GitHub. It's in a truly embarrassing state.

If you look at the header of the example of listing pull requests via the API, you'll see there are indeed the usual "next" and "last" links:

https://developer.github.com/v3/pulls/#list-pull-requests

so obviously pagination happens, as of course you have already discovered.

Looking at the relevant function in this package, I see there is no way for the user to pass per_page:

rgithub/R/pulls.R

Lines 84 to 85 in 153bde1

get.pull.request.files <- function(owner, repo, id, ctx = get.github.context())
.api.get.request(ctx, c("repos", owner, repo, "pulls", id, "files"))

Contrast that to the function for requesting a list of organization repositories, which uses the argument:

rgithub/R/repositories.R

Lines 35 to 36 in 153bde1

get.organization.repositories <- function(org, ..., ctx = get.github.context())
.api.get.request(ctx, c("orgs", org, "repos"), params=list(...))

Best case, the definition of github::get.pull.request.files() just needs to be modified to pass correctly and then the manual workflow outlined above is possible. Anecdotally, from my own usage, I think there are other functions that have the same problem.

@aronlindberg
Copy link
Contributor

I tried adding params=list(...) to get.pull.request.files so that it reads:

get.pull.request.files <- function(owner, repo, id, ctx = get.github.context()) .api.get.request(ctx, c("repos", owner, repo, "pulls", id, "files"), params=list(...))

However, calling get.pull.request.files(owner = "rails", repo = "rails", id = 572, ctx = get.github.context(), page = 1) still generates unused argument (page = 1). I'm new at this - can anyone point me in the right direction for how to solve this? I looked at:

https://developer.github.com/v3/pulls/#list-pull-requests-files

and it looks like I should be able to pass the page parameter there...

@jennybc
Copy link
Contributor Author

jennybc commented Feb 7, 2015

isn't it per_page?

@aronlindberg
Copy link
Contributor

Neither works...Shouldn't I be able to pass both of them once the ... parameter is implemented correctly?

@jennybc
Copy link
Contributor Author

jennybc commented Feb 7, 2015

I'm going to try changing def'n of get.pull.request.files() and see if I succeed ….

@jennybc
Copy link
Contributor Author

jennybc commented Feb 7, 2015

OK that worked. I'll make a PR to @cscheid in a moment. Here you can prove it to yourself:

library(github)

# I store my access token in an environment variable
ctx <- create.github.context(access_token = Sys.getenv("GITHUB_TOKEN"))

## using rgithub "as is"
req <- get.pull.request.files(owner = "rails", repo = "rails", id = 572)
length(req$content) # by default, 30 items are retrieved
req$headers$link # I can see 30 items per page will imply we need 3 pages

## new definition of get.pull.request.files()
## I will make a PR to cscheid/rgithub momentarily
jfun <- function(owner, repo, id, ..., ctx = get.github.context())
  github:::.api.get.request(ctx, c("repos", owner, repo, "pulls", id, "files"),
                   params=list(...))

jreq <- jfun(owner = "rails", repo = "rails", id = 572, per_page = 3)
length(jreq$content) # yep, it's honoring my request for 3 per page
jreq$headers$link # at this rate, we'll need 23 pages

@aronlindberg
Copy link
Contributor

Cool! =) So I was using:
ctx = interactive.login("client_id", "client_secret")

while you are using
ctx <- create.github.context(access_token = Sys.getenv("GITHUB_TOKEN"))

is it then that the token allows you to pass params=list(...) but not client_id/client_secret?

@jennybc
Copy link
Contributor Author

jennybc commented Feb 7, 2015

No I don't think this has anything to do with our different approaches to authentication.

When you redefined get.pull.request.files() above, the definition doesn't actually include in the formal arguments. So when you added params=list(…) to the call to .api.get.request(), R complains that it doesn't know what to do with additional argument, such as per_page.

@aronlindberg
Copy link
Contributor

Aaaah! I see, yes, it works with my authentication too! The dots were just sooooo small, so I missed them. =) Thanks!

@jennybc
Copy link
Contributor Author

jennybc commented Feb 7, 2015

@aronlindberg I've made a PR now (#39). This change is reflected in my fork, in case you're in a huge hurry. You could install from my fork to get the functionality and switch back once it's merged.

@aronlindberg
Copy link
Contributor

@jennybc - thanks, I was able to implement the same change in my own fork. Thanks for having a dialogue about it, it helps me understand how these functions are written better. Now I can start trying to construct a manual pagination approach for the script I'm writing.

@jennybc
Copy link
Contributor Author

jennybc commented Feb 7, 2015

@aronlindberg Here's the function I have in my "private and embarrassing" branch I haven't pushed that's a good start on digesting link headers for the depagination operation 😄

https://gist.github.com/jennybc/862a01dc9243118d83c9#file-digest_header_links-r

As you can see, I use some functions from plyr and dplyr, in addition to stringr, which this package already imports. This data.frame is helpful for determining how many pages there are and getting the URL for the next page.

@aronlindberg
Copy link
Contributor

Thanks @jennybc that's helpful! It seems that since ... can be passed, when I run get.pull.request.files(owner = "django", repo = "django", id = i, ctx = get.github.context(), per_page=1000) I can get all the results that I want. Do you know if there is an absolute limit to the per_page argument?

@jennybc
Copy link
Contributor Author

jennybc commented Feb 13, 2015

I thought it was 100. Are you actually getting 1000 items in one request?

@aronlindberg
Copy link
Contributor

Whoops. I think it's pretty clear that the limit is 100: https://developer.github.com/guides/traversing-with-pagination/ I'll have to start digging into your manual mechanism...

@aronlindberg
Copy link
Contributor

A rather messy example (mine) of how to implement @jennybc 's function can be found here: https://gist.github.com/aronlindberg/2a9e9802579b2d239655

@jennybc
Copy link
Contributor Author

jennybc commented Feb 13, 2015

Glad you are making progress @aronlindberg! I think that will work for your specific application (pull request files).

However, to build something into this package, you actually have to rely on the "next" links, rather than requesting specific pages via page = i. Here's the relevant excerpt from the official docs re pagination:

Keep in mind that you should always rely on these link relations provided to you. Don’t try to guess or construct your own URL. Some API calls, like listing commits on a repository, use pagination results that are based on SHA values, not numbers.

So, using a while loop and creeping from one page to the next looks virtually unavoidable. In the long-term and in the general case, @cscheid's inclination to create a general pagination wrapper seems promising.

@hadley
Copy link

hadley commented Oct 22, 2015

Just skimming, but I don't think you should need computing on the language to do this - if the response object was a bit richer, I think you should be able to do autopagination by inspecting the request metadata.

@jennybc
Copy link
Contributor Author

jennybc commented Oct 22, 2015

I am now enjoying the page traversal @gaborcsardi put into gh.

https://github.com/gaborcsardi/gh/blob/master/R/pagination.R

@cscheid
Copy link
Owner

cscheid commented Oct 22, 2015

Oh wait, any reason we're using rgithub instead of jumping ship to gh?

EDIT: That one is a little too low-level for my taste. So we'll leave with rgithub and its crumminess for now.

@hadley
Copy link

hadley commented Oct 22, 2015

Maybe rgithub could rely on gh for the lowlevel stuff?

@realAkhmed
Copy link

Just wanted to chime in. I did create a simple paginator for rgithub that @hadley seems to be describing (the one that is inspecting request results in case the object was richer). I intended to submit it as a pull request as a proposal but given that this discussion has started already it may make sense to introduce it right now.

I used it a lot last few months and it seems to work reliably.

It could be used as follows:

# without paginator
first_30_forks <- get.repository.forks(owner_name, repo_name)

# with paginator
all_forks <- auto.page( get.repository.forks(owner_name, repo_name) )

Here is the code of that future pull request (given that it is short I am typing it here as a comment as well)

# Make automated paging till response is empty
auto.page <- function(f) {
  f_call <- substitute(f)
  stopifnot(is.call(f_call))

  i <- 1
  req <- list()
  result_lst <- list()

  repeat {
    # Specify the page to download
    f_call$page <- i

    req <- eval(f_call, parent.frame())

    # Last page has empty content
    if (length(req$content)<=0) break

    result_lst[[i]] <- req$content
    i <- i+1
  }

  result_req <- req
  result_req$content <- unlist(result_lst, recursive = FALSE)

  (result_req)
}

@cscheid
Copy link
Owner

cscheid commented Oct 22, 2015

@akhmed1, that looks great! @jennybc, @hadley: does that look sufficiently in style? I'm always wary of substitute, eval, and parent.frame, so I'd prefer to get some expert reviewers to go over it :)

@realAkhmed
Copy link

@cscheid great! Would love to hear the feedback! One word of caution:

  • some existing API functions in rgithub do not allow page argument to be passed to Github while other functions do.

Say get.organization.teams is one example that does not. It is currently defined as:

get.organization.teams <- function(org, ctx = get.github.context())
  .api.get.request(ctx, c("orgs", org, "teams"))

However, auto.page needs an ability to pass page argument to the API. (Otherwise, pagination can't work). The good news: this is a trivial change in current function declarations such as:

get.organization.teams2 <- function(org, ctx = get.github.context(), ...)
  .api.get.request(ctx, c("orgs", org, "teams"), params = list(...))

The latter would work perfectly with auto.page.

@cscheid
Copy link
Owner

cscheid commented Oct 22, 2015

@akhmed1, that's issue #40 :)

@jennybc
Copy link
Contributor Author

jennybc commented Oct 22, 2015

One observation from the pagination workaround I was using with this package:

Once you deal the #40 problem, you can make an initial request with per_page = 1. Then dig the total number of objects out of the link header for the "last" page. This way there's no mystery about how many things you can get. You can pick your own page size and make the real request(s) w/o having to grow objects or test for empty content. This may not really matter, but I liked it better this way.

@realAkhmed
Copy link

@jennybc Your points are well-taken! Unfortunately, that approach did not work for me (we are using Github Enterprise): last or next page links were not reliable: sometimes empty strings were returned when other pages did exist. Not sure if Github.com had the same issues though.

@jennybc
Copy link
Contributor Author

jennybc commented Oct 23, 2015

That's weird! No I haven't had that problem on Github.com.

In Traversing with Pagination they try to scare us away from asking for specific pages by number:

Keep in mind that you should always rely on these link relations provided to you. Don’t try to guess or construct your own URL. Some API calls, like listing commits on a repository, use pagination results that are based on SHA values, not numbers.

but ... I've never accessed an exotic endpoint w/ non-numeric pagination. Sounds like you have something that works 😃.

@krlmlr krlmlr mentioned this issue Mar 31, 2016
7 tasks
@Kinzowa
Copy link

Kinzowa commented Oct 29, 2017

Hello, is it me or the pagination doesn't work with GET /repos/:owner/:repo/stats/contributors ? looks like it is limited to 100 even on the repository graph page and the argument page=2 doesn't return the expected result.

https://developer.github.com/v3/repos/statistics/

@cscheid
Copy link
Owner

cscheid commented Oct 29, 2017

Yes, this is an issue. As you can see, I've been unable to work on this for a long time now. I would really welcome a pull request if you have one. Sorry about that.

@cscheid
Copy link
Owner

cscheid commented Jun 5, 2018

Closed (wontfix) by #70.

@cscheid cscheid closed this as completed Jun 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants