Add BaseGraphqlService, support [github] V4 API #3763

chris48s · 2019-07-23T20:15:09Z

This bit of work sprouted off from #3610 but I've decided to do it in its own PR because this will require some review to get right and trying to load all this into a PR that is already a ~500 line diff is going to make that annoying. Lets get this bit right then I'll go back, rebase #3610 onto this work and pick that up.

Main things going on here are:

Add a generic BaseGraphqlService - we'll probably need this in future anyway
Add a GithubAuthV4Service service for badges which query the Github V4 API - most of the work here was to accommodate one change in the V4 API: rate limit usage info is now returned in the response body on request. As well as supporting move [githubtag] and [githubrelease] under /v, move variants to query params #3610 this will allow us to address other open feature requests requiring the V4 API. e.g: Badge request: GitHub dependency graph #1550, Badge request: GitHub Package Registry, NPM variants #3608
Just for illustrative purposes, I've switched the github forks badge to use the graphql API. There's no real reason we need to do this. I just picked a really simple "get one value from the API, stick it on a badge" example to migrate as I think it helps to see a concrete implementation of the base class (both in dev and review). If we want to switch this back to use the V3/REST API, I can revert 160569e before merging.

shields-ci · 2019-07-23T20:15:56Z

	Warnings
⚠️	This PR modified service code for `github` but not its test code. That's okay so long as it's refactoring existing code.
⚠️	📚 Remember to ensure any changes to `config.private` in `services/github/github-helpers.js` are reflected in the server secrets documentation

	Messages
📖	✨ Thanks for your contribution to Shields, @chris48s!

Generated by 🚫 dangerJS against ba974b0

chris48s · 2019-07-23T20:15:36Z

core/base-service/base-graphql.js

+    schema,
+    url,
+    query,
+    variables = {},


I've made query and variables top-level params here. I'm in 2 minds about whether it makes more sense to collect them into an object. Also note here I deliberately haven't implemented mutations. In general, calling a shields badge should read data from upstream service, not write data to it, so we shouldn't need mutations.

core/base-service/base-graphql.js

services/github/github-api-provider.js

chris48s · 2019-07-23T20:17:55Z

services/github/github-auth-service.js

+    */
+    attrs.query = mergeQueries(
+      attrs.query,
+      'query { rateLimit { limit cost remaining resetAt } }'


Explicitly asking the GH API to just tell us what our remaining usage is in the body seems like an easier approach than attempting to parse the query and calculate the projected cost of each call and less likely to result in a discrepency: https://developer.github.com/v4/guides/resource-limitations/#calculating-nodes-in-a-call

That said, we probably do need to watch out for query count on contributions. With the V3 API, its pretty hard for someone to contribute a badge that smashes our usage limit because 1 http request = 1 point, but with a graphql a single http request can cost you a whole truckload of points. Not sure what the solution is for that other than to watch out for it in review.

How about establishing a max number of queries, which could be overridden per service? If the response exceeds the configured number it could error out rather than rendering. That way we'll see it right in the code, and can even search for it.

On the other hand, we have a rather huge amount of GitHub quota – I want to say something like 2.2k tokens – so maybe we don't need to worry about this.

Are you thinking of that as a runtime check or a static check? Either way, I'm not against adding it, but I think it can be raised as an issue to address in a later PR.

Since we don't know the answer until the response comes back, it'd have to be a runtime check. Agreed, it can be addressed later.

Right, ok. I misunderstood and thought you were still talking about pre-parsing the query before we send it.

Throwing the exception after we've already made the query is obviously simpler, but also of more limited value (esp if it makes it through to production) in that we've already made the request we don't want to make before deciding we didn't want to make it, if you see what I mean :)

Maybe its a useful MVP to start with, in that if we start getting exceptions in sentry saying "lol - just rinsed our token and now we're not even going to render a badge with all the expensive data we just requested" we can at least be reactive. Bear in mind it might lull us into a false sense of security though as we'd be unlikely to throw it in dev because we treat global tokens differently from pooled ones. My instinct is most contributors probably use a global token as opposed to a pool with one token in it (I know that's what I usually do - I think working on this PR is the first time I've set up a pool locally).

Unfortunately I think we do still actually need to work out the query cost up-front to implement it in a useful way.

chris48s · 2019-07-23T20:18:11Z

services/github/github-helpers.js

+function graphqlErrorHandler(errors) {
+  if (errors[0].type === 'NOT_FOUND') {
+    throw new NotFound({ prettyMessage: 'repo not found' })
+  } else {


In time, it might be useful to add other cases in here, but I don't know what they are yet.

paulmelnikow

Hey Chris, this is looking really good. Not an insignificant piece of work! I left a few comments though there's nothing really major here.

As for porting the service or not, one strategy might be to reserve the GraphQL endpoints for the things that really need it, and the other might be to port things over to GraphQL as much as possible.

Is there a performance difference?

services/github/github-api-provider.js

paulmelnikow · 2019-07-23T21:07:51Z

services/github/github-api-provider.js

+  }
+
+  getRateLimitFromBody(body) {
+    const parsedBody = JSON.parse(body)


Hmm, it would be nice to avoid parsing the body twice for all these responses. Any idea if there's a good way to do that?

Yeah this is a bit annoying.. I think we could, but I think its also a fairly major change. At the moment we're essentially doing all this token work inside the call to _request(), which means we can deal with manipulating the request and processing the response in the same place. We can also deal with standardTokens, searchTokens and graphqlTokens in the same GithubApiProvider class. In order to avoid double-parsing, we'd need to restructure so we can still manipulate the request, but then update the token pool after we call _parseJson() in either _requestJson() or _requestGraphql().

I've not really completely thought through the consequences of all this, but here's a couple of thoughts:

One way to deal with that would be to split this up so we deal with standardTokens / searchTokens in by overriding the _requestJson() method in GithubAuthV3Service.. and graphqlTokens by overriding the _requestGraphql() method of GithubAuthV4Service. Although at that point if GithubAuthV4Service overrides _requestGraphql(), its not really a subclass of BaseGraphqlService anymore.

Maybe another way would be to allow the _requestJson()/_requestGraphql()/_requestFomat() methods in base classes to accept an optional callback param or something that cna be used to run a post-process after parsing but before we return the response.

Any further thoughts or ideas?

How about having GithubAuthV4Service rely on BaseGraphqlService to do the json parsing, but pass in an empty schema so the validation is a no-op. Then the token-related stuff can happen, then GithubAuthV4Service can invoke _validate?

Partly related, I dislike deep inheritance because the indirection makes it hard for people unfamiliar with the code to dig through. Each layer is one more thing to keep in your head at once.

So I think it would be good to make BaseGraphqlService extend directly from BaseService. All that would take is exporting parseJson as its own function (or moving it to its own file).

There are a couple places in GitHub where we invoke _parseJson from within a service, and those will be made cleaner by importing it directly.

In Gerritt there's also a place where we override it, though it'd be less magical and easier to understand if we used BaseService directly (invoking _request, doing the trim, then invoking parse and validate).

The refactor is a bit involved so maybe it's best to leave it for a follow-on.

As a first step on this, I've moved parseJson() out to a helper and switched BaseGraphqlService to inherit from BaseService. That seems sensible anyway, but I've not taken it further.

Not sure I agree that overriding a method from the base class is "magical" in the context of gerrit though. Its what pretty much every other function we declare in a badge service class is doing - seems like a pretty consistent pattern 🤷‍♂️

services/github/github-api-provider.js

services/github/github-auth-service.js

paulmelnikow · 2019-07-23T21:17:21Z

services/github/github-auth-service.js

+    */
+    attrs.query = mergeQueries(
+      attrs.query,
+      'query { rateLimit { limit cost remaining resetAt } }'


How about establishing a max number of queries, which could be overridden per service? If the response exceeds the configured number it could error out rather than rendering. That way we'll see it right in the code, and can even search for it.

On the other hand, we have a rather huge amount of GitHub quota – I want to say something like 2.2k tokens – so maybe we don't need to worry about this.

services/github/github-forks.service.js

core/base-service/base-graphql.js

core/base-service/graphql.js

chris48s · 2019-07-26T20:50:45Z

As for porting the service or not, one strategy might be to reserve the GraphQL endpoints for the things that really need it, and the other might be to port things over to GraphQL as much as possible.

Is there a performance difference?

I thought about this a bit already. As I see it, there are a few tradeoffs to think about here (some performance-related and some not). Predictably there is not a single clear-cut answer in all cases:

In most cases, a sensible graphql query will return a smaller JSON response than a REST call. For example, in the forks example we need to request this object with the V3 API to get one value: https://api.github.com/repos/badges/shields but with graphql the response is only like { data: { repository: { forks: { totalCount: 1984 } }, rateLimit: { limit: 5000, cost: 1, remaining: 4999, resetAt: '2019-07-24T21:23:26Z' } } } That's less bytes over the wire and a faster object to parse.
Beyond that, I think we can probably assume that making one HTTP request to the V3 API and making one HTTP request to the V4 API are basically identical in terms of performance (even if theoretically one or other takes longer to calculate/serve on their end) because the main thing we're waiting for is network latency.
With the V3 API, one request = one point off the usage limit. With the V4 API one request may be many points off the usage limit (but will be a minimum of one). Assuming we can produce the same badge with one REST call or a graphql request: In the best-case scenario, the graphql query will be the same cost as the REST request. In all other cases, graphql will be more expensive in this respect. Given we won't accept a badge that makes more than one V3 request, using the V3 request will sometimes be more efficient on our token usage.
At this point in time, REST is more familiar to contributors and more well-understood.
I don't think we're in any danger of GitHub deprecating the V3 API in the near future. They haven't published any info about when it will be deprecated or retired: https://developer.github.com/v3/versions/ . That said, presumably at some point in time it will go away and we'll need to migrate everything to V4 one day anyway, unless graphql goes out of fashion or something.

paulmelnikow · 2019-07-26T21:56:52Z

I thought about this a bit already. As I see it, there are a few tradeoffs to think about here (some performance-related and some not). Predictably there is not a single clear-cut answer in all cases:

In most cases, a sensible graphql query will return a smaller JSON response than a REST call. For example, in the forks example we need to request this object with the V3 API to get one value: https://api.github.com/repos/badges/shields but with graphql the response is only like { data: { repository: { forks: { totalCount: 1984 } }, rateLimit: { limit: 5000, cost: 1, remaining: 4999, resetAt: '2019-07-24T21:23:26Z' } } } That's less bytes over the wire and a faster object to parse.

Beyond that, I think we can probably assume that making one HTTP request to the V3 API and making one HTTP request to the V4 API are basically identical in terms of performance (even if theoretically one or other takes longer to calculate/serve on their end) because the main thing we're waiting for is network latency.

With the V3 API, one request = one point off the usage limit. With the V4 API one request may be many points off the usage limit (but will be a minimum of one). Assuming we can produce the same badge with one REST call or a graphql request: In the best-case scenario, the graphql query will be the same cost as the REST request. In all other cases, graphql will be more expensive in this respect. Given we won't accept a badge that makes more than one V3 request, using the V3 request will sometimes be more efficient on our token usage.

At this point in time, REST is more familiar to contributors and more well-understood.

I don't think we're in any danger of GitHub deprecating the V3 API in the near future. They haven't published any info about when it will be deprecated or retired: https://developer.github.com/v3/versions/ . That said, presumably at some point in time it will go away and we'll need to migrate everything to V4 one day anyway, unless graphql goes out of fashion or something.

Everything you're saying makes sense.

Assuming the GraphQL API is otherwise fast, since network latency is the key issue I'd guess the smaller responses probably will have a bigger impact than anything else.

Could you paste what you wrote into a comment in github-auth-service.js? Or else distill it down to what you think contributors need to know?

chris48s · 2019-07-29T20:41:37Z

Cheers. I'll raise issues for the additional bits of work

shields-deployment · 2019-07-29T20:42:07Z

This pull request was merged to master branch. This change is now waiting for deployment, which will usually happen within a few days. Stay tuned by joining our #ops channel on Discord!

After deployment, changes are copied to gh-pages branch:

chris48s added 4 commits July 23, 2019 21:04

add base class for Graphql APIs

2aeb24e

add GithubAuthV4Service + updates to GH token pool

063da6a

update github forks to use GithubAuthV4Service

160569e

rename GithubAuthService to GithubAuthV3Service

08c7d5d

chris48s added service-badge Accepted and actionable changes, features, and bugs core Server, BaseService, GitHub auth labels Jul 23, 2019

chris48s commented Jul 23, 2019

View reviewed changes

paulmelnikow reviewed Jul 23, 2019

View reviewed changes

shields-cd temporarily deployed to shields-staging-pr-3763 July 23, 2019 22:16 Inactive

chris48s added 3 commits July 26, 2019 19:50

change function names

ab9ec5f

create new objects instead of modifying in-place

2c19bb2

use gql template strings

9940d9a

shields-cd temporarily deployed to shields-staging-pr-3763 July 26, 2019 20:41 Inactive

chris48s added 3 commits July 28, 2019 17:21

graphqlErrorHandler --> transformErrors

2256db6

add a comment about V3 vs V4 tradeoff

be8b655

move parseJson out to a shared helper

b4b5bf6

shields-cd temporarily deployed to shields-staging-pr-3763 July 28, 2019 17:33 Inactive

paulmelnikow approved these changes Jul 28, 2019

View reviewed changes

Merge branch 'master' into base-graphql

ba974b0

shields-cd deployed to shields-staging-pr-3763 July 29, 2019 20:33 View deployment

chris48s merged commit 75ee413 into badges:master Jul 29, 2019

This was referenced Jul 29, 2019

Avoid parsing body twice in GithubAuthV4Service #3809

Open

Warn about expensive queries in GithubAuthV4Service #3810

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BaseGraphqlService, support [github] V4 API #3763

Add BaseGraphqlService, support [github] V4 API #3763

chris48s commented Jul 23, 2019

shields-ci commented Jul 23, 2019 •

edited

chris48s Jul 23, 2019

chris48s Jul 23, 2019

paulmelnikow Jul 23, 2019

chris48s Jul 26, 2019

paulmelnikow Jul 26, 2019

chris48s Jul 28, 2019

chris48s Jul 23, 2019

paulmelnikow left a comment

paulmelnikow Jul 23, 2019

chris48s Jul 26, 2019

paulmelnikow Jul 26, 2019

chris48s Jul 28, 2019

paulmelnikow Jul 23, 2019

chris48s commented Jul 26, 2019

paulmelnikow commented Jul 26, 2019

chris48s commented Jul 29, 2019

shields-deployment bot commented Jul 29, 2019

Add BaseGraphqlService, support [github] V4 API #3763

Add BaseGraphqlService, support [github] V4 API #3763

Conversation

chris48s commented Jul 23, 2019

shields-ci commented Jul 23, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulmelnikow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chris48s commented Jul 26, 2019

paulmelnikow commented Jul 26, 2019

chris48s commented Jul 29, 2019

shields-deployment bot commented Jul 29, 2019

shields-ci commented Jul 23, 2019 •

edited