[apollo-gateway] Question about nullability and cross-service joins #860

lennyburdette · 2020-02-28T19:55:22Z

Hi @trevor-scheer, @jhampton and friends,

Any time the query planner calls another service, there's an additional chance of failure. Consider the following two-service federated graph:

type User @key(fields: "id") {
  id: ID!
  name: String!
}

extend type User @key(fields: "id") {
  id: ID! @external
  reviews: ReviewConnection!
}

type ReviewConnection {
  edges: [ReviewEdge!]!
  nodes: [Review!]!
  pageInfo: PageInfo!
}

Because User.reviews is a non-nullable ReviewConnection, if the request to the Reviews service fails (due to a network partition or service downtime) the response will fail validation and we won't get any data at all.

query UserReviews($id: ID!) {
  user(id: $id) {
    id
    name
    reviews { # returning null here means we won't get `user.id` or `user.name`
      nodes {
        id
        content
      }
   }
}

What are your thoughts on making User.reviews nullable to better support partial data? It's kind of a shame to leak implementation details into the schema. Are y'all documenting best practices like this anywhere?

Would we possibly want a schema linter that enforces that cross-service data is nullable to allow partial data?

Alternatively, would it be possible to handle failures like this?

{
  "data": {
    "user": {
      "id": "1",
      "name": "Lenny",
      "reviews": {
        "nodes": [],
      }
    }
  },
  "errors": [
    {
      "message": "Error fetching User.reviews: Service is down",
      "path": ["user", "reviews"]
    }
  ]
}

I have no idea how you'd do this generically, but maintaining the contract by returning empty data plus an error would be very cool.

The text was updated successfully, but these errors were encountered:

jodosha · 2020-04-09T15:53:01Z

👍 On this request.

Here's an example of this problem based on https://github.com/apollographql/federation-demo

Steps to reproduce:

Instead of starting the services with npm run start-services, open 4 different shell tabs and cd into each of them to manually start each service with npm start.
Start the gateway with npm run start-gateway
Run a query like the one listed below 👇
Stop the inventory service
Run again the query, you will see inStock with null values, and an error without path node. See https://gist.github.com/jodosha/df091662096a7df4c33f30f378b304da

{
  topProducts {
    name
    inStock
  }
}

I believe that having path will help consumers (e.g. UI) to write more robust code, as they can to understand why inStock is null.

Another concerning aspect is the error message:

"request to http://localhost:4004/graphql failed, reason: connect ECONNREFUSED 127.0.0.1:4004"

It leaks sensitive details like the GQL endpoint that it supposed to abstract.

Thanks for your consideration.

jhampton · 2020-07-15T17:42:45Z

A few follow-up thoughts:

My take here is that the correct behavior is to ensure (in the DOWNSTREAM SERVICE), that responses respect nullability, and that the types in question can optionally return a union of the request type or requested type-error.

If the user experience expects a given set of child objects to be available and they’re not, a reasonable pattern would be to hide the related UI. But if the type is marked non-null, the right answer might be to contact support (because there must be something there in order to complete a task or flow.)
Having a consistent rule applied here might be challenging, but having a general guidance feels achievable:

Partial data should be returned unless that partial data renders the user’s task unachievable or breaks a business rule, in which case an exception is returned.
If partial data can be made available, it is the implementing service’s responsibility to ensure a valid result set at all times (in the case of nullability)
Optionally, a service can choose to provide additional context in the form of a union of the expected type and an expressive error type

So as a graph administrator, I can apply these guidelines, make it a part of schema/code review, and write tests for it.

Thank you for the callout @jodosha on the leaking of error details. This could be reformatted using the existing Apollo Server hooks: https://github.com/apollographql/apollo-server/blob/570f548b88750a06fbf5f67a4abe78fb0f870ccd/docs/source/data/errors.md#masking-and-logging-errors

I'd like to close this out as an active issue, but please feel free to re-open if this surfaces with better or more-specific patterns and practices from the field.

Thank you!

jodosha · 2020-11-11T10:08:26Z

@jhampton Thanks for the pointer to masking errors, but still there is a bit that is missing IMO: the correlation between the error and the field, ideally with path information in error payload.

If you look again at the output, there is no hint for the caller that inStock is null because of that INTERNAL_SERVER_ERROR.

From the caller perspective I don't know if inStock is null because it's a nullable field or because the service is down and Apollo Gateway implementation happens to return null.

I assume that the query planner can associate fields (like inStock) to HTTP requests to downstream services, and enrich each errors node with a path information.

EDIT: Another option would be to add a callback to RemoteGraphQLDataSource subclasses that would be triggered in case of error. Something like this:

class MyGraphQLDatasource extends RemoteGraphQLDataSource {
  didReceiveError({ error, response, request, context }) {
  }
}

const gateway = new ApolloGateway({
  buildService({ name, url }) {
    return new MyGraphQLDatasource({ url });
  },
});

Ideally context (or request) should contain information regarding the HTTP request to somehow correlate it to inStock. In this way, I can at least enrich error with this information.

kindermax · 2021-07-06T18:43:56Z

I am also concerned about the way gateway handles partial responses if one of the underlying subgraph services are unable to respond.
In @lennyburdette example, there is only one extending field reviews: ReviewConnection! and potentialy it can be made optional, but that is a no-go in my opinion. What if reviews service exposes a lot more fields. Are they all have to be declared as optional just to satisfy the case when reviews is down for some time ? It does not make sence.

I think it should be a way:

to have non-optional fields on subgraph service
for a gateway to return as much queried data as possible and for the rest of fields just provide a sensible errors.

This will make a whole federated gateway way more fault tolerant.

prasek · 2021-07-06T22:36:07Z

Thanks @kindermax will take a look 👍

cc @martijnwalraven @pcmanus @abernix - something to consider for the next round of Federation changes

mjfaga · 2021-07-15T03:24:01Z

Throwing my two cents in here. I would agree that forcing null in extending services is non-ideal for the reasons @kindermax sited above.

Without breaking existing GQL spec expectations with regards to nullable vs. non-nullable fields, I wonder if something like a @faultTolerant directive (happy to debate a legit name for this at some point) on extended fields of a required type in federation can instruct the gateway to either:

(preferred) Keep fields as required, but return a custom ServiceOutageError value type hydrated with the correct information, During the federation process, this value type would be exposed by compiling it into the final schema as a union type.
Expose fields as nullable, but treat them as non-nullable internally, and if the underlying service experiences some sort of network error/500, handle that and populate the errors array in the response appropriately.

There is also a potential option to make this sort of functionality default in federated implementations, avoiding the use of a directive attribution unless you want to opt out of this sort of functionality... something like a @propogateNetworkFailures directive or the like.

Option 1 Example

extend type User @key(fields: "id") {
  id: ID! @external
  reviews: ReviewConnection! @faultTolerant
}

type ReviewConnection {
  edges: [ReviewEdge!]!
  nodes: [Review!]!
  pageInfo: PageInfo!
}

would compile to the following in the supergraph schema and guarantee a result set that clients explicitly detect and have to handle

extend type User @key(fields: "id") {
  id: ID! @external
  reviews: UserReviewsFaultTolerantPayload!
}

union UserReviewsFaultTolerantPayload = ReviewConnection | ServiceOutageError

type ReviewConnection {
  edges: [ReviewEdge!]!
  nodes: [Review!]!
  pageInfo: PageInfo!
}

# I'm sure we can come up with something better here :)
type ServiceOutageError {
  serviceName: String!
  statusCode: Int!
  message: String!
}

Option 2 Example

extend type User @key(fields: "id") {
  id: ID! @external
  reviews: ReviewConnection! @faultTolerant
}

type ReviewConnection {
  edges: [ReviewEdge!]!
  nodes: [Review!]!
  pageInfo: PageInfo!
}

would simply compile to the following in the supergraph schema, where again, the errors object would be hydrated with the appropriate content that clients could use to decision off of

extend type User @key(fields: "id") {
  id: ID! @external
  reviews: ReviewConnection
}

type ReviewConnection {
  edges: [ReviewEdge!]!
  nodes: [Review!]!
  pageInfo: PageInfo!
}

mjfaga · 2021-07-16T12:42:41Z

The more I think about what I wrote above, the less I like it. Option 1 above actually feels like it starts to leak the internals of federation into our frontend by explicitly defining where services extend other services. I can't decided if this is a necessarily evil or not.

Ensuring the gateway is returning sensible errors feels like a reasonable way to handle this, but either way, knowing that there was a service outage and handling that in our frontend doesn't feel much different.

It will be interesting to see what comes of graphql/graphql-spec#867 and if that plays a role in how we think about this.

Mithras · 2022-07-01T19:05:53Z

I'm not sure why this is closed but I'd like to add that the problem is present even when all subgraphs are up and running as expected. Please read my explanation here: apollographql/router#1308 (comment)

jhampton self-assigned this Feb 28, 2020

jhampton closed this as completed Jul 15, 2020

glasser transferred this issue from apollographql/apollo-server Jul 6, 2021

abernix mentioned this issue Jul 1, 2022

Null response when resolving a non-null filed from a different subgraph apollographql/router#1308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[apollo-gateway] Question about nullability and cross-service joins #860

[apollo-gateway] Question about nullability and cross-service joins #860

lennyburdette commented Feb 28, 2020

jodosha commented Apr 9, 2020

jhampton commented Jul 15, 2020

jodosha commented Nov 11, 2020 •

edited

kindermax commented Jul 6, 2021

prasek commented Jul 6, 2021 •

edited

mjfaga commented Jul 15, 2021

mjfaga commented Jul 16, 2021 •

edited

Mithras commented Jul 1, 2022

[apollo-gateway] Question about nullability and cross-service joins #860

[apollo-gateway] Question about nullability and cross-service joins #860

Comments

lennyburdette commented Feb 28, 2020

jodosha commented Apr 9, 2020

jhampton commented Jul 15, 2020

jodosha commented Nov 11, 2020 • edited

kindermax commented Jul 6, 2021

prasek commented Jul 6, 2021 • edited

mjfaga commented Jul 15, 2021

Option 1 Example

Option 2 Example

mjfaga commented Jul 16, 2021 • edited

Mithras commented Jul 1, 2022

jodosha commented Nov 11, 2020 •

edited

prasek commented Jul 6, 2021 •

edited

mjfaga commented Jul 16, 2021 •

edited