Null response when resolving a non-null filed from a different subgraph #1308

Mithras · 2022-06-24T21:38:23Z

Describe the bug
This one is tricky. Let me try to explain with an example:

Let's assume we have two subgraphs: SubgraphA and SubgraphB.
SubgraphA defines MyEntity like this

type MyEntity @key(fields: "id") {
  id: Int!
  propertyA: String!
}

SubgraphB defines MyEntity like this

type MyEntity @key(fields: "id") {
  id: Int!
  propertyB: String!
}

Now let's say SubgraphA also defines a field myEntity: MyEntity! which returns MyEntity with id=1.

Now if we try to run a query like this

{
  myEntity { # resolved from SubgraphA
    id # resolved from SubgraphA
    propertyA # resolved from SubgraphA
    propertyB # resolved from SubgraphB
  }
}

two things can happen:

If SubgraphB can resolve MyEntity with id=1, everything works fine
BUG: If SubgraphB can not resolve MyEntity with id=1, the whole response is set to null.
By "can not resolve" I mean SubgraphB doesn't know about MyEntity with id=1 and returns { data: [null] } for
```
{
  _entities(representations: [{ __typename: "MyEntity", id: 1 }]) {
    propertyB
  }
}
```

Expected behavior
Gateway was just omitting propertyB for case 2 above. Router should probably do the same.

Desktop (please complete the following information):

OS: Ubuntu 20.04
Version: v0.9.5

Additional context
Everything works as expected if we make propertyB nullable but that's not a solution because it would require making pretty much all fields except IDs nullable.

Maybe related: #1290
Somewhat related: #1304

The text was updated successfully, but these errors were encountered:

Mithras · 2022-06-24T21:42:18Z

@Geal

Geal · 2022-06-28T12:56:30Z

this is a tricky one, because the router is doing the right thing here. It has to present the supergraph as if it was one monolithic graph, so the basic error handling rules apply: if propertyB is missing, since it is non null, it should nullify the wrapping object, this is not dependent on the subgraph. If propertyB was handled by subgraphA and it returned { "myEntity" { "id": 123, "propertyA": "A } }, we would have to nullify the object too.

But the behaviour of the gateway looks useful here. Perhaps that's something that should be provided by client controlled nullability instead https://github.com/graphql/graphql-wg/blob/main/rfcs/ClientControlledNullability.md#-behavior

Mithras · 2022-06-28T16:45:25Z

I understand that it technically violates composed schema but as I have said, returning { data: null } is not a viable solution for these cases. That pretty much means that Federation can not be used with non-null fields and everybody will be forced to mark everything as nullable. In this case what's the point of null checks in the Router if all fields has to be nullable anyway? And if that's the case, you might as well fail composition in Rover CLI or Apollo Studio when there are non-null fields in entities shared by subgraphs.
Fields nullability can not be enforced when multiple subgraphs are involved. Maybe rover supergraph compose should automatically change non-null fields to null-fields if you want to ensure response and supergraph schema always match?

Mithras · 2022-06-28T17:00:52Z

@Geal Do you think you can add something like --unsafe-null-checks to make Router behave the same as Gateway at least as a short term solution? It's a major breaking change for people who try to migrate to Router.

Geal · 2022-06-29T08:02:39Z

@Mithras modifying the error management in that way would be a significant amount of work, and would potentially break other parts, so this is not something we can do as a short term solution, it will require some thought before we reach a satisfying solution.

First, can you give me some information about your setup with the gateway (ie package version, federation 1 or 2, are there some custom extensions)? Because we tried to reproduce with the gateway and it behaves like the router: https://stackblitz.com/edit/basic-federation-7ak1t8?file=one.js,index.js

Mithras · 2022-06-29T18:19:14Z

I tested it and yes, that's because we skip validation in Gateway for performance reasons (about 2x throughput without validation) with this yarn patch:

diff --git a/execution/execute.js b/execution/execute.js
index ec5ddccc06aef550cdb55a32dec348deddcdc00b..b5f17b872d433c9cd5273c11d9b1290044bad80a 100644
--- a/execution/execute.js
+++ b/execution/execute.js
@@ -580,6 +580,14 @@ function completeValue(exeContext, returnType, fieldNodes, info, path, result) {
   } // If field type is NonNull, complete for inner type, and throw field error
   // if result is null.
 
+  let rootPath = path;
+  while(rootPath.prev){
+    rootPath = rootPath.prev;
+  }
+  if(rootPath.key !== "__schema") {
+    return result;
+  }
+
   if ((0, _definition.isNonNullType)(returnType)) {
     const completed = completeValue(
       exeContext,

  "resolutions": {
    "graphql@16.4.0": "patch:graphql@npm:16.4.0#.yarn/patches/graphql-npm-16.4.0-e6908d8ae7.patch"
  }

because we don't really see much benefit of re-validating already valid responses from subgraphs at a cost of halving throughput.
Well, I guess it is not a breaking change but I'm still confused how should we use Federation if that's by design. Can you provide best practices for this?
I see these options:

Manually make all federated fields nullable. Pros: no changes on Apollo side. Cons: if somebody accidentally adds a non-nullable field, everything breaks. This might be possible to mitigate with graphql-inspector but this option adds extra work and potential outages for all Apollo Federation customers.
Add an option to rover supergraph compose and Apollo Studio to automatically change non-nullable federated fields to nullable. This sounds like a relatively easy and safe feature to add. This will guarantee that supergraph schema always matches response and adding a non-nullable field will never break anything.
Add an option to Router to skip validation. Cons: schema might not match response. Pros: on top of solving nullability issue, will probably boost performance.

Personally I'm in favor of option 2 but I would probably opt-in option 3 as well if it were available because for me performance >>> a couple schema mismatches. Our subgraphs always return valid schema and re-validating it in Router/Gateway doesn't seem necessary to me. I wouldn't use option 3 without option 2 though because some generated clients might not work if there is nothing in response for a non-nullable field.

Geal · 2022-06-30T09:00:23Z

about the performance issue, I think you can revisit that decision for the router, because validation is much cheaper there. Having this last validation step can prevent bugs like, as an example, one instance of a subgraph that did not get the latest upgrade and returns invalid data (but valid from its point of view).
In general it's not recommended to use non nullable fields too much, even outside of federation, because it complexifies schema migrations (and the fields could still come from other async sources like a database).

It's more a matter of how you design your graph. In some contexts, you can still process a response without one of the requested fields, so it should be nullable, like propertyB in your example. If you cannot process the response without one of the fields, then yes it must be non nullable, and then you make sure it can be reliably obtained, either by increasing the stability/redundancy of the service providing it, or moving back that field implementation with the other fields.

So none of the 3 options are really satisfactory. Option 1 is too coarse grained, you might still benefit from non nullable fields in some parts of the schema, 2 is bound to create surprising issues and 3 will not happen because the bare minimum we need to do is follow the GraphQL specification

Mithras · 2022-06-30T18:26:36Z

I think you can revisit that decision for the router, because validation is much cheaper there

Sure. The only reason I did this in Gateway was 2x performance boost. Router seems to be faster than Gateway even with validation enabled.

It's more a matter of how you design your graph.

I disagree. Again, the problem is adding a non-null field to an entity will ALWAYS make some requests return { data: null }. It's not a matter of how YOU design. You are FORCED to make everything nullable, otherwise some responses WILL BE null. The only question here is that either tooling is doing it for you or you have to do it yourself.

2 is bound to create surprising issues

What issues do you see with option 2? I'm suggesting an optional flag that changes non-null fields to null fields. First of all, it's an option. People that are ok with null responses (I really want to see anybody who is ok with that though) don't have to opt-in. Second, making non-null fields nullable will never result in any schema violations. I don't see any issues with that.

Just to make sure we are on the same page. When more than one subgraph contributes to entity fields, it is not possible to guarantee these fields won't be null. It does not make sense to have them non-nullable in supergraph schema. Non-null fields can only be enforced at each individual subgraph level. Never at supergraph level.

Geal · 2022-07-01T08:38:14Z

Just to make sure we are on the same page. When more than one subgraph contributes to entity fields, it is not possible to guarantee these fields won't be null

It cannot be guaranteed, in the same way that an entity's field in a monolithic service could fail for various reasons (missing data, can't find file, unresponsive DB, etc).

It does not make sense to have them non-nullable in supergraph schema. Non-null fields can only be enforced at each individual subgraph level. Never at supergraph level.

that's where we are not aligned. It has nothing to do with federation. Marking a field as non-nullable means one thing: when it is requested, it does not make sense to get the entity's data if that field is missing. If the entity should be available even without that field, then that field has to be nullable, and that's a decision taken in the schema design, not in the router or rover.

Where I(personally) think non-nullability is creating issues is that it should not be a decision from the server side, the client should actually be able to decide which data it can live without. So we're looking at the ongoing work around client controlled nullability.

This discussion is going beyond the scope of what we are doing in the router, as it follows the consensus around the GraphQL specification and federation. So if it has to change, you should start by opening an issue on https://github.com/apollographql/federation and propose updates to the nullability behaviour.

Mithras · 2022-07-01T16:10:12Z

It cannot be guaranteed, in the same way that an entity's field in a monolithic service could fail for various reasons (missing data, can't find file, unresponsive DB, etc).

Router doesn't fail. It returns null without any reason. There is a reason services return errors and instead of just null.

It's not going anywhere but let me give another good try.
Back to our example:

Subgraph A is guranteed to return propertyA for all entities it knows about so it defines this:

type MyEntity @key(fields: "id") {
  id: Int!
  propertyA: String!
}

Is this wrong? I don't think so. propertyA is not nullable as far as Subgraph A is concerned.

Subgraph B is guranteed to return propertyB for all entities it knows about so it defines this:

type MyEntity @key(fields: "id") {
  id: Int!
  propertyB: String!
}

Is this wrong? I don't think so. propertyB is not nullable as far as Subgraph B is concerned.

Now comes Apollo and composes it into

type MyEntity @key(fields: "id") {
  id: Int!
  propertyA: String!
  propertyB: String!
}

Is this correct? I don't think so. There is no way to guarantee either propertyA nor propertyB not being null.
There are only two possible ways of querying MyEntity and both can result in null response:

Subgraph A resolves id and propertyA, then { _entities(representations: [{ __typename: "MyEntity", id: XXX }]){ propertyB } } is sent to Subgraph B which might or might not return null. When the query returns null, the composed schema is wrong.
Subgraph B resolves id and propertyB, then { _entities(representations: [{ __typename: "MyEntity", id: XXX }]){ propertyA } } is sent to Subgraph A which might or might not return null. When the query returns null, the composed schema is wrong.

Does this make sense? There is no way to guarantee fields won't be null in composed schema even when all subgraphs and the Router are working exactly as they should. Expecting both Subgraph A and Subgraph B to have the exact set of entities is naïve. It's only possible if both subgraphs are synchronized with some consensus algorithm like Paxos or Raft. Nobody is going to do that for Federated services in real world.

abernix · 2022-07-01T16:58:58Z

It returns null without any reason. There is a reason services return errors and instead of just null.

If the lack of errors is the sole point of contention, that is very fair and perhaps #528 is the resolution for you. It should be clear in the errors where the null was encountered even if that field is not present in the response. Let's focus on #528 if that is the sticking point.

2. BUG: If SubgraphB can not resolve MyEntity with id=1, the whole response is set to null.

This is the specified behavior of GraphQL. The "whole response is set to null" because the field error is bubbling up to the Query type. If you had any number of nullable object types on then that object would be the point where the null-ing occurs. If the (implicit, when not specified) schema { query: Query } type was itself non-nullable it would be impossible to return any data at all and only errors would be present.

2. Add an option to rover supergraph compose and Apollo Studio to automatically change non-nullable federated fields to nullable. This sounds like a relatively easy and safe feature to add.

Many users rely on behavior that is not what you are describing here, but we can certainly imagine how a federated directive could help here, and as already stated, Client-Controlled Nullability might be what you want. Linting rules for the graph could also convey to your developers that non-null fields should be avoided.

Some particular requests across federated graphs do rely on non-nullability to enforce and benefit from a simplified client experience by not returning what are, to them, unstable results. This doesn't sound like it is your use case!

2. By "can not resolve" I mean SubgraphB doesn't know about MyEntity with id=1

If this is the case, then you should make propertyB nullable. If you do not do that then you will not be satisfying the contract of the schema which says it's not nullable and the specified bubbling behavior of field errors to parent null fields.

Expecting both Subgraph A and Subgraph B to have the exact set of entities is naïve.

I don't understand this. The entities are keyed by a primary key. That primary key is the way to look up the entity in both subgraphs. If one of the subgraphs can't be assured to find that entity by its primary key and you want the rest of the operation to execute in its absence then you should again specify that in the schema by removing the non-null operator. This is a schema designer's choice.

That pretty much means that Federation can not be used with non-null fields and everybody will be forced to mark everything as nullable.
Again, the problem is adding a non-null field to an entity will ALWAYS make some requests return { data: null }.

I find that saying "ALWAYS" and "some requests" here to be quite confusing. Some requests will return data: null, sure. That is the contract with the schema and that is the specified behavior of GraphQL, right? This might the case if you have built a subgraph that never returns the thing it's being asked by its @key (or is particularly flaky!) and have made the choice in your schema design that such a field is non-nullable via the presence of the non-nullable operator. If some of your requests are failing or missing the expected lookup, then change your fields to be nullable.

we don't really see much benefit of re-validating already valid responses from subgraphs at a cost of halving throughput.
we skip validation in Gateway for performance reasons (about 2x throughput without validation) with this yarn patch

As pointed out above might be a comfortable exception in your case, doesn't guard against mis-implementation in a defensive-enough way to still offer the contract against the schema which the client expects. I'm curious why you didn't decide to disable validation in the subgraphs rather than disabling them in the point closest to the client where the validation perhaps matters the most? You'd still be halving the amount of validations you're doing, while upholding the contract with the schema.

I can certainly understand how the behavior you're encountering isn't what you want. Feel free to respond to anything I said above, but I'm going to close this as "won't fix" because I'm certain that, aside from the error bubbling I noted above (which again, is important to fix), is best fixed by tooling outside of the Router since the Router merely operates on the Supergraph that it's been offered as input and the schema is contract we should obey.

To be clear, that doesn't mean that there isn't a good solution to live elsewhere — just not in the Router.
You're free transform that schema before you pass it to the Router and I can imagine that future tooling will help with that. I would, as suggested above, suggest opening an issue on the federation repository if you need something in the federation model changed, though I'll note that it sounds quite similar to apollographql/federation#860.

Thanks for opening this conversation originally!

Mithras · 2022-07-01T18:47:16Z

If this is the case, then you should make propertyB nullable. If you do not do that then you will not be satisfying the contract of the schema which says it's not nullable and the specified bubbling behavior of field errors to parent null fields.

If one of the subgraphs can't be assured to find that entity by its primary key and you want the rest of the operation to execute in its absence then you should again specify that in the schema by removing the non-null operator. This is a schema designer's choice.

This is the point. It is NOT POSSIBLE (at least not without a consensus algorithm in place) to satisfy not-null contract when fields are federated. I tried to explain it here:

If something is not possible, it shouldn't be a designer's choice. It should be enforced. So far it seems that Federation either requires all subgraphs to mark every field in Federated entities as nullable or implement consensus across all subgraphs. Otherwise it's not correct because it makes claims it can't guarantee. Do I need to elaborate why it's not possible? Think of a case when entity is added by user. How do you add it to ALL subgraphs SIMULTANIOUSLY without a consensus?

You're free transform that schema before you pass it to the Router and I can imagine that future tooling will help with that.

I'd like Rover CLI and Apollo Studio do that.

abernix · 2022-07-01T21:44:25Z

mark every field in Federated entities as nullable

it would require making pretty much all fields except IDs nullable

@Mithras I just want to clarify something since both of these sentences give me the impression that you're talking about marking or making something null: an additional step which you could see being easily missed. (Which resonates with me, if I'm understanding.),

Nullable is the default in GraphQL. It's the addition of the ! that makes it non-nullable.

I don't honestly believe that's obvious about GraphQL. That's beyond the scope of this conversation, but probably worth clarifying since it sounds like we're not seeing eye to eye here on a couple things.

I do still think you're looking for tooling though to help you govern the way your graph is built though (e.g., linting), and I believe that Rover and Studio are great places for tooling, as you've suggested.

Mithras · 2022-07-01T22:28:41Z

It doesn't matter if you mark nullable or mark not nullable. The result is the same - nullability can be only guaranteed at subgraph level, not at supergraph level

Mithras · 2022-07-07T01:23:29Z

Here is a simple graphql-inspector rule in case somebody else doesn't want to get { data: null } in production:

module.exports = ({ changes, newSchema }) => {
    typeMap = newSchema.getTypeMap();
    for (const typeName in typeMap) {
        var type = typeMap[typeName];
        if (type[Symbol.toStringTag] !== "GraphQLObjectType") {
            continue;
        }
        if (!type.astNode) {
            continue;
        }
        var keyDirectives = type.astNode.directives.filter(x => x.name.value === "key");
        if (!keyDirectives.length) {
            continue;
        }
        // var keys = keyDirectives.map(x => x.arguments.find(y => y.name.value === "fields").value.value);

        for (const field of type.astNode.fields) {
            // if (keys.includes(field.name.value)) {
            //     continue;
            // }
            if (field.type.kind === "NonNullType") {
                changes.push({
                    criticality: { level: "BREAKING" },
                    type: "NON_NULL_FIELD_IN_FEDERATED_ENTITY",
                    message: `Non-nullable field '${field.name.value}' in federated type '${typeName}'`,
                    path: `${typeName}.${field.name.value}`
                });
            }
        }
    }
    return changes;
};

carldunham · 2023-12-14T06:35:43Z

So we are seeing a (possibly?) related issue with nullable fields being skipped. A query like

query($skip: Boolean = false) {
  viewer @skip(if: $skip) {
    login
  }
}

returns (with variables {"skip": true})

{
  "data": null
}

while

query {
  viewer @skip(if: true) {
    login
  }
}

returns

{
  "data": {}
}

Per https://spec.graphql.org/October2021/#sec-Data:

If an error was raised during the execution that prevented a valid response, the data entry in the response should be null.

but this does not seem to be an error case (or vars vs no vars would be equivalent). The spec is a bit vague about what happens if there is no error, but also nothing to return.
It also describes other error cases that would result in an errors array, and no data field being returned.
Not sure how all the cases would present, but certainly passing a variable shouldn't impact the ultimate return shape, I would think.

Mithras added raised by user triage labels Jun 24, 2022

Mithras changed the title ~~Null response when federated entity is null when non-nullable field from subgraph~~ Null response when resolving a non-null filed from a different subgraph Jun 24, 2022

abernix closed this as not planned Won't fix, can't repro, duplicate, stale Jul 1, 2022

Mithras mentioned this issue Jul 1, 2022

[apollo-gateway] Question about nullability and cross-service joins apollographql/federation#860

Closed

Mithras mentioned this issue Jul 5, 2022

add a message in the errors field when field nullability rules trigger #1304

Closed

abernix removed the triage label Aug 12, 2022

carldunham mentioned this issue Dec 15, 2023

Inconsistent/invalid null data response when using variables #4397

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Null response when resolving a non-null filed from a different subgraph #1308

Null response when resolving a non-null filed from a different subgraph #1308

Mithras commented Jun 24, 2022 •

edited

Mithras commented Jun 24, 2022

Geal commented Jun 28, 2022 •

edited

Mithras commented Jun 28, 2022 •

edited

Mithras commented Jun 28, 2022

Geal commented Jun 29, 2022

Mithras commented Jun 29, 2022 •

edited

Geal commented Jun 30, 2022

Mithras commented Jun 30, 2022 •

edited

Geal commented Jul 1, 2022 •

edited

Mithras commented Jul 1, 2022 •

edited

abernix commented Jul 1, 2022

Mithras commented Jul 1, 2022 •

edited

abernix commented Jul 1, 2022

Mithras commented Jul 1, 2022

Mithras commented Jul 7, 2022 •

edited

carldunham commented Dec 14, 2023

Null response when resolving a non-null filed from a different subgraph #1308

Null response when resolving a non-null filed from a different subgraph #1308

Comments

Mithras commented Jun 24, 2022 • edited

Mithras commented Jun 24, 2022

Geal commented Jun 28, 2022 • edited

Mithras commented Jun 28, 2022 • edited

Mithras commented Jun 28, 2022

Geal commented Jun 29, 2022

Mithras commented Jun 29, 2022 • edited

Geal commented Jun 30, 2022

Mithras commented Jun 30, 2022 • edited

Geal commented Jul 1, 2022 • edited

Mithras commented Jul 1, 2022 • edited

abernix commented Jul 1, 2022

Mithras commented Jul 1, 2022 • edited

abernix commented Jul 1, 2022

Mithras commented Jul 1, 2022

Mithras commented Jul 7, 2022 • edited

carldunham commented Dec 14, 2023

Mithras commented Jun 24, 2022 •

edited

Geal commented Jun 28, 2022 •

edited

Mithras commented Jun 28, 2022 •

edited

Mithras commented Jun 29, 2022 •

edited

Mithras commented Jun 30, 2022 •

edited

Geal commented Jul 1, 2022 •

edited

Mithras commented Jul 1, 2022 •

edited

Mithras commented Jul 1, 2022 •

edited

Mithras commented Jul 7, 2022 •

edited