Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of computed columns in connections #265

Closed
ferdinandsalis opened this issue Dec 6, 2016 · 11 comments
Closed

Performance of computed columns in connections #265

ferdinandsalis opened this issue Dec 6, 2016 · 11 comments

Comments

@ferdinandsalis
Copy link
Contributor

I am lacking the proper terminology to express the problem correctly but here you go.

For every row in a table the computed column procedure is executed, hence every other row contributes significantly to the execution time of the query. I am just guessing here but could we maybe use with queries to 'inline' the queries 🙈

You can find a 'test case' here https://github.com/ferdinandsalis/computed-columns-perf

@calebmer please let me know if you were expecting something else for the issue and the test case.

@benjie
Copy link
Member

benjie commented Dec 6, 2016

For the query:

{
  allPosts {
    edges {
      node {
        id
        body
        summary #computed column
      }
    }
  }
}

we could potentially solve it with a single SQL statement along the lines of:

select
  to_json(__local_0__) as value,
  json_build_object(
    'summary',
    to_json("perf"."post_summary"(__local_0__))
  ) as computed_value
from "perf"."post" as __local_0__
where "id" is not null
order by "id" using <

and then combine value/computed value with Object.assign or similar. You could go further with explicit column lists (like what's required by #151), minimising encoding data that isn't used, e.g. something like:

select
  json_build_object(
    'id',
    to_json(__local_0__."id"),
    'body',
    to_json(__local_0__."body"),
    'summary',
    to_json("perf"."post_summary"(__local_0__))
  ) as computed_value
from "perf"."post" as __local_0__
where "id" is not null
order by "id" using <

(I'm not sure if the to_jsons inside the json_build_objects are actually required, I suspect not.)

@benjie
Copy link
Member

benjie commented Dec 6, 2016

Also, if we were to do this we'd also be able to sort by computed columns as requested in #262 by @Valorize:

select
  json_build_object(
    'id',
    to_json(__local_0__."id"),
    'body',
    to_json(__local_0__."body"),
    'summary',
    to_json("perf"."post_summary"(__local_0__))
  ) as computed_value
from "perf"."post" as __local_0__
where "id" is not null
order by "perf"."post_summary"(__local_0__) using <

(this would only work for computed columns that take no additional arguments, as the ones that take arguments cannot be expressed as static strings in our orderBy API currently)

@benjie
Copy link
Member

benjie commented Dec 8, 2016

So I spent an hour digging through how PostGraphQL works, thought it might be useful to share some of my scattered learnings. I'm a GraphQL newbie, so apologies if some of this is obvious.

We start in src/postgraphql/schema/createPostGraphQLSchema which runs the introspection query against the PG database and then passes it's learnings onto createGqlSchema. Particularly of interest for this specific issue #265 in this file is the _hooks that are defined for objectTypeFieldEntries which call out to createPgProcedureObjectTypeGqlFieldEntry whose resolve method is what's responsible for running the extra SQL queries referenced in @ferdinandsalis' test case.

The root of the schema generation is src/graphql/schema/createGqlSchema which recurses through and creates everything that's necessary based on the learnings from the postgresql introspection. It defines the query root node.

This then loops through creating the type for each each "collection" (table) via createCollectionQueryFieldEntries.

The allPosts in the test case is generated here, which passes out to createConnectionGqlField to fetch the data.

If you spit out page when you get toward the end of createConnectionGqlField you'll get an object of the form:

{
  values: [
    {value: {id: 1, field: 'value', otherField: 'otherValue'}, cursor: [1]},
    {value: {...}, cursor: ...},
    ...
  ],
  hasNextPage: function...,
  hasPreviousPage: function...,
}

Where the value objects are the to_json(__local_0__).

It's then later that this data is fed back into the resolver for the computed columns (via the _hooks mentioned above) which is why postgraphql then feed the data it has fetched back into the database in order to run the procedures against it again.


The computed columns are effectively added here: https://github.com/calebmer/postgraphql/blob/c43ab8d852b4f83704571287c64662b1a3e6d915/src/graphql/schema/collection/getCollectionGqlType.ts#L81-L84; the columns are fetched via later on in that file via here: https://github.com/calebmer/postgraphql/blob/c43ab8d852b4f83704571287c64662b1a3e6d915/src/graphql/schema/collection/getCollectionGqlType.ts#L108

@benjie
Copy link
Member

benjie commented Dec 8, 2016

What we optimally want to do is find the list of all the fields that are required for a particular row/collection and then run them all via one query, as above. It's not yet clear to me how to go about doing that.

@ferdinandsalis
Copy link
Contributor Author

@benjie thanks for sharing you learnings! This is very helpful to me.

@calebmer
Copy link
Collaborator

calebmer commented Dec 9, 2016

What we optimally want to do is find the list of all the fields that are required for a particular row/collection and then run them all via one query, as above. It's not yet clear to me how to go about doing that.

Exactly. This has been on my mind for a while, because it also means we could do away with select * and just select the columns we need. It actually wouldn’t be too hard to implement this on the GraphQL side, it just becomes complex if we want to integrate it in a more general way. Some of the complexities are as follows:

  • If we don’t have all the columns, isTypeOf checks break which currently do a structural check. However, this structural check could either be optimized by including a type tag to values or removed entirely as currently its only purpose is to resolve the type of Nodes which can be done in a resolveType function on the Node interface.
  • Connection procedures are tricky to do this way because things start to get recursive. If you have a procedure person_latest_posts that returns setof post and you want to get the post_summary procedure on post things get tricky.

Another less efficient but easier to implement option would be to split this query @benjie wrote effectively into two queries:

select
  to_json(__local_0__) as value
from "perf"."post" as __local_0__
where "id" is not null
order by "id" using <

Then once we get that back:

select
  json_build_object(
    'summary',
    to_json("perf"."post_summary"(__local_0__))
  ) as computed_value
from "perf"."post" as __local_0__
where "id" in ($1)
order by "id" using <

Where $1 is the ids we got back from the first query.

Not as efficient as the optimal solution, but more efficient than what we have now. We should at least be able to optimize the most common case (returning a scalar from a procedure), and I have a PR that I’ve been sitting on which should make this easier to implement. I’ll see if I can clean it up and send it soon.

@benjie
Copy link
Member

benjie commented Dec 11, 2016

Excellent feedback @calebmer, thanks for helping us to understand.

Regarding the isTypeOf I think the type tag you suggest would be best - very easy to add that to the list of fields selected select to_json(__local_0__) as value, 'perf.post' as type from ... or mix it into the JSON object.

My API has some large JSON objects and arrays so your suggested modification which doesn't pipe the fetched data straight back into postgres again would be considerably better than the current solution 👍 . Some of my queries take 800ms+ locally at the moment; this should reduce their duration significantly, not least because it goes from O(N) to O(1) queries.

@valoricDe
Copy link
Contributor

I would be super happy with every performance boost. My requests currently take about 1360ms. For 20 items with some other nested items.

@benjie
Copy link
Member

benjie commented Feb 4, 2017

I believe I've implemented the optimal solution in #340; there's still work to be done before it can be merged but the foundations are mostly there.

@benjie
Copy link
Member

benjie commented Feb 4, 2017

@Valorize Are you able to try my latest stuff in #342 to see if it solves your 1360ms issue?

Checkout master, install everything, then do DEBUG="" ./scripts/dev -c postgresql://localhost/YOUR_DATABASE_NAME --schema YOUR_SCHEMA_NAME and run your query in graphiql (run it a few times so you can break in V8 a bit, and take the lowest runtime (ignoring the introspection query that might run when graphiql boots up))

Then git checkout bg/performance-experiment-2 (the dev server should automatically restart itself) and run the query again.

You can then repeat the process with the DEBUG="" part removed so that you can count the number of SQL queries that are executed.

@benjie
Copy link
Member

benjie commented Jul 9, 2017

Fixed in PostGraphQL v4 👍

@benjie benjie closed this as completed Jul 9, 2017
@benjie benjie modified the milestones: Next, 4.0 Aug 15, 2018
benjie added a commit that referenced this issue Jan 27, 2020
Fixes #812

If you define a type in a non-exposed schema, or define a type but don't grant access to it, then that type will generally not appear in your schema (if `ignoreRBAC: false`). If it does appear, chances are it's been pulled in by a SQL function or similar; if this is the case then pulling it through with no columns doesn't make any sense (and if there's no PK it can cause a crash as the type now has no fields). This PR adds attributes back on tables that haven't been granted *any* read permissions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants