Performance of computed columns in connections #265

ferdinandsalis · 2016-12-06T15:57:30Z

I am lacking the proper terminology to express the problem correctly but here you go.

For every row in a table the computed column procedure is executed, hence every other row contributes significantly to the execution time of the query. I am just guessing here but could we maybe use with queries to 'inline' the queries 🙈

You can find a 'test case' here https://github.com/ferdinandsalis/computed-columns-perf

@calebmer please let me know if you were expecting something else for the issue and the test case.

The text was updated successfully, but these errors were encountered:

benjie · 2016-12-06T18:44:05Z

For the query:

{
  allPosts {
    edges {
      node {
        id
        body
        summary #computed column
      }
    }
  }
}

we could potentially solve it with a single SQL statement along the lines of:

select
  to_json(__local_0__) as value,
  json_build_object(
    'summary',
    to_json("perf"."post_summary"(__local_0__))
  ) as computed_value
from "perf"."post" as __local_0__
where "id" is not null
order by "id" using <

and then combine value/computed value with Object.assign or similar. You could go further with explicit column lists (like what's required by #151), minimising encoding data that isn't used, e.g. something like:

select
  json_build_object(
    'id',
    to_json(__local_0__."id"),
    'body',
    to_json(__local_0__."body"),
    'summary',
    to_json("perf"."post_summary"(__local_0__))
  ) as computed_value
from "perf"."post" as __local_0__
where "id" is not null
order by "id" using <

(I'm not sure if the to_jsons inside the json_build_objects are actually required, I suspect not.)

benjie · 2016-12-06T19:02:17Z

Also, if we were to do this we'd also be able to sort by computed columns as requested in #262 by @Valorize:

select
  json_build_object(
    'id',
    to_json(__local_0__."id"),
    'body',
    to_json(__local_0__."body"),
    'summary',
    to_json("perf"."post_summary"(__local_0__))
  ) as computed_value
from "perf"."post" as __local_0__
where "id" is not null
order by "perf"."post_summary"(__local_0__) using <

(this would only work for computed columns that take no additional arguments, as the ones that take arguments cannot be expressed as static strings in our orderBy API currently)

benjie · 2016-12-08T13:14:46Z

So I spent an hour digging through how PostGraphQL works, thought it might be useful to share some of my scattered learnings. I'm a GraphQL newbie, so apologies if some of this is obvious.

We start in src/postgraphql/schema/createPostGraphQLSchema which runs the introspection query against the PG database and then passes it's learnings onto createGqlSchema. Particularly of interest for this specific issue #265 in this file is the _hooks that are defined for objectTypeFieldEntries which call out to createPgProcedureObjectTypeGqlFieldEntry whose resolve method is what's responsible for running the extra SQL queries referenced in @ferdinandsalis' test case.

The root of the schema generation is src/graphql/schema/createGqlSchema which recurses through and creates everything that's necessary based on the learnings from the postgresql introspection. It defines the query root node.

This then loops through creating the type for each each "collection" (table) via createCollectionQueryFieldEntries.

The allPosts in the test case is generated here, which passes out to createConnectionGqlField to fetch the data.

If you spit out page when you get toward the end of createConnectionGqlField you'll get an object of the form:

{
  values: [
    {value: {id: 1, field: 'value', otherField: 'otherValue'}, cursor: [1]},
    {value: {...}, cursor: ...},
    ...
  ],
  hasNextPage: function...,
  hasPreviousPage: function...,
}

Where the value objects are the to_json(__local_0__).

It's then later that this data is fed back into the resolver for the computed columns (via the _hooks mentioned above) which is why postgraphql then feed the data it has fetched back into the database in order to run the procedures against it again.

The computed columns are effectively added here: https://github.com/calebmer/postgraphql/blob/c43ab8d852b4f83704571287c64662b1a3e6d915/src/graphql/schema/collection/getCollectionGqlType.ts#L81-L84; the columns are fetched via later on in that file via here: https://github.com/calebmer/postgraphql/blob/c43ab8d852b4f83704571287c64662b1a3e6d915/src/graphql/schema/collection/getCollectionGqlType.ts#L108

benjie · 2016-12-08T13:15:53Z

What we optimally want to do is find the list of all the fields that are required for a particular row/collection and then run them all via one query, as above. It's not yet clear to me how to go about doing that.

ferdinandsalis · 2016-12-08T13:20:34Z

@benjie thanks for sharing you learnings! This is very helpful to me.

calebmer · 2016-12-09T15:09:40Z

What we optimally want to do is find the list of all the fields that are required for a particular row/collection and then run them all via one query, as above. It's not yet clear to me how to go about doing that.

Exactly. This has been on my mind for a while, because it also means we could do away with select * and just select the columns we need. It actually wouldn’t be too hard to implement this on the GraphQL side, it just becomes complex if we want to integrate it in a more general way. Some of the complexities are as follows:

If we don’t have all the columns, isTypeOf checks break which currently do a structural check. However, this structural check could either be optimized by including a type tag to values or removed entirely as currently its only purpose is to resolve the type of Nodes which can be done in a resolveType function on the Node interface.
Connection procedures are tricky to do this way because things start to get recursive. If you have a procedure person_latest_posts that returns setof post and you want to get the post_summary procedure on post things get tricky.

Another less efficient but easier to implement option would be to split this query @benjie wrote effectively into two queries:

select
  to_json(__local_0__) as value
from "perf"."post" as __local_0__
where "id" is not null
order by "id" using <

Then once we get that back:

select
  json_build_object(
    'summary',
    to_json("perf"."post_summary"(__local_0__))
  ) as computed_value
from "perf"."post" as __local_0__
where "id" in ($1)
order by "id" using <

Where $1 is the ids we got back from the first query.

Not as efficient as the optimal solution, but more efficient than what we have now. We should at least be able to optimize the most common case (returning a scalar from a procedure), and I have a PR that I’ve been sitting on which should make this easier to implement. I’ll see if I can clean it up and send it soon.

benjie · 2016-12-11T16:17:30Z

Excellent feedback @calebmer, thanks for helping us to understand.

Regarding the isTypeOf I think the type tag you suggest would be best - very easy to add that to the list of fields selected select to_json(__local_0__) as value, 'perf.post' as type from ... or mix it into the JSON object.

My API has some large JSON objects and arrays so your suggested modification which doesn't pipe the fetched data straight back into postgres again would be considerably better than the current solution 👍 . Some of my queries take 800ms+ locally at the moment; this should reduce their duration significantly, not least because it goes from O(N) to O(1) queries.

valoricDe · 2016-12-12T18:23:51Z

I would be super happy with every performance boost. My requests currently take about 1360ms. For 20 items with some other nested items.

benjie · 2017-02-04T00:28:38Z

I believe I've implemented the optimal solution in #340; there's still work to be done before it can be merged but the foundations are mostly there.

benjie · 2017-02-04T16:44:02Z

@Valorize Are you able to try my latest stuff in #342 to see if it solves your 1360ms issue?

Checkout master, install everything, then do DEBUG="" ./scripts/dev -c postgresql://localhost/YOUR_DATABASE_NAME --schema YOUR_SCHEMA_NAME and run your query in graphiql (run it a few times so you can break in V8 a bit, and take the lowest runtime (ignoring the introspection query that might run when graphiql boots up))

Then git checkout bg/performance-experiment-2 (the dev server should automatically restart itself) and run the query again.

You can then repeat the process with the DEBUG="" part removed so that you can count the number of SQL queries that are executed.

benjie · 2017-07-09T21:31:06Z

Fixed in PostGraphQL v4 👍

Fixes #812 If you define a type in a non-exposed schema, or define a type but don't grant access to it, then that type will generally not appear in your schema (if `ignoreRBAC: false`). If it does appear, chances are it's been pulled in by a SQL function or similar; if this is the case then pulling it through with no columns doesn't make any sense (and if there's no PK it can cause a crash as the type now has no fields). This PR adds attributes back on tables that haven't been granted *any* read permissions.

ferdinandsalis added the 💅 enhancement label Dec 6, 2016

benjie added the 🚀 performance label Dec 6, 2016

calebmer added this to the 2.4.0 milestone Dec 15, 2016

This was referenced Jan 12, 2017

Queries fail when table has computed column function and INTERVAL column. #309

Closed

Simple query causing massive memory usage (pushing over the half gigabyte level allowed by Heroku) #312

Closed

benjie mentioned this issue Jan 26, 2017

Only NATURAL order when creating a procedure #224

Closed

benjie mentioned this issue Feb 2, 2017

WIP: improving performance by selecting only required fields and placing function calls into initial SQL query #340

Closed

valoricDe mentioned this issue Feb 5, 2017

tsc compile error / does not compile correct directory #343

Closed

benjie mentioned this issue Feb 7, 2017

WIP: Reduce the number of SQL queries #342

Closed

5 tasks

benjie added the 👨‍💻 Fix in v4 label Jul 9, 2017

benjie closed this as completed Jul 9, 2017

benjie modified the milestones: Next, 4.0 Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of computed columns in connections #265

Performance of computed columns in connections #265

ferdinandsalis commented Dec 6, 2016

benjie commented Dec 6, 2016 •

edited

Loading

benjie commented Dec 6, 2016

benjie commented Dec 8, 2016

benjie commented Dec 8, 2016

ferdinandsalis commented Dec 8, 2016

calebmer commented Dec 9, 2016

benjie commented Dec 11, 2016

valoricDe commented Dec 12, 2016

benjie commented Feb 4, 2017

benjie commented Feb 4, 2017

benjie commented Jul 9, 2017

Performance of computed columns in connections #265

Performance of computed columns in connections #265

Comments

ferdinandsalis commented Dec 6, 2016

benjie commented Dec 6, 2016 • edited Loading

benjie commented Dec 6, 2016

benjie commented Dec 8, 2016

benjie commented Dec 8, 2016

ferdinandsalis commented Dec 8, 2016

calebmer commented Dec 9, 2016

benjie commented Dec 11, 2016

valoricDe commented Dec 12, 2016

benjie commented Feb 4, 2017

benjie commented Feb 4, 2017

benjie commented Jul 9, 2017

benjie commented Dec 6, 2016 •

edited

Loading