Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically parsing subqueries as composite type #1801

Open
mitar opened this issue Jan 4, 2019 · 5 comments

Comments

@mitar
Copy link

commented Jan 4, 2019

So in PostgreSQL one can do something like:

SELECT _id, body, (SELECT posts FROM posts WHERE posts._id=comments.post_id) AS post FROM comments

Which returns post as a composite type. Currently, this package returns:

{ _id: 'TWjntjispSBvbZQtv',
    body: { title: 'Comment title 0' },
    post: '(XKP3PeHoDRW2a7hFi,"{""title"": ""Post title 0""}")' }

Ideally, it should return something like:

{ _id: '5B4gBiAnhGY64W4GR',
    body: { title: 'Comment title 0' },
    post:
     { _id: 'yDwGwJ6x5Qc8HhotD', body: { title: 'Post title 0' } } }

I know I can achieve this by doing such query:

SELECT _id, body, (SELECT to_json(posts) FROM posts WHERE posts._id=comments.post_id) AS post FROM comments

But frankly, I am not sure why would I want to convert it to JSON and back just to get it over the wire in correct structure.

I have seen some other issues about composite types, but this is something which seems to be doable from my short exploration. It seems PostgreSQL exposes necessary information but it might require additional queries, which could be cached, I believe.

I tried the following script:

const util = require('util');

const {Pool} = require('pg');

const CONNECTION_CONFIG = {
  user: 'postgres',
  database: 'postgres',
  password: 'pass',
};

const pool = new Pool(CONNECTION_CONFIG);

(async () => {
  await pool.query(`
    CREATE OR REPLACE FUNCTION random_id() RETURNS TEXT LANGUAGE SQL AS $$
      SELECT array_to_string(
        array(
          SELECT SUBSTRING('23456789ABCDEFGHJKLMNPQRSTWXYZabcdefghijkmnopqrstuvwxyz' FROM floor(random()*55)::int+1 FOR 1) FROM generate_series(1, 17)
        ),
        ''
      );
    $$;
    DROP TABLE IF EXISTS comments;
    DROP TABLE IF EXISTS posts;
    CREATE TABLE posts (
      _id CHAR(17) PRIMARY KEY DEFAULT random_id(),
      body JSONB NOT NULL DEFAULT '{}'::JSONB
    );
    CREATE TABLE comments (
      _id CHAR(17) PRIMARY KEY DEFAULT random_id(),
      post_id CHAR(17) NOT NULL REFERENCES posts(_id),
      body JSONB NOT NULL DEFAULT '{}'::JSONB
    );
    DELETE FROM comments;
    DELETE FROM posts;
  `);

  let result;
  for (let i = 0; i < 5; i++) {
    result = await pool.query(`
      INSERT INTO posts (body) VALUES($1) RETURNING _id;
    `, [{'title': `Post title ${i}`}]);

    const postId = result.rows[0]._id;

    for (let j = 0; j < 10; j++) {
      await pool.query(`
        INSERT INTO comments (post_id, body) VALUES($1, $2);
      `, [postId, {'title': `Comment title ${j}`}]);
    }
  }

  result = await pool.query(`
    SELECT _id, body, (SELECT posts FROM posts WHERE posts._id=comments.post_id) AS post FROM comments
  `);

  console.log(util.inspect(result.rows, {depth: null}));

  result = await pool.query(`
    SELECT attname, atttypid FROM pg_catalog.pg_attribute LEFT JOIN pg_catalog.pg_type ON (pg_attribute.attrelid=pg_type.typrelid) WHERE pg_type.oid=$1 AND attisdropped=false AND attnum>=1 ORDER BY attnum
  `, [result.fields[2].dataTypeID]);

  console.log(util.inspect(result.rows, {depth: null}));

  await pool.end();
})();

Results:

[ { _id: 'BgCnoip8HiNt5Lant',
    body: { title: 'Comment title 0' },
    post: '(hjZPGf2ykyeYgFhCe,"{""title"": ""Post title 0""}")' },
  ...,
  { _id: 'Rh3cc3qGHN7v3Seq3',
    body: { title: 'Comment title 9' },
    post: '(zLLeSQFyt844ZX2a6,"{""title"": ""Post title 4""}")' } ]
[ { attname: '_id', atttypid: 1042 },
  { attname: 'body', atttypid: 3802 } ]

As you see, it is possible to get both names and type IDs for values inside post. With some recursion, this could be converted, no? Information how to parse tuple itself is here. And then we would have to parse each individual element.

@noinkling

This comment has been minimized.

Copy link
Contributor

commented May 28, 2019

I'd love to see this happen, but it seems like a lot of work due to having to query the system catalog (at what point do you do this? do you do it lazily or eagerly? do you do it automatically or require manual registration?), cache that information, and ensure the cache doesn't become stale if things change. Related discussion for a .NET driver that implemented it: npgsql/npgsql#441

It's a shame the nested type information isn't provided by the protocol itself...

Edit: here's a link to the relevant queries/business logic in that npgsql library, if it helps:

Another edit: this is how it seems to be done in Python land, much easier to follow:

@noinkling

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

Some previous discussions for posterity:

@vitaly-t's abandoned attempt at a minimal parser (only gets the individual values as strings):

Another library in another language (Ruby this time) that's able to handle them:

I'm hoping if we pool together enough information we can figure out a decent solution, which I'm assuming at this point would live in an external package.

@vitaly-t

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

@noinkling You are looking at it in the wrong way. The main reason I abandoned my attempts was because the tuple format is now considered a legacy, sort-of obsolete, whereas JSON+JSONB is the new recommended approach. And this is where PostgreSQL team has been focusing after v9 of the server, to make JSON as fast as possible. Tuples are a history now. You should avoid using them.

@noinkling

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

@vitaly-t JSON only supports four primitive types: strings, a single JS-style "number" type, booleans and null. If you have data using a type that doesn't match one of those (the timestamp/date types, for example), you lose type information and the ability for the library to automatically parse those values (via pg-types). You're left with having to do your own parsing on a query-by-query basis (since you need to know the structure of the result), or using a hacky and complicated workaround like manually outputting types in the result.

And even ignoring that, there's the fact that casting to JSON in Postgres can have less than ideal results in some edge cases. For example:

=# SELECT original, to_json(original) FROM (VALUES (timestamptz '20000-01-01'), (timestamptz '500-01-01 BC')) v (original);

         original          │            to_json
───────────────────────────┼────────────────────────────────
 20000-01-01 00:00:00+00   │ "20000-01-01T00:00:00+00:00"
 0500-01-01 00:00:00+00 BC │ "0500-01-01T00:00:00+00:00 BC"

Believe it or not those aren't valid ISO date strings:

https://en.wikipedia.org/wiki/ISO_8601#Years

To represent years before 0000 or after 9999, the standard also permits the expansion of the year representation but only by prior agreement between the sender and the receiver. An expanded year representation [±Y̲YYYY] must have an agreed-upon number of extra year digits beyond the four-digit minimum, and it must be prefixed with a + or − sign instead of the more common AD/BC (or CE/BCE) notation; by convention 1 BC is labelled +0000, 2 BC is labeled −0001, and so on.

In JS:

> new Date("20000-01-01T00:00:00+00:00")
Invalid Date
> new Date("0500-01-01T00:00:00+00:00 BC")
Invalid Date

In contrast, the date parser written for this library already supports the original formats fine (but not the JSON ones). In order to use that parser, you could keep the original format in JSON by casting to text first, but suffice it to say that it can make certain queries significantly more complicated/verbose/ugly, and you still have the first issue.

If composite types/tuples/row values/records (whatever they're called) were parsed properly I wouldn't have to worry about any of that.

@mitar

This comment has been minimized.

Copy link
Author

commented May 31, 2019

the tuple format is now considered a legacy, sort-of obsolete, whereas JSON+JSONB is the new recommended approach

Can you provide and support for this claim? I have not seen that anywhere.

Moreover, JSON does not support many types of values. For example, infinity and nan cannot be transported with standard JSON (and PostgreSQL has strict standard JSON).

And this is where PostgreSQL team has been focusing after v9 of the server, to make JSON as fast as possible.

That is true. JSON is really fast. Even more, parsing of JSON on node side is much faster than anything else because it is so heavily optimized in node.js. I made some benchmarks to confirm that.

On the other hand, parsing JSON is recursive. And you do not know what you are getting until you get it, so you have to scan and convert. On other hand, with tuple approach from PostgreSQL, you have all the information about the structure in advance, so at least in theory, you should be able to directly parse the object knowing what is where and so you can just map, without having to first scan what a token is and then map. I have seen this being done in some projects, but I do not find now a reference.

On the other hand, if messaging would be done in something like capnproto then parsing would be zero time instead of current conversion from any memory object to another memory object by copying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.