Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Humbly Checking for Status Updates #10

Open
WonderPanda opened this issue Sep 30, 2020 · 15 comments
Open

Humbly Checking for Status Updates #10

WonderPanda opened this issue Sep 30, 2020 · 15 comments

Comments

@WonderPanda
Copy link

First off, thanks for all the incredible things you do for the open source commuinity @benjie. You're definitely an inspiration on multiple levels. Hopefully this issue doesn't come as a nuisance so please feel free to close if this just isn't on your radar right now.

I went down a crazy rabbit hole today trying to find some kind of tool I could use inside of our repo to apply consistent formatting to Postgres SQL migrations and a bunch of more complex Postgres Functions that we manage. I'm spoiled into assuming that in the Node ecosystem these types of tools are readily accessible but there doesn't seem to be many good options out there. This project definitely seems the most promising both because of it's integration with Prettier and the architectural approach you're taking.

Was just wondering if you plan to revisit and continue work on this or if it's on the backburner because of your other more high profile projects (totally understandable)? Anyways, thanks again and I just wanted to give a huge thumbs up that if this ever got released it would be something that would definitely help out both the org I work for and me personally!

@benjie
Copy link
Owner

benjie commented Sep 30, 2020

Thanks for your kind words ❤️

This project eats at me. I really really want to implement it. But time and again, the different approaches I take, it’s clear I need to build a new PostgreSQL parser to solve it. None of the existing ones will do, they’re either not sufficiently accurate, or they are based on Postgres’ own parser which has a wide range of issues that mean it’s not really suitable for this task (unintuitively).

It’s very frustrating. I pick at it in my hobby hours from time to time. I’m pulling on a few threads in private. I’ve got some interesting ideas that might turn out great or might not turn into anything (except maybe some pull requests to Postgres’ docs...) Mostly I need to invest some time in learning parsing/lexing properly rather than the dabbling I’ve done in the past.

I’ve been considering crowd-funding (e.g. kickstarter) some work on it, but I’m not 100% sure I can deliver, and I’m not sure there’s sufficient demand that the fundraise would be successful either, so right now it remains a hobby project that gently eats at my soul.

FWIW the best formatter I’m aware of is pgFormatter.

@WellspringCS
Copy link

@benjie it pains me to think of this unfinished task eating away at you. I'm watching this project because I'm keen to see a good solution. But please don't lose sleep on it until people (like myself!) are paying you to do it. :)

@karlhorky
Copy link

Would any of the newer parsing libraries like sql-surveyor offer the accuracy / capabilities that you would be looking for?

@benjie
Copy link
Owner

benjie commented Nov 19, 2020

Does it do deparsing too? If not, someone has to write converting the AST to SQL for every single possible AST representation, which is a lot of work as I'm sure you can imagine. Also if it's not PostgreSQL specific then it's unlikely it would have full coverage of the PostgreSQL features that many of us use; but I'd certainly like to see the results of running it against some decent schemas; a good starting point might be Graphile Starter's schema: https://github.com/graphile/starter/blob/main/data/schema.sql

@karlhorky
Copy link

karlhorky commented Nov 19, 2020

Deparsing (I guess aka printing? or maybe I'm misunderstanding...) is not one of its goals, so that would need to be built (unless I'm misunderstanding I guess pretty-printing usually would make up the majority of the code of a Prettier plugin).

But maybe @mtriff or @modeldba (maintainers at https://github.com/modeldba/sql-surveyor) could shed more light here on how this could be done and how difficult it would be.

Here's the output of that Graphile schema when parsed by sql-surveyor using SQLDialect.PLpgSQL: file.log

Excerpt:

ParsedSql {
  parsedQueries: {
    '163': ParsedQuery {
      outputColumns: [],
      referencedColumns: [],
      referencedTables: {},
      tokens: {
        '163': Token {
          value: 'SET',
          location: TokenLocation {
            lineStart: 8,
            lineEnd: 8,
            startIndex: 163,
            stopIndex: 165
          }
        },
        '167': Token {
          value: 'statement_timeout',
          location: TokenLocation {
            lineStart: 8,
            lineEnd: 8,
            startIndex: 167,
            stopIndex: 183
          }
        },
        '185': Token {
          value: '=',
          location: TokenLocation {
            lineStart: 8,
            lineEnd: 8,
            startIndex: 185,
            stopIndex: 185
          }
        },
        '187': Token {
          value: '0',
          location: TokenLocation {
            lineStart: 8,
            lineEnd: 8,
            startIndex: 187,
            stopIndex: 187
          }
        },
        '188': Token {
          value: ';',
          location: TokenLocation {
            lineStart: 8,
            lineEnd: 8,
            startIndex: 188,
            stopIndex: 188
          }
        }
      },
      query: 'SET statement_timeout = 0',
      queryType: 'DDL',
      queryLocation: TokenLocation {
        lineStart: 8,
        lineEnd: 8,
        startIndex: 163,
        stopIndex: 187
      },
      queryErrors: [],
      subqueries: {},
      commonTableExpressions: {}
    },
    '190': ParsedQuery {
      outputColumns: [],
      referencedColumns: [],
      referencedTables: {},
      tokens: {
        '190': Token {
          value: 'SET',
          location: TokenLocation {
            lineStart: 9,
            lineEnd: 9,
            startIndex: 190,
            stopIndex: 192
          }
        },
        '194': Token {
          value: 'lock_timeout',
          location: TokenLocation {
            lineStart: 9,
            lineEnd: 9,
            startIndex: 194,
            stopIndex: 205
          }
        },
        '207': Token {
          value: '=',
          location: TokenLocation {
            lineStart: 9,
            lineEnd: 9,
            startIndex: 207,
            stopIndex: 207
          }
        },
        '209': Token {
          value: '0',
          location: TokenLocation {
            lineStart: 9,
            lineEnd: 9,
            startIndex: 209,
            stopIndex: 209
          }
        },
        '210': Token {
          value: ';',
          location: TokenLocation {
            lineStart: 9,
            lineEnd: 9,
            startIndex: 210,
            stopIndex: 210
          }
        }
      },
      query: 'SET lock_timeout = 0',
      queryType: 'DDL',
      queryLocation: TokenLocation {
        lineStart: 9,
        lineEnd: 9,
        startIndex: 190,
        stopIndex: 209
      },
      queryErrors: [],
      subqueries: {},
      commonTableExpressions: {}
    },
    // ...
  }
}

@karlhorky
Copy link

Or if by deparsing you mean getting the original SQL representation back as part of the parsed data structure, I guess that is happening (both in the Token.values as well as in the ParsedQuery.query field).

@mtriff
Copy link

mtriff commented Nov 19, 2020

Jumping into this conversation a bit blind, so I'm not totally familiar with this project and its goals.

Under the hood, sql-surveyor uses this ANTLR4 grammar to parse PL/pgSQL.

I'm assuming that the deparsing requirement is in support of this from the README:

Where a statement can be expressed in many orders, express it in the same order that pg_dump would

Deparsing like this isn't really a goal of sql-surveyor, but might be possible to build from the output if some additional metadata were added (i.e. specifying which tokens are reserved keywords, identifying expressions, etc.). But, like Benjie said, still a lot of work regardless.

@benjie
Copy link
Owner

benjie commented Nov 20, 2020

Thanks @karlhorky and @mtriff.

Looking at file.log, the first issue I see is that comments (not the COMMENT statement, but literal -- comments) are not represented in the AST; without them we cannot read a file, convert it to AST, and write it back to the file without losing user content (which is what prettier needs to do, it's a code formatter). Perhaps support for comments wouldn't be too complex to add to the grammar? I'm very new to parsers/lexers and have never worked with an ANTLR4 grammar before.

It also seems like we'd need to use the token stream as the digests aren't generally sufficient for our needs, for example in the below parsed SELECT pg_catalog.set_config('search_path', '', false); statement the selection is represented as a columnName which would mean we cannot cascade the formatting into the function arguments.

    '325': ParsedQuery {
      outputColumns: [
        OutputColumn {
          columnName: "pg_catalog.set_config('search_path', '', false)",
          columnAlias: null,
          tableName: null,
          tableAlias: null
        }
      ],
      referencedColumns: [],
      referencedTables: {},
      tokens: {
        '325': Token {
          value: 'SELECT',
          location: TokenLocation {
            lineStart: 13,
            lineEnd: 13,
            startIndex: 325,
            stopIndex: 330
          }
        },
        '332': Token {
          value: 'pg_catalog',
          location: TokenLocation {
            lineStart: 13,
            lineEnd: 13,
            startIndex: 332,
            stopIndex: 341
          }
        },
        '342': Token {
          value: '.',

...

        '378': Token {
          value: ')',
          location: TokenLocation {
            lineStart: 13,
            lineEnd: 13,
            startIndex: 378,
            stopIndex: 378
          }
        },
        '379': Token {
          value: ';',
          location: TokenLocation {
            lineStart: 13,
            lineEnd: 13,
            startIndex: 379,
            stopIndex: 379
          }
        }
      },
      query: "SELECT pg_catalog.set_config('search_path', '', false)",
      queryType: 'DML',
      queryLocation: TokenLocation {
        lineStart: 13,
        lineEnd: 13,
        startIndex: 325,
        stopIndex: 378
      },
      queryErrors: [],
      subqueries: {},
      commonTableExpressions: {}
    },

Alas I don't think sql-surveyor would give us many advantages over pg-query-native-latest which we're using currently.

@mtriff
Copy link

mtriff commented Nov 23, 2020

The comments are available in the AST, but they're loaded onto a separate channel and sql-surveyor ignores them (not relevant for the use case). You may want to try playing around with the low-level library behind sql-surveyor, antlr4ts-sql. It will give you access to the entire AST. You'll also be able to go as granular as you need on the tokens so you can get access to all the function arguments, etc.

If you go this route, you'll probably want to implement a PL/pgSQL parser directly with anltr4ts eventually, since you don't need all the other dialects supported by antlr4ts-sql. But the package will save you from configuring and generating the parser/lexer while you're experimenting.

@benjie
Copy link
Owner

benjie commented Nov 23, 2020

Awesome, thanks @mtriff. At the moment this project is deep on the back burner as I don't have sufficient quantities of time available to work on it (when I tried before I figured it'd take all my OSS hours for a few months to get a decent solution out, I am a beginner at parsers/lexers after all); but I'm keeping my eyes open for better solutions and your pointers help 👍

@bradleyayers
Copy link

@benjie
Copy link
Owner

benjie commented Feb 24, 2021

@wesselvdv
Copy link

@benjie seems the issue you raised has been resolved? Are there more major blockers?

@benjie
Copy link
Owner

benjie commented Mar 29, 2021

Time ;) See the open PR

@karlhorky
Copy link

Wrote about some alternatives over here too, in case anyone wants to try some things in the meantime:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants