Skip to content

Add support for attaching Postgres databases as a pluggable storage catalog, and fix many issues #111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 96 commits into from
Oct 17, 2023

Conversation

Mytherin
Copy link
Contributor

@Mytherin Mytherin commented Oct 17, 2023

This is a giant PR that reworks most of the extension.

Pluggable Storage

This PR enables attaching Postgres databases as a pluggable catalog using the ATTACH command, similar to what is possible for SQLite databases (see duckdb/duckdb#6066).

Here is an example of how this works:

ATTACH 'dbname=postgresscanner' AS postgres (TYPE POSTGRES);
CREATE TABLE postgres.pg_tbl(id INT);
INSERT INTO postgres.pg_tbl VALUES (42), (84), (NULL);
SELECT * FROM postgres.pg_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│    42 │
│    84 │
│  NULL │
└───────┘

The following operations are supported:

CREATE TABLE
CREATE TABLE postgres.pg_tbl(id INT);
INSERT INTO
INSERT INTO postgres.pg_tbl VALUES (42), (84), (NULL);
COPY
COPY (SELECT 1) TO 'file.parquet';
COPY postgres.pg_tbl FROM 'file.parquet';
SELECT
SELECT * FROM postgres.pg_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│    42 │
│    84 │
│  NULL │
│     1 │
└───────┘
DESCRIBE
DESCRIBE postgres.pg_tbl;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│ column_name │ column_type │  null   │   key   │ default │ extra │
│   varcharvarcharvarcharvarcharvarchar │ int32 │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ id          │ INTEGER     │ YES     │ NULLNULLNULL │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴───────┘
DELETE
DELETE FROM postgres.pg_tbl WHERE id=1;
SELECT * FROM postgres.pg_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│    42 │
│    84 │
│  NULL │
└───────┘
UPDATE
UPDATE postgres.pg_tbl SET id=id+1;
SELECT * FROM postgres.pg_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│    43 │
│    85 │
│  NULL │
└───────┘
Transactions (BEGIN, COMMIT, ROLLBACK)
BEGIN;
DELETE FROM postgres.pg_tbl;
FROM postgres.pg_tbl;
┌────────┐
│   id   │
│ int32  │
├────────┤
│ 0 rows │
└────────┘
ROLLBACK;
FROM postgres.pg_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│    43 │
│    85 │
│  NULL │
└───────┘
ALTER TABLE
ALTER TABLE postgres.pg_tbl ADD COLUMN name VARCHAR;
FROM postgres.pg_tbl;
┌───────┬─────────┐
│  id   │  name   │
│ int32 │ varchar │
├───────┼─────────┤
│    43NULL    │
│    85NULL    │
│  NULLNULL    │
└───────┴─────────┘
DROP
DROP TABLE postgres.pg_tbl;
FROM postgres.pg_tbl;
-- Table with name pg_tbl does not exist!
CREATE SCHEMA
CREATE SCHEMA postgres.new_schema;
CREATE TABLE postgres.new_schema.new_tbl(i INT);
INSERT INTO postgres.new_schema.new_tbl VALUES (42);
SELECT * FROM postgres.new_schema.new_tbl;
┌───────┐
│   i   │
│ int32 │
├───────┤
│    42 │
└───────┘
DROP SCHEMA postgres.new_schema CASCADE;
CREATE VIEW
CREATE VIEW postgres.v1 AS SELECT 42;
FROM postgres.v1;
┌───────┐
│  42   │
│ int32 │
├───────┤
│    42 │
└───────┘
CREATE INDEX
CREATE TABLE postgres.pg_tbl(id INT);
CREATE INDEX i_index ON postgres.pg_tbl(id);
CREATE TYPE
USE postgres;
CREATE TYPE my_point AS STRUCT(x INT, y INT);
CREATE TABLE table_with_point(point MY_POINT);
INSERT INTO table_with_point VALUES ({'x': 42, 'y': 84});
FROM table_with_point;
┌──────────────────────────────┐
│            point             │
│ struct(x integer, y integer) │
├──────────────────────────────┤
│ {'x': 42, 'y': 84}           │
└──────────────────────────────┘

Metadata caching

The metadata (e.g. which schemas are there, which tables are there, what their structure is, etc) is read once and then cached. That means that if e.g. a new table is created through a different connection, the cached metadata will not know about the existence of this table. There is a new function pg_clear_cache that can be used to clear the metadata cache and force new data to be read from the Postgres instance.

CALL pg_clear_cache();

Extra Parameters/Configuration

This PR adds the following settings to the extension:

D FROM duckdb_settings() WHERE name LIKE 'pg_%';
┌─────────────────────────────────┬─────────┬────────────────────────────────────────────────────────────────────────────┬────────────┐
│              name               │  value  │                                description                                 │ input_type │
│             varcharvarcharvarcharvarchar   │
├─────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────┼────────────┤
│ pg_debug_show_queries           │ false   │ DEBUG SETTING: print all queries sent to Postgres to stdout                │ BOOLEAN    │
│ pg_experimental_filter_pushdown │ false   │ Whether or not to use filter pushdown (currently experimental)             │ BOOLEAN    │
│ pg_connection_limit             │ 64      │ The maximum amount of concurrent Postgres connections                      │ UBIGINT    │
│ pg_array_as_varchar             │ false   │ Read Postgres arrays as varchar - enables reading mixed dimensional arrays │ BOOLEAN    │
│ pg_pages_per_task               │ 1000    │ The amount of pages per task                                               │ UBIGINT    │
│ pg_use_binary_copy              │ true    │ Whether or not to use BINARY copy to read data                             │ BOOLEAN    │
└─────────────────────────────────┴─────────┴────────────────────────────────────────────────────────────────────────────┴────────────┘

Fixes

This PR reworks the extension significantly - and also fixes many outstanding issues.

Fixes #16
Fixes #68
Fixes #69
Fixes #73
Fixes #79
Fixes #81
Fixes #82
Fixes #85
Fixes #92
Fixes #93
Fixes #94
Fixes #96
Fixes #100
Fixes #102
Supersedes #59
Supersedes #99
Supersedes #107

Decimals

There are a number of fixes for decimals, in particular:

  • Large decimals are read as doubles (e.g. when width > 38 which is the max width DuckDB itself supports)
  • Large decimals are read directly into doubles
  • Overflows when reading decimals are fixed and now correctly checked for by greatly expanding the decimals that are tested

Support for querying views

Views can now be queried, e.g.:

Postgres
CREATE VIEW v1 AS SELECT 42;
DuckDB
ATTACH 'dbname=test' AS postgres (TYPE POSTGRES);
SELECT * FROM postgres.v1;
┌──────────┐
│ ?column? │
│  int32   │
├──────────┤
│       42 │
└──────────┘

Duplicate column/table names

In Postgres, column and table names are case sensitive. Hence this is valid SQL:

CREATE TABLE mixed_names("COL" INT, "col" INT);

This conflicts with DuckDB which is case insensitive. This PR fixes the issue by instead renaming the columns when read into DuckDB instead of throwing an error. For example:

D SELECT * FROM POSTGRES_SCAN('dbname=test', 'public', 'mixed_names');
┌───────┬───────┐
│  COL  │ col:1 │
│ int32 │ int32 │
├───────────────┤
│    0 rows     │
└───────────────┘

Connection Pool

This PR introduces the concept of a "connection pool" which is a limit to how many parallel connections will be opened during scans. The default limit is set to 64, and can be adjusted using the pg_connection_limit setting (e.g. SET pg_connection_limit=1000).

Complex Types

This PR adds support for complex types, including composite types (created through CREATE TYPE), and multidimensional arrays. This support is for both ingesting data (using e.g. CREATE TABLE/INSERT) as well as for reading data.

ATTACH time

While this PR does not fix the issue of pg_attach taking a long time to execute, the new ATTACH method (which should be the preferred method of attaching going forward) should not suffer from the same problem because it only issues a single query to get the column names/types of all tables at once instead of sending one query per table.

@Nintorac
Copy link

Hey, great effort!!

Is there an ETA on when this will make it into a DuckDB release?

@Mytherin
Copy link
Contributor Author

Should be in next week for 0.9.2

@tmontes
Copy link

tmontes commented Nov 14, 2023

Perfect timing! ...in the process of leading the first duckdb/PostgreSQL integration project.

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment