Add support for attaching Postgres databases as a pluggable storage catalog, and fix many issues #111

Mytherin · 2023-10-17T13:28:34Z

This is a giant PR that reworks most of the extension.

Pluggable Storage

This PR enables attaching Postgres databases as a pluggable catalog using the ATTACH command, similar to what is possible for SQLite databases (see duckdb/duckdb#6066).

Here is an example of how this works:

ATTACH 'dbname=postgresscanner' AS postgres (TYPE POSTGRES);
CREATE TABLE postgres.pg_tbl(id INT);
INSERT INTO postgres.pg_tbl VALUES (42), (84), (NULL);
SELECT * FROM postgres.pg_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│    42 │
│    84 │
│  NULL │
└───────┘

The following operations are supported:

CREATE TABLE

CREATE TABLE postgres.pg_tbl(id INT);

INSERT INTO

INSERT INTO postgres.pg_tbl VALUES (42), (84), (NULL);

COPY

COPY (SELECT 1) TO 'file.parquet';
COPY postgres.pg_tbl FROM 'file.parquet';

SELECT

SELECT * FROM postgres.pg_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│    42 │
│    84 │
│  NULL │
│     1 │
└───────┘

DESCRIBE

DESCRIBE postgres.pg_tbl;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│ column_name │ column_type │  null   │   key   │ default │ extra │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ int32 │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ id          │ INTEGER     │ YES     │ NULL    │ NULL    │  NULL │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴───────┘

DELETE

DELETE FROM postgres.pg_tbl WHERE id=1;
SELECT * FROM postgres.pg_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│    42 │
│    84 │
│  NULL │
└───────┘

UPDATE

UPDATE postgres.pg_tbl SET id=id+1;
SELECT * FROM postgres.pg_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│    43 │
│    85 │
│  NULL │
└───────┘

Transactions (BEGIN, COMMIT, ROLLBACK)

BEGIN;
DELETE FROM postgres.pg_tbl;
FROM postgres.pg_tbl;
┌────────┐
│   id   │
│ int32  │
├────────┤
│ 0 rows │
└────────┘
ROLLBACK;
FROM postgres.pg_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│    43 │
│    85 │
│  NULL │
└───────┘

ALTER TABLE

ALTER TABLE postgres.pg_tbl ADD COLUMN name VARCHAR;
FROM postgres.pg_tbl;
┌───────┬─────────┐
│  id   │  name   │
│ int32 │ varchar │
├───────┼─────────┤
│    43 │ NULL    │
│    85 │ NULL    │
│  NULL │ NULL    │
└───────┴─────────┘

DROP

DROP TABLE postgres.pg_tbl;
FROM postgres.pg_tbl;
-- Table with name pg_tbl does not exist!

CREATE SCHEMA

CREATE SCHEMA postgres.new_schema;
CREATE TABLE postgres.new_schema.new_tbl(i INT);
INSERT INTO postgres.new_schema.new_tbl VALUES (42);
SELECT * FROM postgres.new_schema.new_tbl;
┌───────┐
│   i   │
│ int32 │
├───────┤
│    42 │
└───────┘
DROP SCHEMA postgres.new_schema CASCADE;

CREATE VIEW

CREATE VIEW postgres.v1 AS SELECT 42;
FROM postgres.v1;
┌───────┐
│  42   │
│ int32 │
├───────┤
│    42 │
└───────┘

CREATE INDEX

CREATE TABLE postgres.pg_tbl(id INT);
CREATE INDEX i_index ON postgres.pg_tbl(id);

CREATE TYPE

USE postgres;
CREATE TYPE my_point AS STRUCT(x INT, y INT);
CREATE TABLE table_with_point(point MY_POINT);
INSERT INTO table_with_point VALUES ({'x': 42, 'y': 84});
FROM table_with_point;
┌──────────────────────────────┐
│            point             │
│ struct(x integer, y integer) │
├──────────────────────────────┤
│ {'x': 42, 'y': 84}           │
└──────────────────────────────┘

Metadata caching

The metadata (e.g. which schemas are there, which tables are there, what their structure is, etc) is read once and then cached. That means that if e.g. a new table is created through a different connection, the cached metadata will not know about the existence of this table. There is a new function pg_clear_cache that can be used to clear the metadata cache and force new data to be read from the Postgres instance.

CALL pg_clear_cache();

Extra Parameters/Configuration

This PR adds the following settings to the extension:

D FROM duckdb_settings() WHERE name LIKE 'pg_%';
┌─────────────────────────────────┬─────────┬────────────────────────────────────────────────────────────────────────────┬────────────┐
│              name               │  value  │                                description                                 │ input_type │
│             varchar             │ varchar │                                  varchar                                   │  varchar   │
├─────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────┼────────────┤
│ pg_debug_show_queries           │ false   │ DEBUG SETTING: print all queries sent to Postgres to stdout                │ BOOLEAN    │
│ pg_experimental_filter_pushdown │ false   │ Whether or not to use filter pushdown (currently experimental)             │ BOOLEAN    │
│ pg_connection_limit             │ 64      │ The maximum amount of concurrent Postgres connections                      │ UBIGINT    │
│ pg_array_as_varchar             │ false   │ Read Postgres arrays as varchar - enables reading mixed dimensional arrays │ BOOLEAN    │
│ pg_pages_per_task               │ 1000    │ The amount of pages per task                                               │ UBIGINT    │
│ pg_use_binary_copy              │ true    │ Whether or not to use BINARY copy to read data                             │ BOOLEAN    │
└─────────────────────────────────┴─────────┴────────────────────────────────────────────────────────────────────────────┴────────────┘

Fixes

This PR reworks the extension significantly - and also fixes many outstanding issues.

Fixes #16
Fixes #68
Fixes #69
Fixes #73
Fixes #79
Fixes #81
Fixes #82
Fixes #85
Fixes #92
Fixes #93
Fixes #94
Fixes #96
Fixes #100
Fixes #102
Supersedes #59
Supersedes #99
Supersedes #107

Decimals

There are a number of fixes for decimals, in particular:

Large decimals are read as doubles (e.g. when width > 38 which is the max width DuckDB itself supports)
Large decimals are read directly into doubles
Overflows when reading decimals are fixed and now correctly checked for by greatly expanding the decimals that are tested

Support for querying views

Views can now be queried, e.g.:

Postgres

CREATE VIEW v1 AS SELECT 42;

DuckDB

ATTACH 'dbname=test' AS postgres (TYPE POSTGRES);
SELECT * FROM postgres.v1;
┌──────────┐
│ ?column? │
│  int32   │
├──────────┤
│       42 │
└──────────┘

Duplicate column/table names

In Postgres, column and table names are case sensitive. Hence this is valid SQL:

CREATE TABLE mixed_names("COL" INT, "col" INT);

This conflicts with DuckDB which is case insensitive. This PR fixes the issue by instead renaming the columns when read into DuckDB instead of throwing an error. For example:

D SELECT * FROM POSTGRES_SCAN('dbname=test', 'public', 'mixed_names');
┌───────┬───────┐
│  COL  │ col:1 │
│ int32 │ int32 │
├───────────────┤
│    0 rows     │
└───────────────┘

Connection Pool

This PR introduces the concept of a "connection pool" which is a limit to how many parallel connections will be opened during scans. The default limit is set to 64, and can be adjusted using the pg_connection_limit setting (e.g. SET pg_connection_limit=1000).

Complex Types

This PR adds support for complex types, including composite types (created through CREATE TYPE), and multidimensional arrays. This support is for both ingesting data (using e.g. CREATE TABLE/INSERT) as well as for reading data.

ATTACH time

While this PR does not fix the issue of pg_attach taking a long time to execute, the new ATTACH method (which should be the preferred method of attaching going forward) should not suffer from the same problem because it only issues a single query to get the column names/types of all tables at once instead of sending one query per table.

…ly broken)

…ELETE - forward the query to postgres.

…ve cyclical queries (e.g. INSERT INTO postgres_tbl FROM postgres_tbl)

Nintorac · 2023-11-10T01:45:47Z

Hey, great effort!!

Is there an ETA on when this will make it into a DuckDB release?

Mytherin · 2023-11-10T10:17:25Z

Should be in next week for 0.9.2

tmontes · 2023-11-14T20:18:28Z

Perfect timing! ...in the process of leading the first duckdb/PostgreSQL integration project.

Thanks a lot!

Mytherin added 30 commits October 6, 2023 14:10

Begin code cleanup

2508641

Basic attach working - tons of todos

d9244c7

CREATE TABLE working

bccf0d2

Insert WIP

1b430f7

Postgres COPY TO conversion code and insert into "working" (i.e. most…

3f91272

…ly broken)

Inserting and scanning working

8e79675

attach_simple working

bf8cbe1

Support for more types

b5f62bf

CREATE IF EXISTS, and mute notices

0ae4639

Add support for scanning ctids (rowids)

b6905f9

Delete working

c6ee07f

Correctly handle special characters in table and column names

d3a2511

Rework catalog management to begin support for multiple schemas

9282256

Schema support working

3bd8850

Test fixes + read only

d6a71be

UPDATE working!?

d691dd5

Views v0.1

e1a2a84

GetTableInfo

b1ee5ac

Fix for time_tz type reading

9e50be7

Fix for TPC-H test

c0a5bc2

Scanner cleanup - move filter pushdown to separate file/class

603b26d

Move read functions to separate PostgresBinaryReader class

2987466

Add support for reading views

87e161b

DROP TABLE/DROP VIEW verification

8ccb4cf

Tests for inserting into views

281c84b

Constraints test

8c3d254

Checkpoint

e27111b

Correctly support transactions and rework UPDATE/DELETE. For UPDATE/D…

dd51b9f

…ELETE - forward the query to postgres.

Revert to old update/delete workflow

f8da0a8

Add materialization option to postgres scans, and use them when we ha…

f28f795

…ve cyclical queries (e.g. INSERT INTO postgres_tbl FROM postgres_tbl)

Mytherin added 9 commits October 17, 2023 17:04

We need 0.9.1

e626242

Run sh pgconfigure through makefile

ce5a690

cmath include for Linux

212178e

Misisng define for htonll

c825fe0

Windows fixes

2dfe961

Move tests around

5c852ed

Test fixes

d7ca803

Fixes for old postgres version

77b0242

query I

90b300f

Mytherin merged commit 47d37ad into duckdb:main Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for attaching Postgres databases as a pluggable storage catalog, and fix many issues #111

Add support for attaching Postgres databases as a pluggable storage catalog, and fix many issues #111

Uh oh!

Mytherin commented Oct 17, 2023 •

edited

Loading

Uh oh!

Nintorac commented Nov 10, 2023

Uh oh!

Mytherin commented Nov 10, 2023

Uh oh!

tmontes commented Nov 14, 2023

Uh oh!

Uh oh!

Add support for attaching Postgres databases as a pluggable storage catalog, and fix many issues #111

Add support for attaching Postgres databases as a pluggable storage catalog, and fix many issues #111

Uh oh!

Conversation

Mytherin commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pluggable Storage

CREATE TABLE

INSERT INTO

COPY

SELECT

DESCRIBE

DELETE

UPDATE

Transactions (BEGIN, COMMIT, ROLLBACK)

ALTER TABLE

DROP

CREATE SCHEMA

CREATE VIEW

CREATE INDEX

CREATE TYPE

Metadata caching

Extra Parameters/Configuration

Fixes

Decimals

Support for querying views

Postgres

DuckDB

Duplicate column/table names

Connection Pool

Complex Types

ATTACH time

Uh oh!

Nintorac commented Nov 10, 2023

Uh oh!

Mytherin commented Nov 10, 2023

Uh oh!

tmontes commented Nov 14, 2023

Uh oh!

Uh oh!

Mytherin commented Oct 17, 2023 •

edited

Loading