Skip to content

Conversation

Mytherin
Copy link
Contributor

@Mytherin Mytherin commented Oct 17, 2023

This is a giant PR that reworks most of the extension.

Pluggable Storage

This PR enables attaching Postgres databases as a pluggable catalog using the ATTACH command, similar to what is possible for SQLite databases (see duckdb/duckdb#6066).

Here is an example of how this works:

ATTACH 'dbname=postgresscanner' AS postgres (TYPE POSTGRES);
CREATE TABLE postgres.pg_tbl(id INT);
INSERT INTO postgres.pg_tbl VALUES (42), (84), (NULL);
SELECT * FROM postgres.pg_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│    42 │
│    84 │
│  NULL │
└───────┘

The following operations are supported:

CREATE TABLE
CREATE TABLE postgres.pg_tbl(id INT);
INSERT INTO
INSERT INTO postgres.pg_tbl VALUES (42), (84), (NULL);
COPY
COPY (SELECT 1) TO 'file.parquet';
COPY postgres.pg_tbl FROM 'file.parquet';
SELECT
SELECT * FROM postgres.pg_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│    42 │
│    84 │
│  NULL │
│     1 │
└───────┘
DESCRIBE
DESCRIBE postgres.pg_tbl;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│ column_name │ column_type │  null   │   key   │ default │ extra │
│   varcharvarcharvarcharvarcharvarchar │ int32 │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ id          │ INTEGER     │ YES     │ NULLNULLNULL │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴───────┘
DELETE
DELETE FROM postgres.pg_tbl WHERE id=1;
SELECT * FROM postgres.pg_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│    42 │
│    84 │
│  NULL │
└───────┘
UPDATE
UPDATE postgres.pg_tbl SET id=id+1;
SELECT * FROM postgres.pg_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│    43 │
│    85 │
│  NULL │
└───────┘
Transactions (BEGIN, COMMIT, ROLLBACK)
BEGIN;
DELETE FROM postgres.pg_tbl;
FROM postgres.pg_tbl;
┌────────┐
│   id   │
│ int32  │
├────────┤
│ 0 rows │
└────────┘
ROLLBACK;
FROM postgres.pg_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│    43 │
│    85 │
│  NULL │
└───────┘
ALTER TABLE
ALTER TABLE postgres.pg_tbl ADD COLUMN name VARCHAR;
FROM postgres.pg_tbl;
┌───────┬─────────┐
│  id   │  name   │
│ int32 │ varchar │
├───────┼─────────┤
│    43NULL    │
│    85NULL    │
│  NULLNULL    │
└───────┴─────────┘
DROP
DROP TABLE postgres.pg_tbl;
FROM postgres.pg_tbl;
-- Table with name pg_tbl does not exist!
CREATE SCHEMA
CREATE SCHEMA postgres.new_schema;
CREATE TABLE postgres.new_schema.new_tbl(i INT);
INSERT INTO postgres.new_schema.new_tbl VALUES (42);
SELECT * FROM postgres.new_schema.new_tbl;
┌───────┐
│   i   │
│ int32 │
├───────┤
│    42 │
└───────┘
DROP SCHEMA postgres.new_schema CASCADE;
CREATE VIEW
CREATE VIEW postgres.v1 AS SELECT 42;
FROM postgres.v1;
┌───────┐
│  42   │
│ int32 │
├───────┤
│    42 │
└───────┘
CREATE INDEX
CREATE TABLE postgres.pg_tbl(id INT);
CREATE INDEX i_index ON postgres.pg_tbl(id);
CREATE TYPE
USE postgres;
CREATE TYPE my_point AS STRUCT(x INT, y INT);
CREATE TABLE table_with_point(point MY_POINT);
INSERT INTO table_with_point VALUES ({'x': 42, 'y': 84});
FROM table_with_point;
┌──────────────────────────────┐
│            point             │
│ struct(x integer, y integer) │
├──────────────────────────────┤
│ {'x': 42, 'y': 84}           │
└──────────────────────────────┘

Metadata caching

The metadata (e.g. which schemas are there, which tables are there, what their structure is, etc) is read once and then cached. That means that if e.g. a new table is created through a different connection, the cached metadata will not know about the existence of this table. There is a new function pg_clear_cache that can be used to clear the metadata cache and force new data to be read from the Postgres instance.

CALL pg_clear_cache();

Extra Parameters/Configuration

This PR adds the following settings to the extension:

D FROM duckdb_settings() WHERE name LIKE 'pg_%';
┌─────────────────────────────────┬─────────┬────────────────────────────────────────────────────────────────────────────┬────────────┐
│              name               │  value  │                                description                                 │ input_type │
│             varcharvarcharvarcharvarchar   │
├─────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────┼────────────┤
│ pg_debug_show_queries           │ false   │ DEBUG SETTING: print all queries sent to Postgres to stdout                │ BOOLEAN    │
│ pg_experimental_filter_pushdown │ false   │ Whether or not to use filter pushdown (currently experimental)             │ BOOLEAN    │
│ pg_connection_limit             │ 64      │ The maximum amount of concurrent Postgres connections                      │ UBIGINT    │
│ pg_array_as_varchar             │ false   │ Read Postgres arrays as varchar - enables reading mixed dimensional arrays │ BOOLEAN    │
│ pg_pages_per_task               │ 1000    │ The amount of pages per task                                               │ UBIGINT    │
│ pg_use_binary_copy              │ true    │ Whether or not to use BINARY copy to read data                             │ BOOLEAN    │
└─────────────────────────────────┴─────────┴────────────────────────────────────────────────────────────────────────────┴────────────┘

Fixes

This PR reworks the extension significantly - and also fixes many outstanding issues.

Fixes #16
Fixes #68
Fixes #69
Fixes #73
Fixes #79
Fixes #81
Fixes #82
Fixes #85
Fixes #92
Fixes #93
Fixes #94
Fixes #96
Fixes #100
Fixes #102
Supersedes #59
Supersedes #99
Supersedes #107

Decimals

There are a number of fixes for decimals, in particular:

  • Large decimals are read as doubles (e.g. when width > 38 which is the max width DuckDB itself supports)
  • Large decimals are read directly into doubles
  • Overflows when reading decimals are fixed and now correctly checked for by greatly expanding the decimals that are tested

Support for querying views

Views can now be queried, e.g.:

Postgres
CREATE VIEW v1 AS SELECT 42;
DuckDB
ATTACH 'dbname=test' AS postgres (TYPE POSTGRES);
SELECT * FROM postgres.v1;
┌──────────┐
│ ?column? │
│  int32   │
├──────────┤
│       42 │
└──────────┘

Duplicate column/table names

In Postgres, column and table names are case sensitive. Hence this is valid SQL:

CREATE TABLE mixed_names("COL" INT, "col" INT);

This conflicts with DuckDB which is case insensitive. This PR fixes the issue by instead renaming the columns when read into DuckDB instead of throwing an error. For example:

D SELECT * FROM POSTGRES_SCAN('dbname=test', 'public', 'mixed_names');
┌───────┬───────┐
│  COL  │ col:1 │
│ int32 │ int32 │
├───────────────┤
│    0 rows     │
└───────────────┘

Connection Pool

This PR introduces the concept of a "connection pool" which is a limit to how many parallel connections will be opened during scans. The default limit is set to 64, and can be adjusted using the pg_connection_limit setting (e.g. SET pg_connection_limit=1000).

Complex Types

This PR adds support for complex types, including composite types (created through CREATE TYPE), and multidimensional arrays. This support is for both ingesting data (using e.g. CREATE TABLE/INSERT) as well as for reading data.

ATTACH time

While this PR does not fix the issue of pg_attach taking a long time to execute, the new ATTACH method (which should be the preferred method of attaching going forward) should not suffer from the same problem because it only issues a single query to get the column names/types of all tables at once instead of sending one query per table.

@Nintorac
Copy link

Hey, great effort!!

Is there an ETA on when this will make it into a DuckDB release?

@Mytherin
Copy link
Contributor Author

Should be in next week for 0.9.2

@tmontes
Copy link

tmontes commented Nov 14, 2023

Perfect timing! ...in the process of leading the first duckdb/PostgreSQL integration project.

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment