Sql catalog #229

JanKaul · 2024-03-04T15:14:04Z

This PR implements the basic operations for a Sql catalog. The implementation uses the sqlx crate which enables Postgres, MySQL and Sqlite.

The update_table method is to be implemented later.

JanKaul · 2024-03-04T16:40:53Z

PTAL @liurenjie1024 @Xuanwo @ZENOTME @Fokko

crates/catalog/sql/Cargo.toml

crates/catalog/sql/src/catalog.rs

martin-g · 2024-03-05T21:38:21Z

crates/catalog/sql/src/catalog.rs

+        let rows = connection.transaction(|txn|{
+            let name = self.name.clone();
+            Box::pin(async move {
+            sqlx::query(&format!("select distinct table_namespace from iceberg_tables where catalog_name = '{}';",&name)).fetch_all(&mut **txn).await


Should you care about SQL injections ? Or the catalog / namespace / table names are assumed to be safe ?

Good point, it's better to use prepare statement here.

There are several points with this implementation:

If parent is None, we should list all namespaces.

We should also count namespaces in iceberg_namespace_properties

We should list only sub namespaces.

See java implementation here.

crates/catalog/sql/src/error.rs

crates/catalog/sql/src/catalog.rs

crates/catalog/sql/Cargo.toml

crates/catalog/sql/src/catalog.rs

crates/catalog/sql/Cargo.toml

crates/catalog/sql/src/catalog.rs

odysa · 2024-03-05T21:54:20Z

crates/catalog/sql/src/catalog.rs

+    }
+
+    async fn load_table(&self, identifier: &TableIdent) -> Result<Table> {
+        let metadata_location = {


Do we need to check the cache first? Given that it's inserted later.

Also the insertion should not blind, we need to check its version first. My suggestion is to remove the cache for now so that things don't get too complicated.

The cache is only intended for update_table. Optimistically assuming that the metadata_location hasn't been changed since loading the table, the metadata and metadata_location from the cache can directly be used to perform the update. This way the database has to be queried only once for the optimistic case.
If the metadata_location changed, the update method has to be more involved.

I would not use the cache for loading tables.

I'm thinking maybe we should have a standalone data structure for caching, just like CachingCatalog in java

odysa · 2024-03-05T21:59:57Z

crates/catalog/sql/src/catalog.rs

+                Box::pin(async move {
+                    sqlx::query(
+                        "create table if not exists iceberg_namespace_properties (
+                                catalog_name text not null,


VARCHAR(255) here. Maybe we can copy&paste SQL from java iceberg?

+ CATALOG_NAME + " VARCHAR(255) NOT NULL," + NAMESPACE_NAME + " VARCHAR(255) NOT NULL," + NAMESPACE_PROPERTY_KEY + " VARCHAR(255)," + NAMESPACE_PROPERTY_VALUE + " VARCHAR(1000),"

crates/catalog/sql/Cargo.toml

liurenjie1024 · 2024-03-06T01:58:30Z

crates/catalog/sql/src/catalog.rs

+        let rows = connection.transaction(|txn|{
+            let name = self.name.clone();
+            Box::pin(async move {
+            sqlx::query(&format!("select distinct table_namespace from iceberg_tables where catalog_name = '{}';",&name)).fetch_all(&mut **txn).await


Good point, it's better to use prepare statement here.

liurenjie1024 · 2024-03-06T02:41:41Z

crates/catalog/sql/src/catalog.rs

+        let rows = connection.transaction(|txn|{
+            let name = self.name.clone();
+            Box::pin(async move {
+            sqlx::query(&format!("select distinct table_namespace from iceberg_tables where catalog_name = '{}';",&name)).fetch_all(&mut **txn).await


There are several points with this implementation:

If parent is None, we should list all namespaces.

We should also count namespaces in iceberg_namespace_properties

We should list only sub namespaces.

See java implementation here.

crates/catalog/sql/src/lib.rs

liurenjie1024 · 2024-03-06T02:46:38Z

crates/catalog/sql/src/catalog.rs

+                        y.table_namespace
+                            .split('.')
+                            .map(ToString::to_string)
+                            .collect::<Vec<_>>(),


How about extract this to a common method in NamespaceIdent?

crates/catalog/sql/src/catalog.rs

liurenjie1024 · 2024-03-06T02:54:20Z

crates/catalog/sql/src/catalog.rs

+    }
+
+    async fn load_table(&self, identifier: &TableIdent) -> Result<Table> {
+        let metadata_location = {


Also the insertion should not blind, we need to check its version first. My suggestion is to remove the cache for now so that things don't get too complicated.

liurenjie1024 · 2024-03-06T03:13:54Z

crates/iceberg/src/spec/table_metadata.rs

@@ -44,21 +49,27 @@ pub(crate) static INITIAL_SEQUENCE_NUMBER: i64 = 0;
 /// Reference to [`TableMetadata`].
 pub type TableMetadataRef = Arc<TableMetadata>;

-#[derive(Debug, PartialEq, Deserialize, Eq, Clone)]
+#[derive(Debug, PartialEq, Deserialize, Eq, Clone, TypedBuilder)]


We had a discussion in this pr about the table metadata builder. I have concern on this derived builder since it's error prone and not easy to review. TableMetadataBuilder will be heavily used by transaction api and we will need to do a lot of check for in it. I would suggest to maintain this struct manually, what do you think?

liurenjie1024 · 2024-04-16T14:19:22Z

cc @JanKaul Is this pr ready for review or you need to do more updates?

JanKaul · 2024-04-17T05:24:08Z

I have to add a couple of more changes. I'll notify you when I'm finished.

himadripal · 2024-04-18T04:55:13Z

@JanKaul WDYT? I think this PR is ready for review, I can add the update and delete in a separate PR.

liurenjie1024 · 2024-04-25T03:33:16Z

@JanKaul WDYT? I think this PR is ready for review, I can add the update and delete in a separate PR.

Cool, I'll take a look first.

fix connection pool issue for sql catalog

create sqlconfig, fix rest of the tests and remove todo

Co-authored-by: Renjie Liu <liurenjie2008@gmail.com>

JanKaul · 2024-04-25T08:51:21Z

Thank you all for your helpful comments. I think the PR is ready for review again.

@liurenjie1024 @sdd @odysa @ZENOTME @martin-g

liurenjie1024

Thanks @JanKaul for this pr, we moved a huge step forward! I think there are some places we can improve a little to make it more robust.

crates/catalog/sql/Cargo.toml

crates/catalog/sql/src/catalog.rs

liurenjie1024 · 2024-04-25T10:45:01Z

crates/catalog/sql/src/catalog.rs

+    name: String,
+    connection: AnyPool,
+    storage: FileIO,
+    cache: Arc<DashMap<TableIdent, (String, TableMetadata)>>,


I'm hesitating to add cache here, maybe we can add sth like CachedCatalog in java so that all catalog implementations could benefit from it?

crates/catalog/sql/src/catalog.rs

liurenjie1024 · 2024-04-25T14:33:14Z

crates/catalog/sql/src/catalog.rs

+        Ok(table)
+    }
+
+    async fn create_table(


There are some things missing here:

We should first check namespace exists

The location is optional, it should use warehouse's subdir as location

I would suggest to refer to python's implementation,

Co-authored-by: Renjie Liu <liurenjie2008@gmail.com>

JanKaul requested review from liurenjie1024, Fokko and Xuanwo March 4, 2024 16:41

sdd reviewed Mar 5, 2024

View reviewed changes

crates/catalog/sql/Cargo.toml Outdated Show resolved Hide resolved

sdd reviewed Mar 5, 2024

View reviewed changes

crates/catalog/sql/src/catalog.rs Outdated Show resolved Hide resolved

sdd reviewed Mar 5, 2024

View reviewed changes

crates/catalog/sql/src/catalog.rs Show resolved Hide resolved

martin-g reviewed Mar 5, 2024

View reviewed changes

crates/catalog/sql/src/error.rs Outdated Show resolved Hide resolved

odysa reviewed Mar 5, 2024

View reviewed changes

liurenjie1024 mentioned this pull request Mar 7, 2024

Discussion: Design of TableMetadataBuilder. #232

Open

liurenjie1024 reviewed Mar 11, 2024

View reviewed changes

liurenjie1024 linked an issue Mar 11, 2024 that may be closed by this pull request

Implement Sql Catalog. #248

Open

JanKaul force-pushed the sql-catalog branch from b286678 to 307fe1d Compare March 13, 2024 10:25

JanKaul force-pushed the sql-catalog branch from 307fe1d to 042fd03 Compare April 8, 2024 18:23

Fokko added this to the 0.3.0 Release milestone Apr 24, 2024

Fokko mentioned this pull request Apr 24, 2024

Tracking issues of iceberg-rust v0.3.0 #348

Open

72 tasks

JanKaul and others added 9 commits April 25, 2024 09:38

initialize sql catalog crate

baa8855

implement basic sql catalog functionality

6a50576

fix clippy warnings

1a280b1

fix format

5c5ae7a

fix cargo sort

8a78d3a

fix ordering

c9d739d

fix ordering

3c78b27

fix connection pool issue for sql catalog

34d5a37

use prepared statements

f0009a9

hpal and others added 10 commits April 25, 2024 09:38

move the sqllite database creation part inside test case.

2fde285

run test

7479c70

style fix and few additional logic removal and add todo check.

473a9f5

rebase

5fb887b

create sqlconfig, fix rest of the tests and remove todo

46c9afe

Merge pull request #1 from himadripal/sql-catalog-conn-pool-fix

bdf64c0

fix connection pool issue for sql catalog

Merge pull request #2 from himadripal/sql-catalog-conn-pool-fix-2

8e2dfdc

create sqlconfig, fix rest of the tests and remove todo

use varchar

bda84da

use static strings for sql identifiers

5c050f7

fix namespace encoding

9893068

JanKaul force-pushed the sql-catalog branch from b2cbc5d to 9893068 Compare April 25, 2024 07:38

JanKaul and others added 2 commits April 25, 2024 09:40

fix typo

2e6f5d1

Fix heading

6e79b37

Co-authored-by: Renjie Liu <liurenjie2008@gmail.com>

liurenjie1024 reviewed Apr 25, 2024

View reviewed changes

JanKaul and others added 13 commits April 26, 2024 20:43

Update crates/catalog/sql/src/catalog.rs

b450245

Co-authored-by: Renjie Liu <liurenjie2008@gmail.com>

rename fileio

7929eb3

simplify conversion from string to NamespaceIdent

0d67077

use uri for database connection

2e3d708

use features for sqlx database support

e7c4148

use statics for pool default values

d9df683

move derive Default

0012fc7

create new tempdir for every test run

e3c464e

use table record type

3aea7c1

use parent for list namespaces

b0a34cd

fix typo

6a6479c

fix clippy warning

3b78264

use statics for namespace sql

d5e1b7a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sql catalog #229

Sql catalog #229

JanKaul commented Mar 4, 2024

JanKaul commented Mar 4, 2024

martin-g Mar 5, 2024

liurenjie1024 Mar 6, 2024

liurenjie1024 Mar 6, 2024

odysa Mar 5, 2024

liurenjie1024 Mar 6, 2024

JanKaul Mar 13, 2024

liurenjie1024 Mar 13, 2024

odysa Mar 5, 2024

liurenjie1024 Mar 6, 2024

liurenjie1024 Mar 6, 2024

liurenjie1024 Mar 6, 2024

liurenjie1024 Mar 6, 2024

liurenjie1024 Mar 6, 2024

liurenjie1024 commented Apr 16, 2024

JanKaul commented Apr 17, 2024

himadripal commented Apr 18, 2024

liurenjie1024 commented Apr 25, 2024

JanKaul commented Apr 25, 2024

liurenjie1024 left a comment

liurenjie1024 Apr 25, 2024

liurenjie1024 Apr 25, 2024

Sql catalog #229

Are you sure you want to change the base?

Sql catalog #229

Conversation

JanKaul commented Mar 4, 2024

JanKaul commented Mar 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 commented Apr 16, 2024

JanKaul commented Apr 17, 2024

himadripal commented Apr 18, 2024

liurenjie1024 commented Apr 25, 2024

JanKaul commented Apr 25, 2024

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment