Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map semantics for tables with primary keys. #772

Closed
14 of 16 tasks
ryzhyk opened this issue Sep 25, 2023 · 5 comments
Closed
14 of 16 tasks

Map semantics for tables with primary keys. #772

ryzhyk opened this issue Sep 25, 2023 · 5 comments
Assignees
Labels
adapters Issues related to the adapters crate SQL compiler Related to the SQL compiler

Comments

@ryzhyk
Copy link
Contributor

ryzhyk commented Sep 25, 2023

We currently use set semantics for all input tables, i.e., duplicate insertions and deletions are ignored and all weights are 1.

In many applications, including CDC, we need to delete or update records by key. This is already supported by DBSP via the add_input_map method. What's missing is:

Implementation plan:

  • New JIT compiler input nodes ([JIT] Set & Map Sources #866 )
    • Rename existing input node types: Source -> SourceZSet, SourceMap -> SourceIndexedZSet or some other choice of names that will distinguish these existing operators from the new operators (see below)
    • add SourceSet node type backed by Circuit::add_input_set
    • add SourceMap node type backed by Circuit::add_input_map
  • JIT deserialization demands: deserialization demands for SourceIndexedZSet and SourceMap need to define schemas for both keys and values. ([JIT] Set & Map Sources #866 )
  • JIT input handles: we currently only have JsonZSetHandle. We need to add JsonSetHandle, JsonIndexedZSetHandle and JsonMapHandle. Corresponding APIs in a separate comment below. ([JIT] Set & Map Sources #866 )
  • Catalog API: add register_input_map method to the static Rust API (Primary key support #826)
  • SQL compiler (Primary key support #826)
    • Rust mode:
      • For tables with a primary key, generate a struct definition for the primary key. It will be exactly the same as for the value type, but will only contain primary key columns.
      • call Catalog::register_input_map instead of register_input_set for tables with a primary key. Example (KEY is the key struct name):
        catalog.register_input_map::<_, TABLE, KEY>("DEMOGRAPHICS", DEMOGRAPHICS.clone(), handle0);
    • JIT mode: For tables with primary keys, generate the SourceMap node instead of SourceZSet (which means the compiler will also need to create a layout for the key type). [SQL] support for tables with primary keys in JIT #891
  • adapters integration test
  • Test with Debezium
  • Pipeline manager integration test Primary key support #826
  • Docs
@ryzhyk ryzhyk added SQL compiler Related to the SQL compiler JIT adapters Issues related to the adapters crate labels Sep 25, 2023
@ryzhyk ryzhyk added this to the v0.1.5 milestone Sep 25, 2023
@ryzhyk
Copy link
Contributor Author

ryzhyk commented Sep 25, 2023

A more detailed summary of proposed JIT input handle API changes.

The current API in facade/handle.rs:

pub struct JsonZSetHandle {
    handle: CollectionHandle<Row, i32>,
    deserialize_fn: DeserializeJsonFn,
    vtable: &'static VTable,
    updates: Vec<(Row, i32)>,
}

The first thing we probably want to do is generalize this type so it works for all formats, not just JSON. This requires a generalized definition of DeserializeFn that hides serde_json::Value inside a JSON-specific closure.

Next, we add two more types of handles:

pub struct SetHandle {
    handle: CollectionHandle<Row, bool>,
    deserialize_fn: DeserializeFn,
    vtable: &'static VTable,
    updates: Vec<(Row, bool)>,
}

pub struct MapHandle {
    handle: CollectionHandle<Row, Option<Row>>,
    deserialize_fn: DeserializeFn,
    vtable: &'static VTable,
    updates: Vec<(Row, Option<Row>)>,
    key_func: ????
}

key_func is the tricky part here. It is needed to insert new (key, value) pairs to the map. In practice, input data contains only the value, and the key is implicitly defined by a subset of columns of the value. The key_func implements this value-to-key mapping. Information needed to generate this function must be included in the SourceMap deserialization demand.

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Oct 11, 2023

Moving to the next milestone (we still need to integrate #866 with the SQL compiler)

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Oct 11, 2023

@mihaibudiu , some examples of the new IR format from #866

"Source": {
        "layout": {
            "Map": [3, 2]
        },
        "kind": "ZSet",
        "table": "T"
  }
  "Source": {
        "layout": { "Set": 1 },
        "kind": "ZSet",
        "table": "T"
  }

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Oct 11, 2023

@mihaibudiu , one more TODO for the SQL compiler that I missed: we need to add primary key information to the schema file, e.g.,

 {
    "name" : "PART",
    "fields" : [ {
      "name" : "ID",
      "case_sensitive" : false,
      "columntype" : {
        "type" : "BIGINT",
        "nullable" : false
      }
    }, {
      "name" : "NAME",
      "case_sensitive" : false,
      "columntype" : {
        "type" : "VARCHAR",
        "nullable" : true,
        "precision" : -1
      }
    } ],
   "primary_key": ["ID"]
  }

The order of column names in the primary key definition must match the layout you generate for JIT.

@mihaibudiu
Copy link
Collaborator

I think #891 completes the SQL compiler side of this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adapters Issues related to the adapters crate SQL compiler Related to the SQL compiler
Projects
None yet
Development

No branches or pull requests

3 participants