Skip to content

[REFACTOR] type annotation -> CocoIndex type encoding logic in Python SDK should return strong-typed schema class #1083

@georgeh0

Description

@georgeh0

Some Background

Regarding data types / schemas, there're multiple forms:

  1. Python native type annotation, e.g. int, dict[str, Any], a specific data class. They're directly used in users code as type hints.
  2. AnalyzedTypeInfo: basically a more structured representation of 1. Used by our Python SDK internally only.
  3. Strong-typed schema representing in CocoIndex's type system, these classes, they mirror engine's data schema representation. They're exposed to some third party APIs, e.g. custom targets (custom target connectors can inspect schema of the data exporting to them), and also custom functions / sources in the future.
  4. Generic-typed JSON-equivalent values, in types such as dict[str, Any] (for JSON object), list[dict[str, Any]] (for JSON array), str (for JSON string), etc. They can be directly passed from/to engine in Rust.

Task

We have logic to convert Python's native type annotation to engine type. Currently we're doing 1->2->4 (code), because 3 was just introduced recently.

We want to:

  • Change the logic of 2->4 to 2->3, i.e. convert AnalyzedTypeInfo to strong-typed schema representation first. This will make our code easier to read and maintain (3 is easier to build than 4, and can leverage mypy type checks etc.)
  • After got 3, existing callers can simply call the encode() method to get 4. So we don't have to expose convenient methods to directly return 4 in the typing package.
  • Tests in test_typing.py should be updated accordingly, to check the output of 3 instead of 4 (3 is more structured than 4, and easier to check).

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions