First round of advanced search on ducpy#227
Conversation
Jorgedanisc
ci: update semantic-release CI environment variables
Jorgedanisc
ci: update PR comment handling to use POST method
Jorgedanisc
ci: update PR comment API endpoint and ducjs package
fix: new release references wasm as internal
fix: remove pkg from files and update wasm paths
fix: improve delta encoding on the version graph
feat: restructure external files schema with revisions
There was a problem hiding this comment.
Pull request overview
Adds an “advanced search” capability to ducpy by introducing an FTS5-backed element search pipeline (with Python-side scoring/grouping) and corresponding tests, plus updates to the SQLite search schema to improve prefix/diacritic-insensitive matching.
Changes:
- Enhanced
schema/search.sqlFTS5 configuration (tokenizer, prefix indexing, idempotent creation) and added index rebuild statements. - Introduced
ducpy.searchmodule (search_duc_elements, schema ensure/rebuild, result types) with hybrid SQLite+Python ranking. - Added a dedicated pytest suite validating exact, prefix, grouping, empty-query, and limit behaviors.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| schema/search.sql | Improves FTS5 configuration (unicode61 + prefix) and adds rebuild backfill statements. |
| packages/ducpy/src/ducpy/search/search_elements.py | Implements the search API, schema bootstrap, candidate scoring, and grouped results logic. |
| packages/ducpy/src/ducpy/search/init.py | Exposes the search API/types from the ducpy.search package. |
| packages/ducpy/src/ducpy/init.py | Re-exports search helpers at the top-level ducpy namespace. |
| packages/ducpy/src/tests/src/test_search_module.py | Adds integration-style tests exercising search behaviors against real .duc assets. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _find_schema_dir() -> Path: | ||
| current = Path(__file__).resolve() | ||
| for parent in current.parents: | ||
| candidate = parent / "schema" | ||
| if (candidate / "search.sql").exists(): | ||
| return candidate | ||
| raise FileNotFoundError( | ||
| "Could not locate schema/search.sql. Ensure the search module is running inside the DUC repository." | ||
| ) | ||
|
|
||
|
|
||
| def _read_search_schema_sql() -> str: | ||
| return (_find_schema_dir() / "search.sql").read_text(encoding="utf-8") |
There was a problem hiding this comment.
_find_schema_dir() searches for schema/search.sql by walking up the filesystem and raises if not running inside the monorepo. Since ducpy is distributed via PyPI, this will break in normal installations where the workspace-level schema/ directory is not present. Consider shipping search.sql as package data (e.g., under ducpy/search/schema/) and loading it via importlib.resources, or embedding the SQL at build time, so ensure_search_schema() works outside the repo.
| @dataclass(slots=True) | ||
| class DucSearchResult: | ||
| """Compatibility placeholder for exported result types.""" | ||
|
|
||
|
|
||
| @dataclass(slots=True) | ||
| class DucElementSearchResult: | ||
| """One result row for a single element.""" | ||
|
|
||
| element_id: str | ||
| element_type: str | ||
| matches: list[str] | ||
| score: float | ||
|
|
||
| def to_dict(self) -> dict[str, Any]: | ||
| return { | ||
| "element_id": self.element_id, | ||
| "element_type": self.element_type, | ||
| "matches": self.matches, | ||
| "score": round(self.score, 6), | ||
| } | ||
|
|
||
|
|
||
| @dataclass(slots=True) | ||
| class DucFileSearchResult: | ||
| """One grouped result row for repeated file-backed elements.""" | ||
|
|
||
| file_id: str | ||
| element_type: str | ||
| matches: list[str] | ||
| score: float | ||
| hits: int | ||
| element_ids: list[str] | ||
|
|
||
| def to_dict(self) -> dict[str, Any]: | ||
| return { | ||
| "file_id": self.file_id, | ||
| "element_type": self.element_type, | ||
| "matches": self.matches, | ||
| "score": round(self.score, 6), | ||
| "hits": self.hits, | ||
| "element_ids": self.element_ids, | ||
| } | ||
|
|
||
|
|
||
| DucSearchResult = DucElementSearchResult | DucFileSearchResult | ||
|
|
||
|
|
There was a problem hiding this comment.
DucSearchResult is first declared as an (empty) dataclass and then immediately reassigned to a union type alias. This makes the placeholder class unreachable and can confuse runtime users (e.g., isinstance(x, DucSearchResult) won’t work) and type checkers. Either remove the placeholder dataclass and declare the union using TypeAlias, or keep DucSearchResult as a real base class and have the concrete result types inherit from it.
| def ensure_search_schema(conn: sqlite3.Connection, rebuild: bool = False) -> None: | ||
| """Create missing FTS search tables and triggers for a DUC database. | ||
|
|
||
| Args: | ||
| conn: Open SQLite connection for a ``.duc`` database. | ||
| rebuild: When ``True``, reruns the FTS rebuild statements even if the | ||
| search schema already exists. | ||
| """ | ||
|
|
||
| existing_tables = { | ||
| row[0] | ||
| for row in conn.execute("SELECT name FROM sqlite_master WHERE type IN ('table', 'view')") | ||
| } | ||
| if rebuild or not _SEARCH_TABLES.issubset(existing_tables): | ||
| conn.executescript(_read_search_schema_sql()) | ||
| conn.commit() |
There was a problem hiding this comment.
ensure_search_schema() only checks for the presence of the FTS tables, but the schema’s correctness also depends on the associated triggers existing (to keep the external-content FTS indexes in sync). If tables exist but triggers are missing (or were created under different names), this function will skip initialization and the index will silently stop updating. Consider also checking sqlite_master for the expected trigger names (or a schema version marker) before deciding to skip schema application.
| CREATE VIRTUAL TABLE IF NOT EXISTS search_elements USING fts5( | ||
| label, | ||
| description, | ||
| content='elements', | ||
| content_rowid='rowid' | ||
| content_rowid='rowid', | ||
| tokenize='unicode61 remove_diacritics 2', | ||
| prefix='2 3 4 5 6 7 8 9 10' | ||
| ); |
There was a problem hiding this comment.
Using IF NOT EXISTS for the FTS virtual tables/triggers means that running this script against a database that already has an older version of these objects will not apply changes (e.g., tokenizer/prefix options or trigger definitions). If you expect this script to serve as an upgrade path (not just first-time creation), consider explicitly dropping/recreating the FTS tables and triggers (or provide a versioned migration path) so schema changes take effect.
…n SQL migrations for the builder - use the sql_builder instead of hardcoding db connectivity on the search
|
🎉 This PR is included in version 3.2.1 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
|
🎉 This PR is included in version 3.2.1 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
|
🎉 This PR is included in version 3.2.1 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
|
🎉 This PR is included in version 3.2.1 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
|
🎉 This PR is included in version 3.7.1 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
|
🎉 This PR is included in version 3.6.1 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
No description provided.