Conversation
…n SQL migrations for the builder - use the sql_builder instead of hardcoding db connectivity on the search
First round of advanced search on ducpy
Jorgedanisc
📋 Release Preview
|
There was a problem hiding this comment.
Pull request overview
This PR updates the DUC SQLite schema to enhance FTS5-based search (diacritic-insensitive tokenization + prefix indexing), introduces a migration to rebuild existing FTS tables, and adds a Python (ducpy) search API with tests while ensuring Python-side schema migrations are applied when opening .duc databases.
Changes:
- Rebuild/upgrade FTS5 virtual tables to use
unicode61 remove_diacritics 2andprefix=..., plus add explicit backfill (rebuild) steps. - Add schema migration
3000002 → 3000003and bumpPRAGMA user_versionto3000003. - Add
ducpy.searchmodule (hybrid FTS + Python reranking) with new pytest coverage, and updateDucSQLto apply SQL migrations on open.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| schema/search.sql | Adds tokenizer/prefix settings and rebuild statements for FTS tables; switches to IF NOT EXISTS for FTS tables/triggers. |
| schema/migrations/3000002_to_3000003.sql | New migration that drops/recreates FTS tables and rebuilds the indexes; bumps user_version. |
| schema/duc.sql | Bumps PRAGMA user_version to 3000003. |
| packages/ducrs/src/api/version_control.rs | Removes outdated module doc bullet. |
| packages/ducpy/src/ducpy/builders/sql_builder.py | Adds migration discovery/application to keep opened DBs current. |
| packages/ducpy/src/ducpy/search/search_elements.py | New Python search implementation and result types. |
| packages/ducpy/src/ducpy/search/init.py | Exposes the search API from the new package. |
| packages/ducpy/src/ducpy/init.py | Re-exports search helpers at top-level ducpy. |
| packages/ducpy/src/tests/src/test_search_module.py | New tests covering search behavior (exact, prefix, grouping, limit, empty query). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| -- 1. Drop old FTS tables (triggers are automatically dropped) | ||
| DROP TABLE IF EXISTS search_elements; | ||
| DROP TABLE IF EXISTS search_element_text; | ||
| DROP TABLE IF EXISTS search_element_doc; | ||
| DROP TABLE IF EXISTS search_element_model; | ||
| DROP TABLE IF EXISTS search_blocks; |
There was a problem hiding this comment.
Dropping the FTS tables will NOT drop the existing sync triggers, because the triggers are defined on elements / element_text / etc (not on the FTS tables). On an upgrade, the old trg_* triggers will still exist, so the subsequent CREATE TRIGGER trg_* statements will fail with "trigger already exists" (and/or keep stale trigger definitions if you switch to IF NOT EXISTS). Explicitly DROP TRIGGER IF EXISTS ... for each trg_* before recreating the tables/triggers, and consider fixing the misleading comment here.
| conn.executescript(sql) | ||
| current = conn.execute("PRAGMA user_version").fetchone()[0] |
There was a problem hiding this comment.
_apply_migrations can get stuck in an infinite loop if a migration script fails to advance PRAGMA user_version (e.g., script error, missing final PRAGMA, or PRAGMA guarded by a conditional). Add a progress check after executescript (e.g., ensure the new user_version is > current and ideally equals the expected to_v) and raise a clear error if not.
| conn.executescript(sql) | |
| current = conn.execute("PRAGMA user_version").fetchone()[0] | |
| prev_version = current | |
| conn.executescript(sql) | |
| new_version = conn.execute("PRAGMA user_version").fetchone()[0] | |
| if new_version <= prev_version: | |
| raise RuntimeError( | |
| "Migration did not advance PRAGMA user_version: " | |
| f"attempted to migrate from {prev_version} to {to_v}, " | |
| f"but user_version is still {new_version}." | |
| ) | |
| if new_version != to_v: | |
| raise RuntimeError( | |
| "Migration set unexpected PRAGMA user_version: " | |
| f"attempted to migrate from {prev_version} to {to_v}, " | |
| f"but user_version is {new_version}." | |
| ) | |
| current = new_version |
| @dataclass(slots=True) | ||
| class DucSearchResult: | ||
| """Compatibility placeholder for exported result types.""" | ||
|
|
||
|
|
||
| @dataclass(slots=True) | ||
| class DucElementSearchResult: |
There was a problem hiding this comment.
DucSearchResult is defined twice: first as a (empty) dataclass, then later reassigned to a DucElementSearchResult | DucFileSearchResult union type alias. The second assignment overwrites the class at runtime, which is confusing and can break consumers expecting DucSearchResult to be a class. Either remove the placeholder dataclass, or rename it (e.g., DucSearchResultBase) and keep DucSearchResult exclusively as the type alias (or vice versa).
| for _variant_name, _expression, variant_boost in _build_query_variants(query): | ||
| for element in elements: | ||
| if element.get("is_deleted"): | ||
| continue | ||
|
|
||
| element_id = element.get("id") | ||
| element_type = element.get("type") | ||
| if not element_id or not element_type: | ||
| continue | ||
|
|
||
| aggregate = aggregates.get(element_id) | ||
| if aggregate is None: | ||
| aggregate = _ElementAggregate( | ||
| element_id=element_id, | ||
| raw_element_type=element_type, | ||
| label=element.get("label") or "", | ||
| description=element.get("description"), | ||
| ) | ||
| aggregates[element_id] = aggregate | ||
|
|
||
| for field_name, source_weight in field_weights.items(): | ||
| raw_text = element.get(field_name) | ||
| score, _similarity = _evaluate_match_text( | ||
| query, | ||
| raw_text, | ||
| fts_rank=None, | ||
| source_weight=source_weight, | ||
| variant_boost=variant_boost, | ||
| ) | ||
| if score > 0.0 and raw_text: | ||
| aggregate.add_match(raw_text, score) |
There was a problem hiding this comment.
In _collect_candidates_from_parsed_duc, the loop over _build_query_variants(query) ignores the generated FTS expressions and only uses variant_boost. Because the boosts are all <= 1.0 and _evaluate_match_text doesn't depend on the expression, this repeats the same work up to 3× without affecting the final best score (the 1.0-boost pass dominates). Consider removing the variants loop for the non-SQLite path (or refactor so variants actually change the matching logic) to avoid unnecessary CPU work.
| for _variant_name, _expression, variant_boost in _build_query_variants(query): | |
| for element in elements: | |
| if element.get("is_deleted"): | |
| continue | |
| element_id = element.get("id") | |
| element_type = element.get("type") | |
| if not element_id or not element_type: | |
| continue | |
| aggregate = aggregates.get(element_id) | |
| if aggregate is None: | |
| aggregate = _ElementAggregate( | |
| element_id=element_id, | |
| raw_element_type=element_type, | |
| label=element.get("label") or "", | |
| description=element.get("description"), | |
| ) | |
| aggregates[element_id] = aggregate | |
| for field_name, source_weight in field_weights.items(): | |
| raw_text = element.get(field_name) | |
| score, _similarity = _evaluate_match_text( | |
| query, | |
| raw_text, | |
| fts_rank=None, | |
| source_weight=source_weight, | |
| variant_boost=variant_boost, | |
| ) | |
| if score > 0.0 and raw_text: | |
| aggregate.add_match(raw_text, score) | |
| # For non-SQLite search, query variants do not affect matching logic, | |
| # and the highest boost is 1.0, so we only need a single pass here. | |
| variant_boost = 1.0 | |
| for element in elements: | |
| if element.get("is_deleted"): | |
| continue | |
| element_id = element.get("id") | |
| element_type = element.get("type") | |
| if not element_id or not element_type: | |
| continue | |
| aggregate = aggregates.get(element_id) | |
| if aggregate is None: | |
| aggregate = _ElementAggregate( | |
| element_id=element_id, | |
| raw_element_type=element_type, | |
| label=element.get("label") or "", | |
| description=element.get("description"), | |
| ) | |
| aggregates[element_id] = aggregate | |
| for field_name, source_weight in field_weights.items(): | |
| raw_text = element.get(field_name) | |
| score, _similarity = _evaluate_match_text( | |
| query, | |
| raw_text, | |
| fts_rank=None, | |
| source_weight=source_weight, | |
| variant_boost=variant_boost, | |
| ) | |
| if score > 0.0 and raw_text: | |
| aggregate.add_match(raw_text, score) |
| asset_path = _asset_input_path(asset_name) | ||
| assert asset_path.exists(), f"Missing asset file: {asset_path}" | ||
|
|
||
| json_path = Path(test_output_dir+"/search_results") / f"{test_name}.json" |
There was a problem hiding this comment.
Path construction here uses string concatenation with a hard-coded / (test_output_dir + "/search_results"), which is not portable on Windows and bypasses pathlib's normalization. Prefer Path(test_output_dir) / "search_results" / f"{test_name}.json" (and avoid the double-space before the f-string).
| json_path = Path(test_output_dir+"/search_results") / f"{test_name}.json" | |
| json_path = Path(test_output_dir) / "search_results" / f"{test_name}.json" |
| from .search_elements import (DucElementSearchResult, DucFileSearchResult, | ||
| DucSearchResponse, DucSearchResult, | ||
| search_duc_elements) |
There was a problem hiding this comment.
The import list formatting/indentation is inconsistent here, which makes the module harder to scan and is likely to fight auto-formatters. Reformat this into a standard multi-line import (one name per line or a single line if it fits) to keep style consistent.
| from .search_elements import (DucElementSearchResult, DucFileSearchResult, | |
| DucSearchResponse, DucSearchResult, | |
| search_duc_elements) | |
| from .search_elements import ( | |
| DucElementSearchResult, | |
| DucFileSearchResult, | |
| DucSearchResponse, | |
| DucSearchResult, | |
| search_duc_elements, | |
| ) |
|
🎉 This PR is included in version 3.2.1 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
|
🎉 This PR is included in version 3.2.1 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
|
🎉 This PR is included in version 3.2.1 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
|
🎉 This PR is included in version 3.2.1 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
|
🎉 This PR is included in version 3.7.1 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
|
🎉 This PR is included in version 3.6.1 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
No description provided.