docs: add dataset relationship engine design document and architectur…#40116
docs: add dataset relationship engine design document and architectur…#40116hjadm wants to merge 1 commit into
Conversation
…e analysis - superset_relationship_design.md: comprehensive 3-phase technical design for dataset relationships engine (backend models, dual-mode JOIN engine for same-DB and cross-DB, REST API, React Flow visual canvas, cross-filtering, drill-down hierarchies, and filter propagation) - analise_superset.md: detailed architecture analysis of the Apache Superset repository covering monorepo structure, tech stack, code patterns, and improvement suggestions
|
Bito Automatic Review Skipped - Files Excluded |
✅ Deploy Preview for superset-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
| qry = qry.join( | ||
| target_table, | ||
| join_condition, | ||
| isouter=(rel.join_type == "LEFT"), | ||
| ) | ||
| # === END NEW === | ||
|
|
||
| # ... resto do pipeline de filtros, groupby, orderby ... | ||
| return qry | ||
| ``` | ||
|
|
||
| #### Modo 2 — Cross-Database (Pandas Merge) | ||
|
|
||
| Quando os datasets estão em bancos diferentes, o engine executa queries separadas e faz merge dos DataFrames resultantes no nível da aplicação. | ||
|
|
||
| **Ponto de injeção:** `superset/models/helpers.py` → `ExploreMixin.get_query_result()` | ||
|
|
||
| ```python | ||
| # Pseudocódigo da modificação em get_query_result() | ||
| def get_query_result(self, query_obj, ...): | ||
| # Executa query principal | ||
| primary_result = self.query(query_obj) | ||
| primary_df = primary_result.df | ||
|
|
||
| # === NEW: Cross-database merges === | ||
| if query_obj.relationships: | ||
| for rel in query_obj.relationships: | ||
| if rel.is_cross_database: | ||
| # Proteção de memória | ||
| if len(primary_df) > RELATIONSHIP_MAX_MERGE_ROWS: | ||
| raise RelationshipMergeError( | ||
| f"Primary dataset exceeds {RELATIONSHIP_MAX_MERGE_ROWS} rows limit" | ||
| ) | ||
|
|
||
| # Executa query no dataset target | ||
| target_result = rel.target_dataset.query( | ||
| _build_target_query(rel, primary_df) | ||
| ) | ||
| target_df = target_result.df | ||
|
|
||
| if len(target_df) > RELATIONSHIP_MAX_MERGE_ROWS: | ||
| raise RelationshipMergeError( | ||
| f"Target dataset exceeds {RELATIONSHIP_MAX_MERGE_ROWS} rows limit" | ||
| ) | ||
|
|
||
| # Merge dos DataFrames | ||
| primary_df = primary_df.merge( | ||
| target_df, | ||
| left_on=[cm.source_column for cm in rel.column_mappings], | ||
| right_on=[cm.target_column for cm in rel.column_mappings], | ||
| how=rel.join_type.lower(), |
There was a problem hiding this comment.
🟠 Architect Review — HIGH
JOIN semantics in the design are inconsistent with the declared join_type options: in same-database mode only LEFT vs non-LEFT is honored (so RIGHT and FULL behave as INNER), and in cross-database mode join_type is passed directly to pandas.DataFrame.merge(how=...), so FULL becomes "full", which is not a valid how value.
Suggestion: Constrain join_type to the subset both engines can correctly implement or define explicit per-engine mappings (e.g., map FULL to a proper SQL full outer join and to how="outer" in pandas) so that the same join_type yields consistent, valid behavior in both same-DB and cross-DB paths.
Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is an **Architect / Logical Review** comment left during a code review. These reviews are first-class, important findings — not optional suggestions. Do NOT dismiss this as a 'big architectural change' just because the title says architect review; most of these can be resolved with a small, localized fix once the intent is understood.
**Path:** docs/superset_relationship_design.md
**Line:** 179:229
**Comment:**
*HIGH: JOIN semantics in the design are inconsistent with the declared `join_type` options: in same-database mode only `LEFT` vs non-`LEFT` is honored (so `RIGHT` and `FULL` behave as `INNER`), and in cross-database mode `join_type` is passed directly to `pandas.DataFrame.merge(how=...)`, so `FULL` becomes `"full"`, which is not a valid `how` value.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
If a suggested approach is provided above, use it as the authoritative instruction. If no explicit code suggestion is given, you MUST still draft and apply your own minimal, localized fix — do not punt back with 'no suggestion provided, review manually'. Keep the change as small as possible: add a guard clause, gate on a loading state, reorder an await, wrap in a conditional, etc. Do not refactor surrounding code or expand scope beyond the finding.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| source_dataset_id INTEGER NOT NULL REFERENCES ab_datasets(id) ON DELETE CASCADE, | ||
|
|
||
| -- Target dataset | ||
| target_dataset_id INTEGER NOT NULL REFERENCES ab_datasets(id) ON DELETE CASCADE, | ||
|
|
There was a problem hiding this comment.
🟠 Architect Review — HIGH
The design uses inconsistent dataset table names: the initial DDL references ab_datasets(id) while the ORM and migration snippets use tables.id, conflicting with the actual Superset dataset model (SqlaTable.__tablename__ = "tables").
Suggestion: Normalize the design to reference the real dataset table (tables.id for SqlaTable) consistently across the DDL, ORM model, and migration examples to avoid implementing against a non-existent ab_datasets table.
Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is an **Architect / Logical Review** comment left during a code review. These reviews are first-class, important findings — not optional suggestions. Do NOT dismiss this as a 'big architectural change' just because the title says architect review; most of these can be resolved with a small, localized fix once the intent is understood.
**Path:** docs/superset_relationship_design.md
**Line:** 46:50
**Comment:**
*HIGH: The design uses inconsistent dataset table names: the initial DDL references `ab_datasets(id)` while the ORM and migration snippets use `tables.id`, conflicting with the actual Superset dataset model (`SqlaTable.__tablename__ = "tables"`).
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
If a suggested approach is provided above, use it as the authoritative instruction. If no explicit code suggestion is given, you MUST still draft and apply your own minimal, localized fix — do not punt back with 'no suggestion provided, review manually'. Keep the change as small as possible: add a guard clause, gate on a loading state, reorder an await, wrap in a conditional, etc. Do not refactor surrounding code or expand scope beyond the finding.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
…e analysis
SUMMARY
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TESTING INSTRUCTIONS
ADDITIONAL INFORMATION