Skip to content

docs: add dataset relationship engine design document and architectur…#40116

Closed
hjadm wants to merge 1 commit into
apache:masterfrom
hjadm:feature/dataset-relationships-design
Closed

docs: add dataset relationship engine design document and architectur…#40116
hjadm wants to merge 1 commit into
apache:masterfrom
hjadm:feature/dataset-relationships-design

Conversation

@hjadm
Copy link
Copy Markdown

@hjadm hjadm commented May 14, 2026

…e analysis

  • superset_relationship_design.md: comprehensive 3-phase technical design for dataset relationships engine (backend models, dual-mode JOIN engine for same-DB and cross-DB, REST API, React Flow visual canvas, cross-filtering, drill-down hierarchies, and filter propagation)
  • analise_superset.md: detailed architecture analysis of the Apache Superset repository covering monorepo structure, tech stack, code patterns, and improvement suggestions

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

…e analysis

- superset_relationship_design.md: comprehensive 3-phase technical design for
  dataset relationships engine (backend models, dual-mode JOIN engine for
  same-DB and cross-DB, REST API, React Flow visual canvas, cross-filtering,
  drill-down hierarchies, and filter propagation)
- analise_superset.md: detailed architecture analysis of the Apache Superset
  repository covering monorepo structure, tech stack, code patterns, and
  improvement suggestions
@dosubot dosubot Bot added design:proposal Design proposals doc:developer Developer documentation labels May 14, 2026
@bito-code-review
Copy link
Copy Markdown
Contributor

bito-code-review Bot commented May 14, 2026

Bito Automatic Review Skipped - Files Excluded

Bito didn't auto-review this change because all changed files are in the exclusion list for automatic reviews. No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change the excluded files settings here, or contact your Bito workspace admin at evan@preset.io.

@github-actions github-actions Bot added the doc Namespace | Anything related to documentation label May 14, 2026
@hjadm hjadm closed this May 14, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 14, 2026

Deploy Preview for superset-docs-preview ready!

Name Link
🔨 Latest commit bd8a63f
🔍 Latest deploy log https://app.netlify.com/projects/superset-docs-preview/deploys/6a059b5d8192be00077cc3fa
😎 Deploy Preview https://deploy-preview-40116--superset-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

Comment on lines +179 to +229
qry = qry.join(
target_table,
join_condition,
isouter=(rel.join_type == "LEFT"),
)
# === END NEW ===

# ... resto do pipeline de filtros, groupby, orderby ...
return qry
```

#### Modo 2 — Cross-Database (Pandas Merge)

Quando os datasets estão em bancos diferentes, o engine executa queries separadas e faz merge dos DataFrames resultantes no nível da aplicação.

**Ponto de injeção:** `superset/models/helpers.py` → `ExploreMixin.get_query_result()`

```python
# Pseudocódigo da modificação em get_query_result()
def get_query_result(self, query_obj, ...):
# Executa query principal
primary_result = self.query(query_obj)
primary_df = primary_result.df

# === NEW: Cross-database merges ===
if query_obj.relationships:
for rel in query_obj.relationships:
if rel.is_cross_database:
# Proteção de memória
if len(primary_df) > RELATIONSHIP_MAX_MERGE_ROWS:
raise RelationshipMergeError(
f"Primary dataset exceeds {RELATIONSHIP_MAX_MERGE_ROWS} rows limit"
)

# Executa query no dataset target
target_result = rel.target_dataset.query(
_build_target_query(rel, primary_df)
)
target_df = target_result.df

if len(target_df) > RELATIONSHIP_MAX_MERGE_ROWS:
raise RelationshipMergeError(
f"Target dataset exceeds {RELATIONSHIP_MAX_MERGE_ROWS} rows limit"
)

# Merge dos DataFrames
primary_df = primary_df.merge(
target_df,
left_on=[cm.source_column for cm in rel.column_mappings],
right_on=[cm.target_column for cm in rel.column_mappings],
how=rel.join_type.lower(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Architect Review — HIGH

JOIN semantics in the design are inconsistent with the declared join_type options: in same-database mode only LEFT vs non-LEFT is honored (so RIGHT and FULL behave as INNER), and in cross-database mode join_type is passed directly to pandas.DataFrame.merge(how=...), so FULL becomes "full", which is not a valid how value.

Suggestion: Constrain join_type to the subset both engines can correctly implement or define explicit per-engine mappings (e.g., map FULL to a proper SQL full outer join and to how="outer" in pandas) so that the same join_type yields consistent, valid behavior in both same-DB and cross-DB paths.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is an **Architect / Logical Review** comment left during a code review. These reviews are first-class, important findings — not optional suggestions. Do NOT dismiss this as a 'big architectural change' just because the title says architect review; most of these can be resolved with a small, localized fix once the intent is understood.

**Path:** docs/superset_relationship_design.md
**Line:** 179:229
**Comment:**
	*HIGH: JOIN semantics in the design are inconsistent with the declared `join_type` options: in same-database mode only `LEFT` vs non-`LEFT` is honored (so `RIGHT` and `FULL` behave as `INNER`), and in cross-database mode `join_type` is passed directly to `pandas.DataFrame.merge(how=...)`, so `FULL` becomes `"full"`, which is not a valid `how` value.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
If a suggested approach is provided above, use it as the authoritative instruction. If no explicit code suggestion is given, you MUST still draft and apply your own minimal, localized fix — do not punt back with 'no suggestion provided, review manually'. Keep the change as small as possible: add a guard clause, gate on a loading state, reorder an await, wrap in a conditional, etc. Do not refactor surrounding code or expand scope beyond the finding.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

Comment on lines +46 to +50
source_dataset_id INTEGER NOT NULL REFERENCES ab_datasets(id) ON DELETE CASCADE,

-- Target dataset
target_dataset_id INTEGER NOT NULL REFERENCES ab_datasets(id) ON DELETE CASCADE,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Architect Review — HIGH

The design uses inconsistent dataset table names: the initial DDL references ab_datasets(id) while the ORM and migration snippets use tables.id, conflicting with the actual Superset dataset model (SqlaTable.__tablename__ = "tables").

Suggestion: Normalize the design to reference the real dataset table (tables.id for SqlaTable) consistently across the DDL, ORM model, and migration examples to avoid implementing against a non-existent ab_datasets table.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is an **Architect / Logical Review** comment left during a code review. These reviews are first-class, important findings — not optional suggestions. Do NOT dismiss this as a 'big architectural change' just because the title says architect review; most of these can be resolved with a small, localized fix once the intent is understood.

**Path:** docs/superset_relationship_design.md
**Line:** 46:50
**Comment:**
	*HIGH: The design uses inconsistent dataset table names: the initial DDL references `ab_datasets(id)` while the ORM and migration snippets use `tables.id`, conflicting with the actual Superset dataset model (`SqlaTable.__tablename__ = "tables"`).

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
If a suggested approach is provided above, use it as the authoritative instruction. If no explicit code suggestion is given, you MUST still draft and apply your own minimal, localized fix — do not punt back with 'no suggestion provided, review manually'. Keep the change as small as possible: add a guard clause, gate on a loading state, reorder an await, wrap in a conditional, etc. Do not refactor surrounding code or expand scope beyond the finding.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

design:proposal Design proposals doc:developer Developer documentation doc Namespace | Anything related to documentation size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant