Skip to content

[Hackathon] feat: Data Profiling Panel — Instant Dataset Analysis Before You Run#5114

Closed
EmilySun621 wants to merge 4 commits into
apache:mainfrom
EmilySun621:hackathon/data-profiling
Closed

[Hackathon] feat: Data Profiling Panel — Instant Dataset Analysis Before You Run#5114
EmilySun621 wants to merge 4 commits into
apache:mainfrom
EmilySun621:hackathon/data-profiling

Conversation

@EmilySun621
Copy link
Copy Markdown

@EmilySun621 EmilySun621 commented May 16, 2026

🎯 Problem

Researchers drag a CSV onto the canvas and start building operators blind — they don't know how many missing values there are, which columns are IDs, or whether the target variable is imbalanced. They find out after the workflow fails.

💡 Solution

One-click data profiling that reads the real CSV and tells you everything before you write a single operator.


✨ Features

1. Data Quality Score (0-100)

Single number summarizing dataset health, with sub-score breakdown:

  • ✅ Completeness — missing value percentage across columns
  • ✅ Duplicates — duplicate row detection
  • ⚠️ Outliers — values beyond 3 standard deviations
  • ⚠️ Constant columns — zero-information columns
  • ⚠️ High cardinality — likely ID columns
  • ⚠️ Class imbalance — skewed target distribution

Color scale: 🟢 90-100 Excellent · 🟡 70-89 Good · 🟠 50-69 Needs attention · 🔴 0-49 Poor


2. Suggested Cleaning Actions

Rule-based, no LLM — zero hallucination risk:

  • 🔧 Impute — "HbA1c has 12.3% missing — use median imputation" → [Add to Workflow]
  • 🗑️ Drop column — "smoker_flag has only 1 unique value" → [Add to Workflow]
  • 📋 Remove duplicates — "23 rows (3.0%) are exact duplicates" → [Add to Workflow]
  • 🏷️ Flag ID — "patient_id is 100% unique — drop before modeling" → [Add to Workflow]
  • 📊 Review outliers — "income has 42 outliers (5.5%) beyond 3σ" → [Copy hint]

Sorted by severity: critical → warning → info.


3. Column Role Detection (auto)

Heuristic classification of each column's ML role:

  • 🎯 Target — columns named target/label/class/outcome, or low-cardinality categoricals
  • 🏷️ ID — high-cardinality columns or names matching id/index/patient_id
  • 📊 Feature — numeric and categorical columns for modeling
  • 📅 Datetime — date/time columns
  • Constant — single-value columns (flag for removal)

Summary: "1 possible target: Species · 1 ID: Id · 4 features: SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm"


4. Per-Column Statistics

Each column card displays:

  • Name + type badge (numeric/categorical) + role badge (Target/ID/Feature)
  • Numeric columns: mean, median, std, min, max, range
  • Numeric columns: inline SVG histogram (10 bins)
  • Categorical columns: unique count, top values with counts
  • Missing value warning (red highlight if > 10%)
  • Role suggestion ("Use as input feature" / "Drop before modeling")

5. Overview Tabs

  • Columns — all column cards with stats and histograms
  • Missing — missing value summary across all columns
  • Correlations — correlation matrix for numeric columns

6. Real Data, Not Mock

Reads the actual CSV through Texera's file-service API. Parses, computes statistics, and renders in real time. Mock data fallback only if API fails.


📸 Screenshots

Quality Score + Suggestions | Column Roles + Stats -- | -- Quality Score bar, sub-score badges, cleaning action cards with "Add to Workflow" buttons | Auto-detected roles, overview stats (rows/cols/dupes), per-column histograms

🎬 Demo

  1. Open workflow → click CSV File Scan operator
  2. Properties panel → click "📊 Profile Data"
  3. Modal opens → Quality Score: 100/100 "Excellent" (Iris is clean)
  4. Column Roles: Species = 🎯 possible target, Id = 🏷️ ID, 4 📊 features
  5. Scroll columns → see histograms for SepalLength, PetalWidth
  6. Switch to diabetes dataset → Score drops, suggestions appear: "Impute HbA1c", "Drop patient_id"

📁 Files Changed

New files:

  • data-profiling-panel/data-profiling.types.ts — Profile, Column, Suggestion, Role types
  • data-profiling-panel/data-profiling.utils.ts — Quality score, suggestions, role detection algorithms
  • data-profiling-panel/data-profiling.service.ts — CSV fetch, parse, compute stats
  • data-profiling-panel/data-profiling-panel.component.* — Panel UI with score, suggestions, columns, correlations
  • data-profiling-panel/data-profiling-modal.component.ts — Modal wrapper

Modified (additive only):

  • operator-property-edit-frame — Added "📊 Profile Data" button for CSV/scan operators

✅ Testing

  • Angular typecheck: clean
  • Profile button renders on CSV File Scan operators
  • Real Iris.csv: 150 rows, 6 columns, Score 100, correct role detection
  • Histograms render for all numeric columns
  • Suggestions generate correctly for datasets with issues
  • Quality score formula produces consistent res
Screenshot 2026-05-16 at 12 38 25 PM

Emily Sun and others added 4 commits May 15, 2026 21:55
This bundles the feature work that built up on this branch:

- Custom agents: dashboard CRUD page and editor dialog (48px icon tile,
  chip-style guardrails, model selector). Each custom agent now carries a
  LiteLLM model_name (Opus 4.7 / Haiku 4.5) that is passed through to the
  agent-service so different agents can use different models.

- Conversation history is scoped per (workflowId, agentId): switching
  agent or workflow yields a different conversation list. localStorage
  key: texera.workflowConversations.v1.{workflowId}.{agentId}.

- Time machine: workflow snapshot list, revert, and agent-tagged
  checkpoints. New workflow-history-tool in agent-service backs the
  "undo my last change" flow; amber gains a WorkflowSnapshotResource;
  sql/updates/23.sql adds the snapshot table.

- Operator-aware custom-agent prompts: the system prompt now injects the
  full operator catalog with a "prefer built-in operators over Python
  UDFs" rule, sourced from WorkflowSystemMetadata at request time.

- LiteLLM: added the claude-opus-4.7 entry alongside claude-haiku-4.5
  and gpt-5-mini in bin/litellm-config.yaml.

- Agent panel rewritten around the (conversation list / chat) two-view
  model with subscription-managed list reloads and per-step persistence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, role detection

Adds a Data Profiling Panel triggered from data-source operator properties
(CSV/JSON/Parquet/FileScan). The panel surfaces three derived views on top of
a single profile response — no new backend calls:

  - Data Quality Score (0–100): completeness, duplicates, outliers, constant
    columns, high-cardinality categoricals, and class-imbalance penalties,
    with a colored progress bar and sub-score badges.
  - Auto-Suggest Cleaning Actions: severity-sorted rules (drop sparse/ID/
    constant cols, impute via median/mode, deduplicate, review outliers) with
    an Add-to-Workflow button that copies an operator hint to the clipboard.
  - Column Relationship Detector: heuristic ID/target/feature/datetime/
    constant classification with badges per column and an auto-detected
    summary section.

Wires a small "📊 Profile Data" button into the operator property editor that
opens the panel as a draggable modal seeded with the operator's file path.
Backend integration is intentionally a follow-up; the service ships a
deterministic mock so the UX is fully exercised.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ble rule

Adds a console.debug so we can see what operatorType is on the selected
operator (helps when the rule doesn't match an unexpected name). Also
broadens the profileable regex to include Text/File so anything that looks
remotely like a data source shows the button.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataProfilingService now fetches the actual dataset file via
DatasetService.retrieveDatasetVersionSingleFile (presign-download endpoint),
parses with papaparse (first 5000 rows for performance), and runs a new
pure-TS profiler that computes:

  - dtype inference per column (numeric / datetime / boolean / categorical / text)
  - per-column: count, missing, missingPercent, unique, plus dtype-specific stats
  - numeric: mean, median, std, min, max, ±3σ outlier count, 10-bin histogram
  - categorical/boolean: top-5 value counts
  - dataset-level: row-key duplicate count
  - Pearson correlation matrix across (up to 8) numeric columns

If the source isn't a dataset path or any step fails (fetch / parse / empty
headers), we fall back to the deterministic mock so the panel always renders.
The panel header now shows a short filename (full path on hover) and surfaces
fetch/parse errors inline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added engine ddl-change Changes to the TexeraDB DDL frontend Changes related to the frontend GUI dev common agent-service labels May 16, 2026
@Yicong-Huang
Copy link
Copy Markdown
Contributor

Thank you for participating in the hackathon. We really enjoyed your idea and appreciate the contribution.

We are now archiving hackathon submissions by closing the submission PRs. However, we strongly encourage you to continue developing your idea and explore the possibility of merging it into the main branch.

To move forward, please:

  1. Open an issue describing your idea, and link this PR as a reference.
  2. Discuss the design and implementation plan with us in the issue.
  3. Open smaller, focused PRs so they can go through proper review.

Thanks again for your participation and contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-service common ddl-change Changes to the TexeraDB DDL dev engine frontend Changes related to the frontend GUI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants