[Hackathon] feat: Data Profiling Panel — Instant Dataset Analysis Before You Run by EmilySun621 · Pull Request #5114 · apache/texera

EmilySun621 · 2026-05-16T19:46:51Z

🎯 Problem

Researchers drag a CSV onto the canvas and start building operators blind — they don't know how many missing values there are, which columns are IDs, or whether the target variable is imbalanced. They find out after the workflow fails.

💡 Solution

One-click data profiling that reads the real CSV and tells you everything before you write a single operator.

✨ Features

1. Data Quality Score (0-100)

Single number summarizing dataset health, with sub-score breakdown:

✅ Completeness — missing value percentage across columns
✅ Duplicates — duplicate row detection
⚠️ Outliers — values beyond 3 standard deviations
⚠️ Constant columns — zero-information columns
⚠️ High cardinality — likely ID columns
⚠️ Class imbalance — skewed target distribution

Color scale: 🟢 90-100 Excellent · 🟡 70-89 Good · 🟠 50-69 Needs attention · 🔴 0-49 Poor

2. Suggested Cleaning Actions

Rule-based, no LLM — zero hallucination risk:

🔧 Impute — "HbA1c has 12.3% missing — use median imputation" → [Add to Workflow]
🗑️ Drop column — "smoker_flag has only 1 unique value" → [Add to Workflow]
📋 Remove duplicates — "23 rows (3.0%) are exact duplicates" → [Add to Workflow]
🏷️ Flag ID — "patient_id is 100% unique — drop before modeling" → [Add to Workflow]
📊 Review outliers — "income has 42 outliers (5.5%) beyond 3σ" → [Copy hint]

Sorted by severity: critical → warning → info.

3. Column Role Detection (auto)

Heuristic classification of each column's ML role:

🎯 Target — columns named target/label/class/outcome, or low-cardinality categoricals
🏷️ ID — high-cardinality columns or names matching id/index/patient_id
📊 Feature — numeric and categorical columns for modeling
📅 Datetime — date/time columns
⚪ Constant — single-value columns (flag for removal)

Summary: "1 possible target: Species · 1 ID: Id · 4 features: SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm"

4. Per-Column Statistics

Each column card displays:

Name + type badge (numeric/categorical) + role badge (Target/ID/Feature)
Numeric columns: mean, median, std, min, max, range
Numeric columns: inline SVG histogram (10 bins)
Categorical columns: unique count, top values with counts
Missing value warning (red highlight if > 10%)
Role suggestion ("Use as input feature" / "Drop before modeling")

5. Overview Tabs

Columns — all column cards with stats and histograms
Missing — missing value summary across all columns
Correlations — correlation matrix for numeric columns

6. Real Data, Not Mock

Reads the actual CSV through Texera's file-service API. Parses, computes statistics, and renders in real time. Mock data fallback only if API fails.

📸 Screenshots

Quality Score + Suggestions | Column Roles + Stats -- | -- Quality Score bar, sub-score badges, cleaning action cards with "Add to Workflow" buttons | Auto-detected roles, overview stats (rows/cols/dupes), per-column histograms

🎬 Demo

Open workflow → click CSV File Scan operator
Properties panel → click "📊 Profile Data"
Modal opens → Quality Score: 100/100 "Excellent" (Iris is clean)
Column Roles: Species = 🎯 possible target, Id = 🏷️ ID, 4 📊 features
Scroll columns → see histograms for SepalLength, PetalWidth
Switch to diabetes dataset → Score drops, suggestions appear: "Impute HbA1c", "Drop patient_id"

📁 Files Changed

New files:

data-profiling-panel/data-profiling.types.ts — Profile, Column, Suggestion, Role types
data-profiling-panel/data-profiling.utils.ts — Quality score, suggestions, role detection algorithms
data-profiling-panel/data-profiling.service.ts — CSV fetch, parse, compute stats
data-profiling-panel/data-profiling-panel.component.* — Panel UI with score, suggestions, columns, correlations
data-profiling-panel/data-profiling-modal.component.ts — Modal wrapper

Modified (additive only):

operator-property-edit-frame — Added "📊 Profile Data" button for CSV/scan operators

✅ Testing

Angular typecheck: clean
Profile button renders on CSV File Scan operators
Real Iris.csv: 150 rows, 6 columns, Score 100, correct role detection
Histograms render for all numeric columns
Suggestions generate correctly for datasets with issues
Quality score formula produces consistent res

This bundles the feature work that built up on this branch: - Custom agents: dashboard CRUD page and editor dialog (48px icon tile, chip-style guardrails, model selector). Each custom agent now carries a LiteLLM model_name (Opus 4.7 / Haiku 4.5) that is passed through to the agent-service so different agents can use different models. - Conversation history is scoped per (workflowId, agentId): switching agent or workflow yields a different conversation list. localStorage key: texera.workflowConversations.v1.{workflowId}.{agentId}. - Time machine: workflow snapshot list, revert, and agent-tagged checkpoints. New workflow-history-tool in agent-service backs the "undo my last change" flow; amber gains a WorkflowSnapshotResource; sql/updates/23.sql adds the snapshot table. - Operator-aware custom-agent prompts: the system prompt now injects the full operator catalog with a "prefer built-in operators over Python UDFs" rule, sourced from WorkflowSystemMetadata at request time. - LiteLLM: added the claude-opus-4.7 entry alongside claude-haiku-4.5 and gpt-5-mini in bin/litellm-config.yaml. - Agent panel rewritten around the (conversation list / chat) two-view model with subscription-managed list reloads and per-step persistence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…, role detection Adds a Data Profiling Panel triggered from data-source operator properties (CSV/JSON/Parquet/FileScan). The panel surfaces three derived views on top of a single profile response — no new backend calls: - Data Quality Score (0–100): completeness, duplicates, outliers, constant columns, high-cardinality categoricals, and class-imbalance penalties, with a colored progress bar and sub-score badges. - Auto-Suggest Cleaning Actions: severity-sorted rules (drop sparse/ID/ constant cols, impute via median/mode, deduplicate, review outliers) with an Add-to-Workflow button that copies an operator hint to the clipboard. - Column Relationship Detector: heuristic ID/target/feature/datetime/ constant classification with badges per column and an auto-detected summary section. Wires a small "📊 Profile Data" button into the operator property editor that opens the panel as a draggable modal seeded with the operator's file path. Backend integration is intentionally a follow-up; the service ships a deterministic mock so the UX is fully exercised. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ble rule Adds a console.debug so we can see what operatorType is on the selected operator (helps when the rule doesn't match an unexpected name). Also broadens the profileable regex to include Text/File so anything that looks remotely like a data source shows the button. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DataProfilingService now fetches the actual dataset file via DatasetService.retrieveDatasetVersionSingleFile (presign-download endpoint), parses with papaparse (first 5000 rows for performance), and runs a new pure-TS profiler that computes: - dtype inference per column (numeric / datetime / boolean / categorical / text) - per-column: count, missing, missingPercent, unique, plus dtype-specific stats - numeric: mean, median, std, min, max, ±3σ outlier count, 10-bin histogram - categorical/boolean: top-5 value counts - dataset-level: row-key duplicate count - Pearson correlation matrix across (up to 8) numeric columns If the source isn't a dataset path or any step fails (fetch / parse / empty headers), we fall back to the deterministic mock so the panel always renders. The panel header now shows a short filename (full path on hover) and surfaces fetch/parse errors inline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Yicong-Huang · 2026-05-23T18:02:51Z

Thank you for participating in the hackathon. We really enjoyed your idea and appreciate the contribution.

We are now archiving hackathon submissions by closing the submission PRs. However, we strongly encourage you to continue developing your idea and explore the possibility of merging it into the main branch.

To move forward, please:

Open an issue describing your idea, and link this PR as a reference.
Discuss the design and implementation plan with us in the issue.
Open smaller, focused PRs so they can go through proper review.

Thanks again for your participation and contribution!

Emily Sun and others added 4 commits May 15, 2026 21:55

github-actions Bot assigned EmilySun621 May 16, 2026

github-actions Bot added engine ddl-change Changes to the TexeraDB DDL frontend Changes related to the frontend GUI dev common agent-service labels May 16, 2026

Yicong-Huang closed this May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hackathon] feat: Data Profiling Panel — Instant Dataset Analysis Before You Run#5114

[Hackathon] feat: Data Profiling Panel — Instant Dataset Analysis Before You Run#5114
EmilySun621 wants to merge 4 commits into
apache:mainfrom
EmilySun621:hackathon/data-profiling

EmilySun621 commented May 16, 2026 •

edited

Loading

Uh oh!

Yicong-Huang commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EmilySun621 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎯 Problem

💡 Solution

✨ Features

1. Data Quality Score (0-100)

2. Suggested Cleaning Actions

3. Column Role Detection (auto)

4. Per-Column Statistics

5. Overview Tabs

6. Real Data, Not Mock

📸 Screenshots

🎬 Demo

📁 Files Changed

✅ Testing

Uh oh!

Yicong-Huang commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EmilySun621 commented May 16, 2026 •

edited

Loading