Skip to content

Flink FULL_DATA_TYPE parser#3305

Merged
James Robinson (jlrobins) merged 19 commits intomainfrom
flink_simplified_datatype_parser
Mar 3, 2026
Merged

Flink FULL_DATA_TYPE parser#3305
James Robinson (jlrobins) merged 19 commits intomainfrom
flink_simplified_datatype_parser

Conversation

@jlrobins
Copy link
Contributor

@jlrobins James Robinson (jlrobins) commented Feb 26, 2026

Summary of Changes

  • Implement a traditional LL(1) recursive descent parser for Flink FULL_DATA_TYPE strings:

    • Parse all Flink type variants: scalars, parameterized, arrays, multisets
    • Support ROW types with backtick-quoted field names and optional comments
    • Support MAP types with exactly 2 members (key, value)
    • Minimal regex use - character-by-character parsing for clarity
    • Proper handling of nullability markers (NOT NULL, NULL)
    • Aimed to avoid most all O(n^2) string operations when building up intermediate strings.
  • The parser is split into two layers, one general probably reusable recursive descent (peek(), consume(), ...) generic layer, and then a higher layer FlinkTypeParser which uses the lower layer to parse its particular grammar. The entirety of the parser is hidden away by a simple exported function parseFlinkType(full_data_type_string) which will return a single or a tree of FlinkType.

  • Implements a type-safe discriminated union: FlinkType with two sub-types: ScalarFlinkType (no members) and CompoundFlinkType (required members), discriminated by either isCompoundFlinkType() type guard or just by checking if .kind === FlinkTypeKind.SCALAR as the output nodes from the parser. A toplevel node with or without interior members will be returned.

This is first of a two part effort. This creates the recursive descent parser core and a Flink FULL_DATA_TYPE-centric parser using it that can then parse Flink FULL_DATA_TYPE strings into object/object trees.

A followup branch will use those object/trees to make arbitrarily deep nested treeview items for the Flink table/view columns.

No UI / features changes present in this branch.

But while here, touched up Gulpfile.js to allow for regexes like "gulp test -t flinkTypeParser|ParserState' to be passed to gulp test and testRun.

Be not dismayed, there's only ~500 LOC of new models + parser code. And then so many tests and corner case tests. But since the strings we'll be parsing should all be well formed since coming from the Flink INFORMATION_SCHEMA, the defensiveness of the parsing and tests is ... Very Thorough Coding.

Pull request checklist

Please check if your PR fulfills the following (if applicable):

Tests

  • Added new
  • Updated existing
  • Deleted existing

Release notes

  • Does anything in this PR need to be mentioned in the user-facing CHANGELOG?

- Create simplified FlinkType model for UI purposes
  - Replace complex hierarchy with simple, focused interfaces
  - Add FlinkTypeKind enum (SCALAR, ROW, MAP, ARRAY, MULTISET)
  - Use CompoundFlinkType for ROW and MAP types with members array

- Implement recursive descent parser with peek()/consume() operations
  - Parse all Flink type variants: scalars, parameterized, arrays, multisets
  - Support ROW types with backtick-quoted field names and optional comments
  - Support MAP types with exactly 2 members (key, value)
  - Minimal regex use - character-by-character parsing for clarity
  - Proper handling of nullability markers (NOT NULL, NULL)
Store the full element type structure in a `members` array instead of
flattening the information. This allows full type hierarchy preservation
for nested containers (ARRAY<ROW<>>, MULTISET<ARRAY<>>, etc).

Changes:
- Move `members?: FlinkType[]` from CompoundFlinkType to base FlinkType
- Remove deprecated `areMembersNullable` field
- ARRAY/MULTISET now return CompoundFlinkType with single-element members
- Update type model docstrings to clarify ARRAY/MULTISET participation
- Enhance tests to verify leaf nodes are scalar (NOT compound) and that
  nested container types can be cleanly discriminated

Test coverage:
- ARRAY<ROW<>>, ARRAY<MAP<>>, ARRAY<MULTISET<>>
- MULTISET<ARRAY<>>, MULTISET<ROW<>>
- ARRAY<ARRAY<>> (doubly nested containers)
- All 62 parser tests passing

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
… unterminated comments

Addresses Copilot PR review comments:
- Add validation that closing backtick exists when parsing backtick-quoted ROW field names,
  throws clear error instead of silently consuming wrong character
- Add validation in parseComment() to throw error if EOF reached without closing quote,
  preventing truncated comments

These changes improve error reporting for malformed input and prevent confusing downstream
errors from propagating.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Addresses two medium priority issues:

1. Fix O(n²) performance in consumeWhile() by using slice() instead of
   string concatenation in loop. Previously, concatenating characters one
   at a time caused quadratic time complexity for large inputs.

2. Add negative offset bounds checking in peekAt(). The bounds check
   now properly handles negative offsets by checking `idx < 0` in
   addition to the upper bound check.

Also adds comprehensive test suite (17 tests) covering:
- Negative offset edge cases (before input start)
- consumeWhile() correctness and performance (including 10000-char stress test)
- Position advancement after consuming
- Multiple sequential consumeWhile() calls
- Edge cases (empty input, trimming behavior)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Addresses low priority feedback from code review:

1. Fix docstring typo in flinkTypes.ts: 'TIME STAMP WITH TIME ZONE' →
   'TIMESTAMP WITH TIME ZONE' for consistency with Flink SQL standards

2. Clean up tryConsume() lint rule disabling by replacing unused-var
   disabling for-of loop with simpler for-loop over keyword.length

3. Remove reference to non-existent REAL_WORLD_FULL_DATA_TYPE_EXAMPLES.md
   file in test describe block name (tests remain valid)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…e test coverage

- Updated JSDoc comment to accurately describe what the function validates (checks for non-empty members array, validates SCALAR+members is invalid)
- Added 7 new test cases covering all compound type kinds (ARRAY, ROW, MAP, MULTISET)
- Added tests for scalar types without members and edge cases (empty members array)
- Added test for nested compound type structures (ARRAY<ROW<>>)
- All 84 flinkTypeParser tests passing, 3055 total unit tests passing

This addresses Copilot comment ID 2861056430 by ensuring the documentation accurately reflects the implementation's actual behavior.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Fixes TypeScript compilation error TS2304: Cannot find name 'FlinkType'
in src/parsers/flinkTypeParser.test.ts due to using FlinkType in type
annotations for the new test cases.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…test -t

Add support for running multiple test suites in a single pass by:
- Detecting pipe-separated patterns (e.g., "pattern1|pattern2") and converting to regex OR syntax
- Detecting regex syntax (brackets, parentheses, wildcards, etc.)
- Using Mocha's grep() method for regex patterns (via GREP env var)
- Preserving FGREP behavior for simple substring matching (backward compatible)

Allows flexible test filtering:
  npx gulp test -t "parser|builder"          # Multiple test suites
  npx gulp test -t "flinkType.*test$"        # Regex patterns
  npx gulp test -t "async|integration"       # Various patterns

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 26, 2026 22:22
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements a recursive descent parser for Flink SQL data types, creating the foundation for displaying arbitrarily nested type structures in VS Code tree views. The parser handles all Flink type variants including scalars, parameterized types, arrays, multisets, ROW types with field names and comments, and MAP types.

Changes:

  • Adds FlinkType model interfaces and enum for representing Flink SQL types in a simplified, UI-friendly structure
  • Implements ParserState utility class providing character-level parsing primitives (peek/consume operations) with configurable delimiters
  • Creates FlinkTypeParser using recursive descent parsing to handle all Flink type variants with proper nullability and comment support
  • Enhances test infrastructure to support both exact string matching (FGREP) and regex patterns (GREP) for running specific tests

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/models/flinkTypes.ts Defines FlinkType and CompoundFlinkType interfaces with type guard validation for representing Flink SQL types
src/parsers/parserState.ts Implements reusable ParserState class with character-level parsing operations, delimiter matching, and identifier/keyword parsing
src/parsers/parserState.test.ts Comprehensive unit tests for ParserState covering edge cases, error handling, and custom pattern support
src/parsers/flinkTypeParser.ts Implements FlinkTypeParser using recursive descent to parse all Flink type variants with nullability and comment handling
src/parsers/flinkTypeParser.test.ts Extensive test suite covering scalar types, nested structures, real-world examples, error cases, and degenerate inputs
src/testing.ts Adds GREP support for regex-based test filtering alongside existing FGREP for exact string matching
Gulpfile.js Implements regex pattern detection and conversion for test filtering with pipe-separated patterns

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

- Fix GREP type safety in testing.ts: convert env var to RegExp with error handling
- Update comment wording for FGREP to clarify fixed-string/substring behavior
- Remove duplicate old comment in test file (line 489)
- Update stale Playlist test comment that claimed parser limitations no longer exist
- All test comments now accurately reflect current parser capabilities

All 129 parser tests passing.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@jlrobins James Robinson (jlrobins) marked this pull request as ready for review February 26, 2026 23:30
@jlrobins James Robinson (jlrobins) requested a review from a team as a code owner February 26, 2026 23:30
- Remove unnecessary escape characters in regex pattern (Gulpfile.js:740)
- Remove unused catch parameter (src/testing.ts:41)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Copy link
Contributor

@shouples Dave Shoup (shouples) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! I only had a couple of minor questions, but shouldn't block merging - looking forward to seeing this in action in the Flink Database view 👏

testFilter is never reassigned after initialization, so it should be
declared as const rather than let.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@jlrobins James Robinson (jlrobins) marked this pull request as draft March 2, 2026 14:43
Changes:
- Created BaseFlinkType interface with common fields (kind, dataType, isFieldNullable, fieldName, comment)
- ScalarFlinkType extends BaseFlinkType with kind=SCALAR, no members field
- CompoundFlinkType extends BaseFlinkType with kind=ROW|ARRAY|MULTISET|MAP and required members
- FlinkType is now a union type: ScalarFlinkType | CompoundFlinkType
- Updated isCompoundFlinkType() type guard to simple kind check (no validation throws)

Benefits:
✅ TypeScript prevents accessing .members on scalar types (compile-time safety)
✅ Non-optional members field guarantees non-empty for compounds (via construction)
✅ Type structure self-documents invariants
✅ DRY: common fields defined once in BaseFlinkType
✅ Zero runtime overhead
✅ Extractable for use in other TypeScript codebases

Updated tests:
- Removed tests that expected isCompoundFlinkType() to throw on invalid states
- Added tests verifying type guard correctly discriminates all type kinds

All 92 parser tests passing, TypeScript strict mode satisfied.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@jlrobins James Robinson (jlrobins) marked this pull request as ready for review March 2, 2026 14:56
@sonarqube-confluent
Copy link

@jlrobins James Robinson (jlrobins) merged commit 11c67e6 into main Mar 3, 2026
14 checks passed
@jlrobins James Robinson (jlrobins) deleted the flink_simplified_datatype_parser branch March 3, 2026 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants