Skip to content

feat(describegpt): date/datetime content-type tokens with LLM-inferred chrono format#3884

Merged
jqnatividad merged 11 commits into
masterfrom
feat/describegpt-date-content-type
May 22, 2026
Merged

feat(describegpt): date/datetime content-type tokens with LLM-inferred chrono format#3884
jqnatividad merged 11 commits into
masterfrom
feat/describegpt-date-content-type

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

describegpt --infer-content-type previously classified every temporal column as content_type = "unknown"CONTENT_TYPE_VOCAB had no date/datetime token, so the LLM (told to only use allowed tokens) correctly fell back to unknown. A data dictionary labeling a DateTime column "unknown" reads as a tooling failure.

This PR adds date and datetime content-type tokens that carry an LLM-inferred chrono strftime format as a suffix (e.g. datetime:%m/%d/%Y %I:%M:%S %p), reusing the existing duration:N suffix machinery, and wires the format into qsv synthesize.

How it works

  • Deterministic token, LLM-inferred format. The bare date/datetime token is stamped deterministically from the stats Type column (mirroring the unique_id stamp). The LLM only supplies the :<fmt> suffix, inferred from the raw frequency-distribution values.
  • Two-stage validation. normalize_datetime_token syntactically validates the strftime suffix (chrono StrftimeItems); validate_date_formats semantically validates it against a real frequency sample and strips a format that doesn't parse the data.
  • All-midnight reclassification. qsv stats types a column DateTime whenever any value carries a non-midnight time, which over-reports columns that are plainly dates stored with a zero time-of-day (e.g. NYC 311 Created Date, where every sampled value is MM/DD/YYYY 12:00:00 AM). downgrade_all_midnight_datetime_columns reclassifies such a column to date and strips the time specifiers from <fmt> (datetime:%m/%d/%Y %I:%M:%S %pdate:%m/%d/%Y). A column with a real time anywhere in its sample (e.g. Due Date) stays datetime.
  • synthesize consumption. parse_date_format extracts the format; build_date / DateQuantile / next() emit generated dates in the inferred on-disk format instead of hardcoded ISO. Bare tokens fall back to %Y-%m-%d / RFC 3339.

Example

For NYC 311 data:

Column stats Type content_type
Created Date DateTime date:%m/%d/%Y
Closed Date DateTime date:%m/%d/%Y
Due Date DateTime datetime:%m/%d/%Y %I:%M:%S %p

Testing

  • cargo build --bin qsv -F all_features — clean; cargo clippy — no new warnings.
  • New unit tests in dictionary.rs (normalize/stamp/merge/validate/downgrade/strip_time_from_format) and faker_map.rs (parse_date_format, is_faker_token).
  • New integration test synthesize_with_dictionary_applies_inferred_date_format.
  • Regression: describegpt 70 + 65, synthesize 20 + 37, dictionary unit 43 — all pass.
  • The first-pass and refine prompts (plus the byte-identical DEFAULT_DICTIONARY_REFINE_PROMPT const) are updated; the default_dictionary_refine_prompt_matches_resource sync test passes.

🤖 Generated with Claude Code

jqnatividad and others added 2 commits May 21, 2026 18:27
…d chrono format

describegpt --infer-content-type previously classified date/datetime
columns as "unknown" because CONTENT_TYPE_VOCAB had no date token.

Add `date` and `datetime` content-type tokens that carry an LLM-inferred
chrono strftime format as a suffix (e.g. datetime:%m/%d/%Y %I:%M:%S %p),
reusing the existing duration:N suffix machinery:

- The bare date/datetime token is stamped deterministically from the
  stats Type column (mirroring the unique_id stamp); the LLM supplies
  only the <fmt> suffix.
- normalize_datetime_token syntactically validates the strftime suffix
  (chrono StrftimeItems); validate_date_formats semantically validates
  it against real frequency-distribution samples, stripping a format
  that does not parse the data.
- merge_content_type lets the LLM contribute only the format suffix to
  a code-stamped date column - it cannot reclassify a column or add a
  date token to a non-date column.
- synthesize consumes the inferred format: parse_date_format extracts
  it and build_date / DateQuantile emit generated dates in the original
  on-disk format instead of the hardcoded ISO output.
- First-pass and refine prompts (and the mirrored const) updated to
  instruct the LLM on the date:<fmt> / datetime:<fmt> forms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
qsv stats types a column DateTime whenever any single value carries a
non-midnight time, which over-reports columns whose values are plainly
dates stored with a zero time-of-day (e.g. NYC 311 "Created Date",
where every frequency-sampled value is "MM/DD/YYYY 12:00:00 AM").

Add downgrade_all_midnight_datetime_columns, a deterministic post-merge
step run after validate_date_formats: when every parseable frequency
sample of a datetime:<fmt> column falls exactly on midnight, the column
is reclassified as date and strip_time_from_format drops the time
specifiers from <fmt> (datetime:%m/%d/%Y %I:%M:%S %p becomes
date:%m/%d/%Y). Unparseable samples (e.g. NULL sentinels) are ignored
rather than blocking the downgrade; a column with a real time anywhere
in its frequency sample (e.g. "Due Date") stays datetime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 21, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 23 complexity

Metric Results
Complexity 23

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

jqnatividad and others added 8 commits May 21, 2026 18:52
Address roborev review (job 2345): validate_date_formats checked the
LLM-inferred date/datetime strftime suffix against only the first usable
frequency value, so an ambiguous format could survive incorrectly -
`%m/%d/%Y` parses a first sample `01/02/2020` even when a later sample
`13/02/2020` proves the column is actually `%d/%m/%Y`, making synthesize
emit dates in the wrong layout.

Now every usable frequency sample must parse with the inferred format;
if any fails, the suffix is stripped back to the bare token. Extract a
shared `usable_samples_by_field` helper (also used by
`downgrade_all_midnight_datetime_columns`) and add a regression test
where the first sample is ambiguous but a later one disambiguates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ipping

Address roborev review (job 2346):

MEDIUM - downgrade_all_midnight_datetime_columns used filter_map, which
silently dropped any sample that failed to parse with the inferred
format. A real frequency value with a non-midnight time but a
mismatched format could be dropped, wrongly downgrading the column to
date. It now collects into Option<Vec<bool>>: any sample that fails to
parse blocks the downgrade and the column stays datetime.

LOW - strip_time_from_format only trimmed whitespace, `T` and `,`
before the time specifier, so a format like `%Y-%m-%d-%H:%M:%S`
downgraded to `date:%Y-%m-%d-` (trailing separator). The trim set now
also covers `-` `/` `.` `:` `_`.

Updated the downgrade test to assert an unparseable sample keeps the
column datetime, and added a strip_time_from_format case for the `-`
separator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address roborev review (job 2347): usable_samples_by_field only
excluded rank-0 rows and the <ALL_UNIQUE> sentinel. `frequency
--pct-nulls` gives the null row a real (non-zero) rank, so the `(NULL)`
row was kept as a usable sample. Since validate_date_formats now
requires every sample to parse, a single ranked null row stripped an
otherwise-valid date:<fmt> / datetime:<fmt> suffix.

usable_samples_by_field now also excludes rows whose value is `(NULL)`
(frequency's default --null-text), so both validate_date_formats and
downgrade_all_midnight_datetime_columns ignore them. Added regression
test validate_date_formats_ignores_ranked_null_rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address roborev review (job 2348): the finding — a (NULL) row ranked by
`frequency --pct-nulls` blocking the datetime->date downgrade — shares
its root cause with job 2347 and was already fixed in f26d1b0:
usable_samples_by_field now excludes (NULL) rows by value, and
downgrade_all_midnight_datetime_columns consumes that helper, so a
ranked (NULL) row never reaches the strict parse check.

Add the downgrade-path regression test the review asked for: an
all-midnight datetime column with a valid sample plus a ranked (NULL)
row still downgrades to date.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address roborev review (job 2349): usable_samples_by_field excluded the
null row by matching the hard-coded `(NULL)` label. That was both too
narrow - `describegpt --freq-options` can pass a custom
`frequency --null-text`, whose ranked null row (`--pct-nulls`) then
slipped through and stripped valid date/datetime suffixes - and too
broad - a real data value literally equal to `(NULL)` would be dropped.

Identify the null row structurally instead: its `count` equals the
column's null count, which `DictionaryEntry.null_count` already carries.
usable_samples_by_field now takes the entries and excludes the row
whose count matches the field's null count, independent of `--null-text`
and `--pct-nulls`. Regression tests updated to set null_count and use a
custom null label (`<MISSING>`) to prove label-independence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address roborev review (job 2351): usable_samples_by_field identified the
frequency null row by `count == null_count` alone. That is too broad - a
real, non-null value whose count merely equals the column's null count
would be dropped, which can keep a bad date suffix or wrongly downgrade a
datetime column to date.

Require BOTH signals `frequency` guarantees for the emitted null row: its
`value` equals the configured `--null-text` AND its `count` equals the
column's null count. Value alone is too broad (a real datum reading as the
null label) and count alone is too broad (a real value sharing the null
count); together they confine the exclusion to the genuine null sentinel,
even when `frequency --pct-nulls` gives that row a real rank.

The configured null-text is threaded through validate_date_formats and
downgrade_all_midnight_datetime_columns. A new configured_null_text helper
parses `--null-text` out of `--freq-options` (default `(NULL)`). Regression
tests added: a real sample sharing the null count is not dropped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address roborev review (job 2352). Two findings, both genuine but very
narrow corners whose robust fixes are feature-level:

- Medium: a `file:`-backed frequency CSV generated with a custom
  `--null-text` - describegpt did not generate that CSV, so it cannot
  know the label and `configured_null_text` falls back to `(NULL)`.
- Low: a real data value identical to the null label AND sharing the
  column's null count is indistinguishable from the null sentinel in
  the CSV - an inherent ambiguity that only a controlled null-text
  could remove, which would pollute dictionary enumerations/examples.

The existing null-text + null-count check is correct for all normal
usage. Rather than add a new CLI flag, document the `file:` limitation
explicitly: the `--freq-options` USAGE text now states that a
`file:`-backed CSV is assumed to use frequency's default `(NULL)` null
text, and the `configured_null_text` doc comment spells out the
residual `file:` + custom `--null-text` + `--pct-nulls` gap. Help doc
regenerated from the updated USAGE text.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class temporal content-type support to describegpt --infer-content-type by introducing deterministic date/datetime tokens that can carry an LLM-inferred chrono strftime suffix (e.g. datetime:%m/%d/%Y %I:%M:%S %p), validates the suffix against real frequency samples, and threads the validated format into qsv synthesize so generated outputs match the source on-disk representation instead of defaulting to ISO/RFC3339.

Changes:

  • Extend describegpt dictionary content-type vocabulary with date/datetime plus :<fmt> suffix handling, including merge rules and sample-based validation/downgrade logic.
  • Teach synthesize to parse and apply inferred date/datetime output formats while keeping date/datetime generation type-based (not faker-based).
  • Add unit + integration tests and update default prompts/docs to document the new suffix forms and known file: limitations around null-text.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/test_synthesize.rs Adds an integration test asserting synthesize applies inferred date formatting from a dictionary.
src/cmd/synthesize/generator.rs Threads an optional inferred strftime format into the date/datetime quantile generator and applies it at render time.
src/cmd/synthesize/faker_map.rs Adds date/datetime as non-faker tokens and implements parse_date_format to extract/validate the :<fmt> suffix.
src/cmd/describegpt/dictionary.rs Adds date/datetime tokens to the vocab, normalization/merge rules, format validation against frequency samples, and all-midnight datetime→date downgrade.
src/cmd/describegpt.rs Wires date-format validation/downgrade into the dictionary build pipeline and adds --freq-options null-text parsing (with documented file: limitation).
resources/describegpt_defaults.toml Updates default prompts to instruct the LLM on date:<fmt> / datetime:<fmt> suffix usage and constraints.
docs/help/describegpt.md Updates generated help to document the file: + custom --null-text limitation for date/datetime format validation.

Comment thread tests/test_synthesize.rs Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@jqnatividad jqnatividad merged commit 688aba8 into master May 22, 2026
18 checks passed
@jqnatividad jqnatividad deleted the feat/describegpt-date-content-type branch May 22, 2026 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants