Skip to content

fix(#179): map qsv Date to Postgres date (not timestamp)#314

Merged
jqnatividad merged 3 commits into
mainfrom
fix-179-date-without-timestamp
May 18, 2026
Merged

fix(#179): map qsv Date to Postgres date (not timestamp)#314
jqnatividad merged 3 commits into
mainfrom
fix-179-date-without-timestamp

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

Closes #179.

Reporter: CSV with a date-only column ("2023-02-02") showed up in the CKAN preview as "2023-02-02T00:00:00" — DP+ was storing the column as Postgres timestamp and running RFC-3339 normalization, which adds the T00:00:00 time component for date-only inputs.

Root cause

A discrepancy between two sources of the same default:

  • config.py:144 inline fallback for type_mapping had "Date": "date" (correct).
  • config_declaration.yaml had "Date": "timestamp" (wrong).

The CKAN declaration default wins over an inline tk.config.get(key, fallback) fallback — the declaration populates tk.config BEFORE the fallback is consulted. So real deployments saw the buggy declaration value, never the correct inline one.

Fix

  1. config_declaration.yaml — change "Date": "timestamp" to "Date": "date". A column qsv infers as Date (date-only) now maps to Postgres date, skips qsv datefmt normalization (only timestamp columns are added to datetimecols_list), and lands in CKAN as 2023-02-02 instead of 2023-02-02T00:00:00.

  2. Secondary bug in analysis._parse_stats: the reverse-mapping from Postgres type_override back to qsv type names was missing the "date" entry. An operator setting type_override = "date" in the Data Dictionary on a column qsv inferred as DateTime would silently fall through to qsv's DateTime inference and be mapped right back to Postgres timestamp. Added "date": "Date" to the reverse map.

Test plan

  • 217/217 Python unit tests pass on dpp-test (was 213; +4 new).
  • New tests/test_date_without_timestamp.py:
    • Date column → Postgres date (the user-visible bug).
    • Date column NOT added to datetimecols_list (skips RFC-3339 normalization).
    • type_override = "date" survives the reverse-map (secondary bug).
    • Declaration default + inline fallback in config.py agree (catches re-regression).
  • End-to-end smoke on the integration stack: uploaded a CSV with a date-only event_date column containing values like 2023-02-02. Result on the persisted CKAN resource:
    • event_date field type: date (was timestamp pre-fix)
    • event_date row value: 2023-02-02 (was 2023-02-02T00:00:00 pre-fix)
  • Integration test in CI.

Test isolation note

Needed an autouse fixture to point ckanext.datapusher_plus.dictionary_stash_dir at a per-test tmp_path. Without it, the third test's mocked-existing-resource path writes a stash file to the global /tmp/dpp_dict_stash/ keyed by resource_id, and that stale stash leaks into the FIRST test's _parse_stats call via the retry-restore branch from #265. The fixture isolates the stash dir per test.

🤖 Generated with Claude Code

jqnatividad and others added 2 commits May 18, 2026 14:02
Closes #179.

Reporter: CSV with a date-only column ("2023-02-02") showed up in the
CKAN preview as "2023-02-02T00:00:00" — DP+ was storing the column as
Postgres ``timestamp`` and running RFC-3339 normalization, which adds
the ``T00:00:00`` time component for date-only inputs.

Root cause: a discrepancy between two sources of the same default.

* ``ckanext/datapusher_plus/config.py:144`` inline fallback for
  ``type_mapping`` had ``"Date": "date"`` (correct).
* ``ckanext/datapusher_plus/config_declaration.yaml`` had
  ``"Date": "timestamp"`` (wrong).

The CKAN declaration default wins over an inline ``tk.config.get(key,
fallback)`` fallback — the declaration populates ``tk.config`` BEFORE
the fallback is consulted. So real deployments saw the buggy
declaration value, never the correct inline one.

Fix:

1. ``config_declaration.yaml``: change ``"Date": "timestamp"`` to
   ``"Date": "date"`` to match the inline default. Now a column qsv
   infers as ``Date`` (date-only) maps to Postgres ``date``, skips
   ``qsv datefmt`` normalization (only ``timestamp`` columns are
   added to ``datetimecols_list``), and lands in CKAN as
   ``2023-02-02`` instead of ``2023-02-02T00:00:00``.

2. Secondary bug in ``analysis._parse_stats``: the reverse-mapping
   from Postgres ``type_override`` back to qsv type names was
   missing the ``"date"`` entry. An operator setting
   ``type_override = "date"`` in the Data Dictionary on a column
   qsv inferred as DateTime would silently fall through to qsv's
   ``DateTime`` inference and be mapped right back to Postgres
   ``timestamp`` — losing the operator's intent. Added
   ``"date": "Date"`` to the map.

Tests in new ``tests/test_date_without_timestamp.py``:

* ``test_date_column_becomes_postgres_date_not_timestamp`` — the
  user-visible #179 fix.
* ``test_date_column_not_added_to_datetimecols_for_normalization``
  — pins the qsv-datefmt-skip side of the fix (which is what
  prevents the T00:00:00 from being added).
* ``test_type_override_date_wins_over_qsv_datetime_inference`` —
  the secondary reverse-map fix.
* ``test_config_declaration_default_maps_date_to_date_not_timestamp``
  — pins both the declaration value AND that the inline fallback
  in config.py agrees, so a future "tidy up" PR can't silently
  re-introduce the bug.

Test isolation note: needed an autouse fixture to point
``ckanext.datapusher_plus.dictionary_stash_dir`` at a per-test
tmp_path. Without it, the third test's mocked-existing-resource path
writes a stash file to the global ``/tmp/dpp_dict_stash/`` keyed by
``resource_id``, and that stale stash leaks into the FIRST test's
``_parse_stats`` call via the retry-restore branch in #265. The
fixture isolates the stash dir per test.

End-to-end verification on the integration stack: uploaded a CSV
with a date-only column ``event_date`` containing values like
``2023-02-02``. Result on the persisted CKAN resource:

  datastore field types:
    event_date  →  date          (was: timestamp pre-fix)
  row values:
    event_date  →  '2023-02-02'  (was: '2023-02-02T00:00:00' pre-fix)

Tests: 217/217 unit tests pass on dpp-test (was 213; +4 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings:

**MEDIUM** — Subtle regression for ``AUTO_INDEX_DATES``. Pre-#179,
qsv-inferred Date columns mapped to Postgres ``timestamp``, so
``IndexingStage._get_datetime_columns``' ``header["type"] == "timestamp"``
check correctly picked them up for auto-indexing. Post-#179, date-only
columns are Postgres ``date`` — and the narrow equality silently
excluded them. Operators relying on AUTO_INDEX_DATES would see their
date-only indexes disappear after upgrade. Fixed: changed the check
to ``header["type"] in ("timestamp", "date")``. Note: the
``datetimecols_list`` populated in ``analysis._build_headers_dicts``
drives ``qsv datefmt`` normalization (which we want to SKIP for date-
only columns — that's the whole #179 fix), so I deliberately did NOT
include ``"date"`` there. The two lists now legitimately diverge:
normalization-list ⊂ indexing-list.

**LOW** — Brittle regex on ``config.py`` source. Replaced with
``ast`` walk that finds ``tk.config.get("ckanext.datapusher_plus.type_mapping",
"<json>")`` by structural pattern. Now reformat-immune (double-vs-
single quotes, line breaks, nearby comments with apostrophes).

**LOW** — Stats CSV fixture omitted the many extra columns qsv's real
``stats.csv`` emits (``sum``, ``mean``, ``stddev``, ``nullcount``, etc.).
``_parse_stats`` doesn't read them today; if a future change starts to,
the test would fail with a confusing KeyError. Added a sentinel
comment listing which columns ARE required and noting the maintenance
cost.

Added regression test ``test_auto_index_dates_still_indexes_date_only_columns``
that builds a minimal IndexingStage context with mixed ``date`` /
``timestamp`` / non-date columns and asserts both ``date`` and
``timestamp`` columns land in the AUTO_INDEX_DATES candidate list,
while text / numeric / bigint do not.

Tests: 218/218 unit tests pass on dpp-test (was 217; +1 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes the ingestion/type-mapping pipeline so qsv-inferred Date columns map to Postgres date (not timestamp), preventing date-only values like 2023-02-02 from being normalized/rendered as 2023-02-02T00:00:00 in CKAN previews (issue #179). Also addresses a related Data Dictionary type_override="date" reverse-mapping gap and keeps auto-indexing behavior consistent for date-like columns.

Changes:

  • Update declarative type_mapping default so Date → Postgres date (instead of timestamp).
  • Add date to the Postgres→qsv reverse map used when applying Data Dictionary overrides in the analysis stage.
  • Expand auto-index “date-like” detection to include both timestamp and date, and add regression tests covering the end-to-end behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
ckanext/datapusher_plus/config_declaration.yaml Corrects the declarative default mapping so qsv Date becomes Postgres date.
ckanext/datapusher_plus/jobs/stages/analysis.py Ensures Data Dictionary type_override="date" is correctly reverse-mapped back to qsv Date.
ckanext/datapusher_plus/jobs/stages/indexing.py Treats Postgres date as date-like for AUTO_INDEX_DATES (alongside timestamp).
tests/test_date_without_timestamp.py Adds regression coverage for #179 plus related override + auto-index behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_date_without_timestamp.py Outdated
Comment thread tests/test_date_without_timestamp.py
Comment thread ckanext/datapusher_plus/jobs/stages/indexing.py Outdated
Three findings, all applied:

* **Unused ``import re``** in ``tests/test_date_without_timestamp.py``
  (leftover from before the AST-walk refactor in the roborev #2257
  follow-up). Removed.

* **Unused ``tmp_path`` fixture parameter** on
  ``analysis_stage_context``. The fixture used to use ``tmp_path`` to
  point the dictionary-stash dir, but that responsibility moved to
  the new autouse ``_isolate_dictionary_stash`` fixture. Removed.

* **Misleading method/variable names** in ``indexing.py``:
  ``_get_datetime_columns`` / ``datetimecols_list`` now also pick up
  Postgres ``date`` columns (the AUTO_INDEX_DATES regression fix),
  so the "datetime"-only naming is no longer accurate. Renamed:
  - ``_get_datetime_columns`` -> ``_get_date_like_columns``
  - ``datetimecols_list`` -> ``date_like_cols_list``
  Updated surrounding docstrings + comments to say "date-like
  (Postgres ``timestamp`` or ``date``)". Deliberately did NOT
  rename the same-named variable in ``analysis.py`` — there the
  list is genuinely datetime-only (it drives ``qsv datefmt``
  normalization which we skip for date-only columns; that's the
  #179 fix). The two files now legitimately diverge on the name.
  Updated the regression test to call the new method name.

Tests: 218/218 unit tests still pass on dpp-test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 9a8a6c7 into main May 18, 2026
1 check passed
@jqnatividad jqnatividad deleted the fix-179-date-without-timestamp branch May 18, 2026 19:14
jqnatividad added a commit that referenced this pull request May 18, 2026
Three findings (1 MEDIUM, 2 LOW) on the memory-update commit; all
addressed.

- MEDIUM @ stale `tests/test_date_type_mapping.py` reference: this
  commit was *specifically* supposed to fix that reference but missed
  one instance (line 316). Fixed — and the same class of bug existed
  for three other filenames the reviewer didn't flag explicitly but
  which had the same root cause (me inventing filenames without
  verifying against `git log --diff-filter=A`). Corrected:
    test_file_hash_algorithms.py     -> test_file_hash_algorithm.py
    test_metadata_stage_file_hash.py -> test_metadata_hash_persistence.py
    test_download_result_rehydrate.py -> test_rehydrate_resource_identity.py
    test_date_type_mapping.py        -> test_date_without_timestamp.py
  Verified all four replacements resolve as real files in the tree.

- LOW @ inconsistent test-count math: the prior "+26 tests" claim
  enumerated files summing to 51, not 26. Rewrote the "Known result"
  section to spell out per-file gross counts (collected via
  `pytest --collect-only -q`: 21/4/10/3/4/5/4 = 51), then state the
  honest net delta (+26) and acknowledge the 25-test gap is from
  pre-existing tests removed/consolidated during the #307#314
  refactors. The math is now self-auditable and the framing no longer
  implies the listed adds sum to the delta.

- LOW @ forward-looking "Recognized in qsv ≥ 20.1" claim: tightened
  to "Empirically verified to be recognized as DateTime on qsv 20.1.0
  (macOS host check during #173 investigation)" and cross-referenced
  the test's skip guard for qsv ≥ 20.1, so the claim is grounded in
  a real observation and the next maintainer has a clean handoff
  path when the production qsv pin moves.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Date without timestamp

2 participants