Skip to content

Historic date curation: 8,046 pre-1901 records need review #1

@turfptax

Description

@turfptax

Summary

Analysis of all records with date_event before 1901 revealed ~800 records with date errors across 4 problem categories, embedded among ~7,200 legitimately ancient sighting records.

An analysis dataset has been extracted to temp/historic_pre1901.db with a date_analysis table pre-classified into categories for manual review.

Problem Categories

1. UFOCAT century-only 19// (692 records)

  • Raw date: 19// (YEAR=19, MO=empty, DAY=empty)
  • Parsed as: 0019-01-01 (zero-padded 2-digit year)
  • Actual meaning: "Sometime in the 1900s" — year is genuinely unknown, only century known
  • Evidence: Descriptions include modern events: "recalled abduction from orphanage as little girl", "Motion pictures", "radar confirmation". Cities: Sacramento, Miami, Chicago, Houston
  • Root cause: parse_ufocat_date() in import_ufocat.py line 39: f"{y:04d}" zero-pads 190019
  • Proposed fix: Set date_event = NULL (year genuinely unknown)

2. UFOCAT 3-digit year ambiguity (88 records)

  • 3-digit raw years (034–999), mostly legitimately ancient
  • 2 confirmed modern mislabels:
    • 195// → "H-BOMB TEST" (clearly 1950s, not 195 AD)
    • 188// with states CN, NZL, FRA, TUR and no descriptions (possibly 1880s)
  • Proposed fix: Manual classification in analysis DB, then targeted corrections

3. UPDB mangled modern years (~20 records)

  • Raw JSON confirms broken years in upstream UPDB export
  • Pattern examples:
    • 0196 = 1962 (description says "June 22-23, 1962")
    • 0200 = ~2000 (Ellsworth AFB radar, modern descriptions)
    • 0191 = ~1991 (Boxford, "bright red object...upside down saucer")
    • 0100 = modern (Topanga Canyon, "triangle formation")
  • Proposed fix: Manual correction from description context where possible; NULL the rest

4. NUFORC data entry errors (~3 records)

  • 0205-01-05 = 2005 (description: "Dad seen outside window")
  • 1071-06-16 = ~2007 (description: "my friend spotted object in sky")
  • 1721-02-01 = ~2021 (description: "straight line of lights in sky")
  • Proposed fix: Manual correction from description context

Legitimately Ancient Records (~7,200)

These are correct and need no changes:

  • UFOCAT 4-digit years: 4,436 records (1001–1900)
  • UFO-search: 1,984 records (Geldreich Majestic Timeline, 61–1900 AD)
  • UPDB: ~760 records (1000–1900)
  • MUFON: 40 records (1890s)
  • NUFORC: ~23 records (historic sighting reports)

Analysis Tooling

  • extract_historic.py — Extracts pre-1901 records into temp/historic_pre1901.db
  • temp/historic_pre1901.db — Standalone SQLite DB with date_analysis table containing:
    • category — Auto-classified category
    • corrected_year — Manual override column (NULL = no correction needed)
    • notes — Reviewer notes
    • Views: v_category_summary, v_3digit_review, v_century_only, v_updb_review, v_timeline
  • temp/ANALYSIS.md — Detailed analysis report

Next Steps

  • Annotate ~110 ambiguous records in the analysis DB (UFOCAT 3-digit + UPDB + NUFORC)
  • Decide on UFOCAT century-only handling (NULL vs. marker value)
  • Generate SQL fix statements from annotated analysis DB
  • Add tests for historic date fixes
  • Apply fixes to rebuild_db.py / import_ufocat.py
  • Update methodology documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions