# Notebook D v2: gold_person_scores

**Gamification layer** — thin presentation view over the existing priority pipeline.

### Design principle
No weights are hardcoded here. All scoring logic lives upstream:
- `ref_signal_weights` → base scores per signal
- `ref_intent_category_weights` → category weights per intent
- `gold_person_integrity_priority` → already exposes `completeness_risk_score` and `evidence_fragility_score` as separate sub-scores
- `gold_person_narrative_priority` → `narrative_priority_score` (after fix notebook)

This view only does three things:
1. Reads the sub-scores from the upstream tables
2. Applies `PERCENT_RANK()` within each dimension to produce 0–100 population-relative scores
3. Joins in story status and proximity for the UI

### Prerequisites
Run `fix_narrative_priority.ipynb` before this notebook.

### Score methodology
| Score | Source column | Direction |
|---|---|---|
| Completeness | `gold_person_integrity_priority.completeness_risk_score` | Inverted (lower problems = higher score) |
| Evidence | `gold_person_integrity_priority.evidence_fragility_score` | Inverted |
| Story Potential | `gold_person_narrative_priority.narrative_priority_score` | Direct (higher = better) |
| Overall | Weighted avg: Completeness 35% + Evidence 40% + Story 25% | — |

**Safe to rerun** — `CREATE OR REPLACE VIEW`.

In [0]:
-- Cell 1: Create gold_person_scores view
-- =====================================================================
-- Reads sub-scores directly from upstream priority tables.
-- No weights hardcoded — all scoring logic stays in the priority pipeline.
-- PERCENT_RANK() produces population-relative 0-100 scores.

CREATE OR REPLACE VIEW genealogy.gold_person_scores AS

WITH

-- ----------------------------------------------------------------
-- Pull sub-scores from the integrity priority table
-- completeness_risk_score and evidence_fragility_score are already
-- split by category in gold_person_integrity_priority
-- ----------------------------------------------------------------
integrity AS (
  SELECT
    person_gedcom_id,
    COALESCE(completeness_risk_score, 0) AS completeness_raw,
    COALESCE(evidence_fragility_score, 0) AS evidence_raw
  FROM genealogy.gold_person_integrity_priority
),

-- ----------------------------------------------------------------
-- Pull narrative score — requires fix_narrative_priority to have run
-- ----------------------------------------------------------------
narrative AS (
  SELECT
    person_gedcom_id,
    COALESCE(narrative_priority_score, 0) AS narrative_raw
  FROM genealogy.gold_person_narrative_priority
),

-- ----------------------------------------------------------------
-- Percentile-rank all three dimensions across the full population.
-- Completeness and Evidence are INVERTED: higher raw score means
-- more problems, so we flip the percentile so the UI score reads
-- "higher = healthier".
-- Story Potential is DIRECT: higher raw = better story candidate.
-- ----------------------------------------------------------------
ranked AS (
  SELECT
    p.person_gedcom_id,
    i.completeness_raw,
    i.evidence_raw,
    n.narrative_raw,

    -- Completeness health score: invert percentile rank
    ROUND(
      (1.0 - PERCENT_RANK() OVER (ORDER BY i.completeness_raw ASC)) * 100
    ) AS completeness_score,

    -- Evidence health score: invert percentile rank
    ROUND(
      (1.0 - PERCENT_RANK() OVER (ORDER BY i.evidence_raw ASC)) * 100
    ) AS evidence_score,

    -- Story Potential: direct percentile rank
    ROUND(
      PERCENT_RANK() OVER (ORDER BY n.narrative_raw ASC) * 100
    ) AS story_potential_score

  FROM genealogy.gold_person_life p
  LEFT JOIN integrity i ON i.person_gedcom_id = p.person_gedcom_id
  LEFT JOIN narrative n ON n.person_gedcom_id = p.person_gedcom_id
),

-- ----------------------------------------------------------------
-- Story written status
-- ----------------------------------------------------------------
story_status AS (
  SELECT
    person_gedcom_id,
    story_written,
    story_title,
    story_doc_id
  FROM genealogy.silver_person_story_status
)

-- ----------------------------------------------------------------
-- Final view
-- ----------------------------------------------------------------
SELECT
  r.person_gedcom_id,
  p.given_name,
  p.first_name,
  p.surname,
  p.birth_year,
  p.death_year,
  b.branch,
  p.sex,

  -- Proximity from the signals table (already joined there)
  s.proximity         AS ancestral_proximity,
  s.is_direct_ancestor,

  -- Health scores (0-100, percentile-based, higher = healthier)
  r.completeness_score,
  r.evidence_score,
  r.story_potential_score,

  -- Overall: weighted average
  -- Weights here are display-only aggregation, not scoring weights.
  -- Tune these if you want the leaderboard to emphasise different dimensions.
  ROUND(
    r.completeness_score    * 0.35 +
    r.evidence_score        * 0.40 +
    r.story_potential_score * 0.25
  ) AS overall_score,

  -- Raw upstream scores (useful for debugging score movements)
  r.completeness_raw,
  r.evidence_raw,
  r.narrative_raw,

  -- Story tracking
  COALESCE(ss.story_written, FALSE) AS story_written,
  ss.story_title,
  ss.story_doc_id,

  -- story_ready: high story potential and not yet written
  -- Threshold of 75 is empirical — run Cell 3c to calibrate
  CASE
    WHEN overall_score >= 87 --changed from just story potential to overall as research needs to be relatively complete to be ready to write up
     AND NOT COALESCE(ss.story_written, FALSE)
    THEN TRUE
    ELSE FALSE
  END AS story_ready,

  -- Human-readable proximity label for UI
  CASE
    WHEN s.proximity = 0             THEN 'Direct Ancestor'
    WHEN s.proximity = 1             THEN 'Close'
    WHEN s.proximity BETWEEN 2 AND 3 THEN 'Collateral'
    ELSE                                  'Distant'
  END AS proximity_label

FROM ranked r
JOIN  genealogy.gold_person_life              p   ON p.person_gedcom_id = r.person_gedcom_id
LEFT JOIN genealogy.gold_person_branch        b   ON b.person_gedcom_id = r.person_gedcom_id
LEFT JOIN genealogy.gold_research_person_signals s ON s.person_gedcom_id = r.person_gedcom_id
LEFT JOIN story_status                        ss  ON ss.person_gedcom_id = r.person_gedcom_id

In [0]:
-- Cell 2: Create gold_branch_scores rollup view
-- =====================================================================
-- Branch-level aggregation used by /scores API endpoint
-- for the branch health cards in the Scores tab UI.

CREATE OR REPLACE VIEW genealogy.gold_branch_scores AS

SELECT
  branch,
  COUNT(*)                                                        AS total_individuals,
  ROUND(AVG(completeness_score))                                  AS avg_completeness,
  ROUND(AVG(evidence_score))                                      AS avg_evidence,
  ROUND(AVG(story_potential_score))                               AS avg_story_potential,
  ROUND(AVG(overall_score))                                       AS avg_overall,
  SUM(CASE WHEN story_written            THEN 1 ELSE 0 END)       AS stories_written,
  SUM(CASE WHEN story_ready
            AND NOT story_written        THEN 1 ELSE 0 END)       AS stories_ready,
  COUNT(CASE WHEN ancestral_proximity = 0 THEN 1 END)             AS direct_ancestors,
  ROUND(AVG(CASE WHEN ancestral_proximity = 0
                 THEN overall_score END))                         AS ancestor_avg_overall,
  MIN(completeness_score)                                         AS min_completeness,
  MIN(evidence_score)                                             AS min_evidence
FROM genealogy.gold_person_scores
WHERE branch IS NOT NULL
GROUP BY branch
ORDER BY avg_overall DESC

In [0]:
-- Cell 3a: Verification — score distribution
-- Expected: roughly uniform across buckets (percentile rank guarantees this)
-- Any severe skew suggests NULL values in upstream priority tables

SELECT
  FLOOR(overall_score / 10) * 10   AS score_bucket,
  COUNT(*)                          AS n,
  ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct
FROM genealogy.gold_person_scores
GROUP BY 1
ORDER BY 1

In [0]:
-- Cell 3b: Verification — branch summary

SELECT * FROM genealogy.gold_branch_scores

In [0]:
-- Cell 3c: Calibrate story_ready threshold
-- Shows how many people qualify at different score thresholds.
-- If top band is too small (< ~10): lower threshold in Cell 1.
-- If top band is too large (> 50): raise threshold.

SELECT
  threshold,
  COUNT(CASE WHEN overall_score >= threshold
              AND NOT story_written THEN 1 END) AS story_ready_count
FROM genealogy.gold_person_scores
CROSS JOIN (VALUES (60), (65), (70), (75), (80), (85), (90)) AS t(threshold)
GROUP BY threshold
ORDER BY threshold

In [0]:
-- Cell 3d: Top 10 story-ready individuals not yet written

SELECT
  given_name,
  surname,
  birth_year,
  branch,
  proximity_label,
  completeness_score,
  evidence_score,
  story_potential_score,
  overall_score
FROM genealogy.gold_person_scores
WHERE story_ready = TRUE
ORDER BY story_potential_score DESC
LIMIT 10

In [0]:
-- Cell 3e: 50 Stories progress

SELECT
  COUNT(*) FILTER (WHERE story_written)                      AS written,
  COUNT(*) FILTER (WHERE story_ready AND NOT story_written)  AS ready_to_write,
  50 - COUNT(*) FILTER (WHERE story_written)                 AS remaining_to_goal,
  COUNT(*)                                                    AS total_individuals
FROM genealogy.gold_person_scores