# AP Analysis Database Overview

In [84]:
from pathlib import Path
from IPython.display import display
import sqlite3
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 3)

In [85]:
DB_PATH = Path('../out/ap/analysis.db')
if not DB_PATH.exists():
    raise FileNotFoundError(f'Expected database at {DB_PATH.resolve()}')
conn = sqlite3.connect(DB_PATH)
conn.row_factory = sqlite3.Row

## Table Directory

In [86]:
tables = pd.read_sql("""
    SELECT name
    FROM sqlite_master
    WHERE type = 'table'
    ORDER BY name
""", conn)
display(tables)

row_counts = []
for table in tables['name']:
    count_sql = f"SELECT COUNT(*) AS rowcount FROM \"{table}\""
    count = pd.read_sql(count_sql, conn)['rowcount'][0]
    row_counts.append({'table': table, 'rows': int(count)})
row_counts_df = pd.DataFrame(row_counts).sort_values('rows', ascending=False).reset_index(drop=True)
display(row_counts_df)

Unnamed: 0,name
0,article_metrics
1,articles
2,entity_mentions
3,pair_anon_named_replacements
4,pair_claims
5,pair_frame_cues
6,pair_numeric_changes
7,pair_source_transitions
8,pair_sources_added
9,pair_sources_removed


Unnamed: 0,table,rows
0,entity_mentions,558182
1,versions,10639
2,version_metrics,9751
3,source_mentions,5043
4,pair_frame_cues,4983
5,pair_anon_named_replacements,4893
6,pair_source_transitions,4228
7,articles,2222
8,article_metrics,2069
9,pair_claims,2069


## Articles

This table holds the canonical identity for every story pulled from the upstream archive. It gives analysts a stable anchor before layering in prompt-derived annotations, and the live-blog flag comes from [D4_live_blog_detect](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/D4_live_blog_detect.output.json#L1-L25) which screens for rolling updates.

**Fields**
- `article_id` — Unique integer assigned by the source archive to each story. It stays constant across all downstream tables so you can stitch the narrative together.
- `news_org` — Short code identifying the newsroom or wire service that produced the item. It keeps datasets clean when you mix multiple feeds in a single run.
- `url` — Canonical URL captured during ingestion. This is the pointer you can click when auditing the content in context.
- `title_first` — The earliest headline text seen for the story. It gives a baseline when you compare framing shifts.
- `title_final` — The last headline captured during the run. Studying the delta between first and final titles reveals editorial direction.
- `original_publication_time` — Timestamp when the pipeline first observed the story. It underpins cadence analyses.
- `total_edits` — Count of subsequent revisions after the seed version. Higher numbers warn that you should inspect version-level tables.
- `is_live_blog` — Boolean surfaced by the live-blog detector prompt. A true value tells you the story bypassed deep LLM analysis to avoid blow-ups.

In [87]:
display(pd.read_sql("""
SELECT article_id, news_org, url, title_first, title_final, total_edits, is_live_blog
FROM articles
ORDER BY total_edits DESC
LIMIT 5
""", conn))

Unnamed: 0,article_id,news_org,url,title_first,title_final,total_edits,is_live_blog
0,17,ap,http://hosted.ap.org/dynamic/stories/U/US_POLI...,News from The Associated Press,News from The Associated Press,18,0
1,1020,ap,http://hosted.ap.org/dynamic/stories/U/US_FEDE...,News from The Associated Press,News from The Associated Press,18,0
2,1060,ap,http://hosted.ap.org/dynamic/stories/U/US_CONG...,News from The Associated Press,News from The Associated Press,18,0
3,1115,ap,http://hosted.ap.org/dynamic/stories/A/AS_AUST...,News from The Associated Press,News from The Associated Press,18,0
4,1836,ap,http://hosted.ap.org/dynamic/stories/U/US_DAMA...,News from The Associated Press,News from The Associated Press,18,1


In [88]:
display(pd.read_sql("""
SELECT COUNT(*) AS articles,
       SUM(is_live_blog) AS live_blog_articles,
       AVG(total_edits) AS avg_edits
FROM articles
""", conn))

Unnamed: 0,articles,live_blog_articles,avg_edits
0,2222,153,3.788


## Article Metrics

Article-level deltas summarise how the first and final versions diverged. They fuse framing differences from [B2_first_final_framing_compare](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/B2_first_final_framing_compare.output.json#L1-L50) with aggregates built from version-level metrics.

**Fields**
- `article_id` — Foreign key pointing back to the canonical story row. You use it to join the deltas with article metadata and version histories.
- `news_org` — Repeats the outlet code so you can facet metrics by publisher. It also guards against accidental cross-feed joins.
- `overstate_institutional_share` — Numeric delta showing whether institutional voices gain or lose share in the final story. Positive values flag a tilt toward official sources, while negative values capture a move toward non-institutional voices.
- `distinct_sources_delta` — Change in the count of unique canonical sources between first and final versions. It highlights whether the reporter broadened or narrowed the sourcing base over time.
- `anonymity_rate_delta` — Difference in the share of attributed words assigned to anonymous speakers. A positive jump signals heavier reliance on anonymity in the latest version.
- `hedge_density_delta` — Delta in hedge markers per 1,000 tokens between the first and last summaries. Large swings imply the tone got more cautious or more assertive.

In [89]:
display(pd.read_sql("""
SELECT article_id, news_org,
       overstate_institutional_share,
       distinct_sources_delta,
       anonymity_rate_delta,
       hedge_density_delta
FROM article_metrics
ORDER BY ABS(overstate_institutional_share) DESC
LIMIT 5
""", conn))

Unnamed: 0,article_id,news_org,overstate_institutional_share,distinct_sources_delta,anonymity_rate_delta,hedge_density_delta
0,2,ap,1.0,-3,0.0,-5.435
1,47,ap,-1.0,-3,0.0,-12.681
2,48,ap,1.0,0,0.0,-21.583
3,90,ap,1.0,3,-0.529,1.719
4,173,ap,-1.0,1,0.0,0.0


In [90]:
display(pd.read_sql("""
SELECT COUNT(*) AS rows,
       AVG(overstate_institutional_share) AS avg_bias_shift,
       AVG(distinct_sources_delta) AS avg_source_delta,
       AVG(hedge_density_delta) AS avg_hedge_delta
FROM article_metrics
""", conn))

Unnamed: 0,rows,avg_bias_shift,avg_source_delta,avg_hedge_delta
0,2069,-0.004,0.042,-0.019


## Versions

Each article revision captured from the source CMS gets a row here. These records come straight from the upstream database and ground every prompt output in time.

**Fields**
- `version_id` — Unique identifier for the specific revision. It connects the dots between raw CMS data and prompt results.
- `article_id` — Story identifier that groups every revision under a common parent. Use it to pivot from version timelines back to article baselines.
- `news_org` — Source outlet label repeated per revision. It makes cross-newsroom studies easier when you union datasets.
- `version_num` — Integer sequence ordering revisions for a given article. Analysts rely on it to figure out which version is “first” or “final.”
- `timestamp_utc` — Ingestion timestamp in UTC for when the revision arrived. It powers cadence and latency analyses.
- `title` — Headline text for that revision. Comparing these values helps spot headline edits even before diving into body text.
- `char_len` — Character length of the processed summary text. It acts as a rough proxy for story depth or trimming.

In [91]:
display(pd.read_sql("""
SELECT version_id, article_id, version_num, timestamp_utc, title, char_len
FROM versions
ORDER BY timestamp_utc DESC
LIMIT 5
""", conn))

Unnamed: 0,version_id,article_id,version_num,timestamp_utc,title,char_len
0,55675,5026,1,2017-11-09 11:41:06.009530,News from The Associated Press,4564
1,55657,1036,4,2017-11-09 09:36:46.452150,News from The Associated Press,4132
2,55656,1129,6,2017-11-09 09:34:09.854435,News from The Associated Press,889
3,55651,4095,17,2017-11-09 08:37:06.177899,News from The Associated Press,3836
4,55649,5033,13,2017-11-09 08:26:12.630042,News from The Associated Press,3762


In [92]:
display(pd.read_sql("""
SELECT COUNT(*) AS versions,
       COUNT(DISTINCT article_id) AS articles_with_versions,
       AVG(char_len) AS avg_char_len
FROM versions
""", conn))

Unnamed: 0,versions,articles_with_versions,avg_char_len
0,10639,2222,3459.972


## Version Metrics

Per-version quantitative features translate detailed prompt annotations into numbers you can trend. They synthesise sourcing outputs from [A1_source_mentions](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/A1_source_mentions.output.json#L1-L123), hedge context from [A2_hedge_window](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/A2_hedge_window.output.json#L1-L59), and narrative roles from [N1_narrative_keywords](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/N1_narrative_keywords.output.json#L1-L56).

**Fields**
- `version_id` — Primary key that matches the revision row you are summarising. It is the join handle to the versions table and the prompt caches.
- `article_id` — Story identifier copied in so you can aggregate metrics without rejoining the versions table. It keeps per-version statistics scoped to the original article.
- `news_org` — Outlet label to support newsroom-level slicing. It lets you compare metrics across organisations quickly.
- `distinct_sources` — Count of canonical sources quoted in that revision. Higher values indicate broader sourcing at that point in time.
- `institutional_share_words` — Share of attributed words belonging to institutional categories such as government or corporate speakers. It flags when official voices dominate the narrative.
- `anonymous_source_share_words` — Fraction of attributed words tied to anonymous sources. Analysts track this number to understand transparency shifts.
- `hedge_density_per_1k` — Hedge markers per 1,000 summary tokens derived from the hedge window prompt. Spikes suggest the revision became more cautious or uncertain.

In [93]:
display(pd.read_sql("""
SELECT version_id,
       distinct_sources,
       institutional_share_words,
       anonymous_source_share_words,
       hedge_density_per_1k
FROM version_metrics
ORDER BY hedge_density_per_1k DESC
LIMIT 5
""", conn))

Unnamed: 0,version_id,distinct_sources,institutional_share_words,anonymous_source_share_words,hedge_density_per_1k
0,53399,1,1.0,0.0,48.951
1,355,5,0.521,0.167,40.0
2,48428,5,0.657,0.143,39.773
3,9934,6,0.531,0.354,38.462
4,543,5,0.048,0.422,36.81


In [94]:
display(pd.read_sql("""
SELECT COUNT(*) AS rows,
       AVG(distinct_sources) AS avg_distinct_sources,
       AVG(hedge_density_per_1k) AS avg_hedge_density
FROM version_metrics
""", conn))

Unnamed: 0,rows,avg_distinct_sources,avg_hedge_density
0,9751,0.388,0.436


## Source Mentions

Every LLM-identified quote or paraphrase lands here with rich contextual metadata. The rows blend extraction from [A1_source_mentions](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/A1_source_mentions.output.json#L1-L123), hedge cues from [A2_hedge_window](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/A2_hedge_window.output.json#L1-L59), and narrative roles from [N1_narrative_keywords](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/N1_narrative_keywords.output.json#L1-L56).

**Fields**
- `version_id` — Identifier of the revision that produced the mention. It lets you align quotes with their exact snapshot in time.
- `article_id` — Story key repeated for convenience. It ensures you can aggregate mentions without bouncing through other tables.
- `news_org` — Outlet code tied to the version. It helps compare sourcing behaviour across newsrooms.
- `source_id_within_article` — Stable ID assigned by the pipeline to a canonical source within the article. It makes it easy to trace a source across multiple mentions.
- `source_canonical` — Normalised name chosen by the canonicaliser. Use it when grouping mentions irrespective of surface spelling.
- `source_surface` — Exact surface form the prompt observed in context. Analysts need it to compare canonical names with the words on the page.
- `source_type` — Coarse taxonomy such as “government” or “individual.” The category indicates which sector the speaker represents.
- `speech_style` — Indicator of whether the quote is direct, indirect, or mixed. It highlights how explicitly the article presents the source’s words.
- `attribution_verb` — Verb the prompt associates with the quote (e.g., “said,” “claimed”). Tracking verbs showcases tonal differences in attribution.
- `char_start` — Character offset where the mention begins in the summary text. It assists with downstream highlighting.
- `char_end` — Character offset where the mention ends. Combined with the start value it defines the span for text alignment.
- `sentence_index` — Index of the sentence containing the mention. It aids narrative sequencing and text extraction.
- `paragraph_index` — Index of the paragraph containing the mention. Analysts often filter for lede versus body mentions using this field.
- `is_in_title` — Boolean marking whether the source appears in the headline. Title mentions generally signal high prominence.
- `is_in_lede` — Boolean capturing whether the source sits in the first paragraph. Presence here indicates early emphasis.
- `attributed_text` — Exact text snippet attributed to the source. Use it for qualitative review or quoting back in reports.
- `is_anonymous` — Flag showing whether the speaker was unnamed. It enables anonymity-rate calculations.
- `anonymous_description` — Description supplied when the speaker is anonymous (e.g., “police official”). This text reveals how the newsroom framed the unnamed source.
- `anonymous_domain` — Category for the anonymous party such as government or corporate. It helps assess where anonymous sourcing concentrates.
- `evidence_type` — Prompt-labelled evidence category like “statistic” or “eyewitness.” It tells you how the source is supporting the narrative.
- `evidence_text` — Any additional supporting text the prompt captured. Analysts can mine it for fact-checking leads.
- `narrative_function` — Role assigned by the narrative keywords prompt (e.g., “central protagonist”). It describes why the source is present in the story arc.
- `centrality` — Prompt judgement of the source’s prominence (High, Medium, Low). It is useful when weighting voices in analyses.
- `perspective` — JSON list of stance descriptors such as “Supportive” or “Skeptical.” You can parse it to study viewpoint diversity.
- `doubted` — Binary mark derived from hedge analysis indicating skepticism toward the source. A value of 1 warns the statement was doubted in context.
- `hedge_count` — Number of hedge markers found around the quote. Analysts track it to see how cautious the writing was.
- `hedge_markers` — JSON array of the specific hedging phrases (e.g., “reportedly”). It provides text evidence for the hedge_count.
- `epistemic_verbs` — JSON array of epistemic verbs like “believe” or “estimate.” You can inspect it to understand tonal signalling.
- `hedge_stance` — Prompt label summarising the stance toward the source (supportive, neutral, skeptical, unclear). It offers an at-a-glance read on tone.
- `hedge_confidence` — Confidence score (1–5) from the hedge prompt. Higher scores mean the model is surer about the stance assessment.
- `prominence_lead_pct` — Normalised position of the mention within the summary text. Lower numbers mean the source appears earlier.
- `confidence` — Overall extraction confidence from the primary source mention prompt. Analysts can filter low-confidence rows when auditing.

In [95]:
display(pd.read_sql("""
SELECT version_id,
       source_canonical,
       source_surface,
       source_type,
       narrative_function,
       perspective,
       hedge_count,
       hedge_markers,
       hedge_stance,
       confidence
FROM source_mentions
LIMIT 5
""", conn))

Unnamed: 0,version_id,source_canonical,source_surface,source_type,narrative_function,perspective,hedge_count,hedge_markers,hedge_stance,confidence
0,1,Barack Obama,President Barack Obama,government,"""Key Actor"": The source is the main actor in t...","[\n ""Authoritative""\n]",0,[],neutral,5.0
1,1,Neil Eggleston,Neil Eggleston,government,"""Expert Context"": This source provides context...","[\n ""Informative"",\n ""Supportive""\n]",0,[],neutral,5.0
2,1,Chase Strangio,Chase Strangio,civil_society,"""Advocate"": The source is an advocate for the ...","[\n ""Supportive""\n]",0,[],supportive,5.0
3,1,Paul Ryan,House Speaker Paul Ryan,government,"""Counterpoint"": This source provides an opposi...","[\n ""Against"",\n ""Skeptical""\n]",0,[],skeptical,5.0
4,1,Josh Earnest,Josh Earnest,government,"""Spokesperson"": The source serves as a spokesp...","[\n ""Informative"",\n ""Neutral""\n]",0,[],neutral,5.0


In [96]:
display(pd.read_sql("""
SELECT COUNT(*) AS rows,
       SUM(is_anonymous) AS anonymous_mentions,
       AVG(confidence) AS avg_confidence,
       AVG(hedge_count) AS avg_hedge_count,
       SUM(doubted) AS skeptical_mentions
FROM source_mentions
""", conn))

Unnamed: 0,rows,anonymous_mentions,avg_confidence,avg_hedge_count,skeptical_mentions
0,5043,466,4.732,0.398,268


## Sources Aggregate

This roll-up compresses mention-level data into per-source summaries across an article. It combines signals from [A1_source_mentions](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/A1_source_mentions.output.json#L1-L123), [A2_hedge_window](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/A2_hedge_window.output.json#L1-L59), and [N1_narrative_keywords](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/N1_narrative_keywords.output.json#L1-L56) while tracking retention over time.

**Fields**
- `article_id` — Identifier tying the roll-up to the parent story. It ensures aggregates never blend across articles.
- `news_org` — Outlet label so you can examine retention by newsroom. It mirrors the structure of other tables for consistent joins.
- `source_id_within_article` — Stable ID representing the canonical source. Use it to trace the same actor through versions.
- `source_canonical` — Final canonical name chosen for the source. Analysts rely on it when comparing to external knowledge bases.
- `source_type` — High-level category of the source. Changes in this field show shifts in the mix of voices over time.
- `first_seen_version` — Version number where the source first appeared. It marks the debut of the voice within the story.
- `first_seen_time` — Timestamp tied to the first sighting. It lets you align sourcing changes with newsroom timelines.
- `last_seen_version` — Latest version number containing the source. Use it to detect disappearances or drop-offs.
- `last_seen_time` — Timestamp of the last sighting. It pairs with first_seen_time to show an appearance window.
- `num_mentions_total` — Total quotes attributed to the source. Higher numbers signal stronger presence in the narrative.
- `num_versions_present` — Count of revisions where the source appears. Analysts link it to voice retention.
- `total_attributed_words` — Word count attributed to the source across the article. It gives a sense of how much narrative real estate they occupy.
- `voice_retention_index` — Ratio indicating how consistently the source remains present across consecutive versions. Values near 1.0 reflect durable sourcing.
- `mean_prominence` — Average of the prominence percentile across mentions. Lower numbers mean the source tends to show up earlier.
- `lead_appearance_count` — Number of lede paragraph appearances. It signals top-of-story emphasis.
- `title_appearance_count` — Number of headline mentions. Headline presence indicates extremely high importance.
- `doubted_any` — Boolean flag if any mention for the source was doubted. It summarises tone across the article.
- `deemphasized_any` — Boolean showing whether the source moved from prominent positions to later paragraphs. Analysts watch it for downgrading of voices.
- `disappeared_any` — Boolean capturing whether the source vanished before the final version. It highlights potentially dropped perspectives.

In [97]:
display(pd.read_sql("""
SELECT article_id,
       source_canonical,
       num_mentions_total,
       num_versions_present,
       voice_retention_index,
       lead_appearance_count,
       title_appearance_count
FROM sources_agg
ORDER BY num_mentions_total DESC
LIMIT 5
""", conn))

Unnamed: 0,article_id,source_canonical,num_mentions_total,num_versions_present,voice_retention_index,lead_appearance_count,title_appearance_count
0,25,Aldo Fasci,40,9,1.0,0,0
1,47,Barack Obama,35,10,0.889,4,0
2,114,Donald Trump,33,11,0.9,4,0
3,122,Rick Perry,31,5,1.0,0,0
4,330,Donald Trump,30,12,0.636,5,0


In [98]:
display(pd.read_sql("""
SELECT COUNT(*) AS rows,
       AVG(num_versions_present) AS avg_versions_present,
       AVG(voice_retention_index) AS avg_voice_retention,
       SUM(disappeared_any) AS sources_disappeared
FROM sources_agg
""", conn))

Unnamed: 0,rows,avg_versions_present,avg_voice_retention,sources_disappeared
0,1934,1.954,0.414,1166


## Entity Mentions

Named entities detected via spaCy live here as a contrast to the prompt-based source annotations. They offer a lightweight way to compare quoted actors with all proper nouns in the text.

**Fields**
- `version_id` — Revision identifier that produced the entity extraction. It lets you align entities with specific updates.
- `article_id` — Story identifier so you can group entities per article. Analysts often pivot on this field for coverage counts.
- `news_org` — Outlet code repeated to enable cross-newsroom comparisons. It keeps the table consistent with the rest of the schema.
- `entity_id_within_article` — Internal ID assigned to each unique normalised entity per article. It lets you follow the same entity across revisions.
- `entity_type` — spaCy label such as PERSON or ORG. It explains the general category of the entity.
- `canonical_name` — Lowercased normalised surface form. You can use it to match entities with external datasets.
- `char_start` — Starting character offset of the entity mention. It enables text alignment or highlighting in downstream tools.
- `char_end` — Ending character offset of the entity mention. Together with start it defines the span.
- `sentence_index` — Sentence index where the entity appears. Sequence numbers help you reconstruct context windows.
- `paragraph_index` — Paragraph index of the entity. Analysts use it to separate lede versus body mentions.

In [99]:
display(pd.read_sql("""
SELECT version_id,
       canonical_name,
       entity_type,
       char_start,
       char_end
FROM entity_mentions
LIMIT 5
""", conn))

Unnamed: 0,version_id,canonical_name,entity_type,char_start,char_end
0,1,washington,ORG,12,30
1,1,ap,ORG,31,33
2,1,barack obama,PERSON,97,109
3,1,his final days,DATE,143,157
4,1,chelsea manning s,PERSON,220,237


In [100]:
display(pd.read_sql("""
SELECT COUNT(*) AS rows,
       COUNT(DISTINCT canonical_name) AS distinct_entities,
       AVG(char_end - char_start) AS avg_span_length
FROM entity_mentions
""", conn))

Unnamed: 0,rows,distinct_entities,avg_span_length
0,558182,43838,9.105


## Version Pairs

Pairwise comparisons summarise how the first and final versions differ across sourcing, framing, and movement. The rows weave together motion insights from [P10_movement_pair](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/P10_movement_pair.output.json#L1-L50), edit classifications from [A3_edit_type_pair](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/A3_edit_type_pair.output.json#L1-L84), and angle diagnostics from [D5_angle_change_pair](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/D5_angle_change_pair.output.json#L1-L90).

**Fields**
- `article_id` — Story identifier tying the pair back to the original article. It keeps the comparison scoped correctly.
- `news_org` — Outlet label for the story. Use it to group pair analytics by newsroom.
- `from_version_id` — Identifier for the earlier version in the pair. Analysts usually treat this as the baseline snapshot.
- `to_version_id` — Identifier for the later version in the pair. It represents the post-edit state you are evaluating.
- `from_version_num` — Numeric order of the baseline version. It provides redundant clarity when debugging.
- `to_version_num` — Numeric order of the comparison partner. It verifies you are comparing the intended revisions.
- `delta_minutes` — Time gap between the two versions in minutes. Larger gaps hint that significant updates may have been published.
- `tokens_added` — Count of tokens appearing in the later version but not the earlier one. It measures the size of new material.
- `tokens_deleted` — Count of tokens removed in the later version. It shows how much text was trimmed or rewritten.
- `percent_text_new` — Share of the later summary that is fresh relative to the earlier text. High percentages mean the story dramatically changed.
- `movement_upweighted_summary` — Free-text summary of elements that gained emphasis per the movement prompt. It contextualises what readers would notice more.
- `movement_downweighted_summary` — Text describing what lost emphasis. Together with the upweighted summary it outlines narrative shifts.
- `movement_notes` — Additional bullet-style notes from the movement prompt. Analysts read these for nuance beyond the summaries.
- `movement_confidence` — Confidence score provided by the movement prompt. High values signal reliable qualitative insight.
- `movement_notable_shifts` — JSON string of individual shift snippets with direction labels. You can parse it to list the concrete changes.
- `edit_type` — Category supplied by the edit-type prompt (e.g., new_source_added). It offers a quick classification of the revision.
- `edit_summary` — Narrative blurb explaining the edit. It is helpful when writing up change logs.
- `edit_confidence` — Confidence score for the edit classification. Analysts can down-weight low-confidence rows when aggregating.
- `angle_changed` — Binary indicator from the angle prompt noting whether the story’s angle moved. A value of 1 calls for deeper inspection.
- `angle_change_category` — Detailed category describing the angle shift. It helps researchers bucket changes.
- `angle_summary` — Text summary of the angle change. Teams use it as a human-readable explainer.
- `title_alignment_notes` — Commentary on how the headline aligns with the body. Useful when checking headline-body coherence.
- `angle_confidence` — Confidence score from the angle prompt. High scores imply trustworthy diagnoses.
- `angle_evidence` — JSON array of evidence snippets the model cited. It provides traceability for the angle judgement.
- `title_jaccard_prev` — Jaccard similarity between the earlier headline and lede. It describes alignment before the edit.
- `title_jaccard_curr` — Jaccard similarity for the later headline and lede. Comparing to the previous value shows headline-body drift.
- `summary_jaccard` — Jaccard similarity between the two summaries. Lower numbers indicate substantial rewriting.

In [101]:
display(pd.read_sql("""
SELECT article_id,
       from_version_id,
       to_version_id,
       movement_confidence,
       edit_type,
       angle_change_category,
       angle_confidence,
       summary_jaccard
FROM version_pairs
LIMIT 5
""", conn))

Unnamed: 0,article_id,from_version_id,to_version_id,movement_confidence,edit_type,angle_change_category,angle_confidence,summary_jaccard
0,1,1,112,1,content_update,no_change,5,0.863
1,2,2,653,1,content_update,no_change,5,0.138
2,5,5,53399,1,content_update,no_change,5,0.083
3,10,10,493,1,content_update,no_change,5,0.254
4,11,12,118,1,content_update,no_change,5,0.238


In [102]:
display(pd.read_sql("""
SELECT COUNT(*) AS rows,
       AVG(tokens_added) AS avg_tokens_added,
       AVG(summary_jaccard) AS avg_summary_jaccard,
       AVG(angle_confidence) AS avg_angle_confidence
FROM version_pairs
""", conn))

Unnamed: 0,rows,avg_tokens_added,avg_summary_jaccard,avg_angle_confidence
0,2069,265.132,0.387,4.691


## Pair Sources Added

This table lists sources introduced between the paired versions. Entries come directly from the add/remove logic inside [A3_edit_type_pair](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/A3_edit_type_pair.output.json#L1-L84).

**Fields**
- `article_id` — Story identifier shared with the pair row. It ensures the source addition is attributed to the correct article.
- `news_org` — Outlet label associated with the article. It supports newsroom-level analysis of sourcing changes.
- `from_version_id` — Earlier version identifier used as the baseline. It clarifies where the comparison started.
- `to_version_id` — Later version identifier showing where the source appears. Analysts use it to drill into the updated text.
- `surface` — Surface form for the newly introduced source. It records the exact wording the prompt saw.
- `canonical` — Normalised name for the source. Use it for aggregation across articles or to match the mentions table.
- `type` — Source category assigned by the prompt. It reveals what kind of voice entered the story.

In [103]:
display(pd.read_sql("""
SELECT article_id,
       from_version_id,
       to_version_id,
       surface,
       canonical,
       type
FROM pair_sources_added
LIMIT 5
""", conn))

Unnamed: 0,article_id,from_version_id,to_version_id,surface,canonical,type
0,1,1,112,Sean Spicer,Sean Spicer,government
1,2,2,653,Donald Trump,Donald Trump,government
2,10,10,493,Jane Forbes Clark,Jane Forbes Clark,individual
3,10,10,493,Jack O'Connell,Jack O'Connell,individual
4,11,12,118,Deke Arndt,Deke Arndt,government


In [104]:
display(pd.read_sql("""
SELECT COUNT(*) AS rows,
       COUNT(DISTINCT canonical) AS distinct_sources_added,
       COUNT(*) FILTER (WHERE type = 'government') AS government_sources_added
FROM pair_sources_added
""", conn))

Unnamed: 0,rows,distinct_sources_added,government_sources_added
0,1021,919,392


## Pair Sources Removed

This table mirrors the additions table but captures sources that disappear between versions. The records also come from [A3_edit_type_pair](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/A3_edit_type_pair.output.json#L1-L84) so you can track attrition.

**Fields**
- `article_id` — Story identifier for the removal event. It confirms which article lost the voice.
- `news_org` — Outlet label tied to the article. It lets you compare drop-offs by newsroom.
- `from_version_id` — Earlier version identifier where the source was still present. It sets the baseline for the removal.
- `to_version_id` — Later version identifier after the source vanished. Analysts inspect this version to confirm the drop.
- `surface` — Surface form of the removed source. It documents how the source appeared before removal.
- `canonical` — Canonical name of the removed source. Grouping on this field reveals who got cut most often.
- `type` — Source category for the removed voice. It shows whether entire sectors are being pruned.

In [105]:
display(pd.read_sql("""
SELECT article_id,
       from_version_id,
       to_version_id,
       surface,
       canonical,
       type
FROM pair_sources_removed
LIMIT 5
""", conn))

Unnamed: 0,article_id,from_version_id,to_version_id,surface,canonical,type
0,2,2,653,Kathleen Hall Jamieson,Kathleen Hall Jamieson,individual
1,2,2,653,Michael Gerson,Michael Gerson,individual
2,2,2,653,Wayne Fields,Wayne Fields,individual
3,5,5,53399,Rep. Tom Price,Tom Price,government
4,5,5,53399,Sen. Patty Murray,Patty Murray,government


In [106]:
display(pd.read_sql("""
SELECT COUNT(*) AS rows,
       COUNT(DISTINCT canonical) AS distinct_sources_removed,
       COUNT(*) FILTER (WHERE type = 'individual') AS individuals_removed
FROM pair_sources_removed
""", conn))

Unnamed: 0,rows,distinct_sources_removed,individuals_removed
0,408,364,125


## Pair Source Transitions

Not every change is add-or-remove—sometimes sources shift roles. These rows originate from [D5_angle_change_pair](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/D5_angle_change_pair.output.json#L1-L90) and log when a source is promoted, demoted, added, or removed with reasoning.

**Fields**
- `article_id` — Identifier ensuring the transition is attached to the correct story. It keeps downstream joins straightforward.
- `news_org` — Outlet label for the story. It helps you study role shifts across newsrooms.
- `from_version_id` — Baseline version identifier. It shows where the prior state was measured.
- `to_version_id` — Later version identifier after the transition. Analysts jump here to confirm the change in text.
- `canonical` — Canonicalised source name experiencing the transition. It aligns with the mentions table for deeper exploration.
- `transition_type` — Prompt label such as “promoted” or “demoted.” It summarises how the source’s role changed.
- `reason_category` — Higher-level reason bucket (e.g., new_actor). It provides a quick rationale.
- `reason_detail` — Free-text detail elaborating on the category. Use it when writing qualitative summaries.

In [107]:
display(pd.read_sql("""
SELECT article_id,
       from_version_id,
       to_version_id,
       canonical,
       transition_type,
       reason_category
FROM pair_source_transitions
LIMIT 5
""", conn))

Unnamed: 0,article_id,from_version_id,to_version_id,canonical,transition_type,reason_category
0,240,737,10039,Senator Patty Murray,promoted,escalation
1,240,737,10039,Betsy DeVos,demoted,context_clarification
2,241,745,53132,Senate Minority Leader Chuck Schumer,added,new_actor
3,241,745,53132,Larry Levitt of the Kaiser Family Foundation,demoted,context_clarification
4,241,745,53132,Medical organizations and the U.S. Chamber of ...,added,new_actor


In [108]:
display(pd.read_sql("""
SELECT COUNT(*) AS rows,
       COUNT(*) FILTER (WHERE transition_type = 'promoted') AS promotions,
       COUNT(*) FILTER (WHERE transition_type = 'demoted') AS demotions
FROM pair_source_transitions
""", conn))

Unnamed: 0,rows,promotions,demotions
0,4228,1137,478


## Pair Anon-Named Replacements

These rows flag when a source flips between anonymity and attribution. They are extracted from [P3_anon_named_replacement_pair](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/P3_anon_named_replacement_pair.output.json#L1-L42) to highlight transparency shifts.

**Fields**
- `article_id` — Story identifier for the replacement event. It links the swap back to the main narrative.
- `news_org` — Outlet label for the story. It enables comparisons across newsrooms.
- `from_version_id` — Identifier for the earlier version. It contains the source’s original anonymity state.
- `to_version_id` — Identifier for the later version. It shows how the source is labelled after the change.
- `src` — Representation of the source before the edit. It may read “anonymous official” or a named person depending on the direction.
- `dst` — Representation after the edit. Analysts inspect it to see whether anonymity was introduced or removed.
- `direction` — Prompt label like “anon_to_named” describing the flow. It clarifies whether transparency improved or regressed.
- `likelihood` — Confidence score from the prompt. Higher numbers indicate the model is confident the replacement happened.

In [109]:
display(pd.read_sql("""
SELECT article_id,
       from_version_id,
       to_version_id,
       src,
       dst,
       direction,
       likelihood
FROM pair_anon_named_replacements
LIMIT 5
""", conn))

Unnamed: 0,article_id,from_version_id,to_version_id,src,dst,direction,likelihood
0,11,12,118,Government scientists,Deke Arndt,anon_to_named,0.7
1,11,12,118,Gavin Schmidt,Gavin Schmidt,named_to_anon,0.0
2,11,12,118,NASA and the National Oceanic and Atmospheric ...,Deke Arndt,named_to_anon,0.4
3,11,12,118,NOAA,Arndt,named_to_anon,0.6
4,11,12,118,Gavin Schmidt,Schmidt,named_to_anon,0.8


In [110]:
display(pd.read_sql("""
SELECT COUNT(*) AS rows,
       COUNT(*) FILTER (WHERE direction = 'anon_to_named') AS anon_to_named,
       COUNT(*) FILTER (WHERE direction = 'named_to_anon') AS named_to_anon,
       AVG(likelihood) AS avg_likelihood
FROM pair_anon_named_replacements
""", conn))

Unnamed: 0,rows,anon_to_named,named_to_anon,avg_likelihood
0,4893,2346,2547,0.622


## Pair Numeric Changes

Numeric claims frequently shift as facts are verified. [P7_numeric_changes_pair](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/P7_numeric_changes_pair.output.json#L1-L59) records the old and new numbers so you can quantify corrections.

**Fields**
- `article_id` — Story identifier to keep the numeric change tied to its article. It prevents cross-story mix-ups.
- `news_org` — Outlet label for the story. It lets you audit numeric edits by newsroom.
- `from_version_id` — Baseline version identifier containing the previous value. Analysts refer to it when verifying original numbers.
- `to_version_id` — Later version identifier with the updated value. It points you to the correction in context.
- `item` — Short description of what the number refers to (e.g., “injured protestors”). It helps you understand the subject without rereading the article.
- `prev` — Textual representation of the earlier numeric claim. It is the value that was replaced.
- `curr` — Textual representation of the new number. Comparing it with prev reveals the direction and magnitude of change.
- `delta` — Prompt-supplied description of how the number shifted. It provides a readable explanation such as “+5”.
- `unit` — Unit associated with the number (e.g., “people”). Analysts need it to contextualise changes.
- `source` — Source credited for the numeric claim. This shows whether the citation changed alongside the number.
- `change_type` — Prompt label such as “increase,” “decrease,” or “correction.” It categorises the nature of the update.
- `confidence` — Confidence score for the numeric analysis. Higher values suggest the change detection is reliable.

In [111]:
display(pd.read_sql("""
SELECT article_id,
       from_version_id,
       to_version_id,
       item,
       prev,
       curr,
       change_type,
       confidence
FROM pair_numeric_changes
LIMIT 5
""", conn))

Unnamed: 0,article_id,from_version_id,to_version_id,item,prev,curr,change_type,confidence


In [112]:
display(pd.read_sql("""
SELECT COUNT(*) AS rows,
       COUNT(*) FILTER (WHERE change_type = 'increase') AS increases,
       COUNT(*) FILTER (WHERE change_type = 'decrease') AS decreases,
       AVG(confidence) AS avg_confidence
FROM pair_numeric_changes
""", conn))

Unnamed: 0,rows,increases,decreases,avg_confidence
0,0,0,0,


## Pair Claims

This table tracks narrative claims that were added, removed, or reframed. Records come from [P8_claims_pair](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/P8_claims_pair.output.json#L1-L48) so you can audit substantive content shifts.

**Fields**
- `article_id` — Identifier for the article whose claim changed. It keeps the analysis grounded in a specific story.
- `news_org` — Outlet label associated with the article. It permits newsroom-level rollups of claim churn.
- `from_version_id` — Identifier for the earlier version containing the prior claim state. It is the anchor for before/after comparisons.
- `to_version_id` — Identifier for the later version reflecting the updated claim state. It points to the revised text.
- `claim_id` — Stable ID assigned by the prompt so you can match notes and references. It helps trace the same claim across analyses.
- `proposition` — Text of the claim under evaluation. This is what you would cite in qualitative write-ups.
- `status` — Prompt label indicating whether the claim was added, removed, updated, or unchanged. Analysts summarise change dynamics using this field.
- `change_note` — Free-text commentary describing how the claim shifted. It provides narrative colour beyond the status label.
- `confidence` — Confidence score from the claim prompt. High scores suggest the classification is dependable.

In [113]:
display(pd.read_sql("""
SELECT article_id,
       from_version_id,
       to_version_id,
       claim_id,
       proposition,
       status,
       confidence
FROM pair_claims
LIMIT 5
""", conn))

Unnamed: 0,article_id,from_version_id,to_version_id,claim_id,proposition,status,confidence
0,1,1,112,C1,No key factual or causal claims provided,stable,5.0
1,2,2,653,C1,No claims to track,stable,5.0
2,5,5,53399,C1,No claims to track,stable,5.0
3,10,10,493,C1,No claims to track,stable,5.0
4,11,12,118,C1,No key factual or causal claims provided for t...,stable,5.0


In [114]:
display(pd.read_sql("""
SELECT COUNT(*) AS rows,
       COUNT(*) FILTER (WHERE status = 'updated') AS updated_claims,
       COUNT(*) FILTER (WHERE status = 'added') AS added_claims,
       AVG(confidence) AS avg_confidence
FROM pair_claims
""", conn))

Unnamed: 0,rows,updated_claims,added_claims,avg_confidence
0,2069,0,0,5.0


## Pair Frame Cues

Framing cues reveal rhetorical shifts even when facts stay constant. Rows come from [P9_frame_cues_pair](https://github.com/alex2awesome/news-edit-source-credibility/blob/main/news-edits-pipeline/prompts/P9_frame_cues_pair.output.json#L1-L41) which compares cues such as responsibility or victimhood.

**Fields**
- `article_id` — Story identifier anchoring the frame cue change. It keeps cues tied to their articles.
- `news_org` — Outlet label for the story. Use it to measure framing differences by newsroom.
- `from_version_id` — Identifier for the earlier version. It shows whether the cue existed initially.
- `to_version_id` — Identifier for the later version. It indicates if the cue emerged or faded.
- `cue` — Name of the framing cue (e.g., “blame”). It is the linchpin for rhetorical analysis.
- `prev` — Boolean showing whether the cue appeared in the earlier version. A value of 1 means it was already present.
- `curr` — Boolean showing whether the cue appears in the later version. Comparing prev and curr reveals the direction of change.
- `direction` — Prompt label such as “appeared,” “disappeared,” or “unchanged.” It succinctly narrates the cue movement.

In [115]:
display(pd.read_sql("""
SELECT article_id,
       from_version_id,
       to_version_id,
       cue,
       prev,
       curr,
       direction
FROM pair_frame_cues
LIMIT 5
""", conn))

Unnamed: 0,article_id,from_version_id,to_version_id,cue,prev,curr,direction
0,1,1,112,law_and_order_emphasis,0,0,unchanged
1,1,1,112,violence_highlight,0,0,unchanged
2,2,2,653,law_and_order_emphasis,0,0,unchanged
3,2,2,653,violence_highlight,0,0,unchanged
4,5,5,53399,law_and_order_emphasis,0,0,unchanged


In [116]:
display(pd.read_sql("""
SELECT COUNT(*) AS rows,
       COUNT(*) FILTER (WHERE direction = 'appeared') AS cues_appeared,
       COUNT(*) FILTER (WHERE direction = 'disappeared') AS cues_disappeared
FROM pair_frame_cues
""", conn))

Unnamed: 0,rows,cues_appeared,cues_disappeared
0,4983,131,138


In [117]:
conn.close()