# Bluesky Ranker — Example Notebook

This notebook demonstrates the typical workflow:
- Fetch recent public posts into SQLite (upsert-by-URI)
- Load posts from SQLite into a Polars DataFrame
- Rank posts using the TopicRanker (TF–IDF/Count/SBERT)
- Inspect the top clusters and sample posts
- (Optional) Generate a per-handle cluster report to Markdown


> Note: This notebook expects a SQLite DB with posts.
Create one via the sample (no network): `python -m blueskyranker.sample_db --db newsflows_sample.db`
or fetch live data via the fetcher CLI.


In [1]:
# Imports and setup
import polars as pl
from blueskyranker.fetcher import Fetcher, ensure_db, load_posts_df
from blueskyranker.ranker import TopicRanker

## 1) Fetch recent posts into SQLite

- Adjust `--max-age-days` to control the time window.
- Upsert ensures engagement metrics refresh over time.
- You can also call the fetcher via CLI if you prefer.


In [None]:
fetcher = Fetcher()
result = fetcher.fetch(max_age_days=1, 
                       extract_articles=True, 
                       extract_actors=False, 
                       handles=['news-flows-nl.bsky.social'])  
print(result)

# 5 min for 438 posts

Posts fetched (all handles): 0post [00:00, ?post/s]
Posts fetched (all handles): 100post [01:03,  1.15post/s]
[A
Posts fetched (all handles): 200post [02:10,  1.85post/s]
[A
Posts fetched (all handles): 300post [03:21,  2.03post/s]
[A
Posts fetched (all handles): 400post [04:37,  1.61post/s]
[A
Posts fetched (all handles): 438post [05:10,  1.16post/s]
[A
[A
[A
[A
                                                         
Handles: 100%|██████████| 1/1 [05:10<00:00, 310.75s/handle]
Posts fetched (all handles): 438post [05:10,  1.41post/s]

✅ DONE news-flows-nl.bsky.social: upserted 438 posts into SQLite

FINAL REPORT

Handle: news-flows-nl.bsky.social
  Pages fetched         : 6
  Posts fetched         : 438
    - originals         : 438
    - replies           : 0
    - reposts           : 0
  Engagement (sums)
    - likes             : 11
    - reposts           : 7
    - replies           : 3
    - quotes            : 0
  Engagement (averages per post)
    - likes             : 0.03
    - reposts           : 0.02
    - replies           : 0.01
    - quotes            : 0.00
  Time range            : 2025-10-05T15:48:34+00:00  →  2025-10-06T15:22:14+00:00
  Time taken            : 310.73s
  Effective rate        : 1.41 posts/sec
  WARN embed anomalies  :
    - empty news_title  : 1
    - empty news_descr. : 4
    - empty news_uri    : 0

------------------------------------------------------------------------
All handles combined
------------------------------------------------------------------------
  Total pages    




## 2) Load posts from SQLite

- Choose a handle you want to rank.
- You can limit rows or change ordering as needed.


In [2]:
conn = ensure_db('newsflows.db')
handle = 'news-flows-nl.bsky.social'  # pick one of your handles
data = load_posts_df(conn, handle = handle, order_by='createdAt', descending=False)
data.head()

uri,cid,author_handle,author_did,indexedAt,createdAt,text,reply_root_uri,reply_parent_uri,is_repost,like_count,repost_count,reply_count,quote_count,news_title,news_description,news_uri,news_content,news_actors,createdAt_ns
str,str,str,str,str,str,str,null,null,i64,i64,i64,i64,i64,str,str,str,str,null,i64
"""at://did:plc:toz4no26o2x4vsbum…","""bafyreib237ndagghwlgbt2kpchbea…","""news-flows-nl.bsky.social""","""did:plc:toz4no26o2x4vsbum7cp4b…","""2025-10-06T07:52:00.461Z""","""2025-10-05T15:48:34.000000Z""","""Boerenland behouden, door het …",,,0,0,0,0,0,"""Column | Boerenland behouden, …","""In het Polderlab bij Oude Ade …","""https://www.nrc.nl/nieuws/2025…","""Het heet in goed Nederlands fa…",,1759679314000000000
"""at://did:plc:toz4no26o2x4vsbum…","""bafyreibgzdtubiuegsdwx4pgpbede…","""news-flows-nl.bsky.social""","""did:plc:toz4no26o2x4vsbum7cp4b…","""2025-10-05T16:07:04.758Z""","""2025-10-05T15:50:00.000000Z""","""WK-leider Oscar Piastri niet b…",,,0,0,0,0,0,"""WK-leider Oscar Piastri niet b…","""WK-leider Oscar Piastri baalde…","""https://www.ad.nl/formule-1/wk…",""",,En het is moeilijk om iemand…",,1759679400000000000
"""at://did:plc:toz4no26o2x4vsbum…","""bafyreihwezqte2c5cdfysiuuq77te…","""news-flows-nl.bsky.social""","""did:plc:toz4no26o2x4vsbum7cp4b…","""2025-10-05T19:26:38.560Z""","""2025-10-05T15:53:00.000000Z""","""Opgepakte Nederlanders weg uit…",,,0,0,0,0,0,"""Opgepakte Nederlanders weg uit…","""Osman Tastan (62) uit Steenber…","""https://www.ad.nl/binnenland/o…","""De schepen werden vorige week …",,1759679580000000000
"""at://did:plc:toz4no26o2x4vsbum…","""bafyreigiuy37m5rc3sijdv2fgpgfx…","""news-flows-nl.bsky.social""","""did:plc:toz4no26o2x4vsbum7cp4b…","""2025-10-05T16:07:07.458Z""","""2025-10-05T15:54:34.000000Z""","""Amerikaans minister Rubio waar…",,,0,0,0,0,0,"""Amerikaans minister Rubio waar…","""De Amerikaanse minister van Bu…","""https://www.rtl.nl/nieuws/buit…","""Israël heeft ingestemd met een…",,1759679674000000000
"""at://did:plc:toz4no26o2x4vsbum…","""bafyreidnphsy7goppg24xwx6pu3rh…","""news-flows-nl.bsky.social""","""did:plc:toz4no26o2x4vsbum7cp4b…","""2025-10-05T16:30:14.165Z""","""2025-10-05T15:55:00.000000Z""","""Vijf leden Nederlandse delegat…",,,0,0,0,0,0,"""LIVE Midden-Oosten | Vijf lede…","""Ook een vijfde lid van de Nede…","""https://www.ad.nl/buitenland/l…","""Op maandagmiddag arriveerden d…",,1759679700000000000


In [3]:
import pandas as pd
df = pd.DataFrame(data, columns=data.columns)
print(df.shape)
df.head()

(438, 20)


Unnamed: 0,uri,cid,author_handle,author_did,indexedAt,createdAt,text,reply_root_uri,reply_parent_uri,is_repost,like_count,repost_count,reply_count,quote_count,news_title,news_description,news_uri,news_content,news_actors,createdAt_ns
0,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,bafyreib237ndagghwlgbt2kpchbeavi7x6bcyks4r2fkb...,news-flows-nl.bsky.social,did:plc:toz4no26o2x4vsbum7cp4bxp,2025-10-06T07:52:00.461Z,2025-10-05T15:48:34.000000Z,"Boerenland behouden, door het te veranderen\n\...",,,0,0,0,0,0,"Column | Boerenland behouden, door het te vera...",In het Polderlab bij Oude Ade experimenteren L...,https://www.nrc.nl/nieuws/2025/10/05/boerenlan...,Het heet in goed Nederlands farm-to-table: een...,,1759679314000000000
1,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,bafyreibgzdtubiuegsdwx4pgpbedepmau6e5igajby62x...,news-flows-nl.bsky.social,did:plc:toz4no26o2x4vsbum7cp4bxp,2025-10-05T16:07:04.758Z,2025-10-05T15:50:00.000000Z,WK-leider Oscar Piastri niet blij met inhaalac...,,,0,0,0,0,0,WK-leider Oscar Piastri niet blij met inhaalac...,WK-leider Oscar Piastri baalde van de riskante...,https://www.ad.nl/formule-1/wk-leider-oscar-pi...,",,En het is moeilijk om iemand te verslaan die...",,1759679400000000000
2,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,bafyreihwezqte2c5cdfysiuuq77te7z6f45uijhlo2enb...,news-flows-nl.bsky.social,did:plc:toz4no26o2x4vsbum7cp4bxp,2025-10-05T19:26:38.560Z,2025-10-05T15:53:00.000000Z,Opgepakte Nederlanders weg uit Israël: 'We zij...,,,0,0,0,0,0,Opgepakte Nederlanders weg uit Israël: 'We zij...,Osman Tastan (62) uit Steenbergen is een van d...,https://www.ad.nl/binnenland/opgepakte-nederla...,De schepen werden vorige week in international...,,1759679580000000000
3,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,bafyreigiuy37m5rc3sijdv2fgpgfxlj45fwgldaocq62w...,news-flows-nl.bsky.social,did:plc:toz4no26o2x4vsbum7cp4bxp,2025-10-05T16:07:07.458Z,2025-10-05T15:54:34.000000Z,Amerikaans minister Rubio waarschuwt: oorlog i...,,,0,0,0,0,0,Amerikaans minister Rubio waarschuwt: oorlog i...,De Amerikaanse minister van Buitenlandse Zaken...,https://www.rtl.nl/nieuws/buitenland/artikel/5...,Israël heeft ingestemd met een Amerikaans twin...,,1759679674000000000
4,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,bafyreidnphsy7goppg24xwx6pu3rhi3oqqcubgczlc5nm...,news-flows-nl.bsky.social,did:plc:toz4no26o2x4vsbum7cp4bxp,2025-10-05T16:30:14.165Z,2025-10-05T15:55:00.000000Z,Vijf leden Nederlandse delegatie Gazavloot vri...,,,0,0,0,0,0,LIVE Midden-Oosten | Vijf leden Nederlandse de...,Ook een vijfde lid van de Nederlandse afvaardi...,https://www.ad.nl/buitenland/live-midden-ooste...,Op maandagmiddag arriveerden de eerste vijf ac...,,1759679700000000000


In [4]:
from blueskyranker.actor_annotator import ActorAnnotator
annotator = ActorAnnotator(model_name="gpt-oss:20b", seed=0)
print("Actor annotator initialized successfully!")

print("Testing with 10 articles first...")
test_df = df.sample(n=50, random_state=123)

test_annotated = annotator.process_dataframe(
    df=test_df, 
    text_column='news_content', 
    id_column='uri'
)

Actor annotator initialized successfully!
Testing with 10 articles first...
Processing 50 articles...


Extracting actors:   0%|          | 0/50 [00:00<?, ?it/s]

Processing post at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky.feed.post/3m2hn3d64u52v...
Checking availability of model: gpt-oss:20b
Model gpt-oss:20b already available


Extracting actors:   0%|          | 0/50 [16:10<?, ?it/s]


KeyboardInterrupt: 

In [5]:
test_annotated.head()

Unnamed: 0,uri,cid,author_handle,author_did,indexedAt,createdAt,text,reply_root_uri,reply_parent_uri,is_repost,like_count,repost_count,reply_count,quote_count,news_title,news_description,news_uri,news_content,news_actors,createdAt_ns
13,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,bafyreicxwaw6qj5ferfmtpuzqjjyjfezjwx4cca2dmk5d...,news-flows-nl.bsky.social,did:plc:toz4no26o2x4vsbum7cp4bxp,2025-10-05T17:35:26.166Z,2025-10-05T16:23:24.000000Z,Verstappen komt na tweede plaats in Singapore ...,,,0,0,0,0,0,Verstappen komt na tweede plaats in Singapore ...,Formule 1: Max Verstappen werd tweede in Singa...,https://www.nrc.nl/nieuws/2025/10/05/verstappe...,Terwijl de coureurs kletsnat van het zweet zit...,,1759681404000000000
297,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,bafyreicaa6kutrvzyluv63rnyfeajwq6cecwoncdiismt...,news-flows-nl.bsky.social,did:plc:toz4no26o2x4vsbum7cp4bxp,2025-10-06T10:50:29.459Z,2025-10-06T10:26:23.000000Z,Nobelprijs voor Geneeskunde voor onderzoek naa...,,,0,0,0,0,0,Nobelprijs voor Geneeskunde voor onderzoek naa...,Twee Amerikanen en een Japanner hebben de Nobe...,https://www.nu.nl/wetenschap/6371458/nobelprij...,Door onze nieuwsredactie\n\n6 okt 2025 om 12:2...,"{""actors"": [{""actor_name"": ""Het Nobelcomit\u00...",1759746383000000000
142,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,bafyreiei7zlxfdbv534qmavgkwxrxmiat3qouetda5zvf...,news-flows-nl.bsky.social,did:plc:toz4no26o2x4vsbum7cp4bxp,2025-10-06T03:30:42.757Z,2025-10-06T03:01:00.000000Z,Bondskanselier Merz wil alleen Duitse deelname...,,,0,0,0,0,0,Bondskanselier Merz wil alleen Duitse deelname...,De Duitse bondskanselier Friedrich Merz wil da...,https://www.ad.nl/show/bondskanselier-merz-wil...,Of de uitspraak van Merz ook daadwerkelijk gev...,,1759719660000000000
201,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,bafyreiejdcpsq3y44xqq7dcxj4oc3pig2wdbcmm7tqte4...,news-flows-nl.bsky.social,did:plc:toz4no26o2x4vsbum7cp4bxp,2025-10-06T07:05:57.258Z,2025-10-06T06:45:00.000000Z,Tallon Griekspoor naar vierde ronde na opgave ...,,,0,0,0,0,0,Tallon Griekspoor naar vierde ronde na opgave ...,Tallon Griekspoor heeft Jannik Sinner verslage...,https://www.ad.nl/tennis/tallon-griekspoor-naa...,„Het is erg jammer voor hem en ik wens hem een...,"{""actors"": [{""actor_name"": ""Griekspoor"", ""acto...",1759733100000000000
292,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,bafyreibhsup4uysueg2nd3ympfgxzg2nawllyeh5vlxin...,news-flows-nl.bsky.social,did:plc:toz4no26o2x4vsbum7cp4bxp,2025-10-06T10:27:17.457Z,2025-10-06T10:15:53.000000Z,‘De oplossing voor onze overspannen samenlevin...,,,0,0,0,0,0,‘De oplossing voor onze overspannen samenlevin...,De telefoon van psycholoog Thijs Launspach sto...,https://www.ad.nl/werk/de-oplossing-voor-onze-...,Erik (54): ‘Ik ben hier na mijn studie begonne...,"{""actors"": [{""actor_name"": ""Erik (54)"", ""actor...",1759745753000000000


In [6]:
# print the article where news_actors is None
print(test_annotated[test_annotated['news_actors'].isna()].news_content.values[0])

Terwijl de coureurs kletsnat van het zweet zitten uit te hijgen na de zwaarste race van het jaar, glinsteren vuurwerkpluimen zondag in het water van de Marina Bay in Singapore. Vuurwerk ter ere van racewinnaar George Russell. En van McLaren, dat niet alleen zijn tiende constructeurstitel heeft veiliggesteld, maar óók opgelucht kan ademhalen als het gaat om de wereldtitel bij de coureurs.

Dat klinkt tegenstrijdig: Max Verstappen finishte onder het kunstlicht in de drukkend warme stadstaat als tweede, vóór McLaren-rijders Lando Norris en Oscar Piastri. Hij verkleinde zijn achterstand op het duo tot respectievelijk 63 (Piastri) en 41 (Norris) punten. Ruim een maand geleden lag hij nog meer dan honderd punten achter. Maar wie voorbij die feiten kijkt, moet op basis van het wedstrijdbeeld in Singapore concluderen dat de kans nog altijd heel klein is dat Verstappen het gat kan dichten in de resterende zes races, waarin elke overwinning 25 punten oplevert.

Het kán, in theorie. In het recent

In [7]:
# Expand test results to actor-level
test_actors_df = ActorAnnotator.expand_actors_to_rows(test_annotated)
print(f"Test results: {len(test_actors_df)} actors found from {len(test_annotated)} articles")
print("\nSample actors:")
print(test_actors_df[['actor_name', 'actor_function', 'actor_pp']].head(10))

Test results: 10 actors found from 10 articles

Sample actors:
              actor_name actor_function actor_pp
0        Het Nobelcomité              b         
1             Griekspoor              b         
2              Erik (54)              d         
3  Psycholoog Esmée (32)              b         
4        Daria Kasatkina              b         
5          Nelson Tanate              b         
6           Thérèse Boer              b         
7               Michelin              b         
8       Robin van Persie              b         
9             Miljuschka              d         


In [8]:
import spacy

!python -m spacy download nl_core_web_sm
nlp = spacy.load("nl_core_web_sm")  
import re

def clean_actor_name(name):
    # Remove text in parentheses
    return re.sub(r"\(.*?\)", "", name).strip()
 
def extract_core_name(full_name):
    """
    Use NER to decide if this is a PERSON or ORG/GPE.
    - For PERSON: return the detected person name
    - For ORG/GPE: return cleaned organization name
    - Otherwise: return None (generic actor)
    """
    clean_name = clean_actor_name(full_name)
    doc = nlp(clean_name)
    
    # Collect labels
    for ent in doc.ents:
        if ent.label_ in ["PERSON", "ORG", "GPE"]:
            return ent.text.title()  
        else:
            clean_name.title()
    
    # fallback: if no entity found, return None (probably generic like "government spokesperson")
    return None


[38;5;1m✘ No compatible package found for 'nl_core_web_sm' (spaCy v3.8.7)[0m



OSError: [E050] Can't find model 'nl_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

In [None]:
test_actors_df['core_actor_name'] = test_actors_df['actor_name'].apply(extract_core_name)

(15, 12)


In [None]:
# print a selection of actor names, functions and pp
for i in test_actors_df[test_actors_df['actor_function'] == 'a'].index:
    print("Actor Names:", test_actors_df.at[i, 'actor_name'])
    print("Actor NER:", test_actors_df.at[i, 'core_actor_name'])
    print("\n" + "="*80 + "\n")

In [None]:
# Requires: pip install SPARQLWrapper requests pandas
from SPARQLWrapper import SPARQLWrapper, JSON
import requests
import pandas as pd

WDQS = "https://query.wikidata.org/sparql"
HEADERS = {"User-Agent": "PartyLookup/0.1 (your-email@example.com)"}

def query_sparql(sparql):
    sparqlw = SPARQLWrapper(WDQS, agent=HEADERS["User-Agent"])
    sparqlw.setQuery(sparql)
    sparqlw.setReturnFormat(JSON)
    return sparqlw.query().convert()

def search_wikidata(name, language="en"):
    params = {
        "action": "wbsearchentities",
        "search": name,
        "language": language,
        "format": "json",
        "limit": 1
    }
    resp = requests.get("https://www.wikidata.org/w/api.php", params=params, headers=HEADERS)
    resp.raise_for_status()
    hits = resp.json().get("search", [])
    return hits[0]["id"] if hits else None

def get_latest_party_name(name, language="en"):
    qid = search_wikidata(name, language=language)
    if not qid:
        return None
    
    sparql = f"""
    SELECT ?partyLabel ?start ?end WHERE {{
      VALUES ?person {{ wd:{qid} }}
      ?person p:P102 ?stmt .
      ?stmt ps:P102 ?party .
      OPTIONAL {{ ?stmt pq:P580 ?start. }}
      OPTIONAL {{ ?stmt pq:P582 ?end. }}
      SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en". }}
    }}
    """
    results = query_sparql(sparql)
    df = pd.DataFrame([{
        "party": r["partyLabel"]["value"],
        "start": r.get("start", {}).get("value"),
        "end": r.get("end", {}).get("value"),
    } for r in results["results"]["bindings"]])
    
    if df.empty:
        return None
    
    # order by start descending, if not null, else end descending
    df['start'] = pd.to_datetime(df['start'], errors='coerce')
    df['end'] = pd.to_datetime(df['end'], errors='coerce')
    
    df = df.sort_values(by=['start', 'end'], ascending=[False, False]).reset_index(drop=True)
    
    return df['party'][0]


In [None]:
test_actors_df['party'] = test_actors_df['core_actor_name'].apply(lambda x: get_latest_party_name(x) if pd.notna(x) else None)

In [None]:
# print a selection of actor names, functions and pp
for i in test_actors_df[test_actors_df['actor_function'] == 'a'].index:
    print("Actor Names:", test_actors_df.at[i, 'core_actor_name'])
    print("Actor party:", test_actors_df.at[i, 'party'])
    print("\n" + "="*80 + "\n")

In [27]:
actor_df.actor_function.value_counts(dropna=False)

actor_function
a    6
b    5
d    4
Name: count, dtype: int64

In [30]:
# get party for core actor names if actor_function is a 
def lookup_party(row):
    if pd.isna(row['core_actor_name']) or row['core_actor_name'] is None:
        return None
    if pd.isna(row['actor_function']) or row['actor_function'] is None:
        return get_latest_party_name(row['core_actor_name'])
    func = row['actor_function'].lower()
    if row['actor_function'].lower() == 'a':
        return get_latest_party_name(row['core_actor_name'])
    return None

actor_df['actor_wikiparty'] = actor_df.apply(lookup_party, axis=1)

In [31]:
actor_df.head()

Unnamed: 0,uri,text,news_title,news_description,news_uri,news_content,actor_name,actor_function,actor_pp,actor_index,total_actors_in_article,core_actor_name,actor_wikiparty
0,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,Kamil E. verklaart over moord Peter R. de Vrie...,Kamil E. verklaart over moord Peter R. de Vrie...,Hoofdverdachte Kamil E. heeft een verklaring a...,https://www.rtl.nl/boulevard/crime/artikel/553...,Kamil E. kwam in maart 2021 naar eigen zeggen ...,Kamil E.,d,,1,4,Kamil E.,
1,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,Kamil E. verklaart over moord Peter R. de Vrie...,Kamil E. verklaart over moord Peter R. de Vrie...,Hoofdverdachte Kamil E. heeft een verklaring a...,https://www.rtl.nl/boulevard/crime/artikel/553...,Kamil E. kwam in maart 2021 naar eigen zeggen ...,Justitie,a,,2,4,,
2,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,Kamil E. verklaart over moord Peter R. de Vrie...,Kamil E. verklaart over moord Peter R. de Vrie...,Hoofdverdachte Kamil E. heeft een verklaring a...,https://www.rtl.nl/boulevard/crime/artikel/553...,Kamil E. kwam in maart 2021 naar eigen zeggen ...,De rechtbank,a,,3,4,,
3,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,Kamil E. verklaart over moord Peter R. de Vrie...,Kamil E. verklaart over moord Peter R. de Vrie...,Hoofdverdachte Kamil E. heeft een verklaring a...,https://www.rtl.nl/boulevard/crime/artikel/553...,Kamil E. kwam in maart 2021 naar eigen zeggen ...,Het OM,a,,4,4,,
4,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,Maxime Meiland is 'heel opgelucht' dat vervolg...,Maxime Meiland is 'heel opgelucht' dat vervolg...,"Maxime Meiland is ""heel opgelucht"" dat ze niet...",https://www.nu.nl/achterklap/6371406/maxime-me...,Door onze nieuwsredactie\n\n5 okt 2025 om 20:4...,Meiland,d,,1,3,,


In [32]:
actor_df.actor_wikiparty.value_counts(dropna=False)

actor_wikiparty
None    15
Name: count, dtype: int64

In [33]:
actor_df[actor_df['actor_function'] == 'a'].actor_name.value_counts(dropna=False)

actor_name
Justitie                    1
De rechtbank                1
Het OM                      1
het gerechtshof Den Haag    1
Brekelmans                  1
De EU                       1
Name: count, dtype: int64

In [None]:
df['domain'] = df[16].map(lambda x: " ".join(x.replace("www.","").split('.')[:1]))
df['isempty'] = df[15].isnull()
pd.crosstab(df['domain'], df['isempty']).sort_values(by=True, ascending=False).head(20)

isempty,False,True
domain,Unnamed: 1_level_1,Unnamed: 2_level_1
https://rtl,93,33
https://ad,111,23
https://nu,85,22
https://volkskrant,19,14
https://nos,31,12
https://nrc,31,9
https://geenstijl,4,7
https://metronieuws,13,1
https://mediacourant,3,0


In [12]:
# read example.csv
example_df = pd.read_csv("blueskyranker/example_news.csv")
print(example_df.shape)
example_df.head()

(10000, 11)


Unnamed: 0,uri,cid,indexed_at,text,news_title,news_description,news_uri,reply_count,repost_count,like_count,quote_count
0,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,bafyreieqmukhyxpxajkfavvqbrholxef7is62mfklqhof...,2025-08-06T09:18:10Z,Besloten afscheidsdienst voor Hulk Hogan in Fl...,Besloten afscheidsdienst voor Hulk Hogan in Fl...,Familie en vrienden hebben afscheid genomen va...,https://www.rtl.nl/boulevard/artikel/5521988/b...,0,0,0,0
1,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,bafyreifh6zdh6vvra73cjspovrml33mlulk2ezfa2sh33...,2025-08-06T09:02:26Z,Slot kan twee aanvallers kwijtraken bij Liverp...,TransferTalk | Slot kan twee aanvallers kwijtr...,Met het nieuwe voetbalseizoen in aantocht draa...,https://www.ad.nl/transfernieuws/transfertalk-...,0,0,0,0
2,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,bafyreifknnduruehkzbt4hfiugdlyp3r3cccxrl6dtdh2...,2025-08-06T09:02:24Z,Nederlands meisje (3) verdronken bij Spaanse v...,Nederlands meisje (3) verdronken bij Spaanse v...,Een 3-jarig Nederlands meisje is maandag verdr...,https://www.ad.nl/buitenland/nederlands-meisje...,0,0,0,0
3,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,bafyreifrqfiaxii6lha5xaqvruj7nerovcq5g5cjd7l2z...,2025-08-06T09:02:22Z,Ook de verkoop van tweedehands Tesla’s dreigt ...,Ook de verkoop van tweedehands Tesla’s dreigt ...,Nadat de verkoop van nieuwe Tesla’s de afgelop...,https://www.ad.nl/auto/ook-de-verkoop-van-twee...,0,0,0,0
4,at://did:plc:toz4no26o2x4vsbum7cp4bxp/app.bsky...,bafyreibxhxtx54t6kbtqwmobrpn3x3aa3hfdadhhiu2jq...,2025-08-06T09:18:07Z,"Kandidatenlijst SGP blijft in beton gegoten, m...","Kandidatenlijst SGP blijft in beton gegoten, m...",De kandidatenlijst van de SGP voor de Tweede K...,https://www.volkskrant.nl/politiek/kandidatenl...,0,0,0,0


In [14]:
# print news_uri, news_description and text
example_df[['news_uri', 'news_description', 'text']].values[1]

array(['https://www.ad.nl/transfernieuws/transfertalk-slot-kan-twee-aanvallers-kwijtraken-bij-liverpool-manunited-bereikt-akkoord-met-spits-van-85-miljoen~a2335771/',
       'Met het nieuwe voetbalseizoen in aantocht draait de transfermolen op volle toeren. Welke spelers vinden voor september onderdak bij een nieuwe club, en wat zijn de laatste geruchten? Hieronder volg je...',
       'Slot kan twee aanvallers kwijtraken bij Liverpool, ManUnited bereikt akkoord met spits van 85 miljoen\n\nMet het nieuwe voetbalseizoen in aantocht draait de transfermolen op volle toeren. Welke spelers vinden voor september onderdak bij een nieuwe club, en wat zijn de laatste geruchten? Hieronder...'],
      dtype=object)

## 3) Rank posts by topic

- Methods: `networkclustering-tfidf`, `networkclustering-count`, `networkclustering-sbert` (slower, higher semantic quality).
- `similarity_threshold`: raise for fewer/tighter clusters.
- `vectorizer_stopwords`: 'english' | list of words | None.


In [13]:
ranker = TopicRanker(
    returnformat='dataframe',
    method='networkclustering-sbert',  # try 'networkclustering-tfidf' for semantics
    descending=True,
    similarity_threshold=0.2,
    vectorizer_stopwords='english',
    # Optional windows (days):
    cluster_window_days=7,
    engagement_window_days=3,
    push_window_days=1,
)
ranking = ranker.rank(data)
ranking.head()


TypeError: from_epoch() got an unexpected keyword argument 'unit'

## 4) Inspect top clusters and posts

- We show the 3 most engaged clusters.
- For each, we list the 5 most recent posts with key fields.


In [None]:
clusters = (
    ranking.group_by('cluster')
    .agg([
        pl.col('cluster_size').first().alias('size'),
        pl.col('cluster_engagement_count').first().alias('engagement')
    ])
    .sort('engagement', descending=True)
    .head(3)
)
for row in clusters.iter_rows(named=True):
    cid = row['cluster']
    size = int(row['size']) if row['size'] is not None else 0
    eng = int(row['engagement']) if row['engagement'] is not None else 0
    print(f"\n=== Cluster {cid} | size={size} | engagement={eng}")
    subset = (
        ranking.filter(pl.col('cluster') == cid)
        .sort('createdAt', descending=True)
        .head(5)
    )
    for rec in subset.select(['uri','text','news_title','news_description','news_uri']).iter_rows(named=True):
        print(f"- uri: {rec['uri']}")
        print(f"  text: {rec.get('text')}")
        print(f"  news_title: {rec.get('news_title')}")
        print(f"  news_description: {rec.get('news_description')}")
        print(f"  news_uri: {rec.get('news_uri')}")


## 5) (Optional) Generate a cluster report

- This writes `cluster_report.md` with top clusters per handle.
- You can adjust method, threshold, and stopwords.


## 6) (Optional) End-to-end: fetch → rank → push (per handle)

- Runs the whole flow and logs a short cluster summary to `push.log`.


In [None]:
from blueskyranker.pipeline import run_fetch_rank_push
run_fetch_rank_push(
    handles=[handle],
    method='networkclustering-sbert', similarity_threshold=0.5,
    cluster_window_days=7, engagement_window_days=1, push_window_days=1,
    include_pins=False, test=True, log_path='push.log'
)


In [None]:
from blueskyranker.cluster_report import generate_cluster_report
generate_cluster_report(db_path='newsflows.db', output_path='cluster_report.md',
                        method='networkclustering-sbert', sample_max=300,
                        similarity_threshold=0.2, vectorizer_stopwords='english')
print('Wrote cluster_report.md')


### Pipeline updates (priority and demotion)

- Priority assignment now starts at 1000 for the first item and decreases by 1 (1000, 999, 998, …). The minimum is clamped at 1. Items explicitly demoted are sent with priority 0.
- Demotion: by default, all posts from the last 48 hours that are not in the current prioritisation are sent with priority 0. Configure via `--demote-window-hours`.
- Export filenames use a human‑readable UTC timestamp: `push_{handle}_{YYYY-MM-DDTHH-mm-ssZ}.json`.
- Server responses: short responses print to stdout; long responses are saved to `push_exports/prioritize_response_{handle}_{YYYY-MM-DDTHH-mm-ssZ}.{json|txt}`.

Example CLI:

```
python -m blueskyranker.pipeline \
+  --handles news-flows-nl.bsky.social news-flows-fr.bsky.social \
+  --method networkclustering-tfidf \
+  --similarity-threshold 0.2 \
+  --cluster-window-days 7 \
+  --engagement-window-days 1 \
+  --push-window-days 2 \
+  --demote-last \
+  --demote-window-hours 48 \
+  --log-path push.log \
+  --no-test
```

Programmatic call:

```python
from blueskyranker.pipeline import run_fetch_rank_push
run_fetch_rank_push(
    handles=['news-flows-nl.bsky.social'],
    method='networkclustering-tfidf', similarity_threshold=0.2,
    cluster_window_days=7, engagement_window_days=1, push_window_days=2,
    demote_last=True, demote_window_hours=48,
    include_pins=False, test=True, log_path='push.log')
```


### Pipeline updates (priority and demotion)

- Priority assignment now starts at 1000 for the first item and decreases by 1 (1000, 999, 998, …). The minimum is clamped at 1. Items explicitly demoted are sent with priority 0.
- Demotion: by default, all posts from the last 48 hours that are not in the current prioritisation are sent with priority 0. Configure via `--demote-window-hours`.
- Export filenames use a human‑readable UTC timestamp: `push_{handle}_{YYYY-MM-DDTHH-mm-ssZ}.json`.
- Server responses: short responses print to stdout; long responses are saved to `push_exports/prioritize_response_{handle}_{YYYY-MM-DDTHH-mm-ssZ}.{json|txt}`.

Example CLI:

```
python -m blueskyranker.pipeline \
+  --handles news-flows-nl.bsky.social news-flows-fr.bsky.social \
+  --method networkclustering-tfidf \
+  --similarity-threshold 0.2 \
+  --cluster-window-days 7 \
+  --engagement-window-days 1 \
+  --push-window-days 2 \
+  --demote-last \
+  --demote-window-hours 48 \
+  --log-path push.log \
+  --no-test
```

Programmatic call:

```python
from blueskyranker.pipeline import run_fetch_rank_push
run_fetch_rank_push(
    handles=['news-flows-nl.bsky.social'],
    method='networkclustering-tfidf', similarity_threshold=0.2,
    cluster_window_days=7, engagement_window_days=1, push_window_days=2,
    demote_last=True, demote_window_hours=48,
    include_pins=False, test=True, log_path='push.log')
```


### Ordering logic (time windows)

- Clustering window: clusters are built from posts in this window (e.g., 7 days).
- Engagement window: cluster engagement is computed here to derive `cluster_engagement_rank` (1 = most engaged).
- Push window: only posts in this window are eligible for the final feed.

Order of posts:

1) Filter to the push window.

2) Order clusters by engagement rank (most engaged first).

3) Within each cluster, sort by recency (newest first).

4) Interleave round‑robin across clusters in rank order (1, 2, 3, … then repeat).

Result: the first post is the most‑recent item from the most‑engaged cluster that has posts in the push window.
