# Bluesky Ranker — Example Notebook

This notebook demonstrates the typical workflow:
- Fetch recent public posts into SQLite (upsert-by-URI)
- Load posts from SQLite into a Polars DataFrame
- Rank posts using the TopicRanker (TF–IDF/Count/SBERT)
- Inspect the top clusters and sample posts
- (Optional) Generate a per-handle cluster report to Markdown


> Note: This notebook expects a SQLite DB with posts.
Create one via the sample (no network): `python -m blueskyranker.sample_db --db newsflows_sample.db`
or fetch live data via the fetcher CLI.


In [1]:
# Imports and setup
import polars as pl
from blueskyranker.fetcher import Fetcher, ensure_db, load_posts_df
from blueskyranker.ranker import TopicRanker


## 1) Fetch recent posts into SQLite

- Adjust `--max-age-days` to control the time window.
- Upsert ensures engagement metrics refresh over time.
- You can also call the fetcher via CLI if you prefer.


In [2]:
fetcher = Fetcher()
result = fetcher.fetch(max_age_days=7)  # change to your needs
print(result)


Posts fetched (all handles): 0post [00:00, ?post/s]
Posts fetched (all handles): 1post [00:00,  4.06post/s]
[A
[A
[A
[A
                                                       
Posts fetched (all handles): 34post [00:00,  4.06post/s]

✅ DONE news-flows-nl.bsky.social: upserted 34 posts into SQLite



Posts fetched (all handles): 35post [00:01, 33.75post/s]
[A
[A
[A
[A
                                                        
Posts fetched (all handles): 51post [00:01, 33.75post/s]  

✅ DONE news-flows-ir.bsky.social: upserted 17 posts into SQLite



Posts fetched (all handles): 52post [00:02, 19.80post/s]
[A
[A
[A
[A
                                                        
Posts fetched (all handles): 105post [00:02, 19.80post/s] 

✅ DONE news-flows-cz.bsky.social: upserted 54 posts into SQLite



Posts fetched (all handles): 106post [00:03, 31.06post/s]
[A
[A
[A
[A
                                                         
Handles: 100%|██████████| 4/4 [00:04<00:00,  1.03s/handle]
Posts fetched (all handles): 139post [00:04, 33.87post/s]

✅ DONE news-flows-fr.bsky.social: upserted 34 posts into SQLite

FINAL REPORT

Handle: news-flows-nl.bsky.social
  Pages fetched         : 2
  Posts fetched         : 34
    - originals         : 34
    - replies           : 0
    - reposts           : 0
  Engagement (sums)
    - likes             : 0
    - reposts           : 0
    - replies           : 0
    - quotes            : 0
  Engagement (averages per post)
    - likes             : 0.00
    - reposts           : 0.00
    - replies           : 0.00
    - quotes            : 0.00
  Time range            : 2025-09-10T11:15:47+00:00  →  2025-09-10T12:18:25+00:00
  Time taken            : 0.81s
  Effective rate        : 42.17 posts/sec
  WARN embed anomalies  :
    - empty news_title  : 5
    - empty news_descr. : 6
    - empty news_uri    : 0

Handle: news-flows-ir.bsky.social
  Pages fetched         : 2
  Posts fetched         : 17
    - originals         : 17
    - replies           : 0
    - reposts           : 0
  Engagement 




## 2) Load posts from SQLite

- Choose a handle you want to rank.
- You can limit rows or change ordering as needed.


In [8]:
conn = ensure_db('newsflows.db')
handle = 'news-flows-ir.bsky.social'  # pick one of your handles
data = load_posts_df(conn, handle = handle, order_by='createdAt', descending=False)
data.head()


uri,cid,author_handle,author_did,indexedAt,createdAt,text,reply_root_uri,reply_parent_uri,is_repost,like_count,repost_count,reply_count,quote_count,news_title,news_description,news_uri
str,str,str,str,str,str,str,null,null,i64,i64,i64,i64,i64,str,str,str
"""at://did:plc:vzmnljt7otfbbgrma…","""bafyreicfd5awvzowg3jbp75skakmq…","""news-flows-ir.bsky.social""","""did:plc:vzmnljt7otfbbgrmachtef…","""2025-09-03T19:43:52.709Z""","""2025-09-03T11:42:44.000000Z""","""'I was just a fan' - Loftus-Ch…",,,0,0,0,0,0,"""Ruben Loftus-Cheek: England mi…","""AC Milan midfielder Ruben Loft…","""https://www.bbc.com/sport/foot…"
"""at://did:plc:vzmnljt7otfbbgrma…","""bafyreibxzkrhgfje7c4lcsz25k6zm…","""news-flows-ir.bsky.social""","""did:plc:vzmnljt7otfbbgrmachtef…","""2025-09-03T12:05:03.103Z""","""2025-09-03T11:45:33.000000Z""","""Taoiseach and Olympic sailor j…",,,0,0,0,0,0,,,"""https://www.irishmirror.ie/new…"
"""at://did:plc:vzmnljt7otfbbgrma…","""bafyreia4yd4sacgffr6max3kvsvzc…","""news-flows-ir.bsky.social""","""did:plc:vzmnljt7otfbbgrmachtef…","""2025-09-04T02:50:22.010Z""","""2025-09-03T11:45:39.000000Z""","""14 Ways To Instantly Shut Down…",,,0,0,0,0,0,"""14 Ways To Instantly Shut Down…","""This is how you can silently s…","""https://www.yahoo.com/lifestyl…"
"""at://did:plc:vzmnljt7otfbbgrma…","""bafyreidbbucjp2hc7yzmocyka4fjs…","""news-flows-ir.bsky.social""","""did:plc:vzmnljt7otfbbgrmachtef…","""2025-09-03T12:05:03.206Z""","""2025-09-03T11:47:00.000000Z""","""Inside Ireland's transfer dead…",,,0,0,0,0,0,,,"""https://www.irishmirror.ie/spo…"
"""at://did:plc:vzmnljt7otfbbgrma…","""bafyreihhqhjk3np37ojg6j4ra3omf…","""news-flows-ir.bsky.social""","""did:plc:vzmnljt7otfbbgrmachtef…","""2025-09-03T12:05:03.223Z""","""2025-09-03T11:47:25.000000Z""","""Another Asian Hornet nest foun…",,,0,1,0,0,0,,,"""https://www.irishmirror.ie/new…"


In [4]:
import pandas as pd
pd.DataFrame(data['news_uri']).map(lambda x: " ".join(x.replace("www.","").split('.')[:1])).value_counts()

0                          
https://bfmtv                  2086
https://ad                      985
https://tf1info                 975
https://franceinfo              964
https://novinky                 813
                               ... 
https://lcsun-news                1
https://jmouders                  1
https://skysports                 1
https://smartasset                1
https://offshore-technology       1
Name: count, Length: 119, dtype: int64

In [5]:
import pandas as pd
df = pd.DataFrame(data)
df['domain'] = df[16].map(lambda x: " ".join(x.replace("www.","").split('.')[:1]))
df['isempty'] = df[15].isnull()
pd.crosstab(df['domain'], df['isempty']).sort_values(by=True, ascending=False).head(20)

isempty,False,True
domain,Unnamed: 1_level_1,Unnamed: 2_level_1
https://bfmtv,348,1738
https://tf1info,265,710
https://franceinfo,280,684
https://bbc,218,410
https://novinky,440,373
https://idnes,425,316
https://leparisien,40,313
https://cnews,124,301
https://irozhlas,209,247
https://20minutes,77,244


## 3) Rank posts by topic

- Methods: `networkclustering-tfidf`, `networkclustering-count`, `networkclustering-sbert` (slower, higher semantic quality).
- `similarity_threshold`: raise for fewer/tighter clusters.
- `vectorizer_stopwords`: 'english' | list of words | None.


In [9]:
ranker = TopicRanker(
    returnformat='dataframe',
    method='networkclustering-sbert',  # try 'networkclustering-tfidf' for semantics
    descending=True,
    similarity_threshold=0.2,
    vectorizer_stopwords='english',
    # Optional windows (days):
    cluster_window_days=7,
    engagement_window_days=3,
    push_window_days=1,
)
ranking = ranker.rank(data)
ranking.head()


Do you really want to do this? You have 4165 texts, calculating sentence embeddings will be REALLY slow
Consider using another method, or submitting less document
  from .autonotebook import tqdm as notebook_tqdm
Batches: 100%|██████████| 131/131 [00:11<00:00, 11.56it/s]


uri,cid,author_handle,author_did,indexedAt,createdAt,text,reply_root_uri,reply_parent_uri,is_repost,like_count,repost_count,reply_count,quote_count,news_title,news_description,news_uri,createdAt_dt,cluster,cluster_like_count,cluster_reply_count,cluster_quote_count,cluster_repost_count,cluster_size,cluster_engagement_count,cluster_engagement_rank,cluster_like_count_right,cluster_reply_count_right,cluster_quote_count_right,cluster_repost_count_right,cluster_size_right,cluster_engagement_count_right,cluster_engagement_rank_right
str,str,str,str,str,str,str,null,null,i64,i64,i64,i64,i64,str,str,str,"datetime[μs, UTC]",i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
"""at://did:plc:vzmnljt7otfbbgrma…","""bafyreibvspmip26g5h7vofmjf4g2f…","""news-flows-ir.bsky.social""","""did:plc:vzmnljt7otfbbgrmachtef…","""2025-09-09T15:25:42.508Z""","""2025-09-09T12:54:50.000000Z""","""Israel targets Hamas leadershi…",,,0,0,0,0,0,"""Israel targets Hamas leadershi…","""Israel launched an airstrike a…","""https://www.yahoo.com/news/art…",2025-09-09 12:54:50 UTC,5,6,3,0,1,176,10,6,3,2,0,0,96,5,4
"""at://did:plc:vzmnljt7otfbbgrma…","""bafyreiaaj22h5znt5wa4bzwkm4y4i…","""news-flows-ir.bsky.social""","""did:plc:vzmnljt7otfbbgrmachtef…","""2025-09-09T13:44:55.801Z""","""2025-09-09T13:07:00.000000Z""","""Israel launches strikes agains…",,,0,0,0,0,0,"""Israel launches strikes agains…","""The Israeli military has said …","""https://news.sky.com/story/isr…",2025-09-09 13:07:00 UTC,5,6,3,0,1,176,10,6,3,2,0,0,96,5,4
"""at://did:plc:vzmnljt7otfbbgrma…","""bafyreifeeoyebwlwimzwq476brvt6…","""news-flows-ir.bsky.social""","""did:plc:vzmnljt7otfbbgrmachtef…","""2025-09-09T15:25:49.207Z""","""2025-09-09T13:12:48.000000Z""","""Israel targeted Hamas leaders …",,,0,0,0,0,0,"""Israel targeted Hamas leaders …","""Israel launched a strike on Ha…","""https://www.yahoo.com/news/art…",2025-09-09 13:12:48 UTC,5,6,3,0,1,176,10,6,3,2,0,0,96,5,4
"""at://did:plc:vzmnljt7otfbbgrma…","""bafyreihfdtfbyi6wkjq3hyjyplbgg…","""news-flows-ir.bsky.social""","""did:plc:vzmnljt7otfbbgrmachtef…","""2025-09-09T20:45:18.307Z""","""2025-09-09T13:19:41.000000Z""","""Why are young people protestin…",,,0,0,0,0,0,,,"""https://www.bbc.com/news/artic…",2025-09-09 13:19:41 UTC,5,6,3,0,1,176,10,6,3,2,0,0,96,5,4
"""at://did:plc:vzmnljt7otfbbgrma…","""bafyreih7uxm32crczic45kiic24wj…","""news-flows-ir.bsky.social""","""did:plc:vzmnljt7otfbbgrmachtef…","""2025-09-09T14:03:46.303Z""","""2025-09-09T13:33:13.000000Z""","""Israel launches strike into Qa…",,,0,0,0,0,0,"""Israel launches strike into Qa…","""They were aiming for Hamas lea…","""https://www.joe.ie/news/israel…",2025-09-09 13:33:13 UTC,5,6,3,0,1,176,10,6,3,2,0,0,96,5,4


## 4) Inspect top clusters and posts

- We show the 3 most engaged clusters.
- For each, we list the 5 most recent posts with key fields.


In [10]:
clusters = (
    ranking.group_by('cluster')
    .agg([
        pl.col('cluster_size').first().alias('size'),
        pl.col('cluster_engagement_count').first().alias('engagement')
    ])
    .sort('engagement', descending=True)
    .head(3)
)
for row in clusters.iter_rows(named=True):
    cid = row['cluster']
    size = int(row['size']) if row['size'] is not None else 0
    eng = int(row['engagement']) if row['engagement'] is not None else 0
    print(f"\n=== Cluster {cid} | size={size} | engagement={eng}")
    subset = (
        ranking.filter(pl.col('cluster') == cid)
        .sort('createdAt', descending=True)
        .head(5)
    )
    for rec in subset.select(['uri','text','news_title','news_description','news_uri']).iter_rows(named=True):
        print(f"- uri: {rec['uri']}")
        print(f"  text: {rec.get('text')}")
        print(f"  news_title: {rec.get('news_title')}")
        print(f"  news_description: {rec.get('news_description')}")
        print(f"  news_uri: {rec.get('news_uri')}")



=== Cluster 2 | size=899 | engagement=35
- uri: at://did:plc:vzmnljt7otfbbgrmachtefxh/app.bsky.feed.post/3lyi7vhsrct2c
  text: Much-travelled Donegal hero takes on first inter-county managerial role

He has previously held coaching roles with club and county teams in Galway, Roscommon, Fermanagh, Westmeath and Donegal

  news_title: Much-travelled Donegal hero takes on first inter-county managerial role
  news_description: He has previously held coaching roles with club and county teams in Galway, Roscommon, Fermanagh, Westmeath and Donegal
  news_uri: https://www.irishmirror.ie/sport/gaa/gaelic-football/gaelic-football-news/much-travelled-donegal-hero-takes-35881053
- uri: at://did:plc:vzmnljt7otfbbgrmachtefxh/app.bsky.feed.post/3lyi6we4t672z
  text: Alastair Campbell says Israel's Doha strike like UK 'wiping out' Adams and McGuinness before GFA

‘It’s like… two days before Good Friday feels like it’s coming together, the British government decides to go and wipe out [Gerry] Adams an

## 5) (Optional) Generate a cluster report

- This writes `cluster_report.md` with top clusters per handle.
- You can adjust method, threshold, and stopwords.


## 6) (Optional) End-to-end: fetch → rank → push (per handle)

- Runs the whole flow and logs a short cluster summary to `push.log`.


In [None]:
from blueskyranker.pipeline import run_fetch_rank_push
run_fetch_rank_push(
    handles=[handle],
    method='networkclustering-sbert', similarity_threshold=0.5,
    cluster_window_days=7, engagement_window_days=1, push_window_days=1,
    include_pins=False, test=True, log_path='push.log'
)


Posts fetched (all handles): 0post [00:00, ?post/s]
Posts fetched (all handles): 1post [00:00,  3.53post/s]
[A
[A
[A
[A
                                                       
Handles: 100%|██████████| 1/1 [00:00<00:00,  1.17handle/s]
Posts fetched (all handles): 9post [00:00, 10.53post/s]


✅ DONE news-flows-ir.bsky.social: upserted 9 posts into SQLite


ConnectionError: HTTPSConnectionPool(host='localhost', port=3020): Max retries exceeded with url: /api/prioritize?test=true (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x12189aa50>: Failed to establish a new connection: [Errno 61] Connection refused'))

In [None]:
from blueskyranker.cluster_report import generate_cluster_report
generate_cluster_report(db_path='newsflows.db', output_path='cluster_report.md',
                        method='networkclustering-sbert', sample_max=300,
                        similarity_threshold=0.2, vectorizer_stopwords='english')
print('Wrote cluster_report.md')


Do you really want to do this? You have 300 texts, calculating sentence embeddings will be REALLY slow
Consider using another method, or submitting less document
Batches: 100%|██████████| 10/10 [00:01<00:00,  6.63it/s]
Do you really want to do this? You have 300 texts, calculating sentence embeddings will be REALLY slow
Consider using another method, or submitting less document
Batches: 100%|██████████| 10/10 [00:01<00:00,  7.88it/s]
Do you really want to do this? You have 300 texts, calculating sentence embeddings will be REALLY slow
Consider using another method, or submitting less document
Batches: 100%|██████████| 10/10 [00:00<00:00, 11.42it/s]
Do you really want to do this? You have 300 texts, calculating sentence embeddings will be REALLY slow
Consider using another method, or submitting less document
Batches: 100%|██████████| 10/10 [00:01<00:00,  8.93it/s]

Wrote cluster_report.md





### Pipeline updates (priority and demotion)

- Priority assignment now starts at 1000 for the first item and decreases by 1 (1000, 999, 998, …). The minimum is clamped at 1. Items explicitly demoted are sent with priority 0.
- Demotion: by default, all posts from the last 48 hours that are not in the current prioritisation are sent with priority 0. Configure via `--demote-window-hours`.
- Export filenames use a human‑readable UTC timestamp: `push_{handle}_{YYYY-MM-DDTHH-mm-ssZ}.json`.
- Server responses: short responses print to stdout; long responses are saved to `push_exports/prioritize_response_{handle}_{YYYY-MM-DDTHH-mm-ssZ}.{json|txt}`.

Example CLI:

```
python -m blueskyranker.pipeline \
+  --handles news-flows-nl.bsky.social news-flows-fr.bsky.social \
+  --method networkclustering-tfidf \
+  --similarity-threshold 0.2 \
+  --cluster-window-days 7 \
+  --engagement-window-days 1 \
+  --push-window-days 2 \
+  --demote-last \
+  --demote-window-hours 48 \
+  --log-path push.log \
+  --no-test
```

Programmatic call:

```python
from blueskyranker.pipeline import run_fetch_rank_push
run_fetch_rank_push(
    handles=['news-flows-nl.bsky.social'],
    method='networkclustering-tfidf', similarity_threshold=0.2,
    cluster_window_days=7, engagement_window_days=1, push_window_days=2,
    demote_last=True, demote_window_hours=48,
    include_pins=False, test=True, log_path='push.log')
```


### Pipeline updates (priority and demotion)

- Priority assignment now starts at 1000 for the first item and decreases by 1 (1000, 999, 998, …). The minimum is clamped at 1. Items explicitly demoted are sent with priority 0.
- Demotion: by default, all posts from the last 48 hours that are not in the current prioritisation are sent with priority 0. Configure via `--demote-window-hours`.
- Export filenames use a human‑readable UTC timestamp: `push_{handle}_{YYYY-MM-DDTHH-mm-ssZ}.json`.
- Server responses: short responses print to stdout; long responses are saved to `push_exports/prioritize_response_{handle}_{YYYY-MM-DDTHH-mm-ssZ}.{json|txt}`.

Example CLI:

```
python -m blueskyranker.pipeline \
+  --handles news-flows-nl.bsky.social news-flows-fr.bsky.social \
+  --method networkclustering-tfidf \
+  --similarity-threshold 0.2 \
+  --cluster-window-days 7 \
+  --engagement-window-days 1 \
+  --push-window-days 2 \
+  --demote-last \
+  --demote-window-hours 48 \
+  --log-path push.log \
+  --no-test
```

Programmatic call:

```python
from blueskyranker.pipeline import run_fetch_rank_push
run_fetch_rank_push(
    handles=['news-flows-nl.bsky.social'],
    method='networkclustering-tfidf', similarity_threshold=0.2,
    cluster_window_days=7, engagement_window_days=1, push_window_days=2,
    demote_last=True, demote_window_hours=48,
    include_pins=False, test=True, log_path='push.log')
```


### Ordering logic (time windows)

- Clustering window: clusters are built from posts in this window (e.g., 7 days).
- Engagement window: cluster engagement is computed here to derive `cluster_engagement_rank` (1 = most engaged).
- Push window: only posts in this window are eligible for the final feed.

Order of posts:

1) Filter to the push window.

2) Order clusters by engagement rank (most engaged first).

3) Within each cluster, sort by recency (newest first).

4) Interleave round‑robin across clusters in rank order (1, 2, 3, … then repeat).

Result: the first post is the most‑recent item from the most‑engaged cluster that has posts in the push window.
