# News item curation and scoring demonstration

The hero area of the BC Randonneurs home page should feature newsletter articles, and which articles are selected should be picked algorithmically.  Criteria for selecting articles should include:

- **Freshness:** Older articles should be automatically dropped after some time.
- **Quality:** We want to show articles that are interesting and well written.
- **Inspiration:** Images of majestic scenery would help to enhance the home page.
- **Diversity:** Images and articles that feature underrepresented populations may encourage reluctanct prospective members to join.

The freshness of each item can automatically be inferred from its publication date.  For the other criteria, I propose that the newsletter editor should assign a *rating* to each article, ranging from 0 (✩✩✩✩✩) to 5 (⭐️⭐️⭐️⭐️⭐️), when uploading it to the server.

When serving the home page, the server would calculate a *score* for each article as follows:
$$ S = \frac{A}{1.5 ^ R - 1} $$
where
$$
\begin{align*}
  S &= \textrm{Score} \\
  A &= \textrm{Age of article (days since publication)} \\
  R &= \textrm{Rating (integer from 0 to 5 inclusive)}
\end{align*}
$$

The articles with the $n$ lowest scores would then be featured.

To understand the formula, you can think of the score as a kind of exponential decay with age, where the rate of decay depends on the rating:
$$ S' = e^{- \frac{A}{1.5 ^ R - 1}} $$

Each additional ⭐️ awarded to an article gives it approximately 1.5× the sustaining power.  The $- 1$ in the denominator makes it so that if $R = 0$ then the article will never be featured since division by zero yields an invalid score.

Since we are raising all these scores to a power $e^{-t}$, we can take the computational shortcut of not doing that and taking the lowest scores $S$ instead of the highest scores $S'$.

---
Here is the core of the code for the demo.

In [12]:
from datetime import date
import os.path
from urllib.error import HTTPError
from IPython.display import clear_output, display
from ipywidgets import DatePicker, IntSlider, interactive
import numpy as np
import pandas as pd
import plotly.express as px

# This kludge is needed for plotly output to appear in Google Colab
try:
    from google.colab import output
    output.enable_custom_widget_manager()
except ImportError:
    pass

def read_csv(path, path_prefixes=[''], **kwargs):
    last_error = None
    for path_prefix in path_prefixes:
        try:
            return pd.read_csv(os.path.join(path_prefix, path), **kwargs)
        except (FileNotFoundError, HTTPError) as e:
            last_error = e
            continue
    raise last_error

def scores(item: pd.DataFrame, /, end_date: np.datetime64):
    dates = pd.date_range(max(item.pubdate, end_date - np.timedelta64(91, 'D')), end=end_date, freq='D', inclusive='both')
    ages = dates - item.pubdate
    return pd.Series((ages.days / (1.5 ** item.rating - 1)), index=dates)

def display_results(d: date, n: int):
    global df
    end_date = np.datetime64(d)
    score_data = df.apply(scores, end_date=end_date, axis=1)
    top_results = score_data.iloc[:,-1:].sort_values(by=end_date).head(n)

    #score_data.transpose().plot(logy=True, ylabel="Irrelevance score (lower is better)", figsize=(15, 8))
    display(top_results)
    px.line(
        score_data.transpose(),
        log_y=True,
        title="Visualization of score evolution",
        labels={'index': "Date", 'value': "Irrelevance score (lower is better)", 'title': "Article Title"},
        width=1000, height=800,
    ).show()

Now we import the data.  If you edit the input file `news_items.csv`, you will need to run the notebook again (⏩) to reload the data.

In [2]:
df = read_csv('news_items.csv',
    path_prefixes=[
        '',
        '../input/bcr-web-demo',
        'https://raw.githubusercontent.com/dpoon/bcr_web_demo/HEAD',
    ],
    index_col='title',
    parse_dates=['pubdate'],
)
df

Unnamed: 0_level_0,author,pubdate,rating
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
VanIsle 1200,Stephanie Briggs,2024-06-25,5
July Permanent #51,Bob Goodison,2024-07-15,1
Kulshan 300,Dan Parke,2024-07-14,1
Tinto's 500: A Blast Then A Bust!!!!,Gary Baker,2024-07-07,0
August Flood Route 200,Karen Smith,2024-08-11,0
Parallels With Latitude,Patrick Jackman,2024-08-11,1
My First 600,Lee Fish,2024-08-26,0
Alberta Bound 1000,Bob Goodison,2024-09-18,5
Gorgeous Fall Permanent,Karen Smith,2024-09-28,1
Williams Lake NDTR 1000,Dara Poon,2024-10-02,5


Pick a date, and it computes the scores of the news items.  We would pick the top $n$ items with the lowest scores to be featured on the home page.  The visualization at the bottom shows the evolution of the scores over time (plotted on a semilog scale for clarity).

In [13]:
interactive(display_results,
    d=DatePicker(description="Date", value=date.today()),
    n=IntSlider(description="Top", value=5, min=1, max=10),
)

interactive(children=(DatePicker(value=datetime.date(2024, 12, 9), description='Date', step=1), IntSlider(valu…