# Trending Wikipedia

1. Get trending Wikipedia articles from yesterday
2. Pass plain text from article to OpenAI for suggestsions as to why each article is trending
3. Build HTML page to display each article and why it is trending

In [2]:
# !pip install requests tiktoken openai python-dotenv

In [3]:
from dotenv import load_dotenv
import os

load_dotenv()

OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]



API call for the featured feed shows different for today than it does for any previous days.

This is written for previous days only

In [4]:
import requests
import datetime

today = datetime.datetime.now()
yesterday = today - datetime.timedelta(days=1)

date_to_query = yesterday
url = 'https://api.wikimedia.org/feed/v1/wikipedia/en/featured/' + date_to_query.strftime('%Y/%m/%d')


response = requests.get(url)
featured_feed = response.json()
print(f"API call: {url}")
print(f"Retrieved Wikipedia top article statistics for {date_to_query}")

API call: https://api.wikimedia.org/feed/v1/wikipedia/en/featured/2024/12/29
Retrieved Wikipedia top article statistics for 2024-12-29 17:51:16.083668


### Save API response to file

In [5]:
import os
import json

# Ensure the 'featured-feed' folder exists
file_directory = "data"
os.makedirs(file_directory, exist_ok=True)

# Define the filename based on the date
base_file_name = date_to_query.strftime('%Y-%m-%d')
file_path = f'{file_directory}/{base_file_name}.json'

# Save to JSON file (overwrite if it already exists)
with open(file_path, 'w', encoding='utf-8') as file:
    json.dump(featured_feed, file, indent=4, ensure_ascii=False)

print(f'Saved Wikipedia response to {file_path}')

Saved Wikipedia response to data/2024-12-29.json


# Keep an eye on the token count

Since we're using the context window for the entire wikipedia article I want to keep an eye on the token count for each article. Here's what I've seen:

- Squid_Game_season_2 (16k)
- Olivia_Hussey (12k)
- Greg_Gumbel (6k)
- Bryant_Gumbel (8k)
- Nosferatu_(2024_film) (17k)
- Pushpa_2 (38k)
- Manmohan_Singh (38k)

So "normal" people have 5-10k tokens whereas Indian Politicians have 38k...

In [6]:
import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

# Changing from Ollama to openai...

When passing the entire article text to Ollama I was having a greate deal of hallucinations. Decided to see what it looked like to pass the entire text to ChatGPT. Dropped it down to do only the top article to test out the token count and cost


- gpt-3.5-turbo-0125 16,385 tokens is not enough for the Anthropology article with 26k tokens
- gpt-4-32k-0613 has a limit of 32k
- gpt-4-turbo 128k tokens: 

### Iteration:

Passing the entire article through the context window cost an average of $0.50 per article... All of the articles I spot checked had the relevant changes at the very top of the article. I wound up just passing the first 5000 characters to ChatGPT and got the same results for around $0.01 per article!


In [21]:
from openai import OpenAI
client = OpenAI(api_key=OPENAI_API_KEY)

articles_with_reasons = []


for item in featured_feed['mostread']['articles'][:10]:
    title = item['title']
    views = item['views']
    link = item['content_urls']['desktop']['page']
    extract = item['extract']
    thumbnail = item.get('thumbnail', {}).get('source', None)
    print(f"Analyzing {title}")

    date_to_query = yesterday

    # Download raw text of article
    url = f"https://en.wikipedia.org/w/index.php?title={title}&action=raw"
    print(url)

    article_text = requests.get(url).text
    article_text_truncated = article_text[:5000]
    
    print(f"Token count: {num_tokens_from_string(article_text, 'cl100k_base')}")
    print(f"Truncated Token count: {num_tokens_from_string(article_text_truncated, 'cl100k_base')}")


    prompt = f"Act as a professional news summarizer. Based on your knowledge of {title} and the following extract. In 1-2 sentences, explain why the {title} article might be trending on Wikipedia on #{date_to_query}:\n\n{article_text_truncated}"


    response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
        "role": "user",
        "content": prompt
        }
    ],
    temperature=1,
    max_tokens=2048,
    top_p=1
    )
    print(f"response: {response}")
    print(f"trendingreason: {response.choices[0].message.content}")
    
    article={
        'title': title,
        'views': views,
        'link': link,
        'thumbnail': thumbnail,
        'extract': extract,
        'trendingreason': response.choices[0].message.content
    }



    articles_with_reasons.append(article)


Analyzing Squid_Game_season_2
https://en.wikipedia.org/w/index.php?title=Squid_Game_season_2&action=raw
Token count: 16382
Truncated Token count: 1401
response: ChatCompletion(id='chatcmpl-AkO4bGb9y5rQG6TlczypS1rtBMhua', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="The article on *Squid Game* Season 2 is likely trending on Wikipedia due to the show's recent release on December 26, 2024, which has generated significant buzz and media coverage. Following the success of its first season, the anticipation for the continuation of Seong Gi-hun's story, alongside the confirmation of a third season set for 2025, has captured the attention of fans and critics alike.", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1735618989, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier=None, system_fingerprint='fp_0aa8d3e20b', usage=CompletionUsage(completion_tokens=86, prompt_tokens=1

Kept running into rate limit errors... But the results I was getting are very positive and close to what I was looking for.

On the downside.. running this for two days hit the Tokens Per Minute limits and cost almost $7....

This is enough for this

### Save vital information to new file

In [19]:
file_path = f'{file_directory}/{base_file_name}-trending-reasons.json'

# Save to JSON file (overwrite if it already exists)
with open(file_path, 'w', encoding='utf-8') as file:
    json.dump(articles_with_reasons, file, indent=4, ensure_ascii=False)

print(f'articles_with_reasons saved to {file_path}')

articles_with_reasons saved to data/2024-12-29-trending-reasons.json


#### Build HTML Page to display the top 10 list complete with thumbnails and the reason generated by ChatGPT

In [36]:
# Start building the HTML
html_title = f"<h1>Wikipedia's most viewed articles on {date_to_query.strftime("%B %d, %Y")}</h1>"
html_list = "<ol>\n"

# Iterate through the data
for item in articles_with_reasons:
    title = item['title']
    link = item['link']
    thumbnail = item['thumbnail']
    trendingreason = item['trendingreason']
    views = item['views']
    extract = item['extract']

    # Handle null thumbnail
    if thumbnail:
        thumbnail_html = f'<img src="{thumbnail}" alt="Thumbnail for {title}"/><br>'
    else:
        thumbnail_html = '<p><em>No thumbnail available</em></p>'
    
    # Create a list item for each entry
    html_list += f"""
    <li>
        <h2>
          <a href="{link}" target="_blank">{title}</a><br>
        </h2>
        {thumbnail_html}
        <strong>Views:</strong> {views}<br><br>
        <strong>Reason for Trending:</strong> {trendingreason}
    </li>\n
    """

# Close the HTML list
html_list += "</ol>"

html_page = html_title + html_list
# Save to html file (overwrite if it already exists)
file_path = f'{file_directory}/{base_file_name}.html'

with open(file_path, 'w', encoding='utf-8') as file:
    file.write(html_page)

# Display the HTML in the notebook
from IPython.display import display, HTML
display(HTML(html_page))