# Trending Wikipedia

1. Get trending Wikipedia articles from yesterday
2. Pass plain text from article to OpenAI for suggestsions as to why each article is trending
3. Build HTML page to display each article and why it is trending

#### But this time with phoenix to trace it and iterate!

In [9]:
# !pip install requests python-dotenv openai arize-phoenix openinference-instrumentation-openai

In [6]:
from dotenv import load_dotenv
import os

load_dotenv()

OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]



# The Phoenix part

1. Launch Phoenix
2. Register Phoenix as the trace provider
3. Initialize the OpenAI Instrumentor with the Phoenix trace provider

After all these are finished, all of our calls to ChatGPT will be recorded

#### Launch Phoenix
This will lauch phoenix locally. We will connect to this later on in the notebook. 

Note: This cell will give benign errors if you run it twice.

In [11]:
import phoenix as px
px.launch_app()

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


<phoenix.session.session.ThreadSession at 0x28daf2f00>

#### Connect OpenAI Instrumentor to Phoenix

In [None]:
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

tracer_provider = register(
  project_name="trending-wikipedia-phoenix-tracing", # Default is 'default'
)

OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: trending-wikipedia-phoenix-tracing
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: localhost:4317
|  Transport: gRPC
|  Transport Headers: {'user-agent': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



### Now each time we use OpenAI, it will be logged in our Phoenix application

API call for the featured feed shows different for today than it does for any previous days.

This is written for previous days only

In [46]:
import requests
import datetime

today = datetime.datetime.now()
yesterday = today - datetime.timedelta(days=1)

date_to_query = yesterday
url = 'https://api.wikimedia.org/feed/v1/wikipedia/en/featured/' + date_to_query.strftime('%Y/%m/%d')


response = requests.get(url)
featured_feed = response.json()
print(f"API call: {url}")
print(f"Retrieved Wikipedia top article statistics for {date_to_query}")

API call: https://api.wikimedia.org/feed/v1/wikipedia/en/featured/2025/01/04
Retrieved Wikipedia top article statistics for 2025-01-04 11:38:18.018141


### Save API response to file

In [47]:
import os
import json

# Ensure the 'featured-feed' folder exists
file_directory = "data"
os.makedirs(file_directory, exist_ok=True)

# Define the filename based on the date
base_file_name = date_to_query.strftime('%Y-%m-%d')
file_path = f'{file_directory}/{base_file_name}.json'

# Save to JSON file (overwrite if it already exists)
with open(file_path, 'w', encoding='utf-8') as file:
    json.dump(featured_feed, file, indent=4, ensure_ascii=False)

print(f'Saved Wikipedia response to {file_path}')

Saved Wikipedia response to data/2025-01-04.json


### Make `article_list` with only relevant information

In [48]:
article_list = []

for item in featured_feed['mostread']['articles'][:20]:
    title = item['title']
    views = item['views']
    link = item['content_urls']['desktop']['page']
    extract = item['extract']
    thumbnail = item.get('thumbnail', {}).get('source', None)

    print(f"Getting full text of {title} article")


    # Download raw text of article
    url = f"https://en.wikipedia.org/w/index.php?title={title}&action=raw"
    print(url)

    article_text = requests.get(url).text
    article_text_truncated = article_text[:5000]

    
    
    article={
        'title': title,
        'views': views,
        'link': link,
        'thumbnail': thumbnail,
        'extract': extract,
        'text': article_text_truncated,
        'trendingreason': ''
    }

    article_list.append(article)

print(article_list)

Getting full text of Squid_Game_season_2 article
https://en.wikipedia.org/w/index.php?title=Squid_Game_season_2&action=raw
Getting full text of Luke_Littler article
https://en.wikipedia.org/w/index.php?title=Luke_Littler&action=raw
Getting full text of Marcus_Freeman article
https://en.wikipedia.org/w/index.php?title=Marcus_Freeman&action=raw
Getting full text of Nosferatu_(2024_film) article
https://en.wikipedia.org/w/index.php?title=Nosferatu_(2024_film)&action=raw
Getting full text of Squid_Game article
https://en.wikipedia.org/w/index.php?title=Squid_Game&action=raw
Getting full text of Human_metapneumovirus article
https://en.wikipedia.org/w/index.php?title=Human_metapneumovirus&action=raw
Getting full text of Pushpa_2:_The_Rule article
https://en.wikipedia.org/w/index.php?title=Pushpa_2:_The_Rule&action=raw
Getting full text of Wayne_Osmond article
https://en.wikipedia.org/w/index.php?title=Wayne_Osmond&action=raw
Getting full text of Avicii article
https://en.wikipedia.org/w/ind

### Run through article list, prompt ChatGPT to decifer reasons for trending based on the infomration provided. 

In [49]:
from openai import OpenAI
client = OpenAI(api_key=OPENAI_API_KEY)

for article in article_list:
    print(f"Analyzing {title}")

    title = article['title']
    text = article['text']

    prompt = f"Act as a professional news summarizer. Based on your knowledge of {title} and the following extract. In 1 concise and confident sentence, explain why the {title} article might be trending on Wikipedia on #{date_to_query}:\n\n{text}"

    response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
        "role": "user",
        "content": prompt
        }
    ],
    temperature=1,
    max_tokens=512,
    top_p=1
    )
    print(f"trendingreason: {response.choices[0].message.content}")
    
    article['trendingreason'] =  response.choices[0].message.content


Analyzing List_of_highest-grossing_Indian_films
trendingreason: The article on Squid Game Season 2 is likely trending on Wikipedia due to its recent release on December 26, 2024, generating significant viewer interest and discussion about its continuation shortly before the anticipated third and final season.
Analyzing Squid_Game_season_2
trendingreason: Luke Littler is trending on Wikipedia due to his historic achievement as the youngest PDC World Darts Champion, winning the title at just 17 years and 347 days old.
Analyzing Luke_Littler
trendingreason: The article about Marcus Freeman might be trending on Wikipedia on January 4, 2025, due to his recognition as the Bobby Dodd Coach of the Year in 2024, signaling his prominent impact and achievements as the head coach of the Notre Dame Fighting Irish football team.
Analyzing Marcus_Freeman
trendingreason: The article about the Nosferatu (2024 film) is trending on Wikipedia due to its recent theatrical release, critical acclaim, and imp

Kept running into rate limit errors... But the results I was getting are very positive and close to what I was looking for.

On the downside.. running this for two days hit the Tokens Per Minute limits and cost almost $7....

This is enough for this

### Save vital information to new file

In [50]:
file_path = f'{file_directory}/{base_file_name}-trending-reasons.json'

# Save to JSON file (overwrite if it already exists)
with open(file_path, 'w', encoding='utf-8') as file:
    json.dump(article_list, file, indent=4, ensure_ascii=False)

print(f'articles_with_reasons saved to {file_path}')

articles_with_reasons saved to data/2025-01-04-trending-reasons.json


#### Build HTML Page to display the top 10 list complete with thumbnails and the reason generated by ChatGPT

In [51]:
# Start building the HTML
html_title = f"<h1>Wikipedia's most viewed articles on {date_to_query.strftime("%B %d, %Y")}</h1>"
html_list = "<ol>\n"

# Iterate through the data
for item in article_list:
    title = item['title']
    link = item['link']
    thumbnail = item['thumbnail']
    trendingreason = item['trendingreason']
    views = item['views']
    extract = item['extract']

    # Handle null thumbnail
    if thumbnail:
        thumbnail_html = f'<img src="{thumbnail}" alt="Thumbnail for {title}"/><br>'
    else:
        thumbnail_html = '<p><em>No thumbnail available</em></p>'
    
    # Create a list item for each entry
    html_list += f"""
    <li>
        <h2>
          <a href="{link}" target="_blank">{title}</a><br>
        </h2>
        {thumbnail_html}
        <strong>Views:</strong> {views}<br><br>
        <strong>Reason for Trending:</strong> {trendingreason}
    </li>\n
    """

# Close the HTML list
html_list += "</ol>"

html_page = html_title + html_list
# Save to html file (overwrite if it already exists)
file_path = f'{file_directory}/{base_file_name}.html'

with open(file_path, 'w', encoding='utf-8') as file:
    file.write(html_page)

# Display the HTML in the notebook
from IPython.display import display, HTML
display(HTML(html_page))

# What I've playing around with Phoenix

The traces visible in Phoenix make it so much easier to look at the entire prompt and result to get a better insight into the results than you can get through print statements in the notebook. I see it as a magnifying glass and complete historic record.

- For Jimmy Carter's article, it needed 500 characters to get the general detail of his recent death in December 2024 and 2000 characters to get the context the full date of death
- droping max_tokens from 2048 to 512 didn't have a discernable difference
- Changing the prompt from "In 1-2 sentences, explain why the..." to "In 1 concise and confident sentence, explain why..." gave results I prefer

