# Trending Wikipedia articles using Langchain Memory to add context

When I have run the trending analysis in my previous notebooks, I was seeing articles that related to each other, but one didn't have any new information present in the portion of the artcile I was passing.

For instance, Jeff Baena passed away. His article trended and his recent death was correctly identified as the reason. His wife, Aubrey Plaza, had her article trending as well. But the reason for her trending article was vague and did not recognize her husband's death as the reason.

I am testing out Langchain's memory feature to try to solve this problem

This current iteration filters out the articles that have already been trending. This makes for a more interesting list. I've also implemented a master HTML file that keeps track of every time this is run.

## Takeaways
- It's important to distiguish which problems are better solved with a function vs an LLM. My though right now is that the more structured data is, the more likely a static function is the answer.

In [None]:
#!pip install langchain langchain-openai langchain-community openai

In [None]:
from dotenv import load_dotenv
import os

load_dotenv()

OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

In [None]:
import json
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder

### Get trending wikipedia articles

In [None]:
import requests
import datetime

today = datetime.datetime.now()
yesterday = today - datetime.timedelta(days=1)

date_to_query = yesterday
url = 'https://api.wikimedia.org/feed/v1/wikipedia/en/featured/' + date_to_query.strftime('%Y/%m/%d')


response = requests.get(url)
featured_feed = response.json()
print(f"API call: {url}")

### Save to file

In [None]:
import os
import json

# Ensure the 'featured-feed' folder exists
file_directory = "data"
os.makedirs(file_directory, exist_ok=True)

# Define the filename based on the date
base_file_name = date_to_query.strftime('%Y-%m-%d')
file_path = f'{file_directory}/{base_file_name}.json'

# Save to JSON file (overwrite if it already exists)
with open(file_path, 'w', encoding='utf-8') as file:
    json.dump(featured_feed, file, indent=4, ensure_ascii=False)

print(f'Saved Wikipedia response to {file_path}')

### Build data structure with all relevant information and placeholders for LLM responses

In [None]:
article_list = []


for item in featured_feed['mostread']['articles']:
    title = item['title']
    views = item['views']
    link = item['content_urls']['desktop']['page']
    extract = item['extract']
    thumbnail = item.get('thumbnail', {}).get('source', None)
    view_history = item['view_history']

    article={
        'title': title,
        'views': views,
        'link': link,
        'thumbnail': thumbnail,
        'extract': extract,
        'text': article_text_truncated,
        'trendingreason': '',
        'memorycontext': '',
        'view_history': view_history,
        'is_newly_trending': ''
    }

    article_list.append(article)

# Filter out the already trending articles

- These results were erratic. So I went with the function below

In [None]:
# # Prepare the LangChain components
# chat_model = ChatOpenAI(model="gpt-4o-mini", temperature=0)  # Set temperature to 0 for deterministic results

# conversation_chain = ConversationChain(llm=chat_model, verbose=True)

# def is_newly_trending(view_history):
#     # Format view history as a string for the prompt
#     formatted_history = "\n".join(
#         [f"- {entry['date']}: {entry['views']} views" for entry in view_history]
#     )
    
#     # Construct the full input question
#     prompt = f"""
#     Given the following view history data:
#     {formatted_history}
    
#     A data point is considered to be trending if the final day shows a meaningful spike.
    
#     A "meaningful spike" means the views on the final day provided have 
#     increased significantly relative the previous day.
    
#     Respond with only "true" or "false" without any additional explanation or text.
#     """

#     # Run the chain
#     response = conversation_chain.predict(input=prompt)
#     print(f"CODYBUG: response: {response}")
#     return response.strip().lower() == "true"

## Determine if the article is newly trending. If it is, add to new list

- If the article views increased by a factor of 5 from the previous day I'm calling newly trending



In [None]:
def is_newly_trending(view_history):
    view_history_length = len(view_history)

    yesterdays_views = view_history[view_history_length-2]['views']
    todays_views = view_history[view_history_length-1]['views']

    return todays_views*0.02 > yesterdays_views

newly_trending_article_list = []

for article in article_list:
    newly_trending = is_newly_trending(article['view_history'])
    article['is_newly_trending'] = newly_trending
    
    if newly_trending:
        newly_trending_article_list.append(article)
  

In [None]:
for article in newly_trending_article_list:
    print(article['title'])
    print(article['is_newly_trending'])
    print(article['view_history'])
    print("")

### Get first 5000 characters of article

In [None]:
for article in newly_trending_article_list:
      # Download raw text of article
  url = f"https://en.wikipedia.org/w/index.php?title={article['title']}&action=raw"
  print(url)

  article_text = requests.get(url).text
  article_text_truncated = article_text[:5000]
  article['text'] =  article_text_truncated

### Creating conversation chain with memory

In [None]:
trending_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful Wikipedia analyst and historian. You speak consiseley and given the choice to say too much or too little, you say too little"),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}")
])

memory = ConversationBufferMemory(return_messages=True)

llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.7,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

trending_conversation = ConversationChain(
    llm=llm,
    memory=memory,
    prompt=trending_prompt,
    verbose=True
)

#### Loop through all articles in data structure
- Use LangChain/ChatGPT to give suggestions why each one is trending
- Save reason to structure

In [None]:

for article in newly_trending_article_list:
    print(f"Analyzing {title}")

    title = article['title']
    text = article['text']

    prediction_prompt = f"Act as a professional news summarizer. Based on your knowledge of {title} and the following extract. In 1 concise and confident sentence, explain why the {title} article might be trending on Wikipedia on #{date_to_query}:\n\n{text}"

    response = trending_conversation.predict(input=prediction_prompt)
    print("trendingreason:", response)
    
    article['trendingreason'] =  response

#### Use conversation memory to derive more context from

- Pass memory from first conversation into a new conversation 
- Search for cross context between today's articles

In [None]:
memory_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful Wikipedia historian."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}")
])

# todays_memory = load_memory()
memory_conversation = ConversationChain(
    llm=llm,
    memory=trending_conversation.memory,
    prompt=memory_prompt,
    verbose=True
)

for article in newly_trending_article_list:
    print(f"Analyzing {title}")

    title = article['title']
    text = article['text']

    memory_prompt = f"Does {title} relate to any other trending article? If yes, tell me why in 1 or 2 sentences."

    response = memory_conversation.predict(input=memory_prompt)
    print("memorycontext:", response)
    
    article['memorycontext'] =  response

#### Build HTML Page to display the top 10 list complete with 
- title
- thumbnail
- trending reason
- relation to other articles

In [249]:
# Start building the HTML
html_title = f"<h1>Newly Trending on {date_to_query.strftime("%B %d, %Y")}</h1>"
html_list = "<ol>\n"

# Iterate through the data
for item in newly_trending_article_list:
    title = item['title']
    link = item['link']
    thumbnail = item['thumbnail']
    trendingreason = item['trendingreason']
    
    memorycontext = item['memorycontext']
    extract = item['extract']

    # Handle null thumbnail
    if thumbnail:
        thumbnail_html = f'<img src="{thumbnail}" alt="Thumbnail for {title}"/><br>'
    else:
        thumbnail_html = '<p><em>No thumbnail available</em></p>'
    
    # Create a list item for each entry
    html_list += f"""
    <li>
        <h2>
          <a href="{link}" target="_blank">{title}</a><br>
        </h2>
        {thumbnail_html}
        <strong>Views:</strong> {views}<br><br>
        <strong>Reason for Trending:</strong> {trendingreason}<br><br>
        <strong>Relation to other trending articles:</strong> {memorycontext}
        
    </li>\n
    """

# Close the HTML list
html_list += "</ol>"

html_page = html_title + html_list
# Save to html file (overwrite if it already exists)
file_path = f'{file_directory}/{base_file_name}.html'

with open(file_path, 'w', encoding='utf-8') as file:
    file.write(html_page)

# Prepend to the master file
master_file_path = f'{file_directory}/master.html'

# Read the existing content of the master file if it exists
if os.path.exists(master_file_path):
    with open(master_file_path, 'r', encoding='utf-8') as master_file:
        master_content = master_file.read()
else:
    master_content = ''

# Combine the new content with the old master content
updated_master_content = html_page + '\n' + master_content

# Save the updated content back to the master file
with open(master_file_path, 'w', encoding='utf-8') as master_file:
    master_file.write(updated_master_content)

# Display the HTML in the notebook (assuming Jupyter or similar)
from IPython.display import display, HTML
display(HTML(updated_master_content))