AInewsbot.ipynb

- Automate collecting daily AI news
- Open URLs of news sites specififed in `sources` dict (sources.yaml) using Selenium and Firefox
- Save HTML of each URL in htmldata directory
- Extract URLs from all files, create a pandas dataframe with url, title, src
- Use ChatGPT to filter only AI-related headlines by sending a prompt and formatted table of headlines
- Use SQLite to filter headlines previously seen 
- OPENAI_API_KEY should be in the environment or in a .env file
  
Alternative manual workflow to get HTML files if necessary
- Use Chrome, open e.g. Tech News bookmark folder, right-click and open all bookmarks in new window
- on Google News, make sure switch to AI tab
- on Google News, Feedly, Reddit, scroll to additional pages as desired
- Use SingleFile extension, 'save all tabs'
- Move files to htmldata directory
- Run lower part of notebook to process the data


1. initialize
2. fetch web pages
3. parse news story urls from web pages
4. filter headlines by relevance, not previously seen
5. perform topic analysis on headlines, and ordering by topic
6. summarize individual pages as bullet points
7. from bullet points, extract top 10 most common themes and stories of the day in order of importance
8. topic analysis of bullet points, categorize bullet points as belonging to particular themes
9. for each theme, make a summary and links. Here we want to iterate to improve summaries per specific criteria.
10. combine themes and send.

In [1]:
# import sys
# del sys.modules['ainb_const']


In [2]:
from datetime import datetime
import os
import yaml
import dotenv
import sqlite3
import unicodedata
import json
import pickle
from collections import Counter

import numpy as np
import pandas as pd
import umap
# import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import DBSCAN

# import bs4
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin, urlparse

import multiprocessing
from concurrent.futures import ThreadPoolExecutor, as_completed
import asyncio
import aiohttp

from IPython.display import HTML, Image, Markdown, display
import markdown

import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

from openai import OpenAI

from ainb_const import (DOWNLOAD_DIR, LOWCOST_MODEL, MODEL, CANONICAL_TOPICS,
                        SOURCECONFIG, FILTER_PROMPT, TOPIC_PROMPT,
                        SUMMARIZE_SYSTEM_PROMPT, SUMMARIZE_USER_PROMPT, FINAL_SUMMARY_PROMPT, TOP_CATEGORIES_PROMPT,
                        MAX_INPUT_TOKENS, MAX_OUTPUT_TOKENS, MAX_RETRIES, TEMPERATURE)
from ainb_utilities import (log, delete_files, filter_unseen_urls_db, insert_article, 
                            nearest_neighbor_sort, agglomerative_cluster_sort, traveling_salesman_sort_scipy,
                            unicode_to_ascii, send_gmail)
from ainb_webscrape import (get_driver, quit_drivers, launch_drivers, get_file, get_url, parse_file, 
                            get_og_tags, get_path_from_url, trimmed_href, process_source_queue_factory, 
                            process_url_queue_factory, DRIVERS)
from ainb_llm import paginate_df, process_pages, fetch_pages, fetch_openai, fetch_all_summaries, fetch_openai_summary, trunc_tokens


import asyncio
# need this to run async in jupyter since it already has an asyncio event loop running
import nest_asyncio
nest_asyncio.apply()


# Initialize

In [3]:
# OpenAI API module
client = OpenAI()

# Or can use REST API directly
API_URL = 'https://api.openai.com/v1/chat/completions'

headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {os.getenv("OPENAI_API_KEY")}',
}


In [4]:
#  load sources to scrape from sources.yaml
with open(SOURCECONFIG, "r") as stream:
    try:
        sources = yaml.safe_load(stream)
    except yaml.YAMLError as exc:
        print(exc)

log(f"Load {len(sources)} sources from {SOURCECONFIG}")

# make a reverse dict to map output file titles to source names
sources_reverse = {}
for k, v in sources.items():
    log(f"{k} -> {v['url']} -> {v['title']}.html")
    v['sourcename'] = k
    # map filename (title) to source name
    sources_reverse[v['title']] = k

log(f"Mapped {len(sources_reverse)} source page titles to sources")


2024-07-09 15:08:55,829 - AInewsbot - INFO - Load 17 sources from sources.yaml
2024-07-09 15:08:55,830 - AInewsbot - INFO - Ars Technica -> https://arstechnica.com/ -> Ars Technica.html
2024-07-09 15:08:55,830 - AInewsbot - INFO - Bloomberg Tech -> https://www.bloomberg.com/technology -> Bloomberg Technology - Bloomberg.html
2024-07-09 15:08:55,831 - AInewsbot - INFO - Business Insider -> https://www.businessinsider.com/tech -> Tech - Business Insider.html
2024-07-09 15:08:55,831 - AInewsbot - INFO - FT Tech -> https://www.ft.com/technology -> Technology.html
2024-07-09 15:08:55,831 - AInewsbot - INFO - Feedly AI -> https://feedly.com/i/aiFeeds?options=eyJsYXllcnMiOlt7InBhcnRzIjpbeyJpZCI6Im5scC9mL3RvcGljLzMwMDAifV0sInNlYXJjaEhpbnQiOiJ0ZWNobm9sb2d5IiwidHlwZSI6Im1hdGNoZXMiLCJzYWxpZW5jZSI6ImFib3V0In1dLCJidW5kbGVzIjpbeyJ0eXBlIjoic3RyZWFtIiwiaWQiOiJ1c2VyLzYyZWViYjlmLTcxNTEtNGY5YS1hOGM3LTlhNTdiODIwNTMwOC9jYXRlZ29yeS9HYWRnZXRzIn1dfQ -> Discover and Add New Feedly AI Feeds.html
2024-07-09 15:0

20

In [5]:
sources

{'Ars Technica': {'include': ['^https://arstechnica.com/(\\w+)/(\\d+)/(\\d+)/'],
  'title': 'Ars Technica',
  'url': 'https://arstechnica.com/',
  'sourcename': 'Ars Technica'},
 'Bloomberg Tech': {'include': ['^https://www.bloomberg.com/news/'],
  'title': 'Bloomberg Technology - Bloomberg',
  'url': 'https://www.bloomberg.com/technology',
  'sourcename': 'Bloomberg Tech'},
 'Business Insider': {'exclude': ['^https://www.insider.com',
   '^https://www.passionfroot.me'],
  'title': 'Tech - Business Insider',
  'url': 'https://www.businessinsider.com/tech',
  'sourcename': 'Business Insider'},
 'FT Tech': {'include': ['https://www.ft.com/content/'],
  'title': 'Technology',
  'url': 'https://www.ft.com/technology',
  'sourcename': 'FT Tech'},
 'Feedly AI': {'exclude': ['^https://feedly.com',
   '^https://s1.feedly.com',
   '^https://blog.feedly.com'],
  'scroll': 5,
  'initial_sleep': 30,
  'title': 'Discover and Add New Feedly AI Feeds',
  'url': 'https://feedly.com/i/aiFeeds?options=e

In [6]:
sources_reverse


{'Ars Technica': 'Ars Technica',
 'Bloomberg Technology - Bloomberg': 'Bloomberg Tech',
 'Tech - Business Insider': 'Business Insider',
 'Technology': 'FT Tech',
 'Discover and Add New Feedly AI Feeds': 'Feedly AI',
 'Google News - Technology - Artificial intelligence': 'Google News',
 'Hacker News Page 1': 'Hacker News',
 'Hacker News Page 2': 'Hacker News 2',
 'HackerNoon - read, write and learn about any technology': 'HackerNoon',
 'Technology - The New York Times': 'NYT Tech',
 'top scoring links _ multi': 'Reddit',
 'Techmeme': 'Techmeme',
 'The Register_ Enterprise Technology News and Analysis': 'The Register',
 'Artificial Intelligence - The Verge': 'The Verge',
 'AI News _ VentureBeat': 'VentureBeat',
 'Technology - WSJ.com': 'WSJ Tech',
 'Technology - The Washington Post': 'WaPo Tech'}

In [7]:
# determine files already in htmldata directory
# List all paths in the directory matching today's date
nfiles = 50
files = [os.path.join(DOWNLOAD_DIR, file)
         for file in os.listdir(DOWNLOAD_DIR)]
# Get the current date
today = datetime.now()
year, month, day = today.year, today.month, today.day
datestr = datetime.now().strftime("%m_%d_%Y")

# filter files only
files = [file for file in files if os.path.isfile(file)]

# Sort files by modification time and take top 50
files.sort(key=lambda x: os.path.getmtime(x), reverse=True)
file = files[:nfiles]

# filter files by with today's date ending in .html
files = [
    file for file in files if datestr in file and file.endswith(".html")]
log(len(files))
for file in files:
    log(file)

saved_pages = []
for file in files:
    filename = os.path.basename(file)
    # locate date like '01_14_2024' in filename
    position = filename.find(" (" + datestr)
    basename = filename[:position]
    # match to source name
    sourcename = sources_reverse.get(basename)
    if sourcename is None:
        log(f"Skipping {basename}, no sourcename metadata")
        continue
    sources[sourcename]['latest'] = file
    saved_pages.append((sourcename, file))

2024-07-09 15:09:04,191 - AInewsbot - INFO - 17
2024-07-09 15:09:04,193 - AInewsbot - INFO - htmldata/Technology - The Washington Post (07_09_2024 11_21_39 AM).html
2024-07-09 15:09:04,193 - AInewsbot - INFO - htmldata/Technology - WSJ.com (07_09_2024 11_21_34 AM).html
2024-07-09 15:09:04,194 - AInewsbot - INFO - htmldata/AI News _ VentureBeat (07_09_2024 11_21_32 AM).html
2024-07-09 15:09:04,194 - AInewsbot - INFO - htmldata/Discover and Add New Feedly AI Feeds (07_09_2024 11_21_28 AM).html
2024-07-09 15:09:04,195 - AInewsbot - INFO - htmldata/top scoring links _ multi (07_09_2024 11_21_22 AM).html
2024-07-09 15:09:04,195 - AInewsbot - INFO - htmldata/Artificial Intelligence - The Verge (07_09_2024 11_21_21 AM).html
2024-07-09 15:09:04,196 - AInewsbot - INFO - htmldata/The Register_ Enterprise Technology News and Analysis (07_09_2024 11_21_10 AM).html
2024-07-09 15:09:04,196 - AInewsbot - INFO - htmldata/Techmeme (07_09_2024 11_21_00 AM).html
2024-07-09 15:09:04,197 - AInewsbot - INFO

# Fetch and save source pages

In [None]:
# Fetch HTML files from sources

# empty download directory
delete_files(DOWNLOAD_DIR)

# save each file specified from sources
num_browsers = 3
log(f"Saving HTML files using {num_browsers} browsers")

# Create a queue for multiprocessing and populate it 
queue = multiprocessing.Queue()
for item in sources.values():
    queue.put(item)
    
# Function to take the queue and pop entries off and process until none are left
# lets you create an array of functions with different args
callable = process_source_queue_factory(queue)

saved_pages = launch_drivers(num_browsers, callable)


In [None]:
log(f"Saved {len(saved_pages)} pages")

print(len(saved_pages))
for sourcename, page in saved_pages:
    sources[sourcename]['latest'] = page
    log("{sourcename} -> {page}")
    

# Extract news URLs from saved pages

In [8]:
# Parse news URLs and titles from downloaded HTML files
log("Parsing html files")
all_urls = []
for sourcename, filename in saved_pages:
    log(sourcename +' -> ' + filename)
    log(f"{sourcename}", "parse loop")
    links = parse_file(sources[sourcename])
    log(f"{len(links)} links found", "parse loop")
    all_urls.extend(links)

log(f"found {len(all_urls)} links", "parse loop")

# make a pandas dataframe of all the links found
orig_df = (
    pd.DataFrame(all_urls)
    .groupby("url")
    .first()
    .reset_index()
    .sort_values("src")[["src", "title", "url"]]
    .reset_index(drop=True)
    .reset_index(drop=False)
    .rename(columns={"index": "id"})
)
orig_df.head()


2024-07-09 15:09:15,875 - AInewsbot - INFO - Parsing html files
2024-07-09 15:09:15,877 - AInewsbot - INFO - WaPo Tech -> htmldata/Technology - The Washington Post (07_09_2024 11_21_39 AM).html
2024-07-09 15:09:15,878 - AInewsbot - INFO - parse loop - WaPo Tech
2024-07-09 15:09:15,907 - AInewsbot - INFO - parse_file - found 160 raw links
2024-07-09 15:09:15,911 - AInewsbot - INFO - parse_file - found 20 filtered links
2024-07-09 15:09:15,912 - AInewsbot - INFO - parse loop - 20 links found
2024-07-09 15:09:15,912 - AInewsbot - INFO - WSJ Tech -> htmldata/Technology - WSJ.com (07_09_2024 11_21_34 AM).html
2024-07-09 15:09:15,912 - AInewsbot - INFO - parse loop - WSJ Tech
2024-07-09 15:09:15,954 - AInewsbot - INFO - parse_file - found 512 raw links
2024-07-09 15:09:15,961 - AInewsbot - INFO - parse_file - found 8 filtered links
2024-07-09 15:09:15,961 - AInewsbot - INFO - parse loop - 8 links found
2024-07-09 15:09:15,961 - AInewsbot - INFO - VentureBeat -> htmldata/AI News _ VentureBeat

Unnamed: 0,id,src,title,url
0,0,Ars Technica,What we know about microdosing candy illnesses...,https://arstechnica.com/science/2024/07/author...
1,1,Ars Technica,Why 1994’sLair of Squidwas the weirdest pack-i...,https://arstechnica.com/google/2024/07/how-i-f...
2,2,Ars Technica,Paul Sutter walks us through the future of cli...,https://arstechnica.com/science/2022/04/paul-s...
3,3,Ars Technica,Alaska’s top-heavy glaciers are approaching an...,https://arstechnica.com/science/2024/07/alaska...
4,4,Ars Technica,Egalitarian oddity found in the Neolithic,https://arstechnica.com/science/2024/07/egalit...


In [None]:
# # extracts all links from history where isAI=1
# # useful for training dimensionality reduction
# conn = sqlite3.connect('articles.db')
# c = conn.cursor()
# #  and timestamp > '2024-07-01' 
# query = "select * from news_articles where isAI=1 order by id"
# ai_history_df = pd.read_sql_query(query, conn)
# ai_history_df

In [None]:
# # clean up sqlite database if you want to rerun the job from a given point
# conn.execute(f"delete from news_articles where timestamp > '2024-07-08 19:15'")
# # conn.execute(f"delete from news_articles where id > 220230")
# # Committing the changes
# conn.commit()

# # Close the connection
# conn.close()


# Filter URLs to new AI headlines only

In [19]:
# filter urls we've already seen in previous runs and saved in SQLite
filtered_df = filter_unseen_urls_db(orig_df, before_date='2024-07-09 06:00:00')
len(filtered_df)

2024-07-09 15:18:14,142 - AInewsbot - INFO - Existing URLs: 125015
2024-07-09 15:18:14,170 - AInewsbot - INFO - New URLs: 576


576

In [None]:
# use chatgpt to filter AI-related headlines using a prompt to OpenAI
print(FILTER_PROMPT)


In [None]:
# make pages that fit in a reasonably sized (MAXPAGELEN or MAX_INPUT_TOKENS) prompt
pages = paginate_df(filtered_df)
log(f"Paginated {len(pages)} pages")


In [None]:
# use REST API directly. OpenAI python API doesn't support concurrent requests from a single client
# this runs fast with async aiohttp and on gpt-3.5 (15 seconds vs 2 minutes synchronously with gpt-4o)
# the old API supported submitting multiple payloads in a single completion request
# current API supports a slow 'batch' submission https://platform.openai.com/docs/guides/rate-limits/usage-tiers
# there is a more complex example here - https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py

log("start classify")
enriched_urls = asyncio.run(fetch_pages(pages, prompt=FILTER_PROMPT))
log("end classify")

enriched_df = pd.DataFrame(enriched_urls)
print(len(enriched_df))
log("isAI", len(enriched_df.loc[enriched_df["isAI"]]))
log("not isAI", len(enriched_df.loc[~enriched_df["isAI"]]))
enriched_df.head()


In [None]:
# merge returned df with isAI column into original df on id column
merged_df = pd.merge(filtered_df, enriched_df, on="id", how="outer")
merged_df['date'] = datetime.now().date()
merged_df.head()


In [None]:
# should be empty, shouldn't get back rows that don't match to existing
log(f"Unmatched response rows: {len(merged_df.loc[merged_df['src'].isna()])}")
# should be empty, should get back all rows from orig
log(f"Unmatched source rows: {len(merged_df.loc[merged_df['isAI'].isna()])}")


In [None]:
# update SQLite database with all seen URLs
conn = sqlite3.connect('articles.db')
cursor = conn.cursor()

for row in merged_df.itertuples():
    insert_article(conn, cursor, row.src, row.title,
                   row.url, row.isAI, row.date)


In [None]:
# keep headlines that are related to AI
AIdf = merged_df.loc[merged_df["isAI"]==1] \
    .reset_index(drop=True)  \
    .reset_index()  \
    .drop(columns=["id"])  \
    .rename(columns={'index': 'id'})

log(f"Found {len(AIdf)} AI headlines")
AIdf

In [None]:
# map title to ascii characters to avoid some dupes with e.g. different quote symbols

def unicode_to_ascii(input_string):
    # Normalize the Unicode string to NFKD form
    normalized_string = unicodedata.normalize('NFKD', input_string)
    
    # Encode to ASCII bytes, ignoring characters that cannot be converted
    ascii_bytes = normalized_string.encode('ascii', 'ignore')
    
    # Convert bytes back to a string
    ascii_string = ascii_bytes.decode('ascii')
    
    return ascii_string

AIdf['title'] = AIdf['title'].apply(unicode_to_ascii)


In [None]:
# dedupe identical headlines
AIdf['title_clean'] = AIdf['title'].map(lambda s: "".join(s.split()))
AIdf = AIdf.sort_values("src") \
    .groupby("title_clean") \
    .first() \
    .reset_index(drop=True) \
    .drop(columns=['id']) \
    .reset_index() \
    .rename(columns={'index': 'id'})

log(f"Found {len(AIdf)} unique AI headlines")


In [None]:
# map google news headlines to redirect, kind of unnecessary
from urllib.parse import urlparse

redirect_dict = {}
for row in AIdf.itertuples():
    parsed_url = urlparse(row.url)
    netloc = parsed_url.netloc
    if netloc == 'news.google.com':
        print(netloc, end=" -> ")        
        response = requests.get(row.url, allow_redirects=False)
        # The URL to which it would have redirected
        redirect_url = response.headers.get('Location')
        redirect_dict[row.url] = redirect_url
        parsed_url2 = urlparse(redirect_url)
        netloc2 = parsed_url2.netloc
        if netloc2 == 'news.google.com':
            print(netloc2, end=" -> ")
            response = requests.get(redirect_url, allow_redirects=False)
        # The URL to which it would have redirected
            redirect_url = response.headers.get('Location')
            redirect_dict[row.url] = redirect_url
        print(redirect_url)

AIdf['url'] = AIdf['url'].apply(lambda url: redirect_dict.get(url, url))


In [None]:
AIdf

# Topic analysis
Here we are trying to identify the top topics of the day, to help make a nice summary. 

1st approach - do dimensionality reduction on the headline embeddings with UMAP and cluster with DBSCAN.

2nd approach
 - extract topics from headline using a prompt
 - human canonicalizes topics
 - assign headlines to topics using a prompt
 
 The final summary is pretty inconsistent, would be nice to give chatgpt a prompt that would say, summarize these bullet points using this categorization.
 

In [None]:
# attempt to extract top topics 
print(TOPIC_PROMPT)


In [None]:
# get topics
pages = paginate_df(AIdf)

# apply this prompt to AI headlines
log("start topic extraction")
response = asyncio.run(fetch_pages(pages, prompt=TOPIC_PROMPT))
log("end topic extraction")

topic_df = pd.DataFrame(response)
print(len(topic_df))
topic_df.head()


In [None]:
all_topics = [item for row in topic_df.itertuples() for item in row.topics]
item_counts = Counter(all_topics)
for x in item_counts.most_common():
    print(x)
    

In [None]:
# evergreen topics to hopefully map healdines to canonical standardized topics
# review extracted topics and add
CANONICAL_TOPICS = [
    "Policy and regulation",
    "AI economic impacts",
    "Robots",
    "Autonomous vehicles",
    "AI job market",
    "LLMs",
    "Healthcare",
    "Fintech",
    "Education",
    "Entertainment",
    "Startup funding",
    "IPOs",
    "Ethical issues",
    "Legal issues",
    "Cybersecurity",
    "AI doom",
    'Stocks',
    'Climate',
    'Scams',
    'Privacy',
    'Intellectual Property',
    'Code assistants',
    'Customer service',
    'Reinforcement Learning',
    'Open Source',
    'Language Models',
    'China',
    'Military',
    'Semiconductor chips',
    'Sustainability',
    'Agriculture',
    'Gen AI',
    'Testing',
    
    'Nvidia',
    'Google',
    'OpenAI',
    'Meta',
    'Apple',
    'Microsoft',
    'Salesforce',
    'Uber',
    'AMD',
    'Netflix',
    'Disney',
    'Amazon',
    'Cloudflare',
    'Anthropic',
    'Cohere',
    'Baidu',
    'Big Tech',
    'Samsung',
    'Tesla',
    
    'ChatGPT',
    'WhatsApp',
    'Gemini',
    'Claude',
    'Copilot',
    
    'Elon Musk',
    'Bill Gates',
    'Sam Altman',
    'Mustafa Suleyman',
    'Sundar Pichai',
    'Yann LeCun',
    'Geoffrey Hinton',
    'Mark Zuckerberg',
]

In [None]:
# you could try it with new cats or new cats + evergreen
# but probably look at new cats and human in the loop should add good new cats today to evergreen list
# new_cats = list(json.loads(response.choices[0].message.content).values())[0]
# categories = sorted(list(set(new_cats + evergreen)))
categories = sorted(CANONICAL_TOPICS)
categories


In [None]:
async def categorize_story(headline, categories, session, 
                           model=LOWCOST_MODEL,
                           temperature=0.5,
                           max_retries=MAX_RETRIES):
    
    retlist = []
    if type(categories) is not list:
        categories = [categories]
    for topic in categories:
        cat_prompt = f"""You are a news topic categorizaton assistant. I will provide a headline 
and a topic. You will respond with a JSON object {{'response': 1}} if the news headline matches 
the news topic and {{'response': 0}} if it does not. Check carefully and only return {{'response': 1}}
if the headline closely matches the topic. If the headline is not a close match or if unsure, 
return {{'response': 0}}
Headline:
{headline}
Topic:
{topic}
"""
        for i in range(max_retries):
            try:
                messages=[
                          {"role": "user", "content": cat_prompt
                          }]

                payload = {"model":  model,
                           "response_format": {"type": "json_object"},
                           "messages": messages,
                           "temperature": temperature
                           }
                response = await fetch_openai(session, payload)
                response_dict = json.loads(response["choices"][0]["message"]["content"])
                response_val = response_dict['response']
                if response_val == 1:
                    retlist.append(topic)
                break
            except Exception as exc:
                log(f"Error: {exc}")

            
    return retlist
        

h = "Utility stocks are Wall Streets secret backdoor to AI"
catdict = dict()

async with aiohttp.ClientSession() as session:
    for i, row in enumerate(AIdf.itertuples()):
        tasks = []
        log(f"Categorizing headline {row.id+1} of {len(AIdf)}")
        h = row.title
        log(h)
        for c in categories:
            task = asyncio.create_task(categorize_story(h, c, session))
            tasks.append(task)
        responses = await asyncio.gather(*tasks)
        catdict[row.id] = [item for sublist in responses for item in sublist]
        log(str(catdict[row.id]))
        


In [None]:
catdict

In [None]:
AIdf = AIdf.drop(columns=["assigned_topics"])

In [None]:
topic_df["assigned_topics"] = topic_df["id"].apply(lambda id: catdict.get(id, []))
topic_df

In [None]:
def clean_topics(row):
    topics = [x.title() for x in row.topics if x.lower() not in {"ai", "artificial intelligence"}]
    assigned_topics = [x.title() for x in row.assigned_topics]
    combined = sorted(list(set(topics + assigned_topics)))
    combined = [s.replace("Ai", "AI") for s in combined]
    combined = [s.replace("Genai", "Gen AI") for s in combined]
    
    return ", ".join(combined)

topic_df["clean_topics"] = topic_df.apply(clean_topics, axis=1)
topic_df

In [None]:
# merge returned df into original df
merged_df = pd.merge(AIdf, topic_df[["id", "topic_str"]], on="id", how="outer")
merged_df['title_topic_str'] = merged_df.apply(lambda row: f'{row.title} (Topics: {row.topic_str})', axis=1)

merged_df


In [None]:
AIdf = merged_df

### Semantic sort

In [None]:
# use embeddings to sort headlines by semantical similarity
log(f"Fetching embeddings for {len(AIdf)} headlines")
embedding_model = 'text-embedding-3-large'
response = client.embeddings.create(input=AIdf['title_topic_str'].tolist(),
                                    model=embedding_model)
embedding_df = pd.DataFrame([e.model_dump()['embedding'] for e in response.data])

sorted_indices = agglomerative_cluster_sort(embedding_df)
AIdf = AIdf.iloc[sorted_indices] \
    .reset_index(drop=True) \
    .reset_index() \
    .drop(columns=["id"]) \
    .rename(columns={'index': 'id'})

# sort embedding_df to match
embedding_df = embedding_df[sorted_indices]

AIdf


In [None]:
with pd.option_context('display.max_rows', None, 'display.max_colwidth', None):
    display(AIdf[["title"]])
    

### Cluster with DBSCAN

In [None]:
# embedding_model = 'text-embedding-3-large'
# chunksize = 1000
# e_results = []
# for start in range(0, len(ai_history_df), chunksize):
#     tempdf = ai_history_df.iloc[start:start+chunksize]
#     templist = tempdf['title'].tolist()
#     log(f"Fetching embeddings for {len(templist)} headlines starting at row {start}")
#     response = client.embeddings.create(input=templist,
#                                         model=embedding_model)
#     e_results.append(pd.DataFrame([e.model_dump()['embedding'] for e in response.data]))


In [None]:
# historical_embedding_df = pd.concat(e_results)
# historical_embedding_df.shape


In [None]:
# historical_embedding_df.to_pickle('historical_embedding_df.pkl')


In [None]:
log(f"Fetching embeddings for {len(AIdf)} headlines")
embedding_model = 'text-embedding-3-large'
response = client.embeddings.create(input=AIdf['title'].tolist(),
                                    model=embedding_model)
embedding_df = pd.DataFrame([e.model_dump()['embedding'] for e in response.data])


In [None]:
# reducer = umap.UMAP(n_components=30)  # Reducing to 30 dimensions
# reduced_data = reducer.fit_transform(historical_embedding_df)
# with open('umap_model.pkl', 'wb') as file:
#     pickle.dump(reducer, file)

In [None]:
with open("umap_model.pkl", 'rb') as file:
    # Load the model from the file
    reducer = pickle.load(file)
    

In [None]:
reduced_data = reducer.transform(embedding_df)


In [None]:
np.isnan(reduced_data).any()


In [None]:
# Apply a Clustering Algorithm (e.g., K-Means)
kmeans = KMeans(n_clusters=20)  
clusters = kmeans.fit_predict(reduced_data)

# Evaluate the Clustering
silhouette_avg = silhouette_score(reduced_data, clusters)
print(f'Silhouette Score: {silhouette_avg}')

# Visualization with UMAP (optional)
# reducer_2d = umap.UMAP(n_components=2)  # Reducing to 2 dimensions for visualization
# reduced_data_2d = reducer_2d.fit_transform(embedding_df)

# plt.scatter(reduced_data_2d[:, 0], reduced_data_2d[:, 1], c=clusters, cmap='viridis', s=5)
# plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
# plt.title('UMAP Projection of the News Headlines Clusters')
# plt.show()


In [None]:
dbscan = DBSCAN(eps=0.5, min_samples=3)  # Adjust eps and min_samples as needed
clusters = dbscan.fit_predict(reduced_data)

AIdf['cluster'] = clusters

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_colwidth', None):
    for i in range(30):
        tmpdf = AIdf.loc[AIdf['cluster']==i][["id", "title"]]
        if len(tmpdf) ==0:
            break
        display(tmpdf)
    


# Save and email headlines


In [None]:
html_str = ""
for row in AIdf.itertuples():
    log(f"[{row.Index}. {row.title} - {row.src}]({row.url})")
    html_str += f'{row.Index}.<a href="{row.url}">{row.title} - {row.src}</a><br />\n'


In [None]:
# save headlines
with open('headlines.html', 'w') as f:
    f.write(html_str)


In [None]:
# send mail
log("Sending headlines email")
subject = f'AI headlines {datetime.now().strftime("%H:%M:%S")}'
send_gmail(subject, html_str)


# Save individual pages 

In [None]:
# fetch pages
# Create a queue for multiprocessing and populate it 
log("Queuing URLs for scraping")

queue = multiprocessing.Queue()
for row in AIdf.itertuples():
    queue.put((row.id, row.url, row.title))
    

In [None]:
# scrape urls in queue asynchronously
num_browsers = 4

callable = process_url_queue_factory(queue)

log(f"fetching {len(AIdf)} pages using {num_browsers} browsers")
saved_pages = launch_drivers(num_browsers, callable)


In [None]:
pages_df = pd.DataFrame(saved_pages)
pages_df.columns = ['id', 'url', 'title', 'path']
pages_df

In [None]:
AIdf = pd.merge(AIdf, pages_df[["id", "path"]], on='id', how="inner")


In [None]:
AIdf

# Summarize individual pages

In [None]:
print(SUMMARIZE_SYSTEM_PROMPT)


In [None]:
print(SUMMARIZE_USER_PROMPT)


In [None]:
# Here we are fetching all at once, could be 200 summaries, so we are firing off 200 REST requests at once
# This seems like a bad idea, could loop through and fire off e.g. 10 at a time, or use queues and workers (seems pointless)
# But it works and runs fast on 3.5 and if ChatGPT doesn't like it they could throttle it

log("Starting summarize")
responses = await fetch_all_summaries(AIdf)
log(f"Received {len(responses)} summaries")
print(responses[0])


In [None]:
# bring summaries into dict
response_dict = {}
for i, response in responses:
    try:
        response_str = response["choices"][0]["message"]["content"]
        response_dict[i] = response_str
    except Exception as exc:
        print(exc)
        
len(response_dict)

In [None]:
markdown_str = ''

for i, row in enumerate(AIdf.itertuples()):
    mdstr = f"[{i+1}. {row.title} - {row.src}]({row.url})  \n\n {row.topic_str} \n\n{response_dict[row.id]} \n\n"
    display(Markdown(mdstr))
    markdown_str += mdstr
    

In [None]:
# Convert Markdown to HTML
html_str = markdown.markdown(markdown_str, extensions=['extra'])
# display(HTML(html_str))


In [None]:
# save bullets
with open('bullets.md', 'w') as f:
    f.write(markdown_str)


In [None]:
log("Sending bullet points email")
subject = f'AI news bullets {datetime.now().strftime("%H:%M:%S")}'
send_gmail(subject, html_str)


# Ask ChatGPT for top categories

In [None]:
print(TOP_CATEGORIES_PROMPT)

In [None]:
response = client.chat.completions.create(
    model=MODEL,
    messages=[
              {"role": "user", "content": TOP_CATEGORIES_PROMPT + markdown_str
              }],
    n=1,
    response_format={"type": "json_object"},
    temperature=0.5
)


In [None]:
list(json.loads(response.choices[0].message.content).values())[0]

# Final Summary

In [None]:
markdown_str = ''
for i, row in enumerate(AIdf.itertuples()):
    mdstr = f"[{i+1}. {row.title} - {row.src}]({row.url})  \n\n"
    if 0 < len(catdict[row.id]) < 11 :
        topicstr = ", ".join(catdict[row.id])
        mdstr += f"Topics: {topicstr}\n\n"
    mdstr += f"{response_dict[row.id]} \n\n"
    display(Markdown(mdstr))
    markdown_str += mdstr
    

In [None]:
FINAL_SUMMARY_PROMPT = f"""You are a summarization assistant. I will provide a list of today's news articlds about AI
and summary bullet points in markdown format. Bullet points will have a title and URL, a list of topics discussed, 
and a bullet-point summary of the article. You are tasked with identifying and summarizing the key themes,
common facts, and recurring elements. Your goal is to create a concise summary containing about 20 of the most 
frequently mentioned topics and developments.


Example Input Bullet Points:

[2. Sentient closes $85M seed round for open-source AI](https://cointelegraph.com/news/sentient-85-million-round-open-source-ai)

AI startup funding, New AI products

- Sentient secured $85 million in a seed funding round led by Peter Thiel's Founders Fund, Pantera Capital, and Framework Ventures for their open-source AI platform.
- The startup aims to incentivize AI developers with its blockchain protocol and incentive mechanism, allowing for the evolution of open artificial general intelligence.
- The tech industry is witnessing a rise in decentralized AI startups combining blockchain

Examples of important stories:

Major investments and funding rounds
Key technological advancements or breakthroughs
Frequently mentioned companies, organizations, or figures
Notable statements by AI leaders
Any other recurring themes or notable patterns

Instructions:

Read the summary bullet points closely.
Use only information provided in them and provide the most common facts without commentary or elaboration.
Write in the professional but engaging, narrative style of a tech reporter for a national publication.
Be balanced, professional, informative, providing accurate, clear, concise summaries in a respectful neutral tone.
Focus on the most common elements across the bullet points and group similar items together.
Headers must be as short and simple as possible: use "Health Care" and not "AI developments in Health Care" or "AI in Health Care"
Ensure that you provide at least one link from the provided text for each item in the summary.
You must include at least 10 and no more than 25 items in the summary.

Example Output Format:

# Today's AI News

### Security and Privacy:
- ChatGPT Mac app had a security flaw exposing user conversations in plain text. ([Macworld](https://www.macworld.com/article/2386267/chatgpt-mac-sandboxing-security-flaw-apple-intelligence.html))
- Brazil suspended Meta from using Instagram and Facebook posts for AI training over privacy concerns. ([BBC](https://www.bbc.com/news/articles/c7291l3nvwv))

### Health Care:
- AI can predict Alzheimer's disease with 70% accuracy up to seven years in advance. ([Decrypt](https://decrypt.co/238449/ai-alzheimers-detection-70-percent-accurate-study))
- New AI system detects 13 cancers with 98% accuracy, revolutionizing cancer diagnosis. ([India Express](https://news.google.com/articles/CBMiiAFodHRwczovL2luZGlhbmV4cHJl))

Bullet Points to Summarize:

"""

In [None]:
response = client.chat.completions.create(
    model=MODEL,
    messages=[
              {"role": "user", "content": FINAL_SUMMARY_PROMPT + markdown_str
              }],
    n=1,   
    temperature=0.5
)


In [None]:
response_str = response.choices[0].message.content
response_str = response_str.replace("$", "\\$")
display(Markdown(response_str))


In [None]:
log("Sending full summary email ")
subject = f'AI news summary {datetime.now().strftime("%H:%M:%S")}'
final_html_str = markdown.markdown(response_str, extensions=['extra'])
display(HTML(final_html_str))
send_gmail(subject, final_html_str)


In [None]:
log("Finished")


In [None]:
redirect_dict