AInewsbot.ipynb

- Automate collecting daily AI news
- Open URLs of news sites specififed in `sources` dict (sources.yaml) using Selenium and Firefox
- Save HTML of each URL in htmldata directory
- Extract URLs from all files, create a pandas dataframe with url, title, src
- Use ChatGPT to filter only AI-related headlines by sending a prompt and formatted table of headlines
- Use SQLite to filter headlines previously seen 
- OPENAI_API_KEY should be in the environment or in a .env file
  
Alternative manual workflow to get HTML files if necessary
- Use Chrome, open e.g. Tech News bookmark folder, right-click and open all bookmarks in new window
- on Google News, make sure switch to AI tab
- on Google News, Feedly, Reddit, scroll to additional pages as desired
- Use SingleFile extension, 'save all tabs'
- Move files to htmldata directory
- Run lower part of notebook to process the data


In [1]:
from datetime import datetime
import time
import re
import os
import yaml
import dotenv
import sqlite3
import unicodedata

import numpy as np
import pandas as pd
from scipy.spatial.distance import cdist

# import bs4
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin, urlparse

from selenium import webdriver
from selenium.webdriver.common.by import By
# use firefox v. chrome b/c it updates less often, can disable updates
# recommend importing profile from Chrome for cookies, passwords
# looks less like a bot with more user cruft in the profile
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service

import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

from openai import OpenAI

from ainb_const import (DOWNLOAD_DIR,
                        SOURCECONFIG, PROMPT)
from ainb_utilities import log, delete_files, filter_unseen_urls_db, insert_article, nearest_neighbor_sort, agglomerative_cluster_sort
from ainb_webscrape import init_browser, get_file, parse_file, get_og_tags, get_path_from_url, trimmed_href
from ainb_llm import paginate_df, process_pages

In [2]:
SOURCECONFIG = "sources.yaml"
DOWNLOAD_DIR = "htmldata"

# load secrets, credentials from .env
dotenv.load_dotenv()


True

In [3]:
PROMPT = """
Please serve as a research assistant for the purpose of categorizing news articles based on their relevance to artificial intelligence (AI). 
Your main responsibility will involve processing and classifying news articles formatted as JSON objects.

Classification Criteria: Based on the title of each story, you are to classify whether the story primarily pertains to AI or not. 
Consider AI-related content to broadly include topics such as machine learning, robotics, computer vision, reinforcement learning, 
large language models, related topics, and specific references to AI entities like OpenAI or ChatGPT.

Input Specification: You will receive a list of news stories formatted as JSON objects separated by the delimiter "|". 
Each object includes an 'id' and a 'title'. For instance:
|
{'stories':
[{'id': 97, 'title': 'AI to predict dementia, detect cancer'},
 {'id': 103,'title': 'Figure robot learns to make coffee by watching humans for 10 hours'},
 {'id': 103,'title': 'Baby trapped in refrigerator eats own foot'},
 {'id': 210,'title': 'ChatGPT removes, then reinstates a summarization assistant without explanation.'},
 {'id': 298,'title': 'The 5 most interesting PC monitors from CES 2024'},
 ]
}
|

Output Specification: For each story, your output should be a JSON object containing the original 'id' and a new field 'isAI', 
which is a boolean indicating if the story is about AI. This output should be enclosed in the delimiter "~". 
The output schema must be strictly adhered to, without any additional fields. Example output:
~
{'stories':
[{'id': 97, 'isAI': true},
 {'id': 103, 'isAI': true},
 {'id': 103, 'isAI': false},
 {'id': 210, 'isAI': true},
 {'id': 298, 'isAI': false}]
}
~

Strictly ensure that each output object accurately reflects the corresponding input object in terms of the 'id' field 
and that the 'isAI' field accurately represents the AI relevance of the story as determined by the title.

The list of news stories to classify and enrich is:

"""

In [4]:
get_og_tags('https://druce.ai')


{'og:site_name': 'Druce.ai',
 'og:title': 'Druce.ai',
 'og:type': 'website',
 'og:description': "Druce's Blog on Machine Learning, Tech, Markets and Economics",
 'og:url': 'https://druce.ai/',
 'title': 'Druce.ai'}

In [5]:
get_path_from_url('https://druce.ai/2024/03/gemini-summarize-book')


'/2024/03/gemini-summarize-book'

In [6]:
trimmed_href('https://druce.ai/2024/03/gemini-summarize-book?xyz')


'https://druce.ai/2024/03/gemini-summarize-book'

In [7]:
#  load sources to scrape from sources.yaml
with open(SOURCECONFIG, "r") as stream:
    try:
        sources = yaml.safe_load(stream)
    except yaml.YAMLError as exc:
        print(exc)

log(f"Load {len(sources)} sources")



2024-05-01 13:06:43,941 - AInewsbot - INFO - Load 17 sources


20

In [8]:
sources_reverse = {}
for k, v in sources.items():
    log(f"{k} -> {v['url']} -> {v['title']}.html")
    v['sourcename'] = k
    # map filename (title) to source name
    sources_reverse[v['title']] = k

sources_reverse

2024-05-01 13:06:43,945 - AInewsbot - INFO - Ars Technica -> https://arstechnica.com/ -> Ars Technica.html
2024-05-01 13:06:43,945 - AInewsbot - INFO - Bloomberg Tech -> https://www.bloomberg.com/technology -> Bloomberg Technology - Bloomberg.html
2024-05-01 13:06:43,945 - AInewsbot - INFO - Business Insider -> https://www.businessinsider.com/tech -> Tech - Business Insider.html
2024-05-01 13:06:43,945 - AInewsbot - INFO - FT Tech -> https://www.ft.com/technology -> Technology.html
2024-05-01 13:06:43,946 - AInewsbot - INFO - Feedly AI -> https://feedly.com/i/aiFeeds?options=eyJsYXllcnMiOlt7InBhcnRzIjpbeyJpZCI6Im5scC9mL3RvcGljLzMwMDAifV0sInNlYXJjaEhpbnQiOiJ0ZWNobm9sb2d5IiwidHlwZSI6Im1hdGNoZXMiLCJzYWxpZW5jZSI6ImFib3V0In1dLCJidW5kbGVzIjpbeyJ0eXBlIjoic3RyZWFtIiwiaWQiOiJ1c2VyLzYyZWViYjlmLTcxNTEtNGY5YS1hOGM3LTlhNTdiODIwNTMwOC9jYXRlZ29yeS9HYWRnZXRzIn1dfQ -> Discover and Add New Feedly AI Feeds.html
2024-05-01 13:06:43,946 - AInewsbot - INFO - Google News -> https://news.google.com/topics/CAA

{'Ars Technica': 'Ars Technica',
 'Bloomberg Technology - Bloomberg': 'Bloomberg Tech',
 'Tech - Business Insider': 'Business Insider',
 'Technology': 'FT Tech',
 'Discover and Add New Feedly AI Feeds': 'Feedly AI',
 'Google News - Technology - Artificial intelligence': 'Google News',
 'Hacker News Page 1': 'Hacker News',
 'Hacker News Page 2': 'Hacker News 2',
 'HackerNoon - read, write and learn about any technology': 'HackerNoon',
 'Technology - The New York Times': 'NYT Tech',
 'top scoring links _ multi': 'Reddit',
 'Techmeme': 'Techmeme',
 'The Register_ Enterprise Technology News and Analysis': 'The Register',
 'Artificial Intelligence - The Verge': 'The Verge',
 'AI News _ VentureBeat': 'VentureBeat',
 'Technology - WSJ.com': 'WSJ Tech',
 'Technology - The Washington Post': 'WaPo Tech'}

In [9]:
# get list of files in htmldata directory
# List all paths in the directory matching today's date
nfiles = 50
files = [os.path.join(DOWNLOAD_DIR, file)
         for file in os.listdir(DOWNLOAD_DIR)]

# Get the current date
today = datetime.now()
year, month, day = today.year, today.month, today.day
datestr = datetime.now().strftime("%m_%d_%Y")

# filter files only
files = [file for file in files if os.path.isfile(file)]

# Sort files by modification time and take top 50
files.sort(key=lambda x: os.path.getmtime(x), reverse=True)
file = files[:nfiles]

# filter files by with today's date ending in .html
files = [
    file for file in files if datestr in file and file.endswith(".html")]
log(len(files))
for file in files:
    log(file)

saved_pages = []
for file in files:
    filename = os.path.basename(file)
    # locate date like '01_14_2024' in filename
    position = filename.find(" (" + datestr)
    basename = filename[:position]
    # match to source name
    sourcename = sources_reverse.get(basename)
    if sourcename is None:
        log(f"Skipping {basename}, no sourcename metadata")
        continue
    sources[sourcename]['latest'] = file
    saved_pages.append((sourcename, file))

2024-05-01 13:06:43,954 - AInewsbot - INFO - 17
2024-05-01 13:06:43,954 - AInewsbot - INFO - htmldata/Technology - The Washington Post (05_01_2024 10_45_05 AM).html
2024-05-01 13:06:43,955 - AInewsbot - INFO - htmldata/Technology - WSJ.com (05_01_2024 10_44_55 AM).html
2024-05-01 13:06:43,955 - AInewsbot - INFO - htmldata/AI News _ VentureBeat (05_01_2024 10_44_44 AM).html
2024-05-01 13:06:43,955 - AInewsbot - INFO - htmldata/Artificial Intelligence - The Verge (05_01_2024 10_44_33 AM).html
2024-05-01 13:06:43,955 - AInewsbot - INFO - htmldata/The Register_ Enterprise Technology News and Analysis (05_01_2024 10_44_23 AM).html
2024-05-01 13:06:43,955 - AInewsbot - INFO - htmldata/Techmeme (05_01_2024 10_44_12 AM).html
2024-05-01 13:06:43,955 - AInewsbot - INFO - htmldata/top scoring links _ multi (05_01_2024 10_44_01 AM).html
2024-05-01 13:06:43,955 - AInewsbot - INFO - htmldata/Technology - The New York Times (05_01_2024 10_43_28 AM).html
2024-05-01 13:06:43,956 - AInewsbot - INFO - ht

In [10]:
# Get HTML files from sources

# empty download directory
delete_files(DOWNLOAD_DIR)

# launch browser via selenium driver
driver = init_browser()

# save each file specified from sources
log("Saving HTML files")
saved_pages = []
for sourcename, sourcedict in sources.items():
    log(f'Processing {sourcename}')
    sourcefile = get_file(sourcedict, driver=driver)
    saved_pages.append((sourcename, sourcefile))

# Close the browser
log("Quit webdriver")
driver.quit()
# finished downloading files


2024-05-01 13:06:43,962 - AInewsbot - INFO - init_browser - Initializing webdriver
2024-05-01 13:06:55,910 - AInewsbot - INFO - init_browser - Initialized webdriver profile
2024-05-01 13:06:55,910 - AInewsbot - INFO - init_browser - Initialized webdriver service
2024-05-01 13:07:34,022 - AInewsbot - INFO - init_browser - Initialized webdriver
2024-05-01 13:07:34,055 - AInewsbot - INFO - Saving HTML files
2024-05-01 13:07:34,056 - AInewsbot - INFO - Processing Ars Technica
2024-05-01 13:07:34,056 - AInewsbot - INFO - get_files(Ars Technica) - starting get_files https://arstechnica.com/
2024-05-01 13:07:44,924 - AInewsbot - INFO - get_files(Ars Technica) - Saving Ars Technica (05_01_2024 01_07_44 PM).html as utf-8
2024-05-01 13:07:44,926 - AInewsbot - INFO - Processing Bloomberg Tech
2024-05-01 13:07:44,926 - AInewsbot - INFO - get_files(Bloomberg Technology - Bloomberg) - starting get_files https://www.bloomberg.com/technology
2024-05-01 13:07:56,088 - AInewsbot - INFO - Message: Unable

In [11]:
print(len(saved_pages))
for sourcename, page in saved_pages:
    # sources[sourcename]['latest'] = page
    print(sourcename, '->', page)
    

17
Ars Technica -> htmldata/Ars Technica (05_01_2024 01_07_44 PM).html
Bloomberg Tech -> htmldata/Bloomberg Technology - Bloomberg (05_01_2024 01_07_56 PM).html
Business Insider -> htmldata/Tech - Business Insider (05_01_2024 01_08_06 PM).html
FT Tech -> htmldata/Technology (05_01_2024 01_08_17 PM).html
Feedly AI -> htmldata/Discover and Add New Feedly AI Feeds (05_01_2024 01_09_17 PM).html
Google News -> htmldata/Google News - Technology - Artificial intelligence (05_01_2024 01_09_49 PM).html
Hacker News -> htmldata/Hacker News Page 1 (05_01_2024 01_10_00 PM).html
Hacker News 2 -> htmldata/Hacker News Page 2 (05_01_2024 01_10_10 PM).html
HackerNoon -> htmldata/HackerNoon - read, write and learn about any technology (05_01_2024 01_10_21 PM).html
NYT Tech -> htmldata/Technology - The New York Times (05_01_2024 01_10_32 PM).html
Reddit -> htmldata/top scoring links _ multi (05_01_2024 01_11_04 PM).html
Techmeme -> htmldata/Techmeme (05_01_2024 01_11_15 PM).html
The Register -> htmldata/T

In [12]:
# Parse news URLs and titles from downloaded HTML files
log("parsing html files")
all_urls = []
for sourcename, filename in saved_pages:
    print(sourcename, '->', filename, flush=True)
    log(f"{sourcename}", "parse loop")
    links = parse_file(sources[sourcename])
    log(f"{len(links)} links found", "parse loop")
    all_urls.extend(links)

log(f"found {len(all_urls)} links", "parse loop")

2024-05-01 13:12:10,750 - AInewsbot - INFO - parsing html files


Ars Technica -> htmldata/Ars Technica (05_01_2024 01_07_44 PM).html


2024-05-01 13:12:10,751 - AInewsbot - INFO - parse loop - Ars Technica
2024-05-01 13:12:10,768 - AInewsbot - INFO - parse_file - found 252 raw links
2024-05-01 13:12:10,771 - AInewsbot - INFO - parse_file - found 27 filtered links
2024-05-01 13:12:10,771 - AInewsbot - INFO - parse loop - 27 links found


Bloomberg Tech -> htmldata/Bloomberg Technology - Bloomberg (05_01_2024 01_07_56 PM).html


2024-05-01 13:12:10,772 - AInewsbot - INFO - parse loop - Bloomberg Tech
2024-05-01 13:12:10,791 - AInewsbot - INFO - parse_file - found 298 raw links
2024-05-01 13:12:10,794 - AInewsbot - INFO - parse_file - found 52 filtered links
2024-05-01 13:12:10,794 - AInewsbot - INFO - parse loop - 52 links found


Business Insider -> htmldata/Tech - Business Insider (05_01_2024 01_08_06 PM).html


2024-05-01 13:12:10,795 - AInewsbot - INFO - parse loop - Business Insider
2024-05-01 13:12:10,820 - AInewsbot - INFO - parse_file - found 339 raw links
2024-05-01 13:12:10,825 - AInewsbot - INFO - parse_file - found 65 filtered links
2024-05-01 13:12:10,825 - AInewsbot - INFO - parse loop - 65 links found


FT Tech -> htmldata/Technology (05_01_2024 01_08_17 PM).html


2024-05-01 13:12:10,825 - AInewsbot - INFO - parse loop - FT Tech
2024-05-01 13:12:10,889 - AInewsbot - INFO - parse_file - found 457 raw links
2024-05-01 13:12:10,894 - AInewsbot - INFO - parse_file - found 104 filtered links
2024-05-01 13:12:10,894 - AInewsbot - INFO - parse loop - 104 links found


Feedly AI -> htmldata/Discover and Add New Feedly AI Feeds (05_01_2024 01_09_17 PM).html


2024-05-01 13:12:10,895 - AInewsbot - INFO - parse loop - Feedly AI
2024-05-01 13:12:10,922 - AInewsbot - INFO - parse_file - found 229 raw links
2024-05-01 13:12:10,925 - AInewsbot - INFO - parse_file - found 61 filtered links
2024-05-01 13:12:10,925 - AInewsbot - INFO - parse loop - 61 links found


Google News -> htmldata/Google News - Technology - Artificial intelligence (05_01_2024 01_09_49 PM).html


2024-05-01 13:12:10,925 - AInewsbot - INFO - parse loop - Google News
2024-05-01 13:12:11,220 - AInewsbot - INFO - parse_file - found 1123 raw links
2024-05-01 13:12:11,228 - AInewsbot - INFO - parse_file - found 491 filtered links
2024-05-01 13:12:11,229 - AInewsbot - INFO - parse loop - 491 links found


Hacker News -> htmldata/Hacker News Page 1 (05_01_2024 01_10_00 PM).html


2024-05-01 13:12:11,229 - AInewsbot - INFO - parse loop - Hacker News
2024-05-01 13:12:11,240 - AInewsbot - INFO - parse_file - found 255 raw links
2024-05-01 13:12:11,242 - AInewsbot - INFO - parse_file - found 21 filtered links
2024-05-01 13:12:11,242 - AInewsbot - INFO - parse loop - 21 links found


Hacker News 2 -> htmldata/Hacker News Page 2 (05_01_2024 01_10_10 PM).html


2024-05-01 13:12:11,243 - AInewsbot - INFO - parse loop - Hacker News 2
2024-05-01 13:12:11,253 - AInewsbot - INFO - parse_file - found 259 raw links
2024-05-01 13:12:11,256 - AInewsbot - INFO - parse_file - found 24 filtered links
2024-05-01 13:12:11,256 - AInewsbot - INFO - parse loop - 24 links found


HackerNoon -> htmldata/HackerNoon - read, write and learn about any technology (05_01_2024 01_10_21 PM).html


2024-05-01 13:12:11,256 - AInewsbot - INFO - parse loop - HackerNoon
2024-05-01 13:12:11,310 - AInewsbot - INFO - parse_file - found 567 raw links
2024-05-01 13:12:11,318 - AInewsbot - INFO - parse_file - found 92 filtered links
2024-05-01 13:12:11,318 - AInewsbot - INFO - parse loop - 92 links found


NYT Tech -> htmldata/Technology - The New York Times (05_01_2024 01_10_32 PM).html


2024-05-01 13:12:11,318 - AInewsbot - INFO - parse loop - NYT Tech
2024-05-01 13:12:11,328 - AInewsbot - INFO - parse_file - found 72 raw links
2024-05-01 13:12:11,330 - AInewsbot - INFO - parse_file - found 17 filtered links
2024-05-01 13:12:11,330 - AInewsbot - INFO - parse loop - 17 links found


Reddit -> htmldata/top scoring links _ multi (05_01_2024 01_11_04 PM).html


2024-05-01 13:12:11,330 - AInewsbot - INFO - parse loop - Reddit
2024-05-01 13:12:11,458 - AInewsbot - INFO - parse_file - found 563 raw links
2024-05-01 13:12:11,468 - AInewsbot - INFO - parse_file - found 366 filtered links
2024-05-01 13:12:11,469 - AInewsbot - INFO - parse loop - 366 links found


Techmeme -> htmldata/Techmeme (05_01_2024 01_11_15 PM).html


2024-05-01 13:12:11,469 - AInewsbot - INFO - parse loop - Techmeme
2024-05-01 13:12:11,485 - AInewsbot - INFO - parse_file - found 351 raw links
2024-05-01 13:12:11,489 - AInewsbot - INFO - parse_file - found 148 filtered links
2024-05-01 13:12:11,490 - AInewsbot - INFO - parse loop - 148 links found


The Register -> htmldata/The Register_ Enterprise Technology News and Analysis (05_01_2024 01_11_26 PM).html


2024-05-01 13:12:11,490 - AInewsbot - INFO - parse loop - The Register
2024-05-01 13:12:11,506 - AInewsbot - INFO - parse_file - found 201 raw links
2024-05-01 13:12:11,509 - AInewsbot - INFO - parse_file - found 89 filtered links
2024-05-01 13:12:11,509 - AInewsbot - INFO - parse loop - 89 links found


The Verge -> htmldata/Artificial Intelligence - The Verge (05_01_2024 01_11_36 PM).html


2024-05-01 13:12:11,510 - AInewsbot - INFO - parse loop - The Verge
2024-05-01 13:12:11,535 - AInewsbot - INFO - parse_file - found 302 raw links
2024-05-01 13:12:11,538 - AInewsbot - INFO - parse_file - found 33 filtered links
2024-05-01 13:12:11,538 - AInewsbot - INFO - parse loop - 33 links found


VentureBeat -> htmldata/AI News _ VentureBeat (05_01_2024 01_11_47 PM).html


2024-05-01 13:12:11,539 - AInewsbot - INFO - parse loop - VentureBeat
2024-05-01 13:12:11,554 - AInewsbot - INFO - parse_file - found 324 raw links
2024-05-01 13:12:11,558 - AInewsbot - INFO - parse_file - found 46 filtered links
2024-05-01 13:12:11,558 - AInewsbot - INFO - parse loop - 46 links found


WSJ Tech -> htmldata/Technology - WSJ.com (05_01_2024 01_11_58 PM).html


2024-05-01 13:12:11,558 - AInewsbot - INFO - parse loop - WSJ Tech
2024-05-01 13:12:11,589 - AInewsbot - INFO - parse_file - found 500 raw links
2024-05-01 13:12:11,595 - AInewsbot - INFO - parse_file - found 4 filtered links
2024-05-01 13:12:11,595 - AInewsbot - INFO - parse loop - 4 links found


WaPo Tech -> htmldata/Technology - The Washington Post (05_01_2024 01_12_08 PM).html


2024-05-01 13:12:11,595 - AInewsbot - INFO - parse loop - WaPo Tech
2024-05-01 13:12:11,646 - AInewsbot - INFO - parse_file - found 160 raw links
2024-05-01 13:12:11,648 - AInewsbot - INFO - parse_file - found 27 filtered links
2024-05-01 13:12:11,649 - AInewsbot - INFO - parse loop - 27 links found
2024-05-01 13:12:11,649 - AInewsbot - INFO - parse loop - found 1667 links


20

In [13]:
# make a pandas dataframe of all the links found
orig_df = (
    pd.DataFrame(all_urls)
    .groupby("url")
    .first()
    .reset_index()
    .sort_values("src")[["src", "title", "url"]]
    .reset_index(drop=True)
    .reset_index(drop=False)
    .rename(columns={"index": "id"})
)
print(len(orig_df))
orig_df.head()

1357


Unnamed: 0,id,src,title,url
0,0,Ars Technica,The BASIC programming language turns 60,https://arstechnica.com/gadgets/2024/05/the-ba...
1,1,Ars Technica,New space company seeks to solve orbital mobil...,https://arstechnica.com/space/2024/04/new-spac...
2,2,Ars Technica,NASA lays out how SpaceX will refuel Starships...,https://arstechnica.com/space/2024/04/nasa-exp...
3,3,Ars Technica,Account compromise of “unprecedented scale” us...,https://arstechnica.com/security/2024/04/every...
4,4,Ars Technica,Health care giant comes clean about recent hac...,https://arstechnica.com/security/2024/04/chang...


In [14]:
# datestr = '2024-04-29'

# conn = sqlite3.connect('articles.db')

# c = conn.cursor()
# query = f"select * from news_articles where article_date > '{datestr}' order by article_date desc"
# df = pd.read_sql_query(query, conn)
# df



In [15]:

# conn.execute(f"delete from news_articles where article_date > '{datestr}'")

# # Committing the changes
# conn.commit()

# # Closing the connection
# conn.close()


In [16]:
filtered_df = filter_unseen_urls_db(orig_df)


2024-05-01 13:12:11,829 - AInewsbot - INFO - Existing URLs: 58252
2024-05-01 13:12:11,840 - AInewsbot - INFO - New URLs: 295


In [17]:
# use chatgpt to filter AI-related headlines using a prompt to OpenAI

client = OpenAI()

# make pages that fit in fa reasonably sized prompt
pages = paginate_df(filtered_df)

enriched_urls = process_pages(client, PROMPT, pages)

enriched_df = pd.DataFrame(enriched_urls)
enriched_df.head()

2024-05-01 13:12:11,971 - AInewsbot - INFO - send page 1 of 6, 50 items 
2024-05-01 13:12:39,212 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-05-01 13:12:39,227 - AInewsbot - INFO - 13:12:39 got dict with 50 items 
2024-05-01 13:12:39,228 - AInewsbot - INFO - send page 2 of 6, 50 items 
2024-05-01 13:13:08,718 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-05-01 13:13:08,720 - AInewsbot - INFO - 13:13:08 got dict with 50 items 
2024-05-01 13:13:08,721 - AInewsbot - INFO - send page 3 of 6, 50 items 
2024-05-01 13:13:31,707 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-05-01 13:13:31,710 - AInewsbot - INFO - 13:13:31 got dict with 50 items 
2024-05-01 13:13:31,711 - AInewsbot - INFO - send page 4 of 6, 50 items 
2024-05-01 13:14:11,419 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.

Unnamed: 0,id,isAI
0,0,False
1,9,False
2,10,False
3,11,True
4,20,True


In [18]:
log("isAI", len(enriched_df.loc[enriched_df["isAI"]]))
log("not isAI", len(enriched_df.loc[~enriched_df["isAI"]]))


2024-05-01 13:15:19,309 - AInewsbot - INFO - 115 - isAI
2024-05-01 13:15:19,311 - AInewsbot - INFO - 180 - not isAI


20

In [19]:
# merge returned df into original df
merged_df = pd.merge(filtered_df, enriched_df, on="id", how="outer")
merged_df['date'] = datetime.now().date()
merged_df.head()


Unnamed: 0,id,src,title,url,isAI,date
0,0,Ars Technica,The BASIC programming language turns 60,https://arstechnica.com/gadgets/2024/05/the-ba...,False,2024-05-01
1,9,Ars Technica,Europe’s ambitious satellite Internet project ...,https://arstechnica.com/space/2024/05/europes-...,False,2024-05-01
2,10,Ars Technica,iOS 17.5 makes it less of a hassle to send you...,https://arstechnica.com/gadgets/2024/05/repair...,False,2024-05-01
3,11,Ars Technica,Rabbit R1 AI box revealed to just be an Androi...,https://arstechnica.com/gadgets/2024/05/rabbit...,True,2024-05-01
4,20,Ars Technica,ChatGPT shows better moral judgement than a co...,https://arstechnica.com/ai/2024/05/chatgpt-sho...,True,2024-05-01


In [20]:
# should be empty, shouldn't get back rows that don't match to existing
log(f"Unmatched response rows: {len(merged_df.loc[merged_df['src'].isna()])}")
# should be empty, should get back all rows from orig
log(f"Unmatched source rows: {len(merged_df.loc[merged_df['isAI'].isna()])}")


2024-05-01 13:15:19,325 - AInewsbot - INFO - Unmatched response rows: 0
2024-05-01 13:15:19,326 - AInewsbot - INFO - Unmatched source rows: 0


20

In [21]:
# update SQLite database with all seen articles
conn = sqlite3.connect('articles.db')
cursor = conn.cursor()
for row in merged_df.itertuples():
    insert_article(conn, cursor, row.src, row.title,
                   row.url, row.isAI, row.date)
    

In [22]:
AIdf = merged_df.loc[merged_df["isAI"]].reset_index(drop=True)
log(f"Found {len(AIdf)} AI headlines")


2024-05-01 13:15:19,515 - AInewsbot - INFO - Found 115 AI headlines


20

In [23]:
# map title to ascii characters to avoid some dupes with e.g. different quote symbols

def unicode_to_ascii(input_string):
    # Normalize the Unicode string to NFKD form
    normalized_string = unicodedata.normalize('NFKD', input_string)
    
    # Encode to ASCII bytes, ignoring characters that cannot be converted
    ascii_bytes = normalized_string.encode('ascii', 'ignore')
    
    # Convert bytes back to a string
    ascii_string = ascii_bytes.decode('ascii')
    
    return ascii_string

AIdf['title'] = AIdf['title'].apply(unicode_to_ascii)


In [24]:
# dedupe identical headlines
AIdf['title_clean'] = AIdf['title'].map(lambda s: "".join(s.split()))
AIdf = AIdf.sort_values("src") \
    .groupby("title_clean") \
    .first() \
    .reset_index()
log(f"Found {len(AIdf)} unique AI headlines")


2024-05-01 13:15:19,523 - AInewsbot - INFO - Found 112 unique AI headlines


20

In [28]:
log(f"Fetching embeddings for {len(AIdf)} headlines")
embedding_model = 'text-embedding-3-large'
response = client.embeddings.create(input=AIdf['title'].tolist(),
                                    model=embedding_model)
embedding_df = pd.DataFrame([e.model_dump()['embedding'] for e in response.data])
# embedding_array = embedding_df.values

# # find index of most central headline
# centroid = embedding_array.mean(axis=0)
# distances = np.linalg.norm(embedding_array - centroid, axis=1)
# start_index = np.argmin(distances)

# # Get the sorted indices and use them to sort the df
# sorted_indices = nearest_neighbor_sort(embedding_array, start_index)
# AIdf = AIdf.iloc[sorted_indices]


2024-05-01 13:29:20,773 - AInewsbot - INFO - Fetching embeddings for 112 headlines
2024-05-01 13:29:21,181 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [29]:
leaf_order = agglomerative_cluster_sort(embedding_df)
AIdf = AIdf.iloc[leaf_order]


In [30]:
AIdf=AIdf.reset_index(drop=True)
AIdf

Unnamed: 0,title_clean,id,src,title,url,isAI,date
0,ChatGPTshowsbettermoraljudgementthanacollegeun...,20,Ars Technica,ChatGPT shows better moral judgement than a co...,https://arstechnica.com/ai/2024/05/chatgpt-sho...,True,2024-05-01
1,Excessiveuseofwordslike'commendable'and'meticu...,388,Google News,Excessive use of words like 'commendable' and ...,https://news.google.com/articles/CBMirwFodHRwc...,True,2024-05-01
2,AwaytomakeChatGPTlesschattywhenaskingforcode,967,Reddit,A way to make ChatGPT less chatty when asking ...,https://www.reddit.com/r/ChatGPT/comments/1chi...,True,2024-05-01
3,NationalArchivesBansEmployeeUseofChatGPT,203,Feedly AI,National Archives Bans Employee Use of ChatGPT,https://www.404media.co/national-archives-bans...,True,2024-05-01
4,ChatGPTjustgottwoupdatesthataddressmajorpainpo...,225,Feedly AI,ChatGPT just got two updates that address majo...,https://www.zdnet.com/article/chatgpt-just-got...,True,2024-05-01
...,...,...,...,...,...,...,...
107,Exclusive:NewAI-poweredIterablefeatureshelpbra...,226,Feedly AI,Exclusive: New AI-powered Iterable features he...,https://venturebeat.com/ai/exclusive-new-ai-po...,True,2024-05-01
108,LatticelaunchesnewAIfeaturesforperformancemana...,246,Feedly AI,Lattice launches new AI features for performan...,https://www.hr-brew.com/stories/2024/05/01/lat...,True,2024-05-01
109,AIvoiceanalysisgivessuicidehotlineworkersanemo...,232,Feedly AI,AI voice analysis gives suicide hotline worker...,https://newatlas.com/technology/ai-emotion-spe...,True,2024-05-01
110,"ShowHN:IbuiltthisAIsupportedcareerproduct,soyo...",236,Feedly AI,Show HN: I built this AI supported career prod...,https://interviewtraininggermany.com/course/tr...,True,2024-05-01


In [31]:
html_str = ""
for row in AIdf.itertuples():
    log(f"[{row.Index}. {row.title} - {row.src}]({row.url})")
    html_str += f'{row.Index}.<a href="{row.url}">{row.title} - {row.src}</a><br />\n'


2024-05-01 13:29:31,684 - AInewsbot - INFO - [0. ChatGPT shows better moral judgement than a college undergrad - Ars Technica](https://arstechnica.com/ai/2024/05/chatgpt-shows-better-moral-judgement-than-a-college-undergrad/)
2024-05-01 13:29:31,687 - AInewsbot - INFO - [1. Excessive use of words like 'commendable' and 'meticulous' suggests ChatGPT has been used in thousands of scientific studies - Google News](https://news.google.com/articles/CBMirwFodHRwczovL2VuZ2xpc2guZWxwYWlzLmNvbS9zY2llbmNlLXRlY2gvMjAyNC0wNC0yNS9leGNlc3NpdmUtdXNlLW9mLXdvcmRzLWxpa2UtY29tbWVuZGFibGUtYW5kLW1ldGljdWxvdXMtc3VnZ2VzdC1jaGF0Z3B0LWhhcy1iZWVuLXVzZWQtaW4tdGhvdXNhbmRzLW9mLXNjaWVudGlmaWMtc3R1ZGllcy5odG1s0gEA)
2024-05-01 13:29:31,688 - AInewsbot - INFO - [2. A way to make ChatGPT less chatty when asking for code - Reddit](https://www.reddit.com/r/ChatGPT/comments/1chi1os/a_way_to_make_chatgpt_less_chatty_when_asking_for/)
2024-05-01 13:29:31,689 - AInewsbot - INFO - [3. National Archives Bans Employee Use of Ch

In [32]:
log("Sending mail")
from_addr = os.getenv("GMAIL_USER")
to_addr = os.getenv("GMAIL_USER")
subject = 'AI news ' + datetime.now().strftime('%H:%M:%S')
body = f"""
<html>
    <head></head>
    <body>
    <div>
    {html_str}
    </div>
    </body>
</html>
"""

# Setup the MIME
message = MIMEMultipart()
message['From'] = os.getenv("GMAIL_USER")
message['To'] = os.getenv("GMAIL_USER")
message['Subject'] = subject
message.attach(MIMEText(body, 'html'))

# Create SMTP session
with smtplib.SMTP('smtp.gmail.com', 587) as server:
    server.starttls()  # Secure the connection
    server.login(os.getenv("GMAIL_USER"), os.getenv("GMAIL_PASSWORD"))
    text = message.as_string()
    server.sendmail(from_addr, to_addr, text)

log("Finished")


2024-05-01 13:29:32,688 - AInewsbot - INFO - Sending mail
2024-05-01 13:29:34,000 - AInewsbot - INFO - Finished


20