AInewsbot.ipynb

- Automate collecting daily AI news
- Open URLs of news sites specififed in `sources` dict (sources.yaml) using Selenium and Firefox
- Save HTML of each URL in htmldata directory
- Extract URLs from all files, create a pandas dataframe with url, title, src
- Use ChatGPT to filter only AI-related headlines by sending a prompt and formatted table of headlines
- Use SQLite to filter headlines previously seen 
- OPENAI_API_KEY should be in the environment or in a .env file
  
Alternative manual workflow to get HTML files if necessary
- Use Chrome, open e.g. Tech News bookmark folder, right-click and open all bookmarks in new window
- on Google News, make sure switch to AI tab
- on Google News, Feedly, Reddit, scroll to additional pages as desired
- Use SingleFile extension, 'save all tabs'
- Move files to htmldata directory
- Run lower part of notebook to process the data


In [36]:
from datetime import datetime
import time
import re
import os
import yaml
import dotenv
import sqlite3
import unicodedata

import numpy as np
import pandas as pd
from scipy.spatial.distance import cdist

# import bs4
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin, urlparse

from selenium import webdriver
from selenium.webdriver.common.by import By
# use firefox v. chrome b/c it updates less often, can disable updates
# recommend importing profile from Chrome for cookies, passwords
# looks less like a bot with more user cruft in the profile
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service

import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

from openai import OpenAI

from ainb_const import (DOWNLOAD_DIR,
                        SOURCECONFIG, PROMPT)
from ainb_utilities import log, delete_files, filter_unseen_urls_db, insert_article, nearest_neighbor_sort
from ainb_webscrape import init_browser, get_file, parse_file, get_og_tags, get_path_from_url, trimmed_href
from ainb_llm import paginate_df, process_pages

In [2]:
SOURCECONFIG = "sources.yaml"
DOWNLOAD_DIR = "htmldata"

# load secrets, credentials from .env
dotenv.load_dotenv()


True

In [3]:
PROMPT = """
You will act as a research assistant classifying news stories as related to artificial intelligence (AI) or unrelated to AI.

Your task is to read JSON format objects from an input list of news stories using the schema below delimited by |,
and output JSON format objects for each using the schema below delimited by ~.

Define a list of objects representing news stories in JSON format as in the following example:
|
{'stories':
[{'id': 97, 'title': 'AI to predict dementia, detect cancer'},
 {'id': 103,'title': 'Figure robot learns to make coffee by watching humans for 10 hours'},
 {'id': 103,'title': 'Baby trapped in refrigerator eats own foot'},
 {'id': 210,'title': 'ChatGPT removes, then reinstates a summarization assistant without explanation.'},
 {'id': 298,'title': 'The 5 most interesting PC monitors from CES 2024'},
 ]
}
|

Based on the title, you will classify each story as being about AI or not.

For each object, you will output the input id field, and a field named isAI which is true if the input title is about AI and false if the input title is not about AI.

When extracting information please make sure it matches the JSON format below exactly. Do not output any attributes that do not appear in the schema below.
~
{'stories':
[{'id': 97, 'isAI': true},
 {'id': 103, 'isAI': true},
 {'id': 103, 'isAI': false},
 {'id': 210, 'isAI': true},
 {'id': 298, 'isAI': false}]
}
~

You may interpret the term AI broadly as pertaining to
- machine learning models
- large language models
- robotics
- reinforcement learning
- computer vision
- OpenAI
- ChatGPT
- other closely related topics.

You will return an array of valid JSON objects.

The field 'id' in the output must match the field 'id' in the input EXACTLY.

The field 'isAI' must be either true or false.

The list of news stories to classify and enrich is:


"""

In [4]:
get_og_tags('https://druce.ai')


2024-04-28 13:34:04,430 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): druce.ai:443
2024-04-28 13:34:04,485 - urllib3.connectionpool - DEBUG - https://druce.ai:443 "GET / HTTP/1.1" 200 32623


{'og:site_name': 'Druce.ai',
 'og:title': 'Druce.ai',
 'og:type': 'website',
 'og:description': "Druce's Blog on Machine Learning, Tech, Markets and Economics",
 'og:url': 'https://druce.ai/',
 'title': 'Druce.ai'}

In [5]:
get_path_from_url('https://druce.ai/2024/03/gemini-summarize-book')


'/2024/03/gemini-summarize-book'

In [6]:
trimmed_href('https://druce.ai/2024/03/gemini-summarize-book?xyz')


'https://druce.ai/2024/03/gemini-summarize-book'

In [7]:
#  load sources to scrape from sources.yaml
with open(SOURCECONFIG, "r") as stream:
    try:
        sources = yaml.safe_load(stream)
    except yaml.YAMLError as exc:
        print(exc)

log(f"Load {len(sources)} sources")



2024-04-28 13:34:04,535 - AInewsbot - INFO - Load 17 sources


20

In [8]:
sources_reverse = {}
for k, v in sources.items():
    log(f"{k} -> {v['url']} -> {v['title']}.html")
    v['sourcename'] = k
    # map filename (title) to source name
    sources_reverse[v['title']] = k

sources_reverse

2024-04-28 13:34:04,538 - AInewsbot - INFO - Ars Technica -> https://arstechnica.com/ -> Ars Technica.html
2024-04-28 13:34:04,539 - AInewsbot - INFO - Bloomberg Tech -> https://www.bloomberg.com/technology -> Bloomberg Technology - Bloomberg.html
2024-04-28 13:34:04,539 - AInewsbot - INFO - Business Insider -> https://www.businessinsider.com/tech -> Tech - Business Insider.html
2024-04-28 13:34:04,539 - AInewsbot - INFO - FT Tech -> https://www.ft.com/technology -> Technology.html
2024-04-28 13:34:04,540 - AInewsbot - INFO - Feedly AI -> https://feedly.com/i/aiFeeds?options=eyJsYXllcnMiOlt7InBhcnRzIjpbeyJpZCI6Im5scC9mL3RvcGljLzMwMDAifV0sInNlYXJjaEhpbnQiOiJ0ZWNobm9sb2d5IiwidHlwZSI6Im1hdGNoZXMiLCJzYWxpZW5jZSI6ImFib3V0In1dLCJidW5kbGVzIjpbeyJ0eXBlIjoic3RyZWFtIiwiaWQiOiJ1c2VyLzYyZWViYjlmLTcxNTEtNGY5YS1hOGM3LTlhNTdiODIwNTMwOC9jYXRlZ29yeS9HYWRnZXRzIn1dfQ -> Discover and Add New Feedly AI Feeds.html
2024-04-28 13:34:04,540 - AInewsbot - INFO - Google News -> https://news.google.com/topics/CAA

{'Ars Technica': 'Ars Technica',
 'Bloomberg Technology - Bloomberg': 'Bloomberg Tech',
 'Tech - Business Insider': 'Business Insider',
 'Technology': 'FT Tech',
 'Discover and Add New Feedly AI Feeds': 'Feedly AI',
 'Google News - Technology - Artificial intelligence': 'Google News',
 'Hacker News Page 1': 'Hacker News',
 'Hacker News Page 2': 'Hacker News 2',
 'HackerNoon - read, write and learn about any technology': 'HackerNoon',
 'Technology - The New York Times': 'NYT Tech',
 'top scoring links _ multi': 'Reddit',
 'Techmeme': 'Techmeme',
 'The Register_ Enterprise Technology News and Analysis': 'The Register',
 'Artificial Intelligence - The Verge': 'The Verge',
 'AI News _ VentureBeat': 'VentureBeat',
 'Technology - WSJ.com': 'WSJ Tech',
 'Technology - The Washington Post': 'WaPo Tech'}

In [21]:
# # get existing files
# # List all paths in the directory matching today's date
# nfiles = 50

# # Get the current date
# today = datetime.now()
# year, month, day = today.year, today.month, today.day

# datestr = datetime.now().strftime("%m_%d_%Y")

# # log(f"Year: {year}, Month: {month}, Day: {day}")

# files = [os.path.join(DOWNLOAD_DIR, file) for file in os.listdir(DOWNLOAD_DIR)]
# # filter files only
# files = [file for file in files if os.path.isfile(file)]

# # Sort files by modification time and take top 50
# files.sort(key=lambda x: os.path.getmtime(x), reverse=True)
# file = files[:nfiles]

# # filter files by with today's date ending in .html
# files = [file for file in files if datestr in file and file.endswith(".html")]
# log(len(files))
# for file in files:
#     log(file)

2024-04-28 13:44:02,910 - AInewsbot - INFO - 17
2024-04-28 13:44:02,913 - AInewsbot - INFO - htmldata/Technology - The Washington Post (04_28_2024 01_39_31 PM).html
2024-04-28 13:44:02,914 - AInewsbot - INFO - htmldata/Technology - WSJ.com (04_28_2024 01_39_20 PM).html
2024-04-28 13:44:02,914 - AInewsbot - INFO - htmldata/AI News _ VentureBeat (04_28_2024 01_39_09 PM).html
2024-04-28 13:44:02,915 - AInewsbot - INFO - htmldata/Artificial Intelligence - The Verge (04_28_2024 01_38_59 PM).html
2024-04-28 13:44:02,916 - AInewsbot - INFO - htmldata/The Register_ Enterprise Technology News and Analysis (04_28_2024 01_38_48 PM).html
2024-04-28 13:44:02,916 - AInewsbot - INFO - htmldata/Techmeme (04_28_2024 01_38_38 PM).html
2024-04-28 13:44:02,917 - AInewsbot - INFO - htmldata/top scoring links _ multi (04_28_2024 01_38_27 PM).html
2024-04-28 13:44:02,917 - AInewsbot - INFO - htmldata/Technology - The New York Times (04_28_2024 01_37_54 PM).html
2024-04-28 13:44:02,918 - AInewsbot - INFO - ht

In [22]:
# get a proper file list from existing files, instead of using list returned
# saved_pages = []
# for file in files:
    
# # Extract source name from path
#     filename = os.path.basename(file)

#     # Find the position of '1_14_2024' in the filename
#     position = filename.find(" (" + datestr)
#     basename = filename[:position]
#     sourcename = sources_reverse.get(basename)
#     if sourcename is None:
#         log(f"Skipping {basename}, no sourcename metadata")
#         continue
#     sources[sourcename]['latest'] = file
#     saved_pages.append((sourcename, file))
    
# saved_pages


[('WaPo Tech',
  'htmldata/Technology - The Washington Post (04_28_2024 01_39_31 PM).html'),
 ('WSJ Tech', 'htmldata/Technology - WSJ.com (04_28_2024 01_39_20 PM).html'),
 ('VentureBeat',
  'htmldata/AI News _ VentureBeat (04_28_2024 01_39_09 PM).html'),
 ('The Verge',
  'htmldata/Artificial Intelligence - The Verge (04_28_2024 01_38_59 PM).html'),
 ('The Register',
  'htmldata/The Register_ Enterprise Technology News and Analysis (04_28_2024 01_38_48 PM).html'),
 ('Techmeme', 'htmldata/Techmeme (04_28_2024 01_38_38 PM).html'),
 ('Reddit',
  'htmldata/top scoring links _ multi (04_28_2024 01_38_27 PM).html'),
 ('NYT Tech',
  'htmldata/Technology - The New York Times (04_28_2024 01_37_54 PM).html'),
 ('HackerNoon',
  'htmldata/HackerNoon - read, write and learn about any technology (04_28_2024 01_37_43 PM).html'),
 ('Hacker News 2',
  'htmldata/Hacker News Page 2 (04_28_2024 01_37_31 PM).html'),
 ('Hacker News', 'htmldata/Hacker News Page 1 (04_28_2024 01_37_21 PM).html'),
 ('Google New

In [None]:
# Get HTML files from sources

# empty download directory
delete_files(DOWNLOAD_DIR)

# launch browser via selenium driver
driver = init_browser()

# save each file specified from sources
log("Saving HTML files")
saved_pages = []
for sourcename, sourcedict in sources.items():
    log(f'Processing {sourcename}')
    sourcefile = get_file(sourcedict, driver=driver)
    saved_pages.append((sourcename, sourcefile))

# Close the browser
log("Quit webdriver")
driver.quit()
# finished downloading files


2024-04-28 13:34:06,496 - AInewsbot - INFO - init_browser - Initializing webdriver
2024-04-28 13:34:18,119 - AInewsbot - INFO - init_browser - Initialized webdriver profile
2024-04-28 13:34:18,120 - AInewsbot - INFO - init_browser - Initialized webdriver service
2024-04-28 13:34:18,134 - selenium.webdriver.common.service - DEBUG - Started executable: `/Users/drucev/webdrivers/geckodriver` in a child process with pid: 98882 using 0 to output -3
2024-04-28 13:34:43,439 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://localhost:53933/session {'capabilities': {'firstMatch': [{}], 'alwaysMatch': {'browserName': 'firefox', 'acceptInsecureCerts': True, 'moz:debuggerAddress': True, 'pageLoadStrategy': <PageLoadStrategy.normal: 'normal'>, 'moz:firefoxOptions': {'profile': 'UEsDBBQAAAAIAAVtm1i4mTJ/BQEAAACAAAAaAAAAc3RvcmFnZS1zeW5jLXYyLnNxbGl0ZS1zaG3t3DtOlGEUBuB3hmEGRRDUwRHl...'}}}}
2024-04-28 13:34:43,441 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): loca

In [15]:
print(len(saved_pages))
for sourcename, page in saved_pages:
    # sources[sourcename]['latest'] = page
    print(sourcename, '->', page)
    

17
Ars Technica -> htmldata/Ars Technica (04_28_2024 01_35_05 PM).html
Bloomberg Tech -> htmldata/Bloomberg Technology - Bloomberg (04_28_2024 01_35_16 PM).html
Business Insider -> htmldata/Tech - Business Insider (04_28_2024 01_35_27 PM).html
FT Tech -> htmldata/Technology (04_28_2024 01_35_37 PM).html
Feedly AI -> htmldata/Discover and Add New Feedly AI Feeds (04_28_2024 01_36_38 PM).html
Google News -> htmldata/Google News - Technology - Artificial intelligence (04_28_2024 01_37_10 PM).html
Hacker News -> htmldata/Hacker News Page 1 (04_28_2024 01_37_21 PM).html
Hacker News 2 -> htmldata/Hacker News Page 2 (04_28_2024 01_37_31 PM).html
HackerNoon -> htmldata/HackerNoon - read, write and learn about any technology (04_28_2024 01_37_43 PM).html
NYT Tech -> htmldata/Technology - The New York Times (04_28_2024 01_37_54 PM).html
Reddit -> htmldata/top scoring links _ multi (04_28_2024 01_38_27 PM).html
Techmeme -> htmldata/Techmeme (04_28_2024 01_38_38 PM).html
The Register -> htmldata/T

In [24]:
# Parse news URLs and titles from downloaded HTML files
log("parsing html files")
all_urls = []
for sourcename, filename in saved_pages:
    print(sourcename, '->', filename, flush=True)
    log(f"{sourcename}", "parse loop")
    links = parse_file(sources[sourcename])
    log(f"{len(links)} links found", "parse loop")
    all_urls.extend(links)

log(f"found {len(all_urls)} links", "parse loop")

2024-04-28 13:46:06,319 - AInewsbot - INFO - parsing html files


WaPo Tech -> htmldata/Technology - The Washington Post (04_28_2024 01_39_31 PM).html


2024-04-28 13:46:06,321 - AInewsbot - INFO - parse loop - WaPo Tech
2024-04-28 13:46:06,346 - AInewsbot - INFO - parse_file - found 167 raw links
2024-04-28 13:46:06,350 - AInewsbot - INFO - parse_file - found 28 filtered links
2024-04-28 13:46:06,351 - AInewsbot - INFO - parse loop - 28 links found


WSJ Tech -> htmldata/Technology - WSJ.com (04_28_2024 01_39_20 PM).html


2024-04-28 13:46:06,352 - AInewsbot - INFO - parse loop - WSJ Tech
2024-04-28 13:46:06,392 - AInewsbot - INFO - parse_file - found 493 raw links
2024-04-28 13:46:06,398 - AInewsbot - INFO - parse_file - found 6 filtered links
2024-04-28 13:46:06,398 - AInewsbot - INFO - parse loop - 6 links found


VentureBeat -> htmldata/AI News _ VentureBeat (04_28_2024 01_39_09 PM).html


2024-04-28 13:46:06,398 - AInewsbot - INFO - parse loop - VentureBeat
2024-04-28 13:46:06,415 - AInewsbot - INFO - parse_file - found 323 raw links
2024-04-28 13:46:06,419 - AInewsbot - INFO - parse_file - found 45 filtered links
2024-04-28 13:46:06,420 - AInewsbot - INFO - parse loop - 45 links found


The Verge -> htmldata/Artificial Intelligence - The Verge (04_28_2024 01_38_59 PM).html


2024-04-28 13:46:06,420 - AInewsbot - INFO - parse loop - The Verge
2024-04-28 13:46:06,496 - AInewsbot - INFO - parse_file - found 303 raw links
2024-04-28 13:46:06,499 - AInewsbot - INFO - parse_file - found 25 filtered links
2024-04-28 13:46:06,500 - AInewsbot - INFO - parse loop - 25 links found


The Register -> htmldata/The Register_ Enterprise Technology News and Analysis (04_28_2024 01_38_48 PM).html


2024-04-28 13:46:06,500 - AInewsbot - INFO - parse loop - The Register
2024-04-28 13:46:06,517 - AInewsbot - INFO - parse_file - found 200 raw links
2024-04-28 13:46:06,520 - AInewsbot - INFO - parse_file - found 88 filtered links
2024-04-28 13:46:06,520 - AInewsbot - INFO - parse loop - 88 links found


Techmeme -> htmldata/Techmeme (04_28_2024 01_38_38 PM).html


2024-04-28 13:46:06,521 - AInewsbot - INFO - parse loop - Techmeme
2024-04-28 13:46:06,537 - AInewsbot - INFO - parse_file - found 369 raw links
2024-04-28 13:46:06,542 - AInewsbot - INFO - parse_file - found 157 filtered links
2024-04-28 13:46:06,542 - AInewsbot - INFO - parse loop - 157 links found


Reddit -> htmldata/top scoring links _ multi (04_28_2024 01_38_27 PM).html


2024-04-28 13:46:06,542 - AInewsbot - INFO - parse loop - Reddit
2024-04-28 13:46:06,623 - AInewsbot - INFO - parse_file - found 555 raw links
2024-04-28 13:46:06,632 - AInewsbot - INFO - parse_file - found 361 filtered links
2024-04-28 13:46:06,633 - AInewsbot - INFO - parse loop - 361 links found


NYT Tech -> htmldata/Technology - The New York Times (04_28_2024 01_37_54 PM).html


2024-04-28 13:46:06,633 - AInewsbot - INFO - parse loop - NYT Tech
2024-04-28 13:46:06,643 - AInewsbot - INFO - parse_file - found 72 raw links
2024-04-28 13:46:06,644 - AInewsbot - INFO - parse_file - found 18 filtered links
2024-04-28 13:46:06,644 - AInewsbot - INFO - parse loop - 18 links found


HackerNoon -> htmldata/HackerNoon - read, write and learn about any technology (04_28_2024 01_37_43 PM).html


2024-04-28 13:46:06,645 - AInewsbot - INFO - parse loop - HackerNoon
2024-04-28 13:46:06,741 - AInewsbot - INFO - parse_file - found 554 raw links
2024-04-28 13:46:06,748 - AInewsbot - INFO - parse_file - found 83 filtered links
2024-04-28 13:46:06,749 - AInewsbot - INFO - parse loop - 83 links found


Hacker News 2 -> htmldata/Hacker News Page 2 (04_28_2024 01_37_31 PM).html


2024-04-28 13:46:06,749 - AInewsbot - INFO - parse loop - Hacker News 2
2024-04-28 13:46:06,760 - AInewsbot - INFO - parse_file - found 261 raw links
2024-04-28 13:46:06,769 - AInewsbot - INFO - parse_file - found 18 filtered links
2024-04-28 13:46:06,769 - AInewsbot - INFO - parse loop - 18 links found


Hacker News -> htmldata/Hacker News Page 1 (04_28_2024 01_37_21 PM).html


2024-04-28 13:46:06,769 - AInewsbot - INFO - parse loop - Hacker News
2024-04-28 13:46:06,781 - AInewsbot - INFO - parse_file - found 257 raw links
2024-04-28 13:46:06,784 - AInewsbot - INFO - parse_file - found 27 filtered links
2024-04-28 13:46:06,785 - AInewsbot - INFO - parse loop - 27 links found


Google News -> htmldata/Google News - Technology - Artificial intelligence (04_28_2024 01_37_10 PM).html


2024-04-28 13:46:06,785 - AInewsbot - INFO - parse loop - Google News
2024-04-28 13:46:07,053 - AInewsbot - INFO - parse_file - found 959 raw links
2024-04-28 13:46:07,059 - AInewsbot - INFO - parse_file - found 411 filtered links
2024-04-28 13:46:07,060 - AInewsbot - INFO - parse loop - 411 links found


Feedly AI -> htmldata/Discover and Add New Feedly AI Feeds (04_28_2024 01_36_38 PM).html


2024-04-28 13:46:07,060 - AInewsbot - INFO - parse loop - Feedly AI
2024-04-28 13:46:07,087 - AInewsbot - INFO - parse_file - found 223 raw links
2024-04-28 13:46:07,090 - AInewsbot - INFO - parse_file - found 65 filtered links
2024-04-28 13:46:07,090 - AInewsbot - INFO - parse loop - 65 links found


FT Tech -> htmldata/Technology (04_28_2024 01_35_37 PM).html


2024-04-28 13:46:07,091 - AInewsbot - INFO - parse loop - FT Tech
2024-04-28 13:46:07,116 - AInewsbot - INFO - parse_file - found 457 raw links
2024-04-28 13:46:07,121 - AInewsbot - INFO - parse_file - found 104 filtered links
2024-04-28 13:46:07,121 - AInewsbot - INFO - parse loop - 104 links found


Business Insider -> htmldata/Tech - Business Insider (04_28_2024 01_35_27 PM).html


2024-04-28 13:46:07,122 - AInewsbot - INFO - parse loop - Business Insider
2024-04-28 13:46:07,145 - AInewsbot - INFO - parse_file - found 339 raw links
2024-04-28 13:46:07,149 - AInewsbot - INFO - parse_file - found 64 filtered links
2024-04-28 13:46:07,150 - AInewsbot - INFO - parse loop - 64 links found


Bloomberg Tech -> htmldata/Bloomberg Technology - Bloomberg (04_28_2024 01_35_16 PM).html


2024-04-28 13:46:07,150 - AInewsbot - INFO - parse loop - Bloomberg Tech
2024-04-28 13:46:07,172 - AInewsbot - INFO - parse_file - found 303 raw links
2024-04-28 13:46:07,175 - AInewsbot - INFO - parse_file - found 52 filtered links
2024-04-28 13:46:07,175 - AInewsbot - INFO - parse loop - 52 links found


Ars Technica -> htmldata/Ars Technica (04_28_2024 01_35_05 PM).html


2024-04-28 13:46:07,176 - AInewsbot - INFO - parse loop - Ars Technica
2024-04-28 13:46:07,190 - AInewsbot - INFO - parse_file - found 252 raw links
2024-04-28 13:46:07,192 - AInewsbot - INFO - parse_file - found 29 filtered links
2024-04-28 13:46:07,192 - AInewsbot - INFO - parse loop - 29 links found
2024-04-28 13:46:07,193 - AInewsbot - INFO - parse loop - found 1581 links


20

In [25]:
# make a pandas dataframe of all the links found
orig_df = (
    pd.DataFrame(all_urls)
    .groupby("url")
    .first()
    .reset_index()
    .sort_values("src")[["src", "title", "url"]]
    .reset_index(drop=True)
    .reset_index(drop=False)
    .rename(columns={"index": "id"})
)
print(len(orig_df))
orig_df.head()

1284


Unnamed: 0,id,src,title,url
0,0,Ars Technica,Russia stands alone in vetoing UN resolution o...,https://arstechnica.com/space/2024/04/no-surpr...
1,1,Ars Technica,"Ubuntu 24.04 LTS, Noble Numbat, overhauls its ...",https://arstechnica.com/gadgets/2024/04/ubuntu...
2,2,Ars Technica,Can an online library of classic video games e...,https://arstechnica.com/gaming/2024/04/can-an-...
3,3,Ars Technica,Garry’s Modis taking down 20 years’ worth of “...,https://arstechnica.com/gaming/2024/04/garrys-...
4,4,Ars Technica,Switch 2 reportedly replaces slide-in Joy-Cons...,https://arstechnica.com/gaming/2024/04/report-...


In [26]:
filtered_df = filter_unseen_urls_db(orig_df)


2024-04-28 13:46:21,781 - AInewsbot - INFO - Existing URLs: 54702
2024-04-28 13:46:21,789 - AInewsbot - INFO - New URLs: 131


In [27]:
# use chatgpt to filter AI-related headlines using a prompt to OpenAI

client = OpenAI()

# make pages that fit in fa reasonably sized prompt
pages = paginate_df(filtered_df)

enriched_urls = process_pages(client, PROMPT, pages)

enriched_df = pd.DataFrame(enriched_urls)
enriched_df.head()

2024-04-28 13:46:29,264 - httpx - DEBUG - load_ssl_context verify=True cert=None trust_env=True http2=False
2024-04-28 13:46:29,267 - httpx - DEBUG - load_verify_locations cafile='/opt/anaconda3/envs/ainewsbot/lib/python3.9/site-packages/certifi/cacert.pem'
2024-04-28 13:46:29,462 - AInewsbot - INFO - send page 1 of 3, 50 items 
2024-04-28 13:46:29,464 - openai._base_client - DEBUG - Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': '\nYou will act as a research assistant classifying news stories as related to artificial intelligence (AI) or unrelated to AI.\n\nYour task is to read JSON format objects from an input list of news stories using the schema below delimited by |,\nand output JSON format objects for each using the schema below delimited by ~.\n\nDefine a list of objects representing news stories in JSON format as in the following example:\n|\n{\'stories\':\n[{\'id\': 97, \'title\': \'AI to pre

Unnamed: 0,id,isAI
0,40,False
1,55,False
2,172,False
3,188,True
4,189,True


In [28]:
log("isAI", len(enriched_df.loc[enriched_df["isAI"]]))
log("not isAI", len(enriched_df.loc[~enriched_df["isAI"]]))


2024-04-28 13:47:47,652 - AInewsbot - INFO - 39 - isAI
2024-04-28 13:47:47,655 - AInewsbot - INFO - 92 - not isAI


20

In [29]:
# merge returned df into original df
merged_df = pd.merge(filtered_df, enriched_df, on="id", how="outer")
merged_df['date'] = datetime.now().date()
merged_df.head()


Unnamed: 0,id,src,title,url,isAI,date
0,40,Bloomberg Tech,Musk Makes Surprise China Visit in Search of T...,https://www.bloomberg.com/news/articles/2024-0...,False,2024-04-28
1,55,Bloomberg Tech,French Government Makes Offer for Part of Atos...,https://www.bloomberg.com/news/articles/2024-0...,False,2024-04-28
2,172,FT Tech,‘Call my agent’ producer backed by KKR buys Ge...,https://www.ft.com/content/439a5c2c-f335-43cc-...,False,2024-04-28
3,188,Feedly AI,Midjourney still reigns as an AI image generat...,https://qz.com/how-to-create-ai-images-on-midj...,True,2024-04-28
4,189,Feedly AI,Thousands of explicit AI 'girlfriend' ads foun...,https://sea.mashable.com/tech/32322/thousands-...,True,2024-04-28


In [30]:
# should be empty, shouldn't get back rows that don't match to existing
log(f"Unmatched response rows: {len(merged_df.loc[merged_df['src'].isna()])}")
# should be empty, should get back all rows from orig
log(f"Unmatched source rows: {len(merged_df.loc[merged_df['isAI'].isna()])}")


2024-04-28 13:47:56,581 - AInewsbot - INFO - Unmatched response rows: 0
2024-04-28 13:47:56,584 - AInewsbot - INFO - Unmatched source rows: 0


20

In [31]:
# update SQLite database with all seen articles
conn = sqlite3.connect('articles.db')
cursor = conn.cursor()
for row in merged_df.itertuples():
    insert_article(conn, cursor, row.src, row.title,
                   row.url, row.isAI, row.date)
    

In [32]:
AIdf = merged_df.loc[merged_df["isAI"]].reset_index(drop=True)
log(f"Found {len(AIdf)} AI headlines")


2024-04-28 13:48:08,925 - AInewsbot - INFO - Found 39 AI headlines


20

In [33]:
# map title to ascii characters to avoid some dupes with e.g. different quote symbols

def unicode_to_ascii(input_string):
    # Normalize the Unicode string to NFKD form
    normalized_string = unicodedata.normalize('NFKD', input_string)
    
    # Encode to ASCII bytes, ignoring characters that cannot be converted
    ascii_bytes = normalized_string.encode('ascii', 'ignore')
    
    # Convert bytes back to a string
    ascii_string = ascii_bytes.decode('ascii')
    
    return ascii_string

AIdf['title'] = AIdf['title'].apply(unicode_to_ascii)


In [34]:
# dedupe identical headlines
AIdf['title_clean'] = AIdf['title'].map(lambda s: "".join(s.split()))
AIdf = AIdf.sort_values("src") \
    .groupby("title_clean") \
    .first() \
    .reset_index()
log(f"Found {len(AIdf)} unique AI headlines")


2024-04-28 13:48:13,496 - AInewsbot - INFO - Found 37 unique AI headlines


20

In [37]:
log(f"Fetching embeddings for {len(AIdf)} headlines")
embedding_model = 'text-embedding-3-small'
response = client.embeddings.create(input=AIdf['title'].tolist(),
                                    model=embedding_model)
embedding_df = pd.DataFrame([e.dict()['embedding'] for e in response.data])
embedding_array = embedding_df.values

# find index of most central headline
centroid = embedding_array.mean(axis=0)
distances = np.linalg.norm(embedding_array - centroid, axis=1)
start_index = np.argmin(distances)

# Get the sorted indices and use them to sort the df
sorted_indices = nearest_neighbor_sort(embedding_array, start_index)
AIdf = AIdf.iloc[sorted_indices]


2024-04-28 13:48:56,448 - AInewsbot - INFO - Fetching embeddings for 37 headlines
2024-04-28 13:48:56,458 - openai._base_client - DEBUG - Request options: {'method': 'post', 'url': '/embeddings', 'files': None, 'post_parser': <function Embeddings.create.<locals>.parser at 0x144552c10>, 'json_data': {'input': ['A2RL May Be Autonomous, but Humans Are at Its Heart', 'AI-enhanced Flock Safety camera helps lead to arrest of Marquis Earl-Lee Savannah in deadly Blue Springs, Missouri shooting case', 'AI-enhanced camera technology helps solve Blue Springs murder', 'AI-enhanced camera technology helps solve murder', 'AI unleashes innovation yet raises ethical concerns, say pension executives | Asset Owners', 'Apple Offers Peek at Its AI Language Model as iOS 18 Looms', 'Apple removes three apps from App Store that claimed in ads they could create AI porn', 'ChatGPT Guide for Early Retirement: Realistic Blueprint for Your 40s and 50s', 'ChatGPT gives you completely made up links if you tell it t

In [38]:
AIdf=AIdf.reset_index(drop=True)
AIdf

Unnamed: 0,title_clean,id,src,title,url,isAI,date
0,NavigatingAIdevelopmentandgovernance,533,Google News,Navigating AI development and governance,https://news.google.com/articles/CBMiYWh0dHBzO...,True,2024-04-28
1,TheAIecosystemiscomplexanddynamic:Itsregulatio...,317,Google News,The AI ecosystem is complex and dynamic: Its r...,https://news.google.com/articles/CBMic2h0dHBzO...,True,2024-04-28
2,"AIunleashesinnovationyetraisesethicalconcerns,...",265,Google News,AI unleashes innovation yet raises ethical con...,https://news.google.com/articles/CBMid2h0dHBzO...,True,2024-04-28
3,TheThreatofAISafetytoAmericanAILeadership,567,Google News,The Threat of AI Safety to American AI Leadership,https://news.google.com/articles/CBMia2h0dHBzO...,True,2024-04-28
4,USHomelandSecurityEstablishesBlue-RibbonBoardw...,320,Google News,US Homeland Security Establishes Blue-Ribbon B...,https://news.google.com/articles/CBMic2h0dHBzO...,True,2024-04-28
5,"TheAItradeisback,asconfidenceinBigTechsurges",199,Feedly AI,"The AI trade is back, as confidence in Big Tec...",https://finance.yahoo.com/news/the-ai-trade-is...,True,2024-04-28
6,"MicrosoftandAlphabet:'KeeponBuying,'SayTopAnal...",286,Google News,"Microsoft and Alphabet: 'Keep on Buying,' Say ...",https://news.google.com/articles/CBMicWh0dHBzO...,True,2024-04-28
7,TopAICertificationsfor2024:ElevateYourCareer!,459,Google News,Top AI Certifications for 2024: Elevate Your C...,https://news.google.com/articles/CBMiT2h0dHBzO...,True,2024-04-28
8,TopArtificialIntelligenceAICoursesforBeginners...,574,Google News,Top Artificial Intelligence AI Courses for Beg...,https://news.google.com/articles/CBMiZWh0dHBzO...,True,2024-04-28
9,ThebestfreeAIcourses(andwhetherAI'micro-degree...,483,Google News,The best free AI courses (and whether AI 'micr...,https://news.google.com/articles/CBMiN2h0dHBzO...,True,2024-04-28


In [39]:
html_str = ""
for row in AIdf.itertuples():
    log(f"[{row.Index}. {row.title} - {row.src}]({row.url})")
    html_str += f'{row.Index}.<a href="{row.url}">{row.title} - {row.src}</a><br />\n'


2024-04-28 13:49:14,192 - AInewsbot - INFO - [0. Navigating AI development and governance - Google News](https://news.google.com/articles/CBMiYWh0dHBzOi8vbmV3cy5jZ3RuLmNvbS9uZXdzLzIwMjQtMDQtMjcvTmF2aWdhdGluZy1BSS1kZXZlbG9wbWVudC1hbmQtZ292ZXJuYW5jZS0xdDhFS3pCS3REcS9wLmh0bWzSAQA)
2024-04-28 13:49:14,194 - AInewsbot - INFO - [1. The AI ecosystem is complex and dynamic: Its regulation should acknowledge that - Google News](https://news.google.com/articles/CBMic2h0dHBzOi8vdGhlaGlsbC5jb20vb3Bpbmlvbi80NjIyNDI3LXRoZS1haS1lY29zeXN0ZW0taXMtY29tcGxleC1hbmQtZHluYW1pYy1pdHMtcmVndWxhdGlvbi1zaG91bGQtYWNrbm93bGVkZ2UtdGhhdC_SAXdodHRwczovL3RoZWhpbGwuY29tL29waW5pb24vNDYyMjQyNy10aGUtYWktZWNvc3lzdGVtLWlzLWNvbXBsZXgtYW5kLWR5bmFtaWMtaXRzLXJlZ3VsYXRpb24tc2hvdWxkLWFja25vd2xlZGdlLXRoYXQvYW1wLw)
2024-04-28 13:49:14,195 - AInewsbot - INFO - [2. AI unleashes innovation yet raises ethical concerns, say pension executives | Asset Owners - Google News](https://news.google.com/articles/CBMid2h0dHBzOi8vd3d3LmFzaWFuaW52

In [40]:
log("Sending mail")
from_addr = os.getenv("GMAIL_USER")
to_addr = os.getenv("GMAIL_USER")
subject = 'AI news ' + datetime.now().strftime('%H:%M:%S')
body = f"""
<html>
    <head></head>
    <body>
    <div>
    {html_str}
    </div>
    </body>
</html>
"""

# Setup the MIME
message = MIMEMultipart()
message['From'] = os.getenv("GMAIL_USER")
message['To'] = os.getenv("GMAIL_USER")
message['Subject'] = subject
message.attach(MIMEText(body, 'html'))

# Create SMTP session
with smtplib.SMTP('smtp.gmail.com', 587) as server:
    server.starttls()  # Secure the connection
    server.login(os.getenv("GMAIL_USER"), os.getenv("GMAIL_PASSWORD"))
    text = message.as_string()
    server.sendmail(from_addr, to_addr, text)

log("Finished")


2024-04-28 13:49:21,163 - AInewsbot - INFO - Sending mail
2024-04-28 13:49:23,231 - AInewsbot - INFO - Finished


20