AInewsbot.ipynb

- Automate collecting daily AI news
- Open URLs of news sites specififed in `sources` dict (sources.yaml) using Selenium and Firefox
- Save HTML of each URL in htmldata directory
- Extract URLs from all files, create a pandas dataframe with url, title, src
- Use ChatGPT to filter only AI-related headlines by sending a prompt and formatted table of headlines
- Use SQLite to filter headlines previously seen 
- OPENAI_API_KEY should be in the environment or in a .env file
  
Alternative manual workflow to get HTML files if necessary
- Use Chrome, open e.g. Tech News bookmark folder, right-click and open all bookmarks in new window
- on Google News, make sure switch to AI tab
- on Google News, Feedly, Reddit, scroll to additional pages as desired
- Use SingleFile extension, 'save all tabs'
- Move files to htmldata directory
- Run lower part of notebook to process the data


In [45]:
from datetime import datetime
import os
import yaml
import dotenv
import sqlite3
import unicodedata
import json

import numpy as np
import pandas as pd

# import bs4
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin, urlparse

from concurrent.futures import ThreadPoolExecutor, as_completed
import asyncio
import aiohttp

import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

from openai import OpenAI

from ainb_const import (DOWNLOAD_DIR, MODEL,
                        SOURCECONFIG, PROMPT)
from ainb_utilities import log, delete_files, filter_unseen_urls_db, insert_article, nearest_neighbor_sort, agglomerative_cluster_sort, traveling_salesman_sort_scipy
from ainb_webscrape import get_driver, quit_drivers, get_file, parse_file, get_og_tags, get_path_from_url, trimmed_href, DRIVERS
from ainb_llm import paginate_df, process_pages

# needed because jupyter is already running an async event loop
import nest_asyncio
import asyncio

In [3]:
# PROMPT = """
# You will act as a research assistant to categorize news articles based on their relevance
# to the topic of artificial intelligence (AI). You will process and classify news headlines
# formatted as JSON objects.

# Input Specification:
# You will receive a list of news stories formatted as JSON objects.
# Each object will include an 'id' and a 'title'. For instance:
# [{'id': 97, 'title': 'AI to predict dementia, detect cancer'},
#  {'id': 103,'title': 'Figure robot learns to make coffee by watching humans for 10 hours'},
#  {'id': 103,'title': 'Baby trapped in refrigerator eats own foot'},
#  {'id': 210,'title': 'ChatGPT removes, then reinstates a summarization assistant without explanation.'},
#  {'id': 298,'title': 'The 5 most interesting PC monitors from CES 2024'},
#  ]

# Classification Criteria:
# Classify each story based on its title to determine whether it primarily pertains to AI.
# Broadly define AI-related content to include topics such as machine learning, robotics,
# computer vision, reinforcement learning, large language models, and related topics. Also
# include specific references to AI-related entities and individuals and products such as
# OpenAI, ChatGPT, Elon Musk, Sam Altman, Anthropic Claude, Google Gemini, Copilot,
# Perplexity.ai, Midjourney, etc.

# Output Specification:
# You will return a JSON object with the field 'stories' containing the list of classification results.
# For each story, your output will be a JSON object containing the original 'id' and a new field 'isAI',
# a boolean indicating if the story is about AI. The output schema must be strictly adhered to, without
# any additional fields. Example output:
# {'stories':
# [{'id': 97, 'isAI': true},
#  {'id': 103, 'isAI': true},
#  {'id': 103, 'isAI': false},
#  {'id': 210, 'isAI': true},
#  {'id': 298, 'isAI': false}]
# }

# Ensure that each output object accurately reflects the corresponding input object in terms of the 'id' field
# and that the 'isAI' field accurately represents the AI relevance of the story as determined by the title.

# The list of news stories to classify and enrich is:

# """

In [4]:
print(PROMPT)


You will act as a research assistant to categorize news articles based on their relevance
to the topic of artificial intelligence (AI). You will process and classify news headlines
formatted as JSON objects.

Input Specification:
You will receive a list of news stories formatted as JSON objects.
Each object will include an 'id' and a 'title'. For instance:
[{'id': 97, 'title': 'AI to predict dementia, detect cancer'},
 {'id': 103,'title': 'Figure robot learns to make coffee by watching humans for 10 hours'},
 {'id': 103,'title': 'Baby trapped in refrigerator eats own foot'},
 {'id': 210,'title': 'ChatGPT removes, then reinstates a summarization assistant without explanation.'},
 {'id': 298,'title': 'The 5 most interesting PC monitors from CES 2024'},
 ]

Classification Criteria:
Classify each story based on its title to determine whether it primarily pertains to AI.
Broadly define AI-related content to include topics such as machine learning, robotics,
computer vision, reinforcement l

In [5]:
get_og_tags('https://druce.ai')


{'og:site_name': 'Druce.ai',
 'og:title': 'Druce.ai',
 'og:type': 'website',
 'og:description': "Druce's Blog on Machine Learning, Tech, Markets and Economics",
 'og:url': 'https://druce.ai/',
 'title': 'Druce.ai'}

In [6]:
get_path_from_url('https://druce.ai/2024/03/gemini-summarize-book')


'/2024/03/gemini-summarize-book'

In [7]:
trimmed_href('https://druce.ai/2024/03/gemini-summarize-book?xyz')


'https://druce.ai/2024/03/gemini-summarize-book'

In [33]:
#  load sources to scrape from sources.yaml
with open(SOURCECONFIG, "r") as stream:
    try:
        sources = yaml.safe_load(stream)
    except yaml.YAMLError as exc:
        print(exc)

log(f"Load {len(sources)} sources")



2024-06-21 12:38:53,221 - AInewsbot - INFO - Load 17 sources


20

In [61]:
sources

{'Ars Technica': {'include': ['^https://arstechnica.com/(\\w+)/(\\d+)/(\\d+)/'],
  'title': 'Ars Technica',
  'url': 'https://arstechnica.com/',
  'latest': 'htmldata/Ars Technica (06_21_2024 12_32_22 PM).html',
  'sourcename': 'Ars Technica'},
 'Bloomberg Tech': {'include': ['^https://www.bloomberg.com/news/'],
  'title': 'Bloomberg Technology - Bloomberg',
  'url': 'https://www.bloomberg.com/technology',
  'latest': 'htmldata/Bloomberg Technology - Bloomberg (06_21_2024 12_32_23 PM).html',
  'sourcename': 'Bloomberg Tech'},
 'Business Insider': {'exclude': ['^https://www.insider.com',
   '^https://www.passionfroot.me'],
  'title': 'Tech - Business Insider',
  'url': 'https://www.businessinsider.com/tech',
  'latest': 'htmldata/Tech - Business Insider (06_21_2024 12_32_22 PM).html',
  'sourcename': 'Business Insider'},
 'FT Tech': {'include': ['https://www.ft.com/content/'],
  'title': 'Technology',
  'url': 'https://www.ft.com/technology',
  'latest': 'htmldata/Technology (06_21_2024

In [9]:
# make a reverse dict to map output file titles to source names
sources_reverse = {}
for k, v in sources.items():
    log(f"{k} -> {v['url']} -> {v['title']}.html")
    v['sourcename'] = k
    # map filename (title) to source name
    sources_reverse[v['title']] = k

sources_reverse

2024-06-21 11:30:42,424 - AInewsbot - INFO - Ars Technica -> https://arstechnica.com/ -> Ars Technica.html
2024-06-21 11:30:42,426 - AInewsbot - INFO - Bloomberg Tech -> https://www.bloomberg.com/technology -> Bloomberg Technology - Bloomberg.html
2024-06-21 11:30:42,427 - AInewsbot - INFO - Business Insider -> https://www.businessinsider.com/tech -> Tech - Business Insider.html
2024-06-21 11:30:42,428 - AInewsbot - INFO - FT Tech -> https://www.ft.com/technology -> Technology.html
2024-06-21 11:30:42,429 - AInewsbot - INFO - Feedly AI -> https://feedly.com/i/aiFeeds?options=eyJsYXllcnMiOlt7InBhcnRzIjpbeyJpZCI6Im5scC9mL3RvcGljLzMwMDAifV0sInNlYXJjaEhpbnQiOiJ0ZWNobm9sb2d5IiwidHlwZSI6Im1hdGNoZXMiLCJzYWxpZW5jZSI6ImFib3V0In1dLCJidW5kbGVzIjpbeyJ0eXBlIjoic3RyZWFtIiwiaWQiOiJ1c2VyLzYyZWViYjlmLTcxNTEtNGY5YS1hOGM3LTlhNTdiODIwNTMwOC9jYXRlZ29yeS9HYWRnZXRzIn1dfQ -> Discover and Add New Feedly AI Feeds.html
2024-06-21 11:30:42,430 - AInewsbot - INFO - Google News -> https://news.google.com/topics/CAA

{'Ars Technica': 'Ars Technica',
 'Bloomberg Technology - Bloomberg': 'Bloomberg Tech',
 'Tech - Business Insider': 'Business Insider',
 'Technology': 'FT Tech',
 'Discover and Add New Feedly AI Feeds': 'Feedly AI',
 'Google News - Technology - Artificial intelligence': 'Google News',
 'Hacker News Page 1': 'Hacker News',
 'Hacker News Page 2': 'Hacker News 2',
 'HackerNoon - read, write and learn about any technology': 'HackerNoon',
 'Technology - The New York Times': 'NYT Tech',
 'top scoring links _ multi': 'Reddit',
 'Techmeme': 'Techmeme',
 'The Register_ Enterprise Technology News and Analysis': 'The Register',
 'Artificial Intelligence - The Verge': 'The Verge',
 'AI News _ VentureBeat': 'VentureBeat',
 'Technology - WSJ.com': 'WSJ Tech',
 'Technology - The Washington Post': 'WaPo Tech'}

In [10]:
# determine files in htmldata directory
# List all paths in the directory matching today's date
nfiles = 50
files = [os.path.join(DOWNLOAD_DIR, file)
         for file in os.listdir(DOWNLOAD_DIR)]

# Get the current date
today = datetime.now()
year, month, day = today.year, today.month, today.day
datestr = datetime.now().strftime("%m_%d_%Y")

# filter files only
files = [file for file in files if os.path.isfile(file)]

# Sort files by modification time and take top 50
files.sort(key=lambda x: os.path.getmtime(x), reverse=True)
file = files[:nfiles]

# filter files by with today's date ending in .html
files = [
    file for file in files if datestr in file and file.endswith(".html")]
log(len(files))
for file in files:
    log(file)

saved_pages = []
for file in files:
    filename = os.path.basename(file)
    # locate date like '01_14_2024' in filename
    position = filename.find(" (" + datestr)
    basename = filename[:position]
    # match to source name
    sourcename = sources_reverse.get(basename)
    if sourcename is None:
        log(f"Skipping {basename}, no sourcename metadata")
        continue
    sources[sourcename]['latest'] = file
    saved_pages.append((sourcename, file))

2024-06-21 11:30:45,457 - AInewsbot - INFO - 18
2024-06-21 11:30:45,458 - AInewsbot - INFO - htmldata/Ars Technica (06_21_2024 11_12_51 AM).html
2024-06-21 11:30:45,459 - AInewsbot - INFO - htmldata/Technology - The Washington Post (06_21_2024 10_22_15 AM).html
2024-06-21 11:30:45,460 - AInewsbot - INFO - htmldata/Technology - WSJ.com (06_21_2024 10_22_04 AM).html
2024-06-21 11:30:45,460 - AInewsbot - INFO - htmldata/AI News _ VentureBeat (06_21_2024 10_21_53 AM).html
2024-06-21 11:30:45,461 - AInewsbot - INFO - htmldata/Artificial Intelligence - The Verge (06_21_2024 10_21_42 AM).html
2024-06-21 11:30:45,462 - AInewsbot - INFO - htmldata/The Register_ Enterprise Technology News and Analysis (06_21_2024 10_21_32 AM).html
2024-06-21 11:30:45,463 - AInewsbot - INFO - htmldata/Techmeme (06_21_2024 10_21_21 AM).html
2024-06-21 11:30:45,464 - AInewsbot - INFO - htmldata/top scoring links _ multi (06_21_2024 10_21_10 AM).html
2024-06-21 11:30:45,465 - AInewsbot - INFO - htmldata/Technology -

In [23]:
# takes 5 minutes without multiprocessing
# get driver takes 50 seconds, so call it 1 + 4
# 2 drivers should take 2 + 2, 3: 3 + 4/3
# could save a minute by using 2 webdrivers

# Fetch HTML files from sources
# empty download directory
delete_files(DOWNLOAD_DIR)

# save each file specified from sources
log("Saving HTML files")

# Create a queue for multiprocessing and populate it with 20 numbers
queue = multiprocessing.Queue()
for item in sources.values():
    queue.put(item)
    
# Function to take the queue and pop entries off and fetchuntil none are left
# could probably just access queue as a global or just define 1 function
# this pattern lets you create an array of functions with different args
def process_queue_factory(q):
    def process_queue():
        # launch browser via selenium driver
        driver = get_driver()    
        saved_pages = []
        while not q.empty():
            sourcedict = q.get()
            sourcename = sourcedict['sourcename']
            log(f'Processing {sourcename}')
            sourcefile = get_file(sourcedict, driver)
            saved_pages.append((sourcename, sourcefile))  
        # Close the browser
        log("Quit webdriver")
        driver.quit()
        return saved_pages
    return process_queue

callable = process_queue_factory(queue)
# saved_pages = callable()

def launch_drivers(num_drivers):
    with ThreadPoolExecutor(max_workers=num_drivers) as executor:
        # Create a list of future objects
        futures = [executor.submit(callable) for _ in range(num_drivers)]
        
        # Collect the results (web drivers) as they complete
        retarray = [future.result() for future in as_completed(futures)]
        
    return retarray

results = launch_drivers(3)

2024-06-21 12:31:04,712 - AInewsbot - INFO - Saving HTML files
2024-06-21 12:31:04,716 - AInewsbot - INFO - get_driver - 3590 Initializing webdriver
2024-06-21 12:31:04,717 - AInewsbot - INFO - get_driver - 3590 Initializing webdriver
2024-06-21 12:31:04,718 - AInewsbot - INFO - get_driver - 3590 Initializing webdriver
2024-06-21 12:31:21,229 - AInewsbot - INFO - get_driver - Initialized webdriver profile
2024-06-21 12:31:21,230 - AInewsbot - INFO - get_driver - Initialized webdriver profile
2024-06-21 12:31:21,230 - AInewsbot - INFO - get_driver - Initialized webdriver profile
2024-06-21 12:31:21,231 - AInewsbot - INFO - get_driver - Initialized webdriver service
2024-06-21 12:31:21,231 - AInewsbot - INFO - get_driver - Initialized webdriver service
2024-06-21 12:31:21,231 - AInewsbot - INFO - get_driver - Initialized webdriver service
2024-06-21 12:32:10,629 - AInewsbot - INFO - get_driver - Initialized webdriver
2024-06-21 12:32:10,630 - AInewsbot - INFO - get_driver - Initialized w

2024-06-21 12:32:55,812 - AInewsbot - INFO - get_files(Google News - Technology - Artificial intelligence) - 3590 Saving Google News - Technology - Artificial intelligence (06_21_2024 12_32_55 PM).html as utf-8
2024-06-21 12:32:55,817 - AInewsbot - INFO - Processing NYT Tech
2024-06-21 12:32:55,817 - AInewsbot - INFO - get_files(Technology - The New York Times) - 3590 starting get_files https://www.nytimes.com/section/technology
2024-06-21 12:33:03,317 - AInewsbot - INFO - get_files(Discover and Add New Feedly AI Feeds) - 3590 Loading additional infinite scroll items
2024-06-21 12:33:04,187 - AInewsbot - INFO - Message: Unable to locate element: //meta[@http-equiv='Content-Type']; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:192:5
NoSuchEle

2024-06-21 12:33:45,225 - AInewsbot - INFO - Quit webdriver
2024-06-21 12:33:47,101 - AInewsbot - INFO - Message: Unable to locate element: //meta[@http-equiv='Content-Type']; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:192:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:510:5
dom.find/</<@chrome://remote/content/shared/DOM.sys.mjs:136:16

2024-06-21 12:33:47,102 - AInewsbot - INFO - get_files(Technology - The Washington Post) - 3590 Saving Technology - The Washington Post (06_21_2024 12_33_47 PM).html as utf-8
2024-06-21 12:33:47,106 - AInewsbot - INFO - Quit webdriver


In [27]:
saved_pages = []
for r in results:
    saved_pages.extend(r)
saved_pages

[('Bloomberg Tech',
  'htmldata/Bloomberg Technology - Bloomberg (06_21_2024 12_32_23 PM).html'),
 ('Google News',
  'htmldata/Google News - Technology - Artificial intelligence (06_21_2024 12_32_55 PM).html'),
 ('NYT Tech',
  'htmldata/Technology - The New York Times (06_21_2024 12_33_06 PM).html'),
 ('Techmeme', 'htmldata/Techmeme (06_21_2024 12_33_17 PM).html'),
 ('The Register',
  'htmldata/The Register_ Enterprise Technology News and Analysis (06_21_2024 12_33_27 PM).html'),
 ('VentureBeat',
  'htmldata/AI News _ VentureBeat (06_21_2024 12_33_38 PM).html'),
 ('Ars Technica', 'htmldata/Ars Technica (06_21_2024 12_32_22 PM).html'),
 ('Feedly AI',
  'htmldata/Discover and Add New Feedly AI Feeds (06_21_2024 12_33_23 PM).html'),
 ('The Verge',
  'htmldata/Artificial Intelligence - The Verge (06_21_2024 12_33_33 PM).html'),
 ('WSJ Tech', 'htmldata/Technology - WSJ.com (06_21_2024 12_33_45 PM).html'),
 ('Business Insider',
  'htmldata/Tech - Business Insider (06_21_2024 12_32_22 PM).htm

In [None]:
# gives error, TypeError: cannot pickle '_thread.lock' object
# not sure why, should be straightforward
# overkill to save a minute but 2 ways to do it:
#  function that takes the queue as an arg, gets driver, keeps popping items until queue is empty, quits driver 
#  use a pool.map() on a function that takes a single url as an arg, after all complete, call quit_drivers()

# import multiprocessing
# from queue import Queue

# # Function to download and save the web page
# def download_page(url_queue, download_dir):
#     driver = get_driver()

#     while not url_queue.empty():
#         url = url_queue.get()
#         log(url)
#         try:
#             driver.get(url)
#             # Save page source
#             file_name = os.path.join(download_dir, f"{url_queue.qsize()}.html")
#             with open(file_name, 'w', encoding='utf-8') as file:
#                 file.write(driver.page_source)
#             print(f"Saved {url} to {file_name}")
#         except Exception as e:
#             print(f"Failed to download {url}: {e}")
#         finally:
#             url_queue.task_done()

#     driver.quit()

# # Create a queue and add URLs
# url_queue = Queue()
# urls = [v['url'] for v in sources.values()]
# for url in urls:
#     url_queue.put(url)

# # Create and start threads
# processes = []

# for i in range(3):  # 3 Selenium WebDriver instances
#     process = multiprocessing.Process(target=download_page, args=(url_queue, DOWNLOAD_DIR))
#     process.start()
#     processes.append(thread)

# # Wait for all threads to finish
# for process in processes:
#     process.join()

# print("All pages have been downloaded.")

In [28]:
print(len(saved_pages))
for sourcename, page in saved_pages:
    # sources[sourcename]['latest'] = page
    print(sourcename, '->', page)
    

17
Bloomberg Tech -> htmldata/Bloomberg Technology - Bloomberg (06_21_2024 12_32_23 PM).html
Google News -> htmldata/Google News - Technology - Artificial intelligence (06_21_2024 12_32_55 PM).html
NYT Tech -> htmldata/Technology - The New York Times (06_21_2024 12_33_06 PM).html
Techmeme -> htmldata/Techmeme (06_21_2024 12_33_17 PM).html
The Register -> htmldata/The Register_ Enterprise Technology News and Analysis (06_21_2024 12_33_27 PM).html
VentureBeat -> htmldata/AI News _ VentureBeat (06_21_2024 12_33_38 PM).html
Ars Technica -> htmldata/Ars Technica (06_21_2024 12_32_22 PM).html
Feedly AI -> htmldata/Discover and Add New Feedly AI Feeds (06_21_2024 12_33_23 PM).html
The Verge -> htmldata/Artificial Intelligence - The Verge (06_21_2024 12_33_33 PM).html
WSJ Tech -> htmldata/Technology - WSJ.com (06_21_2024 12_33_45 PM).html
Business Insider -> htmldata/Tech - Business Insider (06_21_2024 12_32_22 PM).html
FT Tech -> htmldata/Technology (06_21_2024 12_32_32 PM).html
Hacker News -

In [37]:
# for sourcename, filename in saved_pages:
#     sources[sourcename]["latest"]=filename
#     sources[sourcename]["sourcename"]=sourcename
    
    

In [39]:
# Parse news URLs and titles from downloaded HTML files
log("parsing html files")
all_urls = []
for sourcename, filename in saved_pages:
    print(sourcename, '->', filename, flush=True)
    log(f"{sourcename}", "parse loop")
    links = parse_file(sources[sourcename])
    log(f"{len(links)} links found", "parse loop")
    all_urls.extend(links)

log(f"found {len(all_urls)} links", "parse loop")

2024-06-21 12:41:59,487 - AInewsbot - INFO - parsing html files


Bloomberg Tech -> htmldata/Bloomberg Technology - Bloomberg (06_21_2024 12_32_23 PM).html


2024-06-21 12:41:59,489 - AInewsbot - INFO - parse loop - Bloomberg Tech
2024-06-21 12:41:59,525 - AInewsbot - INFO - parse_file - found 296 raw links
2024-06-21 12:41:59,529 - AInewsbot - INFO - parse_file - found 51 filtered links
2024-06-21 12:41:59,529 - AInewsbot - INFO - parse loop - 51 links found


Google News -> htmldata/Google News - Technology - Artificial intelligence (06_21_2024 12_32_55 PM).html


2024-06-21 12:41:59,529 - AInewsbot - INFO - parse loop - Google News
2024-06-21 12:41:59,784 - AInewsbot - INFO - parse_file - found 1077 raw links
2024-06-21 12:41:59,791 - AInewsbot - INFO - parse_file - found 467 filtered links
2024-06-21 12:41:59,791 - AInewsbot - INFO - parse loop - 467 links found


NYT Tech -> htmldata/Technology - The New York Times (06_21_2024 12_33_06 PM).html


2024-06-21 12:41:59,792 - AInewsbot - INFO - parse loop - NYT Tech
2024-06-21 12:41:59,802 - AInewsbot - INFO - parse_file - found 72 raw links
2024-06-21 12:41:59,803 - AInewsbot - INFO - parse_file - found 19 filtered links
2024-06-21 12:41:59,803 - AInewsbot - INFO - parse loop - 19 links found


Techmeme -> htmldata/Techmeme (06_21_2024 12_33_17 PM).html


2024-06-21 12:41:59,804 - AInewsbot - INFO - parse loop - Techmeme
2024-06-21 12:41:59,822 - AInewsbot - INFO - parse_file - found 384 raw links
2024-06-21 12:41:59,827 - AInewsbot - INFO - parse_file - found 161 filtered links
2024-06-21 12:41:59,827 - AInewsbot - INFO - parse loop - 161 links found


The Register -> htmldata/The Register_ Enterprise Technology News and Analysis (06_21_2024 12_33_27 PM).html


2024-06-21 12:41:59,827 - AInewsbot - INFO - parse loop - The Register
2024-06-21 12:41:59,935 - AInewsbot - INFO - parse_file - found 200 raw links
2024-06-21 12:41:59,938 - AInewsbot - INFO - parse_file - found 89 filtered links
2024-06-21 12:41:59,939 - AInewsbot - INFO - parse loop - 89 links found


VentureBeat -> htmldata/AI News _ VentureBeat (06_21_2024 12_33_38 PM).html


2024-06-21 12:41:59,939 - AInewsbot - INFO - parse loop - VentureBeat
2024-06-21 12:41:59,956 - AInewsbot - INFO - parse_file - found 322 raw links
2024-06-21 12:41:59,960 - AInewsbot - INFO - parse_file - found 44 filtered links
2024-06-21 12:41:59,960 - AInewsbot - INFO - parse loop - 44 links found


Ars Technica -> htmldata/Ars Technica (06_21_2024 12_32_22 PM).html


2024-06-21 12:41:59,960 - AInewsbot - INFO - parse loop - Ars Technica
2024-06-21 12:41:59,975 - AInewsbot - INFO - parse_file - found 252 raw links
2024-06-21 12:41:59,977 - AInewsbot - INFO - parse_file - found 24 filtered links
2024-06-21 12:41:59,978 - AInewsbot - INFO - parse loop - 24 links found


Feedly AI -> htmldata/Discover and Add New Feedly AI Feeds (06_21_2024 12_33_23 PM).html


2024-06-21 12:41:59,978 - AInewsbot - INFO - parse loop - Feedly AI
2024-06-21 12:42:00,007 - AInewsbot - INFO - parse_file - found 258 raw links
2024-06-21 12:42:00,010 - AInewsbot - INFO - parse_file - found 69 filtered links
2024-06-21 12:42:00,011 - AInewsbot - INFO - parse loop - 69 links found


The Verge -> htmldata/Artificial Intelligence - The Verge (06_21_2024 12_33_33 PM).html


2024-06-21 12:42:00,011 - AInewsbot - INFO - parse loop - The Verge
2024-06-21 12:42:00,037 - AInewsbot - INFO - parse_file - found 311 raw links
2024-06-21 12:42:00,040 - AInewsbot - INFO - parse_file - found 35 filtered links
2024-06-21 12:42:00,040 - AInewsbot - INFO - parse loop - 35 links found


WSJ Tech -> htmldata/Technology - WSJ.com (06_21_2024 12_33_45 PM).html


2024-06-21 12:42:00,041 - AInewsbot - INFO - parse loop - WSJ Tech
2024-06-21 12:42:00,074 - AInewsbot - INFO - parse_file - found 517 raw links
2024-06-21 12:42:00,080 - AInewsbot - INFO - parse_file - found 6 filtered links
2024-06-21 12:42:00,081 - AInewsbot - INFO - parse loop - 6 links found


Business Insider -> htmldata/Tech - Business Insider (06_21_2024 12_32_22 PM).html


2024-06-21 12:42:00,081 - AInewsbot - INFO - parse loop - Business Insider
2024-06-21 12:42:00,103 - AInewsbot - INFO - parse_file - found 315 raw links
2024-06-21 12:42:00,107 - AInewsbot - INFO - parse_file - found 56 filtered links
2024-06-21 12:42:00,107 - AInewsbot - INFO - parse loop - 56 links found


FT Tech -> htmldata/Technology (06_21_2024 12_32_32 PM).html


2024-06-21 12:42:00,107 - AInewsbot - INFO - parse loop - FT Tech
2024-06-21 12:42:00,134 - AInewsbot - INFO - parse_file - found 457 raw links
2024-06-21 12:42:00,139 - AInewsbot - INFO - parse_file - found 102 filtered links
2024-06-21 12:42:00,139 - AInewsbot - INFO - parse loop - 102 links found


Hacker News -> htmldata/Hacker News Page 1 (06_21_2024 12_32_43 PM).html


2024-06-21 12:42:00,139 - AInewsbot - INFO - parse loop - Hacker News
2024-06-21 12:42:00,150 - AInewsbot - INFO - parse_file - found 257 raw links
2024-06-21 12:42:00,153 - AInewsbot - INFO - parse_file - found 29 filtered links
2024-06-21 12:42:00,153 - AInewsbot - INFO - parse loop - 29 links found


Hacker News 2 -> htmldata/Hacker News Page 2 (06_21_2024 12_32_53 PM).html


2024-06-21 12:42:00,154 - AInewsbot - INFO - parse loop - Hacker News 2
2024-06-21 12:42:00,164 - AInewsbot - INFO - parse_file - found 260 raw links
2024-06-21 12:42:00,167 - AInewsbot - INFO - parse_file - found 21 filtered links
2024-06-21 12:42:00,167 - AInewsbot - INFO - parse loop - 21 links found


HackerNoon -> htmldata/HackerNoon - read, write and learn about any technology (06_21_2024 12_33_04 PM).html


2024-06-21 12:42:00,168 - AInewsbot - INFO - parse loop - HackerNoon
2024-06-21 12:42:00,221 - AInewsbot - INFO - parse_file - found 569 raw links
2024-06-21 12:42:00,230 - AInewsbot - INFO - parse_file - found 88 filtered links
2024-06-21 12:42:00,230 - AInewsbot - INFO - parse loop - 88 links found


Reddit -> htmldata/top scoring links _ multi (06_21_2024 12_33_36 PM).html


2024-06-21 12:42:00,230 - AInewsbot - INFO - parse loop - Reddit
2024-06-21 12:42:00,410 - AInewsbot - INFO - parse_file - found 644 raw links
2024-06-21 12:42:00,422 - AInewsbot - INFO - parse_file - found 420 filtered links
2024-06-21 12:42:00,422 - AInewsbot - INFO - parse loop - 420 links found


WaPo Tech -> htmldata/Technology - The Washington Post (06_21_2024 12_33_47 PM).html


2024-06-21 12:42:00,423 - AInewsbot - INFO - parse loop - WaPo Tech
2024-06-21 12:42:00,435 - AInewsbot - INFO - parse_file - found 156 raw links
2024-06-21 12:42:00,437 - AInewsbot - INFO - parse_file - found 29 filtered links
2024-06-21 12:42:00,437 - AInewsbot - INFO - parse loop - 29 links found
2024-06-21 12:42:00,437 - AInewsbot - INFO - parse loop - found 1710 links


20

In [40]:
# make a pandas dataframe of all the links found
orig_df = (
    pd.DataFrame(all_urls)
    .groupby("url")
    .first()
    .reset_index()
    .sort_values("src")[["src", "title", "url"]]
    .reset_index(drop=True)
    .reset_index(drop=False)
    .rename(columns={"index": "id"})
)
print(len(orig_df))
orig_df.head()

1404


Unnamed: 0,id,src,title,url
0,0,Ars Technica,Dell said return to the office or else—nearly ...,https://arstechnica.com/gadgets/2024/06/nearly...
1,1,Ars Technica,Supermassive black hole roars to life as astro...,https://arstechnica.com/science/2024/06/superm...
2,2,Ars Technica,When did humans start social knowledge accumul...,https://arstechnica.com/science/2024/06/stone-...
3,3,Ars Technica,Microdosing candy-linked illnesses double; pos...,https://arstechnica.com/science/2024/06/still-...
4,4,Ars Technica,Radioactive drugs strike cancer with precision,https://arstechnica.com/science/2024/06/radioa...


In [None]:
# datestr = '2024-06-20'

# conn = sqlite3.connect('articles.db')

# c = conn.cursor()
# query = f"select * from news_articles where article_date > '{datestr}' order by article_date desc"
# df = pd.read_sql_query(query, conn)
# df



In [None]:
# conn.execute(f"delete from news_articles where article_date > '{datestr}'")

# # Committing the changes
# conn.commit()

# # Closing the connection
# conn.close()


In [41]:
filtered_df = filter_unseen_urls_db(orig_df)


2024-06-21 12:42:19,383 - AInewsbot - INFO - Existing URLs: 108013
2024-06-21 12:42:19,406 - AInewsbot - INFO - New URLs: 398


In [42]:
# use chatgpt to filter AI-related headlines using a prompt to OpenAI

client = OpenAI()

# make pages that fit in a reasonably sized prompt
pages = paginate_df(filtered_df)

In [43]:
pages

[[{'id': 9,
   'title': 'Family whose roof was damaged by space debris files claims against NASA'},
  {'id': 51,
   'title': 'US Moves Closer to Restricting Outbound Investment in China for Chips, AI Tech'},
  {'id': 53,
   'title': 'Orange Said to Mull 40% Stake Sale in Carrier, Exiting Mauritius'},
  {'id': 54, 'title': 'Rumors Go Dark as Video-Game Leakers Face a Reckoning'},
  {'id': 56,
   'title': 'Nvidia Sheds $200 Billion in Value After Short Run as Top Stock'},
  {'id': 62,
   'title': 'OpenAI Buys Enterprise Startup to Help Customers Sift Through Data'},
  {'id': 63,
   'title': 'Apple’s AI Rally Puts Valuation at Risk of Outpacing Reality'},
  {'id': 79,
   'title': 'First Neuralink patient explains what could happen if his brain-chip implant gets hacked'},
  {'id': 96, 'title': 'How to become an influencer and start making money'},
  {'id': 99,
   'title': 'I got laid off from Snap and it felt like a dream come true after 20 years in tech. I quickly went from shock to antic

In [44]:
# updated ainb_llm.py to use async wait
# the lowest level  client.chat.completions.create call does not support await
# gives object ChatCompletion can't be used in 'await' expression
# this still takes 2 minutes for 456 URLs
# the OpenAI python API still blocks / throttles

# print(datetime.now())
# enriched_urls = await process_pages(client, PROMPT, pages)
# print(datetime.now())


2024-06-21 12:42:25,926 - AInewsbot - INFO - send page 1 of 8, 50 items 
2024-06-21 12:42:25,927 - AInewsbot - INFO - send page 2 of 8, 50 items 
2024-06-21 12:42:25,928 - AInewsbot - INFO - send page 3 of 8, 50 items 
2024-06-21 12:42:25,929 - AInewsbot - INFO - send page 4 of 8, 50 items 
2024-06-21 12:42:25,930 - AInewsbot - INFO - send page 5 of 8, 50 items 
2024-06-21 12:42:25,930 - AInewsbot - INFO - send page 6 of 8, 50 items 
2024-06-21 12:42:25,931 - AInewsbot - INFO - send page 7 of 8, 50 items 
2024-06-21 12:42:25,932 - AInewsbot - INFO - send page 8 of 8, 48 items 


2024-06-21 12:42:25.926292


2024-06-21 12:42:37,096 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-06-21 12:42:37,110 - AInewsbot - INFO - object ChatCompletion can't be used in 'await' expression - An exception occurred on attempt 1:
2024-06-21 12:42:47,111 - AInewsbot - INFO - Attempt 2...
2024-06-21 12:42:57,582 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-06-21 12:42:57,585 - AInewsbot - INFO - object ChatCompletion can't be used in 'await' expression - An exception occurred on attempt 2:
2024-06-21 12:43:07,589 - AInewsbot - INFO - Attempt 3...
2024-06-21 12:43:17,556 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-06-21 12:43:17,559 - AInewsbot - INFO - object ChatCompletion can't be used in 'await' expression - An exception occurred on attempt 3:
2024-06-21 12:43:27,564 - AInewsbot - INFO - Retries exceeded.
2024-06-21 12:43:38,112 - httpx - I

AttributeError: 'NoneType' object has no attribute 'choices'

In [46]:
API_URL = 'https://api.openai.com/v1/chat/completions'

headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {os.getenv("OPENAI_API_KEY")}',
}



In [47]:
# this runs fast with async aiohttp (15 seconds vs 2 minutes)
# the old API supported submitting multiple payloads in a single completion request
# current API supports a slow 'batch' submission https://platform.openai.com/docs/guides/rate-limits/usage-tiers
# there is a more complex example here - https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py

# need this to run async in jupyter since it already has an asyncio event loop running
import nest_asyncio
nest_asyncio.apply()

async def fetch_openai(session, payload):
    async with session.post(API_URL, headers=headers, json=payload) as response:
        return await response.json()


async def fetch_pages(prompt, pages):
    log(f"{datetime.now().strftime('%H:%M:%S')} Sending {len(pages)} pages to OpenAI.")

    # make a prompt and payload for each page
    payloads = [{"model":  MODEL,
                 "response_format": {"type": "json_object"},
                 "messages": [{"role": "user",
                               "content": prompt + json.dumps(p)
                               }]
                 } for p in pages]

    async with aiohttp.ClientSession() as session:
        tasks = []
        for i, payload in enumerate(payloads):
            log(f"{datetime.now().strftime('%H:%M:%S')} Sending page {i}.")
            task = asyncio.create_task(fetch_openai(session, payload))
            tasks.append(task)

        responses = await asyncio.gather(*tasks)

    # validate and process the responses
    log(f"{datetime.now().strftime('%H:%M:%S')} Processing responses... ")

    retlist = []
    for i, response in enumerate(responses):
        try:
            response_dict = json.loads(
                response["choices"][0]["message"]["content"])
        except Exception as e:
            raise TypeError("Error: Invalid response " + str(e))

        if type(response_dict) is dict:
            response_list = response_dict.get("stories")
        else:
            raise TypeError("Error: Invalid response type")

        if type(response_list) is not list:
            raise TypeError("Error: Invalid response type")

        log(f"{datetime.now().strftime('%H:%M:%S')} got list with {len(response_list)} items ")

        # check all sent present in response
        # seems to match, i.e. responses received correctly and in order
        # could add this to fetch_openai and implement retry
        sent_ids = [s['id'] for s in pages[i]]
        received_ids = [r['id'] for r in response_list]
        difference = set(sent_ids) - set(received_ids)

        if difference:
            log(f"missing items, {str(difference)}")

        retlist.extend(response_list)

    log(f"{datetime.now().strftime('%H:%M:%S')} Processed {len(retlist)} responses.")

    return retlist


# Run the main function
enriched_urls = asyncio.run(fetch_pages(PROMPT, pages))
print(datetime.now())


2024-06-21 13:09:25,143 - AInewsbot - INFO - 13:09:25 Sending 8 pages to OpenAI.
2024-06-21 13:09:36,291 - AInewsbot - INFO - 13:09:36 Processing responses... 
2024-06-21 13:09:36,293 - AInewsbot - INFO - 13:09:36 got list with 50 items 
2024-06-21 13:09:36,294 - AInewsbot - INFO - 13:09:36 got list with 50 items 
2024-06-21 13:09:36,295 - AInewsbot - INFO - 13:09:36 got list with 50 items 
2024-06-21 13:09:36,295 - AInewsbot - INFO - 13:09:36 got list with 50 items 
2024-06-21 13:09:36,296 - AInewsbot - INFO - 13:09:36 got list with 50 items 
2024-06-21 13:09:36,298 - AInewsbot - INFO - 13:09:36 got list with 50 items 
2024-06-21 13:09:36,300 - AInewsbot - INFO - 13:09:36 got list with 50 items 
2024-06-21 13:09:36,300 - AInewsbot - INFO - 13:09:36 got list with 48 items 
2024-06-21 13:09:36,301 - AInewsbot - INFO - 13:09:36 Processed 398 responses.


2024-06-21 13:09:36.301797


In [63]:
for sourcename, filename in saved_pages:
        sources[sourcename]['latest'] = filename

In [48]:
enriched_df = pd.DataFrame(enriched_urls)
print(len(enriched_df))
enriched_df.head()


398


Unnamed: 0,id,isAI
0,9,False
1,51,True
2,53,False
3,54,False
4,56,False


In [49]:
log("isAI", len(enriched_df.loc[enriched_df["isAI"]]))
log("not isAI", len(enriched_df.loc[~enriched_df["isAI"]]))


2024-06-21 13:10:52,453 - AInewsbot - INFO - 161 - isAI
2024-06-21 13:10:52,456 - AInewsbot - INFO - 237 - not isAI


20

In [50]:
# merge returned df into original df
merged_df = pd.merge(filtered_df, enriched_df, on="id", how="outer")
merged_df['date'] = datetime.now().date()
merged_df.head()


Unnamed: 0,id,src,title,url,isAI,date
0,9,Ars Technica,Family whose roof was damaged by space debris ...,https://arstechnica.com/space/2024/06/family-w...,False,2024-06-21
1,51,Bloomberg Tech,US Moves Closer to Restricting Outbound Invest...,https://www.bloomberg.com/news/articles/2024-0...,True,2024-06-21
2,53,Bloomberg Tech,"Orange Said to Mull 40% Stake Sale in Carrier,...",https://www.bloomberg.com/news/articles/2024-0...,False,2024-06-21
3,54,Bloomberg Tech,Rumors Go Dark as Video-Game Leakers Face a Re...,https://www.bloomberg.com/news/newsletters/202...,False,2024-06-21
4,56,Bloomberg Tech,Nvidia Sheds $200 Billion in Value After Short...,https://www.bloomberg.com/news/articles/2024-0...,False,2024-06-21


In [51]:
# should be empty, shouldn't get back rows that don't match to existing
log(f"Unmatched response rows: {len(merged_df.loc[merged_df['src'].isna()])}")
# should be empty, should get back all rows from orig
log(f"Unmatched source rows: {len(merged_df.loc[merged_df['isAI'].isna()])}")


2024-06-21 13:10:59,784 - AInewsbot - INFO - Unmatched response rows: 0
2024-06-21 13:10:59,787 - AInewsbot - INFO - Unmatched source rows: 0


20

In [52]:
# update SQLite database with all seen articles
conn = sqlite3.connect('articles.db')
cursor = conn.cursor()
for row in merged_df.itertuples():
    insert_article(conn, cursor, row.src, row.title,
                   row.url, row.isAI, row.date)


In [53]:
AIdf = merged_df.loc[merged_df["isAI"]].reset_index(drop=True)
log(f"Found {len(AIdf)} AI headlines")


2024-06-21 13:11:02,198 - AInewsbot - INFO - Found 161 AI headlines


20

In [54]:
# map title to ascii characters to avoid some dupes with e.g. different quote symbols

def unicode_to_ascii(input_string):
    # Normalize the Unicode string to NFKD form
    normalized_string = unicodedata.normalize('NFKD', input_string)
    
    # Encode to ASCII bytes, ignoring characters that cannot be converted
    ascii_bytes = normalized_string.encode('ascii', 'ignore')
    
    # Convert bytes back to a string
    ascii_string = ascii_bytes.decode('ascii')
    
    return ascii_string

AIdf['title'] = AIdf['title'].apply(unicode_to_ascii)


In [55]:
# dedupe identical headlines
AIdf['title_clean'] = AIdf['title'].map(lambda s: "".join(s.split()))
AIdf = AIdf.sort_values("src") \
    .groupby("title_clean") \
    .first() \
    .reset_index()
log(f"Found {len(AIdf)} unique AI headlines")


2024-06-21 13:11:04,053 - AInewsbot - INFO - Found 154 unique AI headlines


20

In [56]:
log(f"Fetching embeddings for {len(AIdf)} headlines")
embedding_model = 'text-embedding-3-large'
response = client.embeddings.create(input=AIdf['title'].tolist(),
                                    model=embedding_model)
embedding_df = pd.DataFrame([e.model_dump()['embedding'] for e in response.data])
embedding_array = embedding_df.values

# find index of most central headline
centroid = embedding_array.mean(axis=0)
distances = np.linalg.norm(embedding_array - centroid, axis=1)
start_index = np.argmin(distances)

# Get the sorted indices and use them to sort the df
# sorted_indices = nearest_neighbor_sort(embedding_array, start_index)
sorted_indices = traveling_salesman_sort_scipy(embedding_array)
AIdf = AIdf.iloc[sorted_indices]


2024-06-21 13:11:04,697 - AInewsbot - INFO - Fetching embeddings for 154 headlines
2024-06-21 13:11:05,278 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [57]:
# leaf_order = agglomerative_cluster_sort(embedding_df)
# AIdf = AIdf.iloc[leaf_order]


In [58]:
AIdf=AIdf.reset_index(drop=True)
with pd.option_context('display.max_rows', None):
    display(AIdf[["title"]])

Unnamed: 0,title
0,"""I'm not excited about AI making stuff that a ..."
1,$10 A Month For Alexa? Amazon Reportedly Plots...
2,"'AI is not creative, you are': How AI will cha..."
3,'Remarkable' Alexa with AI could cost $5 to $1...
4,'Reverse Turing test' asks AI agents to spot a...
5,10 Features Which Makes Claude 3.5 Sonnet The ...
6,1 Unstoppable Stock Powering Nvidia and the AI...
7,260 McNuggets? McDonald's Ends A.I. Drive-Thro...
8,260 McNuggets? McDonalds Ends A.I. Drive-Throu...
9,81% of workers using AI are more productive. H...


In [59]:
html_str = ""
for row in AIdf.itertuples():
    log(f"[{row.Index}. {row.title} - {row.src}]({row.url})")
    html_str += f'{row.Index}.<a href="{row.url}">{row.title} - {row.src}</a><br />\n'


2024-06-21 13:11:09,797 - AInewsbot - INFO - [0. "I'm not excited about AI making stuff that a human could do - I want to hear things that humans have never imagined": Imogen Heap on AI, making her own voice model, and a new era of musical collaboration - Feedly AI](https://www.musicradar.com/news/imogen-heap-interview)
2024-06-21 13:11:09,799 - AInewsbot - INFO - [1. $10 A Month For Alexa? Amazon Reportedly Plots Overhaul Of Iconic Voice Assistant - Google News](https://news.google.com/articles/CBMifWh0dHBzOi8vd3d3LmZvcmJlcy5jb20vc2l0ZXMvcm9iZXJ0aGFydC8yMDI0LzA2LzIxL2FtYXpvbi1tYXktY2hhcmdlLWZvci1hbGV4YS1yZXBvcnQtc3VnZ2VzdC1tYWpvci1vdmVyaGF1bC1vZi1haS1hc3Npc3RhbnQv0gEA)
2024-06-21 13:11:09,801 - AInewsbot - INFO - [2. 'AI is not creative, you are': How AI will change the way we work, according to Microsoft - Google News](https://news.google.com/articles/CBMifWh0dHBzOi8vd3d3LmV1cm9uZXdzLmNvbS9uZXh0LzIwMjQvMDYvMjEvYWktaXMtbm90LWNyZWF0aXZlLXlvdS1hcmUtaG93LWFpLXdpbGwtY2hhbmdlLXRoZS13YXktd2

2024-06-21 13:11:09,813 - AInewsbot - INFO - [28. Apple: AI With Privacy & Security, A Unique Combination (NASDAQ:AAPL) - Google News](https://news.google.com/articles/CBMicmh0dHBzOi8vc2Vla2luZ2FscGhhLmNvbS9hcnRpY2xlLzQ3MDAzMjQtYXBwbGUtYWktd2l0aC1wcml2YWN5LXNlY3VyaXR5LXVuaXF1ZS1jb21iaW5hdGlvbj9zb3VyY2U9ZmVlZF9zeW1ib2xfQUFQTNIBAA)
2024-06-21 13:11:09,813 - AInewsbot - INFO - [29. Apple Faces a Tough Task in Keeping AI Data Secure and Private - Google News](https://news.google.com/articles/CBMib2h0dHBzOi8vd3d3LmNuZXQuY29tL3RlY2gvc2VydmljZXMtYW5kLXNvZnR3YXJlL2FwcGxlLWZhY2VzLWEtdG91Z2gtdGFzay1pbi1rZWVwaW5nLWFpLWRhdGEtc2VjdXJlLWFuZC1wcml2YXRlL9IBAA)
2024-06-21 13:11:09,813 - AInewsbot - INFO - [30. Apple Unveils Its Artificial Intelligence (AI) Plans. Is This a Signal to Buy the Stock? - Google News](https://news.google.com/articles/CBMiVmh0dHBzOi8vZmluYW5jZS55YWhvby5jb20vbmV3cy9hcHBsZS11bnZlaWxzLWFydGlmaWNpYWwtaW50ZWxsaWdlbmNlLWFpLTA3MTAwMDQyMS5odG1s0gEA)
2024-06-21 13:11:09,813 - AInewsbo

2024-06-21 13:11:09,823 - AInewsbot - INFO - [54. China's AI leader Sensetime unveils US$260 million share placement plan - Google News](https://news.google.com/articles/CBMiemh0dHBzOi8vd3d3LnNjbXAuY29tL2J1c2luZXNzL21hcmtldHMvYXJ0aWNsZS8zMjY3NDQ0L2NoaW5hcy1zZW5zZXRpbWUtc2xpZGVzLTctd2Vlay1sb3dzLWFmdGVyLXBsYW4tcGxhY2Utc2hhcmVzLWRpc2NvdW500gEA)
2024-06-21 13:11:09,823 - AInewsbot - INFO - [55. China's Sensor-Equipped AI Sexbots to Provide Unparalleled User Experience as They React With Both Movements, Speech - Google News](https://news.google.com/articles/CBMifGh0dHBzOi8vd3d3LnRlY2h0aW1lcy5jb20vYXJ0aWNsZXMvMzA1ODc4LzIwMjQwNjIwL2NoaW5hLXNlbnNvci1lcXVpcHBlZC1haS1zZXhib3RzLXByb3ZpZGUtdW5wYXJhbGxlbGVkLXVzZXItZXhwZXJpZW5jZS5odG3SAQA)
2024-06-21 13:11:09,823 - AInewsbot - INFO - [56. Cisco, ServiceRocket, and Checkr CFOs on how AI is changing the way they do their jobs - Google News](https://news.google.com/articles/CBMiUGh0dHBzOi8vZmluYW5jZS55YWhvby5jb20vbmV3cy9jaXNjby1zZXJ2aWNlcm9ja2V0LWNoZWN

2024-06-21 13:11:09,828 - AInewsbot - INFO - [82. I helped develop a ChatGPT tool to assist with learning and research on virtually any topic. It generates responses backed with peer-reviewed literature and can also summarize research articles. Its meant to be like an interactive encyclopedia. Heres the link: www.academicai.io - Reddit](https://www.reddit.com/r/ChatGPT/comments/1dl4roa/i_helped_develop_a_chatgpt_tool_to_assist_with/)
2024-06-21 13:11:09,828 - AInewsbot - INFO - [83. LANL: AI Can Help Forecast Toxic Blue-Green Tides - Feedly AI](https://losalamosreporter.com/2024/06/21/lanl-ai-can-help-forecast-toxic-blue-green-tides/)
2024-06-21 13:11:09,828 - AInewsbot - INFO - [84. Machine learning prediction of prime editing efficiency across diverse chromatin contexts - Google News](https://news.google.com/articles/CBMiMmh0dHBzOi8vd3d3Lm5hdHVyZS5jb20vYXJ0aWNsZXMvczQxNTg3LTAyNC0wMjI2OC0y0gEA)
2024-06-21 13:11:09,829 - AInewsbot - INFO - [86. Microsoft Edge 126 launches in the Stable

2024-06-21 13:11:09,833 - AInewsbot - INFO - [110. Premiere of AI film cancelled following social media backlash and 200 complaints - Feedly AI](https://www.nme.com/news/film/premiere-of-ai-film-cancelled-following-social-media-backlash-and-200-complaints-3767697)
2024-06-21 13:11:09,834 - AInewsbot - INFO - [111. Premiere of First Movie Written by AI Is Axed After Backlash - Reddit](https://www.reddit.com/r/technology/comments/1dkssdy/premiere_of_first_movie_written_by_ai_is_axed/)
2024-06-21 13:11:09,834 - AInewsbot - INFO - [112. Prior Authorization Platform Humata Health Closes $25M Investment - Google News](https://news.google.com/articles/CBMieWh0dHBzOi8vd3d3LmJ1c2luZXNzd2lyZS5jb20vbmV3cy9ob21lLzIwMjQwNjIwMTEzODMxL2VuL1ByaW9yLUF1dGhvcml6YXRpb24tUGxhdGZvcm0tSHVtYXRhLUhlYWx0aC1DbG9zZXMtMjVNLUludmVzdG1lbnTSAQA)
2024-06-21 13:11:09,835 - AInewsbot - INFO - [113. Reddit training A.I. is bad for humanity. - Google News](https://news.google.com/articles/CBMiOWh0dHBzOi8vc2xhdGUuY29tL3RlY

2024-06-21 13:11:09,841 - AInewsbot - INFO - [140. VB Transform 2024: Find out if new AI inference platforms have what it takes to topple Nvidia - VentureBeat](https://venturebeat.com/ai/vb-transform-2024-find-out-if-new-ai-inference-platforms-have-what-it-takes-to-topple-nvidia/)
2024-06-21 13:11:09,841 - AInewsbot - INFO - [141. Venture capitals AI love affair: A bubble waiting to pop? - Google News](https://news.google.com/articles/CBMiXGh0dHBzOi8vdGhlaGlsbC5jb20vb3Bpbmlvbi80NzMyMjQyLXZlbnR1cmUtY2FwaXRhbHMtYWktbG92ZS1hZmZhaXItYS1idWJibGUtd2FpdGluZy10by1wb3Av0gFgaHR0cHM6Ly90aGVoaWxsLmNvbS9vcGluaW9uLzQ3MzIyNDItdmVudHVyZS1jYXBpdGFscy1haS1sb3ZlLWFmZmFpci1hLWJ1YmJsZS13YWl0aW5nLXRvLXBvcC9hbXAv)
2024-06-21 13:11:09,841 - AInewsbot - INFO - [142. Waymo robotaxis set to cruise past red tape into LA and beyond - Feedly AI](https://go.theregister.com/feed/www.theregister.com/2024/06/21/waymo_expansion/)
2024-06-21 13:11:09,841 - AInewsbot - INFO - [143. Waymo robotaxis set to cruise past red t

In [60]:
log("Sending mail")
from_addr = os.getenv("GMAIL_USER")
to_addr = os.getenv("GMAIL_USER")
subject = 'AI news ' + datetime.now().strftime('%H:%M:%S')
body = f"""
<html>
    <head></head>
    <body>
    <div>
    {html_str}
    </div>
    </body>
</html>
"""

# Setup the MIME
message = MIMEMultipart()
message['From'] = os.getenv("GMAIL_USER")
message['To'] = os.getenv("GMAIL_USER")
message['Subject'] = subject
message.attach(MIMEText(body, 'html'))

# Create SMTP session
with smtplib.SMTP('smtp.gmail.com', 587) as server:
    server.starttls()  # Secure the connection
    server.login(os.getenv("GMAIL_USER"), os.getenv("GMAIL_PASSWORD"))
    text = message.as_string()
    server.sendmail(from_addr, to_addr, text)

log("Finished")


2024-06-21 13:11:11,385 - AInewsbot - INFO - Sending mail
2024-06-21 13:11:12,965 - AInewsbot - INFO - Finished


20