AInewsbot.ipynb

- Automate collecting daily AI news
- Open URLs of news sites specififed in `sources` dict (sources.yaml) using Selenium and Firefox
- Save HTML of each URL in htmldata directory
- Extract URLs from all files, create a pandas dataframe with url, title, src
- Use ChatGPT to filter only AI-related headlines by sending a prompt and formatted table of headlines
- Use SQLite to filter headlines previously seen 
- OPENAI_API_KEY should be in the environment or in a .env file
  
Alternative manual workflow to get HTML files if necessary
- Use Chrome, open e.g. Tech News bookmark folder, right-click and open all bookmarks in new window
- on Google News, make sure switch to AI tab
- on Google News, Feedly, Reddit, scroll to additional pages as desired
- Use SingleFile extension, 'save all tabs'
- Move files to htmldata directory
- Run lower part of notebook to process the data


In [1]:
# import sys
# del sys.modules['ainb_utilities']
# del sys.modules['ainb_webscrape']


In [2]:
from datetime import datetime
import os
import yaml
import dotenv
import sqlite3
import unicodedata
import json

import numpy as np
import pandas as pd

# import bs4
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin, urlparse

import multiprocessing
from concurrent.futures import ThreadPoolExecutor, as_completed
import asyncio
import aiohttp

from IPython.display import HTML, Image, Markdown, display
import markdown

import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

from openai import OpenAI

from ainb_const import (DOWNLOAD_DIR, LOWCOST_MODEL, MODEL,
                        SOURCECONFIG, PROMPT, MAX_INPUT_TOKENS,
                        SUMMARIZE_SYSTEM_PROMPT, SUMMARIZE_USER_PROMPT, FINAL_SUMMARY_PROMPT)
from ainb_utilities import log, delete_files, filter_unseen_urls_db, insert_article, nearest_neighbor_sort, agglomerative_cluster_sort, traveling_salesman_sort_scipy
from ainb_webscrape import get_driver, quit_drivers, launch_drivers, get_file, get_url, parse_file, get_og_tags, get_path_from_url, trimmed_href, process_source_queue_factory, process_url_queue_factory, DRIVERS
from ainb_llm import paginate_df, process_pages, fetch_pages, fetch_openai, fetch_all2, fetch_openai2


# needed because jupyter is already running an async event loop
import nest_asyncio
import asyncio

In [3]:
# PROMPT = """
# You will act as a research assistant to categorize news articles based on their relevance
# to the topic of artificial intelligence (AI). You will process and classify news headlines
# formatted as JSON objects.

# Input Specification:
# You will receive a list of news stories formatted as JSON objects.
# Each object will include an 'id' and a 'title'. For instance:
# [{'id': 97, 'title': 'AI to predict dementia, detect cancer'},
#  {'id': 103,'title': 'Figure robot learns to make coffee by watching humans for 10 hours'},
#  {'id': 103,'title': 'Baby trapped in refrigerator eats own foot'},
#  {'id': 210,'title': 'ChatGPT removes, then reinstates a summarization assistant without explanation.'},
#  {'id': 298,'title': 'The 5 most interesting PC monitors from CES 2024'},
#  ]

# Classification Criteria:
# Classify each story based on its title to determine whether it primarily pertains to AI.
# Broadly define AI-related content to include topics such as machine learning, robotics,
# computer vision, reinforcement learning, large language models, and related topics. Also
# include specific references to AI-related entities and individuals and products such as
# OpenAI, ChatGPT, Elon Musk, Sam Altman, Anthropic Claude, Google Gemini, Copilot,
# Perplexity.ai, Midjourney, etc.

# Output Specification:
# You will return a JSON object with the field 'stories' containing the list of classification results.
# For each story, your output will be a JSON object containing the original 'id' and a new field 'isAI',
# a boolean indicating if the story is about AI. The output schema must be strictly adhered to, without
# any additional fields. Example output:
# {'stories':
# [{'id': 97, 'isAI': true},
#  {'id': 103, 'isAI': true},
#  {'id': 103, 'isAI': false},
#  {'id': 210, 'isAI': true},
#  {'id': 298, 'isAI': false}]
# }

# Ensure that each output object accurately reflects the corresponding input object in terms of the 'id' field
# and that the 'isAI' field accurately represents the AI relevance of the story as determined by the title.

# The list of news stories to classify and enrich is:

# """

In [4]:
print(PROMPT)


You will act as a research assistant to categorize news articles based on their relevance
to the topic of artificial intelligence (AI). You will process and classify news headlines
formatted as JSON objects.

Input Specification:
You will receive a list of news stories formatted as JSON objects.
Each object will include an 'id' and a 'title'. For instance:
[{'id': 97, 'title': 'AI to predict dementia, detect cancer'},
 {'id': 103,'title': 'Figure robot learns to make coffee by watching humans for 10 hours'},
 {'id': 103,'title': 'Baby trapped in refrigerator eats own foot'},
 {'id': 210,'title': 'ChatGPT removes, then reinstates a summarization assistant without explanation.'},
 {'id': 298,'title': 'The 5 most interesting PC monitors from CES 2024'},
 ]

Classification Criteria:
Classify each story based on its title to determine whether it primarily pertains to AI.
Broadly define AI-related content to include topics such as machine learning, robotics,
computer vision, reinforcement l

In [5]:
get_og_tags('https://druce.ai')


{'og:site_name': 'Druce.ai',
 'og:title': 'Druce.ai',
 'og:type': 'website',
 'og:description': "Druce's Blog on Machine Learning, Tech, Markets and Economics",
 'og:url': 'https://druce.ai/',
 'title': 'Druce.ai'}

In [6]:
get_path_from_url('https://druce.ai/2024/03/gemini-summarize-book')


'/2024/03/gemini-summarize-book'

In [7]:
trimmed_href('https://druce.ai/2024/03/gemini-summarize-book?xyz')


'https://druce.ai/2024/03/gemini-summarize-book'

In [8]:
#  load sources to scrape from sources.yaml
with open(SOURCECONFIG, "r") as stream:
    try:
        sources = yaml.safe_load(stream)
    except yaml.YAMLError as exc:
        print(exc)

log(f"Load {len(sources)} sources")



2024-07-03 20:06:20,344 - AInewsbot - INFO - Load 17 sources


20

In [9]:
sources

{'Ars Technica': {'include': ['^https://arstechnica.com/(\\w+)/(\\d+)/(\\d+)/'],
  'title': 'Ars Technica',
  'url': 'https://arstechnica.com/'},
 'Bloomberg Tech': {'include': ['^https://www.bloomberg.com/news/'],
  'title': 'Bloomberg Technology - Bloomberg',
  'url': 'https://www.bloomberg.com/technology'},
 'Business Insider': {'exclude': ['^https://www.insider.com',
   '^https://www.passionfroot.me'],
  'title': 'Tech - Business Insider',
  'url': 'https://www.businessinsider.com/tech'},
 'FT Tech': {'include': ['https://www.ft.com/content/'],
  'title': 'Technology',
  'url': 'https://www.ft.com/technology'},
 'Feedly AI': {'exclude': ['^https://feedly.com',
   '^https://s1.feedly.com',
   '^https://blog.feedly.com'],
  'scroll': 5,
  'initial_sleep': 30,
  'title': 'Discover and Add New Feedly AI Feeds',
  'url': 'https://feedly.com/i/aiFeeds?options=eyJsYXllcnMiOlt7InBhcnRzIjpbeyJpZCI6Im5scC9mL3RvcGljLzMwMDAifV0sInNlYXJjaEhpbnQiOiJ0ZWNobm9sb2d5IiwidHlwZSI6Im1hdGNoZXMiLCJzYWxpZW

In [10]:
# make a reverse dict to map output file titles to source names
sources_reverse = {}
for k, v in sources.items():
    log(f"{k} -> {v['url']} -> {v['title']}.html")
    v['sourcename'] = k
    # map filename (title) to source name
    sources_reverse[v['title']] = k

sources_reverse

2024-07-03 20:06:24,754 - AInewsbot - INFO - Ars Technica -> https://arstechnica.com/ -> Ars Technica.html
2024-07-03 20:06:24,756 - AInewsbot - INFO - Bloomberg Tech -> https://www.bloomberg.com/technology -> Bloomberg Technology - Bloomberg.html
2024-07-03 20:06:24,757 - AInewsbot - INFO - Business Insider -> https://www.businessinsider.com/tech -> Tech - Business Insider.html
2024-07-03 20:06:24,758 - AInewsbot - INFO - FT Tech -> https://www.ft.com/technology -> Technology.html
2024-07-03 20:06:24,759 - AInewsbot - INFO - Feedly AI -> https://feedly.com/i/aiFeeds?options=eyJsYXllcnMiOlt7InBhcnRzIjpbeyJpZCI6Im5scC9mL3RvcGljLzMwMDAifV0sInNlYXJjaEhpbnQiOiJ0ZWNobm9sb2d5IiwidHlwZSI6Im1hdGNoZXMiLCJzYWxpZW5jZSI6ImFib3V0In1dLCJidW5kbGVzIjpbeyJ0eXBlIjoic3RyZWFtIiwiaWQiOiJ1c2VyLzYyZWViYjlmLTcxNTEtNGY5YS1hOGM3LTlhNTdiODIwNTMwOC9jYXRlZ29yeS9HYWRnZXRzIn1dfQ -> Discover and Add New Feedly AI Feeds.html
2024-07-03 20:06:24,759 - AInewsbot - INFO - Google News -> https://news.google.com/topics/CAA

{'Ars Technica': 'Ars Technica',
 'Bloomberg Technology - Bloomberg': 'Bloomberg Tech',
 'Tech - Business Insider': 'Business Insider',
 'Technology': 'FT Tech',
 'Discover and Add New Feedly AI Feeds': 'Feedly AI',
 'Google News - Technology - Artificial intelligence': 'Google News',
 'Hacker News Page 1': 'Hacker News',
 'Hacker News Page 2': 'Hacker News 2',
 'HackerNoon - read, write and learn about any technology': 'HackerNoon',
 'Technology - The New York Times': 'NYT Tech',
 'top scoring links _ multi': 'Reddit',
 'Techmeme': 'Techmeme',
 'The Register_ Enterprise Technology News and Analysis': 'The Register',
 'Artificial Intelligence - The Verge': 'The Verge',
 'AI News _ VentureBeat': 'VentureBeat',
 'Technology - WSJ.com': 'WSJ Tech',
 'Technology - The Washington Post': 'WaPo Tech'}

In [11]:
# determine files in htmldata directory
# List all paths in the directory matching today's date
nfiles = 50
files = [os.path.join(DOWNLOAD_DIR, file)
         for file in os.listdir(DOWNLOAD_DIR)]

# Get the current date
today = datetime.now()
year, month, day = today.year, today.month, today.day
datestr = datetime.now().strftime("%m_%d_%Y")

# filter files only
files = [file for file in files if os.path.isfile(file)]

# Sort files by modification time and take top 50
files.sort(key=lambda x: os.path.getmtime(x), reverse=True)
file = files[:nfiles]

# filter files by with today's date ending in .html
files = [
    file for file in files if datestr in file and file.endswith(".html")]
log(len(files))
for file in files:
    log(file)

saved_pages = []
for file in files:
    filename = os.path.basename(file)
    # locate date like '01_14_2024' in filename
    position = filename.find(" (" + datestr)
    basename = filename[:position]
    # match to source name
    sourcename = sources_reverse.get(basename)
    if sourcename is None:
        log(f"Skipping {basename}, no sourcename metadata")
        continue
    sources[sourcename]['latest'] = file
    saved_pages.append((sourcename, file))

2024-07-03 20:06:35,398 - AInewsbot - INFO - 0


In [14]:
# Fetch HTML files from sources

# takes 5 minutes without multiprocessing
# get driver takes 50 seconds, so call it 1 + 4
# 2 drivers should take 2 + 2, 3: 3 + 4/3
# could save a minute by using 2 webdrivers

# empty download directory
delete_files(DOWNLOAD_DIR)

# save each file specified from sources
log("Saving HTML files")

# Create a queue for multiprocessing and populate it 
queue = multiprocessing.Queue()
for item in sources.values():
    queue.put(item)
    
# Function to take the queue and pop entries off and fetchuntil none are left
# could probably just access queue as a global or just define 1 function
# this pattern lets you create an array of functions with different args

num_browsers = 4
# Function to take the queue and pop entries off and process until none are left
# lets you create an array of functions with different args

callable = process_source_queue_factory(queue)

results = launch_drivers(3, callable)

2024-07-03 20:07:37,304 - AInewsbot - INFO - Saving HTML files
2024-07-03 20:07:37,308 - AInewsbot - INFO - get_driver - 93788 Initializing webdriver
2024-07-03 20:07:37,310 - AInewsbot - INFO - get_driver - 93788 Initializing webdriver
2024-07-03 20:07:37,311 - AInewsbot - INFO - get_driver - 93788 Initializing webdriver
2024-07-03 20:07:54,753 - AInewsbot - INFO - get_driver - Initialized webdriver profile
2024-07-03 20:07:54,753 - AInewsbot - INFO - get_driver - Initialized webdriver profile
2024-07-03 20:07:54,753 - AInewsbot - INFO - get_driver - Initialized webdriver profile
2024-07-03 20:07:54,754 - AInewsbot - INFO - get_driver - Initialized webdriver service
2024-07-03 20:07:54,754 - AInewsbot - INFO - get_driver - Initialized webdriver service
2024-07-03 20:07:54,755 - AInewsbot - INFO - get_driver - Initialized webdriver service
2024-07-03 20:08:44,358 - AInewsbot - INFO - get_driver - Initialized webdriver
2024-07-03 20:08:44,364 - AInewsbot - INFO - get_driver - Initialize

2024-07-03 20:09:29,272 - AInewsbot - INFO - get_files(Google News - Technology - Artificial intelligence) - Saving Google News - Technology - Artificial intelligence (07_03_2024 08_09_29 PM).html as utf-8
2024-07-03 20:09:29,277 - AInewsbot - INFO - Processing NYT Tech
2024-07-03 20:09:29,277 - AInewsbot - INFO - get_files(Technology - The New York Times) - starting get_files https://www.nytimes.com/section/technology
2024-07-03 20:09:37,780 - AInewsbot - INFO - get_files(Discover and Add New Feedly AI Feeds) - Loading additional infinite scroll items
2024-07-03 20:09:38,719 - AInewsbot - INFO - Message: Unable to locate element: //meta[@http-equiv='Content-Type']; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:192:5
NoSuchElementError@chrom

2024-07-03 20:10:23,417 - AInewsbot - INFO - get_files(Technology - WSJ.com) - Saving Technology - WSJ.com (07_03_2024 08_10_23 PM).html as utf-8
2024-07-03 20:10:23,418 - AInewsbot - INFO - Quit webdriver
2024-07-03 20:10:28,942 - AInewsbot - INFO - Message: Unable to locate element: //meta[@http-equiv='Content-Type']; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:192:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:510:5
dom.find/</<@chrome://remote/content/shared/DOM.sys.mjs:136:16

2024-07-03 20:10:28,943 - AInewsbot - INFO - get_files(Technology - The Washington Post) - Saving Technology - The Washington Post (07_03_2024 08_10_28 PM).html as utf-8
2024-07-03 20:10:28,946 - AInewsbot - INFO - Quit webdriver


In [15]:
# flatten results
saved_pages = []
for r in results:
    saved_pages.extend(r)
saved_pages

[('Ars Technica', 'htmldata/Ars Technica (07_03_2024 08_08_56 PM).html'),
 ('FT Tech', 'htmldata/Technology (07_03_2024 08_09_07 PM).html'),
 ('Hacker News', 'htmldata/Hacker News Page 1 (07_03_2024 08_09_17 PM).html'),
 ('Hacker News 2',
  'htmldata/Hacker News Page 2 (07_03_2024 08_09_27 PM).html'),
 ('HackerNoon',
  'htmldata/HackerNoon - read, write and learn about any technology (07_03_2024 08_09_38 PM).html'),
 ('Reddit',
  'htmldata/top scoring links _ multi (07_03_2024 08_10_11 PM).html'),
 ('VentureBeat',
  'htmldata/AI News _ VentureBeat (07_03_2024 08_10_22 PM).html'),
 ('Bloomberg Tech',
  'htmldata/Bloomberg Technology - Bloomberg (07_03_2024 08_08_57 PM).html'),
 ('Google News',
  'htmldata/Google News - Technology - Artificial intelligence (07_03_2024 08_09_29 PM).html'),
 ('NYT Tech',
  'htmldata/Technology - The New York Times (07_03_2024 08_09_39 PM).html'),
 ('Techmeme', 'htmldata/Techmeme (07_03_2024 08_09_50 PM).html'),
 ('The Register',
  'htmldata/The Register_ E

In [None]:
# gives error, TypeError: cannot pickle '_thread.lock' object
# not sure why, should be straightforward
# overkill to save a minute but 2 ways to do it:
#  function that takes the queue as an arg, gets driver, keeps popping items until queue is empty, quits driver 
#  use a pool.map() on a function that takes a single url as an arg, after all complete, call quit_drivers()

# import multiprocessing
# from queue import Queue

# # Function to download and save the web page
# def download_page(url_queue, download_dir):
#     driver = get_driver()

#     while not url_queue.empty():
#         url = url_queue.get()
#         log(url)
#         try:
#             driver.get(url)
#             # Save page source
#             file_name = os.path.join(download_dir, f"{url_queue.qsize()}.html")
#             with open(file_name, 'w', encoding='utf-8') as file:
#                 file.write(driver.page_source)
#             print(f"Saved {url} to {file_name}")
#         except Exception as e:
#             print(f"Failed to download {url}: {e}")
#         finally:
#             url_queue.task_done()

#     driver.quit()

# # Create a queue and add URLs
# url_queue = Queue()
# urls = [v['url'] for v in sources.values()]
# for url in urls:
#     url_queue.put(url)

# # Create and start threads
# processes = []

# for i in range(3):  # 3 Selenium WebDriver instances
#     process = multiprocessing.Process(target=download_page, args=(url_queue, DOWNLOAD_DIR))
#     process.start()
#     processes.append(thread)

# # Wait for all threads to finish
# for process in processes:
#     process.join()

# print("All pages have been downloaded.")

In [16]:
print(len(saved_pages))
for sourcename, page in saved_pages:
    sources[sourcename]['latest'] = page
    print(sourcename, '->', page)
    

17
Ars Technica -> htmldata/Ars Technica (07_03_2024 08_08_56 PM).html
FT Tech -> htmldata/Technology (07_03_2024 08_09_07 PM).html
Hacker News -> htmldata/Hacker News Page 1 (07_03_2024 08_09_17 PM).html
Hacker News 2 -> htmldata/Hacker News Page 2 (07_03_2024 08_09_27 PM).html
HackerNoon -> htmldata/HackerNoon - read, write and learn about any technology (07_03_2024 08_09_38 PM).html
Reddit -> htmldata/top scoring links _ multi (07_03_2024 08_10_11 PM).html
VentureBeat -> htmldata/AI News _ VentureBeat (07_03_2024 08_10_22 PM).html
Bloomberg Tech -> htmldata/Bloomberg Technology - Bloomberg (07_03_2024 08_08_57 PM).html
Google News -> htmldata/Google News - Technology - Artificial intelligence (07_03_2024 08_09_29 PM).html
NYT Tech -> htmldata/Technology - The New York Times (07_03_2024 08_09_39 PM).html
Techmeme -> htmldata/Techmeme (07_03_2024 08_09_50 PM).html
The Register -> htmldata/The Register_ Enterprise Technology News and Analysis (07_03_2024 08_10_01 PM).html
The Verge -> 

In [17]:
# for sourcename, filename in saved_pages:
#     sources[sourcename]["latest"]=filename
#     sources[sourcename]["sourcename"]=sourcename
    
    

In [18]:
# Parse news URLs and titles from downloaded HTML files
log("parsing html files")
all_urls = []
for sourcename, filename in saved_pages:
    print(sourcename, '->', filename, flush=True)
    log(f"{sourcename}", "parse loop")
    links = parse_file(sources[sourcename])
    log(f"{len(links)} links found", "parse loop")
    all_urls.extend(links)

log(f"found {len(all_urls)} links", "parse loop")

2024-07-03 20:11:41,931 - AInewsbot - INFO - parsing html files


Ars Technica -> htmldata/Ars Technica (07_03_2024 08_08_56 PM).html


2024-07-03 20:11:41,934 - AInewsbot - INFO - parse loop - Ars Technica
2024-07-03 20:11:41,986 - AInewsbot - INFO - parse_file - found 252 raw links
2024-07-03 20:11:41,990 - AInewsbot - INFO - parse_file - found 29 filtered links
2024-07-03 20:11:41,990 - AInewsbot - INFO - parse loop - 29 links found


FT Tech -> htmldata/Technology (07_03_2024 08_09_07 PM).html


2024-07-03 20:11:41,991 - AInewsbot - INFO - parse loop - FT Tech
2024-07-03 20:11:42,022 - AInewsbot - INFO - parse_file - found 458 raw links
2024-07-03 20:11:42,027 - AInewsbot - INFO - parse_file - found 105 filtered links
2024-07-03 20:11:42,028 - AInewsbot - INFO - parse loop - 105 links found


Hacker News -> htmldata/Hacker News Page 1 (07_03_2024 08_09_17 PM).html


2024-07-03 20:11:42,028 - AInewsbot - INFO - parse loop - Hacker News
2024-07-03 20:11:42,039 - AInewsbot - INFO - parse_file - found 257 raw links
2024-07-03 20:11:42,042 - AInewsbot - INFO - parse_file - found 27 filtered links
2024-07-03 20:11:42,042 - AInewsbot - INFO - parse loop - 27 links found


Hacker News 2 -> htmldata/Hacker News Page 2 (07_03_2024 08_09_27 PM).html


2024-07-03 20:11:42,042 - AInewsbot - INFO - parse loop - Hacker News 2
2024-07-03 20:11:42,052 - AInewsbot - INFO - parse_file - found 261 raw links
2024-07-03 20:11:42,055 - AInewsbot - INFO - parse_file - found 27 filtered links
2024-07-03 20:11:42,056 - AInewsbot - INFO - parse loop - 27 links found


HackerNoon -> htmldata/HackerNoon - read, write and learn about any technology (07_03_2024 08_09_38 PM).html


2024-07-03 20:11:42,056 - AInewsbot - INFO - parse loop - HackerNoon
2024-07-03 20:11:42,112 - AInewsbot - INFO - parse_file - found 602 raw links
2024-07-03 20:11:42,122 - AInewsbot - INFO - parse_file - found 104 filtered links
2024-07-03 20:11:42,122 - AInewsbot - INFO - parse loop - 104 links found


Reddit -> htmldata/top scoring links _ multi (07_03_2024 08_10_11 PM).html


2024-07-03 20:11:42,122 - AInewsbot - INFO - parse loop - Reddit
2024-07-03 20:11:42,202 - AInewsbot - INFO - parse_file - found 557 raw links
2024-07-03 20:11:42,212 - AInewsbot - INFO - parse_file - found 373 filtered links
2024-07-03 20:11:42,213 - AInewsbot - INFO - parse loop - 373 links found


VentureBeat -> htmldata/AI News _ VentureBeat (07_03_2024 08_10_22 PM).html


2024-07-03 20:11:42,213 - AInewsbot - INFO - parse loop - VentureBeat
2024-07-03 20:11:42,229 - AInewsbot - INFO - parse_file - found 326 raw links
2024-07-03 20:11:42,233 - AInewsbot - INFO - parse_file - found 44 filtered links
2024-07-03 20:11:42,233 - AInewsbot - INFO - parse loop - 44 links found


Bloomberg Tech -> htmldata/Bloomberg Technology - Bloomberg (07_03_2024 08_08_57 PM).html


2024-07-03 20:11:42,233 - AInewsbot - INFO - parse loop - Bloomberg Tech
2024-07-03 20:11:42,393 - AInewsbot - INFO - parse_file - found 289 raw links
2024-07-03 20:11:42,396 - AInewsbot - INFO - parse_file - found 50 filtered links
2024-07-03 20:11:42,396 - AInewsbot - INFO - parse loop - 50 links found


Google News -> htmldata/Google News - Technology - Artificial intelligence (07_03_2024 08_09_29 PM).html


2024-07-03 20:11:42,397 - AInewsbot - INFO - parse loop - Google News
2024-07-03 20:11:42,691 - AInewsbot - INFO - parse_file - found 1032 raw links
2024-07-03 20:11:42,697 - AInewsbot - INFO - parse_file - found 448 filtered links
2024-07-03 20:11:42,698 - AInewsbot - INFO - parse loop - 448 links found


NYT Tech -> htmldata/Technology - The New York Times (07_03_2024 08_09_39 PM).html


2024-07-03 20:11:42,698 - AInewsbot - INFO - parse loop - NYT Tech
2024-07-03 20:11:42,708 - AInewsbot - INFO - parse_file - found 72 raw links
2024-07-03 20:11:42,709 - AInewsbot - INFO - parse_file - found 19 filtered links
2024-07-03 20:11:42,709 - AInewsbot - INFO - parse loop - 19 links found


Techmeme -> htmldata/Techmeme (07_03_2024 08_09_50 PM).html


2024-07-03 20:11:42,709 - AInewsbot - INFO - parse loop - Techmeme
2024-07-03 20:11:42,722 - AInewsbot - INFO - parse_file - found 266 raw links
2024-07-03 20:11:42,726 - AInewsbot - INFO - parse_file - found 103 filtered links
2024-07-03 20:11:42,726 - AInewsbot - INFO - parse loop - 103 links found


The Register -> htmldata/The Register_ Enterprise Technology News and Analysis (07_03_2024 08_10_01 PM).html


2024-07-03 20:11:42,727 - AInewsbot - INFO - parse loop - The Register
2024-07-03 20:11:42,743 - AInewsbot - INFO - parse_file - found 202 raw links
2024-07-03 20:11:42,746 - AInewsbot - INFO - parse_file - found 89 filtered links
2024-07-03 20:11:42,747 - AInewsbot - INFO - parse loop - 89 links found


The Verge -> htmldata/Artificial Intelligence - The Verge (07_03_2024 08_10_12 PM).html


2024-07-03 20:11:42,747 - AInewsbot - INFO - parse loop - The Verge
2024-07-03 20:11:42,770 - AInewsbot - INFO - parse_file - found 312 raw links
2024-07-03 20:11:42,774 - AInewsbot - INFO - parse_file - found 37 filtered links
2024-07-03 20:11:42,774 - AInewsbot - INFO - parse loop - 37 links found


WSJ Tech -> htmldata/Technology - WSJ.com (07_03_2024 08_10_23 PM).html


2024-07-03 20:11:42,775 - AInewsbot - INFO - parse loop - WSJ Tech
2024-07-03 20:11:42,804 - AInewsbot - INFO - parse_file - found 503 raw links
2024-07-03 20:11:42,810 - AInewsbot - INFO - parse_file - found 10 filtered links
2024-07-03 20:11:42,811 - AInewsbot - INFO - parse loop - 10 links found


Business Insider -> htmldata/Tech - Business Insider (07_03_2024 08_08_57 PM).html


2024-07-03 20:11:42,811 - AInewsbot - INFO - parse loop - Business Insider
2024-07-03 20:11:42,831 - AInewsbot - INFO - parse_file - found 310 raw links
2024-07-03 20:11:42,835 - AInewsbot - INFO - parse_file - found 48 filtered links
2024-07-03 20:11:42,835 - AInewsbot - INFO - parse loop - 48 links found


Feedly AI -> htmldata/Discover and Add New Feedly AI Feeds (07_03_2024 08_10_17 PM).html


2024-07-03 20:11:42,835 - AInewsbot - INFO - parse loop - Feedly AI
2024-07-03 20:11:42,862 - AInewsbot - INFO - parse_file - found 235 raw links
2024-07-03 20:11:42,865 - AInewsbot - INFO - parse_file - found 64 filtered links
2024-07-03 20:11:42,865 - AInewsbot - INFO - parse loop - 64 links found


WaPo Tech -> htmldata/Technology - The Washington Post (07_03_2024 08_10_28 PM).html


2024-07-03 20:11:42,866 - AInewsbot - INFO - parse loop - WaPo Tech
2024-07-03 20:11:42,877 - AInewsbot - INFO - parse_file - found 155 raw links
2024-07-03 20:11:42,878 - AInewsbot - INFO - parse_file - found 25 filtered links
2024-07-03 20:11:42,879 - AInewsbot - INFO - parse loop - 25 links found
2024-07-03 20:11:42,879 - AInewsbot - INFO - parse loop - found 1602 links


20

In [19]:
# make a pandas dataframe of all the links found
orig_df = (
    pd.DataFrame(all_urls)
    .groupby("url")
    .first()
    .reset_index()
    .sort_values("src")[["src", "title", "url"]]
    .reset_index(drop=True)
    .reset_index(drop=False)
    .rename(columns={"index": "id"})
)
print(len(orig_df))
orig_df.head()

1301


Unnamed: 0,id,src,title,url
0,0,Ars Technica,Paul Sutter walks us through the future of cli...,https://arstechnica.com/science/2022/04/paul-s...
1,1,Ars Technica,"Apple Vision Pro, new cameras fail user-repair...",https://arstechnica.com/gadgets/2024/07/apple-...
2,2,Ars Technica,"Bleeding subscribers, cable companies force th...",https://arstechnica.com/gadgets/2024/07/bleedi...
3,3,Ars Technica,Google’s greenhouse gas emissions jump 48% in ...,https://arstechnica.com/gadgets/2024/07/google...
4,4,Ars Technica,"Japan wins 2-year “war on floppy disks,” kills...",https://arstechnica.com/gadgets/2024/07/japans...


In [34]:
datestr = '2024-07-02'

conn = sqlite3.connect('articles.db')

c = conn.cursor()
query = f"select * from news_articles where article_date > '{datestr}' and isAI = 1 order by id desc limit 220"
df = pd.read_sql_query(query, conn)
# filtered_df=df.reset_index(). \
#   drop(columns='id'). \
#   rename(columns={'index': 'id'})

df



Unnamed: 0,id,src,title,url,isAI,article_date,timestamp
0,120771,The Verge,Google will now generate disclosures for polit...,https://www.theverge.com/2024/7/3/24191669/goo...,1,2024-07-03,2024-07-03 20:22:58
1,120770,The Verge,OpenAI’s ChatGPT Mac app was storing conversat...,https://www.theverge.com/2024/7/3/24191636/ope...,1,2024-07-03,2024-07-03 20:22:58
2,120769,The Verge,Perplexity’s ‘Pro Search’ AI upgrade makes it ...,https://www.theverge.com/2024/7/3/24191431/per...,1,2024-07-03,2024-07-03 20:22:58
3,120765,The Register,Meta training AI models on citizen data gets a...,https://www.theregister.com/2024/07/03/brazil_...,1,2024-07-03,2024-07-03 20:22:58
4,120747,Techmeme,"Sources: Harvey, which builds generative AI to...",https://www.theinformation.com/articles/harvey...,1,2024-07-03,2024-07-03 20:22:58
...,...,...,...,...,...,...,...
215,120054,HackerNoon,Defining Diversity and Inclusion in AI,https://hackernoon.com/defining-diversity-and-...,1,2024-07-03,
216,120051,HackerNoon,Gamified Learning With An AI Board Game Tourna...,https://hackernoon.com/gamified-learning-with-...,1,2024-07-03,
217,120048,HackerNoon,AI for All: Operationalizing Diversity and Inc...,https://hackernoon.com/ai-for-all-operationali...,1,2024-07-03,
218,120039,HackerNoon,Research Methodology for Operationalizing Dive...,https://hackernoon.com/research-methodology-fo...,1,2024-07-03,


In [35]:
conn.execute(f"delete from news_articles where timestamp > '2024-07-02 13:00'")
# conn.execute(f"delete from news_articles where id > 220230")
# Committing the changes
conn.commit()

# Closing the connection
conn.close()


In [22]:
filtered_df = filter_unseen_urls_db(orig_df)


2024-07-03 20:12:34,210 - AInewsbot - INFO - Existing URLs: 120230
2024-07-03 20:12:34,242 - AInewsbot - INFO - New URLs: 548


In [23]:
# use chatgpt to filter AI-related headlines using a prompt to OpenAI

client = OpenAI()

# make pages that fit in a reasonably sized prompt
pages = paginate_df(filtered_df)


In [24]:
pages

[[{'id': 4,
   'title': 'Japan wins 2-year “war on floppy disks,” kills regulations requiring old tech'},
  {'id': 46,
   'title': 'Francisco, KKR Are Said to Consider Takeover of Instructure'},
  {'id': 100,
   'title': 'Influencer talent firm Night is laying off its team that aimed to turn creators into TV and movie stars'},
  {'id': 120,
   'title': 'Being a director in China has just become much tougher'},
  {'id': 147,
   'title': 'Singapore tries to break free of the ‘Crazy Rich Asians’ cliché'},
  {'id': 149,
   'title': 'Jeffrey Katzenberg faces ire of Hollywood donors after Biden debate'},
  {'id': 185, 'title': 'Why deep tech VC Driving Forces is shutting down'},
  {'id': 186,
   'title': 'Would having an AI boss be better than your current human one?'},
  {'id': 187,
   'title': 'China-based inventors lead on global GenAI patents: UN report'},
  {'id': 188, 'title': 'AI key issue in Hollywood contract'},
  {'id': 189,
   'title': "If AI is the 'gas guzzler' of data, how do w

In [25]:
# updated ainb_llm.py to use async wait
# the lowest level  client.chat.completions.create call does not support await
# gives object ChatCompletion can't be used in 'await' expression
# this still takes 2 minutes for 456 URLs
# the OpenAI python API still blocks / throttles

# print(datetime.now())
# enriched_urls = await process_pages(client, PROMPT, pages)
# print(datetime.now())


In [26]:
API_URL = 'https://api.openai.com/v1/chat/completions'

headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {os.getenv("OPENAI_API_KEY")}',
}



In [27]:
# this runs fast with async aiohttp and on gpt-3.5 (15 seconds vs 2 minutes)
# the old API supported submitting multiple payloads in a single completion request
# current API supports a slow 'batch' submission https://platform.openai.com/docs/guides/rate-limits/usage-tiers
# there is a more complex example here - https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py

# need this to run async in jupyter since it already has an asyncio event loop running
import nest_asyncio
nest_asyncio.apply()

# async def fetch_openai(session, payload):
#     async with session.post(API_URL, headers=headers, json=payload) as response:
#         return await response.json()


# async def fetch_pages(prompt, pages):
#     log(f"{datetime.now().strftime('%H:%M:%S')} Sending {len(pages)} pages to OpenAI.")

#     # make a prompt and payload for each page
#     payloads = [{"model":  LOWCOST_MODEL,
#                  "response_format": {"type": "json_object"},
#                  "messages": [{"role": "user",
#                                "content": prompt + json.dumps(p)
#                                }]
#                  } for p in pages]

#     async with aiohttp.ClientSession() as session:
#         tasks = []
#         for i, payload in enumerate(payloads):
#             log(f"{datetime.now().strftime('%H:%M:%S')} Sending page {i}.")
#             task = asyncio.create_task(fetch_openai(session, payload))
#             tasks.append(task)

#         responses = await asyncio.gather(*tasks)

#     # validate and process the responses
#     log(f"{datetime.now().strftime('%H:%M:%S')} Processing responses... ")

#     retlist = []
#     for i, response in enumerate(responses):
#         try:
#             response_dict = json.loads(
#                 response["choices"][0]["message"]["content"])
#         except Exception as e:
#             raise TypeError("Error: Invalid response " + str(e))

#         if type(response_dict) is dict:
#             response_list = response_dict.get("stories")
#         else:
#             raise TypeError("Error: Invalid response type")

#         if type(response_list) is not list:
#             raise TypeError("Error: Invalid response type")

#         log(f"{datetime.now().strftime('%H:%M:%S')} got list with {len(response_list)} items ")

#         # check all sent present in response
#         # seems to match, i.e. responses received correctly and in order
#         # could add this to fetch_openai and implement retry
#         sent_ids = [s['id'] for s in pages[i]]
#         received_ids = [r['id'] for r in response_list]
#         difference = set(sent_ids) - set(received_ids)

#         if difference:
#             log(f"missing items, {str(difference)}")

#         retlist.extend(response_list)

#     log(f"{datetime.now().strftime('%H:%M:%S')} Processed {len(retlist)} responses.")

#     return retlist


# Run the main function
print(datetime.now())
enriched_urls = asyncio.run(fetch_pages(PROMPT, pages))
print(datetime.now())


2024-07-03 20:12:53.972822


2024-07-03 20:13:07,347 - AInewsbot - INFO - 20:13:07 Processing responses... 
2024-07-03 20:13:07,348 - AInewsbot - INFO - 20:13:07 got list with 50 items 
2024-07-03 20:13:07,349 - AInewsbot - INFO - 20:13:07 got list with 50 items 
2024-07-03 20:13:07,350 - AInewsbot - INFO - 20:13:07 got list with 50 items 
2024-07-03 20:13:07,351 - AInewsbot - INFO - 20:13:07 got list with 50 items 
2024-07-03 20:13:07,352 - AInewsbot - INFO - 20:13:07 got list with 50 items 
2024-07-03 20:13:07,353 - AInewsbot - INFO - 20:13:07 got list with 50 items 
2024-07-03 20:13:07,354 - AInewsbot - INFO - 20:13:07 got list with 50 items 
2024-07-03 20:13:07,355 - AInewsbot - INFO - 20:13:07 got list with 50 items 
2024-07-03 20:13:07,355 - AInewsbot - INFO - 20:13:07 got list with 50 items 
2024-07-03 20:13:07,356 - AInewsbot - INFO - 20:13:07 got list with 50 items 
2024-07-03 20:13:07,357 - AInewsbot - INFO - 20:13:07 got list with 48 items 
2024-07-03 20:13:07,357 - AInewsbot - INFO - 20:13:07 Processed

2024-07-03 20:13:07.358761


In [28]:
enriched_df = pd.DataFrame(enriched_urls)
print(len(enriched_df))
enriched_df.head()


548


Unnamed: 0,id,isAI
0,4,False
1,46,False
2,100,False
3,120,False
4,147,False


In [29]:
log("isAI", len(enriched_df.loc[enriched_df["isAI"]]))
log("not isAI", len(enriched_df.loc[~enriched_df["isAI"]]))


2024-07-03 20:13:20,785 - AInewsbot - INFO - 186 - isAI
2024-07-03 20:13:20,789 - AInewsbot - INFO - 362 - not isAI


20

In [None]:
# merge returned df into original df
merged_df = pd.merge(filtered_df, enriched_df, on="id", how="outer")
merged_df['date'] = datetime.now().date()
merged_df.head()


In [None]:
# should be empty, shouldn't get back rows that don't match to existing
log(f"Unmatched response rows: {len(merged_df.loc[merged_df['src'].isna()])}")
# should be empty, should get back all rows from orig
log(f"Unmatched source rows: {len(merged_df.loc[merged_df['isAI'].isna()])}")


In [None]:
# update SQLite database with all seen articles
conn = sqlite3.connect('articles.db')
cursor = conn.cursor()

for row in merged_df.itertuples():
    insert_article(conn, cursor, row.src, row.title,
                   row.url, row.isAI, row.date)


In [None]:
AIdf = merged_df.loc[merged_df["isAI"]==1].reset_index(drop=True)
log(f"Found {len(AIdf)} AI headlines")


In [None]:
# map title to ascii characters to avoid some dupes with e.g. different quote symbols

def unicode_to_ascii(input_string):
    # Normalize the Unicode string to NFKD form
    normalized_string = unicodedata.normalize('NFKD', input_string)
    
    # Encode to ASCII bytes, ignoring characters that cannot be converted
    ascii_bytes = normalized_string.encode('ascii', 'ignore')
    
    # Convert bytes back to a string
    ascii_string = ascii_bytes.decode('ascii')
    
    return ascii_string

AIdf['title'] = AIdf['title'].apply(unicode_to_ascii)


In [None]:
# dedupe identical headlines
AIdf['title_clean'] = AIdf['title'].map(lambda s: "".join(s.split()))
AIdf = AIdf.sort_values("src") \
    .groupby("title_clean") \
    .first() \
    .reset_index()
log(f"Found {len(AIdf)} unique AI headlines")


In [None]:
log(f"Fetching embeddings for {len(AIdf)} headlines")
embedding_model = 'text-embedding-3-large'
response = client.embeddings.create(input=AIdf['title'].tolist(),
                                    model=embedding_model)
embedding_df = pd.DataFrame([e.model_dump()['embedding'] for e in response.data])

sorted_indices = agglomerative_cluster_sort(embedding_df)
AIdf = AIdf.iloc[sorted_indices].reset_index(drop=True)
AIdf


In [None]:
AIdf=AIdf.reset_index(drop=True)
with pd.option_context('display.max_rows', None):
    display(AIdf[["title"]])

In [None]:
html_str = ""
for row in AIdf.itertuples():
    log(f"[{row.Index}. {row.title} - {row.src}]({row.url})")
    html_str += f'{row.Index}.<a href="{row.url}">{row.title} - {row.src}</a><br />\n'


In [None]:
log("Sending mail")
from_addr = os.getenv("GMAIL_USER")
to_addr = os.getenv("GMAIL_USER")
subject = 'AI news ' + datetime.now().strftime('%H:%M:%S')
body = f"""
<html>
    <head></head>
    <body>
    <div>
    {html_str}
    </div>
    </body>
</html>
"""

# Setup the MIME
message = MIMEMultipart()
message['From'] = os.getenv("GMAIL_USER")
message['To'] = os.getenv("GMAIL_USER")
message['Subject'] = subject
message.attach(MIMEText(body, 'html'))

# Create SMTP session
with smtplib.SMTP('smtp.gmail.com', 587) as server:
    server.starttls()  # Secure the connection
    server.login(os.getenv("GMAIL_USER"), os.getenv("GMAIL_PASSWORD"))
    text = message.as_string()
    server.sendmail(from_addr, to_addr, text)

log("Finished")


In [None]:
# fetch pages
# Create a queue for multiprocessing and populate it 
print("Enqueuing URLs")

queue = multiprocessing.Queue()
for row in AIdf.itertuples():
    queue.put((row.id, row.url, row.title))
    

In [None]:
# import importlib
# import ainb_webscrape
# # Make changes to your_module.py here, if desired

# # Reload the module
# importlib.reload(ainb_webscrape)


In [None]:
# import sys
# del sys.modules['ainb_webscrape']


In [None]:
# # ddriver = get_driver()   
# get_url(AIdf.iloc[1].url, AIdf.iloc[1].title, ddriver)


In [None]:
# process queue asynchronously
num_browsers = 4
# # Function to take the queue and pop entries off and process until none are left
# # lets you create an array of functions with different args
# def process_queue_factory(q):
#     def process_queue():
#         # launch browser via selenium driver
#         driver = get_driver()    
#         saved_pages = []
#         while not q.empty():
#             i, url, title = q.get()
#             log(f'Processing {url}')
#             savefile = get_url(url, title, driver)
#             saved_pages.append((i, url, title, savefile))
#         # Close the browser
#         log("Quit webdriver")
#         driver.quit()
#         return saved_pages
#     return process_queue

callable = process_url_queue_factory(queue)
# saved_pages = callable()

print(f"fetching {len(AIdf)} pages using {num_browsers} browsers")
results = launch_drivers(num_browsers, callable)


In [None]:
# flatten results
saved_pages = []
for r in results:
    saved_pages.extend(r)
saved_pages
len(saved_pages)

In [None]:
pages_df = pd.DataFrame(saved_pages)
pages_df.columns = ['id', 'url', 'title', 'path']
pages_df

In [None]:
AIdf = pd.merge(AIdf, pages_df[["id", "path"]], on='id', how="inner")


In [None]:
print(SUMMARIZE_SYSTEM_PROMPT)


In [None]:
print(SUMMARIZE_USER_PROMPT)


In [None]:
async def fetch_openai2(session, payload, i):
    """
    Asynchronously fetches a response from the OpenAI URL using an aiohttp ClientSession.

    Parameters:
    - session (aiohttp.ClientSession): The aiohttp ClientSession object used for making HTTP requests.
    - payload (dict): The payload to be sent in the request body as JSON.
    - i (int): an id to return, to allow us map summary to original request

    Returns:
    - dict: The full JSON response from the OpenAI API.

    Raises:
    - aiohttp.ClientError: If there is an error during the HTTP request.

    Example usage:
    ```
    async with aiohttp.ClientSession() as session:
        response = await fetch_openai(session, payload)
        print(response)
    ```
    """
    async with session.post(API_URL, headers=headers, json=payload) as response:
        retval = await response.json()
        return (i, retval)
    

In [None]:
async def fetch_all2(page_df):
    tasks = []
    responses = []
    async with aiohttp.ClientSession() as session:

        for row in page_df.itertuples():

            # Read the HTML file
            try:
                with open(row.path, 'r', encoding='utf-8') as file:
                    html_content = file.read()
            except:
                print(f"Skipping {row.id} : {row.path}")
                continue

            # Parse the HTML content using BeautifulSoup
            soup = BeautifulSoup(html_content, 'html.parser')

            # Filter out script and style elements
            for script_or_style in soup(['script', 'style']):
                script_or_style.extract()

            # Get text and strip leading/trailing whitespace
            visible_text = soup.get_text(separator=' ', strip=True)
            visible_text = visible_text[:MAX_INPUT_TOKENS]

            userprompt = f"""{SUMMARIZE_USER_PROMPT}:
{visible_text}
            """

            payload = {"model":  LOWCOST_MODEL,
                        "messages": [{"role": "system",
                                      "content": SUMMARIZE_SYSTEM_PROMPT
                                     },
                                     {"role": "user",
                                      "content": userprompt
                                     }]
                       }

            task = asyncio.create_task(fetch_openai2(session, payload, row.id))
            tasks.append(task)

        responses = await asyncio.gather(*tasks)
        return responses


In [None]:
print(datetime.now())
responses = await fetch_all2(AIdf)
print(datetime.now())
print(len(responses))
responses[0]


In [None]:
# bring summaries into dict
response_dict = {}
for i, response in responses:
    try:
        response_str = response["choices"][0]["message"]["content"]
        response_dict[i] = response_str
    except Exception as exc:
        print(exc)
        
len(response_dict)

In [None]:
markdown_str = ''

for i, row in enumerate(AIdf.itertuples()):
    mdstr = f"[{i+1}. {row.title}]({row.url})  \n\n{response_dict[row.id]} \n\n"
    display(Markdown(mdstr))
    markdown_str += mdstr
    

In [None]:
html_str[:1000]

In [None]:
# Convert Markdown to HTML
html_str = markdown.markdown(markdown_str, extensions=['extra'])
display(HTML(html_str))


In [None]:
log("Sending mail")
from_addr = os.getenv("GMAIL_USER")
to_addr = os.getenv("GMAIL_USER")
subject = 'AI news summaries ' + datetime.now().strftime('%H:%M:%S')
body = f"""
<html>
    <head></head>
    <body>
    <div>
    {html_str}
    </div>
    </body>
</html>
"""

# Setup the MIME
message = MIMEMultipart()
message['From'] = os.getenv("GMAIL_USER")
message['To'] = os.getenv("GMAIL_USER")
message['Subject'] = subject
message.attach(MIMEText(body, 'html'))

# Create SMTP session
with smtplib.SMTP('smtp.gmail.com', 587) as server:
    server.starttls()  # Secure the connection
    server.login(os.getenv("GMAIL_USER"), os.getenv("GMAIL_PASSWORD"))
    text = message.as_string()
    server.sendmail(from_addr, to_addr, text)

log("Finished")


In [None]:
print(FINAL_SUMMARY_PROMPT) 


In [None]:
response = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "system", "content": SYSTEM_PROMPT},
              {"role": "user", "content": FINAL_SUMMARY_PROMPT + markdown_str
              }],
    n=3,   
    temperature=0.5
)


In [None]:
SYSTEM_PROMPT


In [None]:
response_str = response.choices[0].message.content
response_str = response_str.replace("$", "\\$")
display(Markdown(response_str))


In [None]:
response_str = response.choices[1].message.content
response_str = response_str.replace("$", "\\$")
display(Markdown(response_str))


In [None]:
response_str = response.choices[2].message.content
response_str = response_str.replace("$", "\\$")
display(Markdown(response_str))


In [None]:
response_str = response.choices[4].message.content
response_str = response_str.replace("$", "\\$")
display(Markdown(response_str))


In [None]:
import markdown

In [None]:
markdown_str[:100]

In [None]:
markdown_str = response.choices[0].message.content
# Convert Markdown to HTML
html_str = markdown.markdown(markdown_str)
display(HTML(html_str))


In [None]:
log("Sending mail")
from_addr = os.getenv("GMAIL_USER")
to_addr = os.getenv("GMAIL_USER")
subject = 'AI news summaries ' + datetime.now().strftime('%H:%M:%S')
body = f"""
<html>
    <head></head>
    <body>
    <div>
    {html_str}
    </div>
    </body>
</html>
"""

# Setup the MIME
message = MIMEMultipart()
message['From'] = os.getenv("GMAIL_USER")
message['To'] = os.getenv("GMAIL_USER")
message['Subject'] = subject
message.attach(MIMEText(body, 'html'))

# Create SMTP session
with smtplib.SMTP('smtp.gmail.com', 587) as server:
    server.starttls()  # Secure the connection
    server.login(os.getenv("GMAIL_USER"), os.getenv("GMAIL_PASSWORD"))
    text = message.as_string()
    server.sendmail(from_addr, to_addr, text)

log("Finished")


In [None]:
clusters


In [None]:
AIdf['clusters'] = clusters
AIdf.loc[AIdf['clusters']==4][['title']]


In [None]:
AIdf['cluster']=clusters