# AInewsbot.ipynb - Automate collecting daily AI news

1. initial scrape of front pages of tech sites
  - Open URLs of news sites specififed in `sources` dict (sources.yaml) using Selenium and Firefox
  - Save HTML of each URL in htmldata directory  
  - Extract URLs from all files, create a pandas dataframe with url, title, src

2. Filter and clean to AI-related headlines not seen before
  - Use ChatGPT prompt to filter only AI-related headlines by sending a prompt and formatted table of headlines
  - Use SQLite to filter headlines previously seen 
  - remove duplicate URLs and headlines
  - ensure there are pretty source names for each news site

3. Topic analysis, make a list of topics for each headline
  - using a prompt, check each headline against a number of evergreen AI topics, e.g. deepfakes, regulation, AI in education
  - extract free from topics from each headline
  - combine topics into topic list for each headline
  - cluster headlines using dimensionality-reduced embeddings and DBSCAN; ask chatgpt to name each cluster
  - sort headlines by doing a traveling salesman shortest traversal in embedding space

4. Summarize individual news story pages in 3 bullets using a prompt

5. create a large markdown file with all bullet points and topics

6. give the markdown file to ChatGPT and ask it to make a list of most popular and import topics of the day

7. human should make a list of the day's topics, combining the chatgpt response to the quesion and cluster topics and 

8. Put summaries in vector store along with metadata. For each topic, retrieve all associated stories and have chatgpt write a digest of those stories in the given format.

9. assemble stories into first draft of newsletter for rewriting as necessary

todo:

use langgraph for final editing workfow
1. prompt to edit final copy for dupes, combine similar sections, copy edit
2. have a reviewer prompt check if there are any bullet points to move to a different section 
3. if so have an editor prompt remove them , return to 2. until nothing left to move dupes left
4. have a reviewer check each section, identify bullet points that are similar to other bullet points in the section and have identical links. rewrite combining so there is no duplication. 
5. identify sections that are short or similar to other sections and suggest sections that should be combined them
6. have an editor prompt merge short sections, return to 4, until no orphan sections left
7. maybe final copy-edit prompt

Original, alternative manual workflow to get HTML files if necessary
- Use Chrome, open e.g. Tech News bookmark folder, right-click and open all bookmarks in new window
- on Google News, make sure switch to AI tab
- on Google News, Feedly, Reddit, scroll to additional pages as desired
- Use SingleFile extension, 'save all tabs'
- Move files to htmldata directory
- Run lower part of notebook to process the data


In [1]:
# import sys
# del sys.modules['ainb_utilities']


In [2]:
from datetime import datetime
import os
import yaml
import dotenv
import sqlite3
import unicodedata
import json
import pickle
from collections import Counter
import shutil

import numpy as np
import pandas as pd
import umap
# import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import DBSCAN

# import bs4
import requests
from urllib.parse import urlparse
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import trafilatura

import multiprocessing
from concurrent.futures import ThreadPoolExecutor, as_completed
import asyncio
import aiohttp

from IPython.display import HTML, Image, Markdown, display
import markdown

import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

from langchain.vectorstores import Chroma
# from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings

from openai import OpenAI

from ainb_const import (DOWNLOAD_DIR, LOWCOST_MODEL, MODEL, CANONICAL_TOPICS,
                        SOURCECONFIG, FILTER_PROMPT, TOPIC_PROMPT,
                        SUMMARIZE_SYSTEM_PROMPT, SUMMARIZE_USER_PROMPT, FINAL_SUMMARY_PROMPT, TOP_CATEGORIES_PROMPT,
                        MAX_INPUT_TOKENS, MAX_OUTPUT_TOKENS, MAX_RETRIES, TEMPERATURE)
from ainb_utilities import (log, delete_files, filter_unseen_urls_db, insert_article, 
                            nearest_neighbor_sort, agglomerative_cluster_sort, traveling_salesman_sort_scipy,
                            unicode_to_ascii, send_gmail)
from ainb_webscrape import (get_driver, quit_drivers, launch_drivers, get_file, get_url, parse_file, 
                            get_og_tags, get_path_from_url, trimmed_href, process_source_queue_factory, 
                            process_url_queue_factory, get_google_news_redirects)
from ainb_llm import (paginate_df, process_pages, fetch_pages, fetch_openai, fetch_all_summaries, 
                      fetch_openai_summary, trunc_tokens, categorize_headline)


import asyncio
# need this to run async in jupyter since it already has an asyncio event loop running
import nest_asyncio
nest_asyncio.apply()


# Initialize

In [3]:
before_date = '2024-07-21 09:00:00'

In [4]:
# OpenAI API module
client = OpenAI()

# Or can use REST API directly
API_URL = 'https://api.openai.com/v1/chat/completions'

headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {os.getenv("OPENAI_API_KEY")}',
}


In [5]:
#  load sources to scrape from sources.yaml
with open(SOURCECONFIG, "r") as stream:
    try:
        sources = yaml.safe_load(stream)
    except yaml.YAMLError as exc:
        print(exc)

log(f"Load {len(sources)} sources from {SOURCECONFIG}")

# make a reverse dict to map output file titles to source names
sources_reverse = {}
for k, v in sources.items():
    log(f"{k} -> {v['url']} -> {v['title']}.html")
    v['sourcename'] = k
    # map filename (title) to source name
    sources_reverse[v['title']] = k

log(f"Mapped {len(sources_reverse)} source page titles to sources")


2024-07-21 15:07:13,850 - AInewsbot - INFO - Load 17 sources from sources.yaml
2024-07-21 15:07:13,851 - AInewsbot - INFO - Ars Technica -> https://arstechnica.com/ -> Ars Technica.html
2024-07-21 15:07:13,851 - AInewsbot - INFO - Bloomberg Tech -> https://www.bloomberg.com/technology -> Bloomberg Technology - Bloomberg.html
2024-07-21 15:07:13,851 - AInewsbot - INFO - Business Insider -> https://www.businessinsider.com/tech -> Tech - Business Insider.html
2024-07-21 15:07:13,852 - AInewsbot - INFO - FT Tech -> https://www.ft.com/technology -> Technology.html
2024-07-21 15:07:13,852 - AInewsbot - INFO - Feedly AI -> https://feedly.com/i/aiFeeds?options=eyJsYXllcnMiOlt7InBhcnRzIjpbeyJpZCI6Im5scC9mL3RvcGljLzMwMDAifV0sInNlYXJjaEhpbnQiOiJ0ZWNobm9sb2d5IiwidHlwZSI6Im1hdGNoZXMiLCJzYWxpZW5jZSI6ImFib3V0In1dLCJidW5kbGVzIjpbeyJ0eXBlIjoic3RyZWFtIiwiaWQiOiJ1c2VyLzYyZWViYjlmLTcxNTEtNGY5YS1hOGM3LTlhNTdiODIwNTMwOC9jYXRlZ29yeS9HYWRnZXRzIn1dfQ -> Discover and Add New Feedly AI Feeds.html
2024-07-21 15:0

In [6]:
sources

{'Ars Technica': {'include': ['^https://arstechnica.com/(\\w+)/(\\d+)/(\\d+)/'],
  'title': 'Ars Technica',
  'url': 'https://arstechnica.com/',
  'sourcename': 'Ars Technica'},
 'Bloomberg Tech': {'include': ['^https://www.bloomberg.com/news/'],
  'title': 'Bloomberg Technology - Bloomberg',
  'url': 'https://www.bloomberg.com/technology',
  'sourcename': 'Bloomberg Tech'},
 'Business Insider': {'exclude': ['^https://www.insider.com',
   '^https://www.passionfroot.me'],
  'title': 'Tech - Business Insider',
  'url': 'https://www.businessinsider.com/tech',
  'sourcename': 'Business Insider'},
 'FT Tech': {'include': ['https://www.ft.com/content/'],
  'title': 'Technology',
  'url': 'https://www.ft.com/technology',
  'sourcename': 'FT Tech'},
 'Feedly AI': {'exclude': ['^https://feedly.com',
   '^https://s1.feedly.com',
   '^https://blog.feedly.com'],
  'scroll': 5,
  'initial_sleep': 30,
  'title': 'Discover and Add New Feedly AI Feeds',
  'url': 'https://feedly.com/i/aiFeeds?options=e

In [7]:
sources_reverse


{'Ars Technica': 'Ars Technica',
 'Bloomberg Technology - Bloomberg': 'Bloomberg Tech',
 'Tech - Business Insider': 'Business Insider',
 'Technology': 'FT Tech',
 'Discover and Add New Feedly AI Feeds': 'Feedly AI',
 'Google News - Technology - Artificial intelligence': 'Google News',
 'Hacker News Page 1': 'Hacker News',
 'Hacker News Page 2': 'Hacker News 2',
 'HackerNoon - read, write and learn about any technology': 'HackerNoon',
 'Technology - The New York Times': 'NYT Tech',
 'top scoring links _ multi': 'Reddit',
 'Techmeme': 'Techmeme',
 'The Register_ Enterprise Technology News and Analysis': 'The Register',
 'Artificial Intelligence - The Verge': 'The Verge',
 'AI News _ VentureBeat': 'VentureBeat',
 'Technology - WSJ.com': 'WSJ Tech',
 'Technology - The Washington Post': 'WaPo Tech'}

In [8]:
# determine files already in htmldata directory
# List all paths in the directory matching today's date
nfiles = 50
files = [os.path.join(DOWNLOAD_DIR, file)
         for file in os.listdir(DOWNLOAD_DIR)]
# Get the current date
today = datetime.now()
year, month, day = today.year, today.month, today.day
datestr = datetime.now().strftime("%m_%d_%Y")

# filter files only
files = [file for file in files if os.path.isfile(file)]

# Sort files by modification time and take top 50
files.sort(key=lambda x: os.path.getmtime(x), reverse=True)
file = files[:nfiles]

# filter files by with today's date ending in .html
files = [
    file for file in files if datestr in file and file.endswith(".html")]
log(len(files))
for file in files:
    log(file)

saved_pages = []
for file in files:
    filename = os.path.basename(file)
    # locate date like '01_14_2024' in filename
    position = filename.find(" (" + datestr)
    basename = filename[:position]
    # match to source name
    sourcename = sources_reverse.get(basename)
    if sourcename is None:
        log(f"Skipping {basename}, no sourcename metadata")
        continue
    sources[sourcename]['latest'] = file
    saved_pages.append((sourcename, file))
    

2024-07-21 15:07:20,372 - AInewsbot - INFO - 0


In [9]:
log(f"{len(files)} files found")


2024-07-21 15:07:54,204 - AInewsbot - INFO - 0 files found


# Fetch and save source pages

In [12]:
from multithreading import Pool, cpu_count
cpu_count()

ModuleNotFoundError: No module named 'multithreading'

In [None]:
dir(Pool)

In [14]:
# Fetch HTML files from sources

# empty download directory
delete_files(DOWNLOAD_DIR)

# save each file specified from sources
num_browsers = 4
log(f"Saving HTML files using {num_browsers} browsers")

# initialize drivers
with ThreadPoolExecutor(max_workers=num_browsers) as executor:
    # Initialize drivers in parallel
    drivers = list(executor.map(get_driver, range(num_browsers)))


2024-07-21 15:11:32,692 - AInewsbot - INFO - Saving HTML files using 4 browsers
2024-07-21 15:11:32,699 - AInewsbot - INFO - get_driver - 67021 Initializing webdriver
2024-07-21 15:11:32,701 - AInewsbot - INFO - get_driver - 67021 Initializing webdriver
2024-07-21 15:11:32,702 - AInewsbot - INFO - get_driver - 67021 Initializing webdriver
2024-07-21 15:11:32,702 - AInewsbot - INFO - get_driver - 67021 Initializing webdriver
2024-07-21 15:11:53,275 - AInewsbot - INFO - get_driver - Initialized webdriver profile
2024-07-21 15:11:53,276 - AInewsbot - INFO - get_driver - Initialized webdriver profile
2024-07-21 15:11:53,276 - AInewsbot - INFO - get_driver - Initialized webdriver profile
2024-07-21 15:11:53,276 - AInewsbot - INFO - get_driver - Initialized webdriver profile
2024-07-21 15:11:53,277 - AInewsbot - INFO - get_driver - Initialized webdriver service
2024-07-21 15:11:53,277 - AInewsbot - INFO - get_driver - Initialized webdriver service
2024-07-21 15:11:53,278 - AInewsbot - INFO -

TypeError: expected str, bytes or os.PathLike object, not int

In [15]:
d = get_driver()



2024-07-21 15:12:21,468 - AInewsbot - INFO - get_driver - 67021 Initializing webdriver
2024-07-21 15:12:34,118 - AInewsbot - INFO - get_driver - Initialized webdriver profile
2024-07-21 15:12:34,119 - AInewsbot - INFO - get_driver - Initialized webdriver service
2024-07-21 15:13:14,893 - AInewsbot - INFO - get_driver - Initialized webdriver


In [16]:
d.quit()

In [None]:
for driver in drivers:
    driver.quit()


In [17]:
# Fetch HTML files from sources

# empty download directory
delete_files(DOWNLOAD_DIR)

# save each file specified from sources
num_browsers = 3
log(f"Saving HTML files using {num_browsers} browsers")

# Create a queue for multiprocessing and populate it 
queue = multiprocessing.Queue()
for item in sources.values():
    queue.put(item)
    
# Function to take the queue and pop entries off and process until none are left
# lets you create an array of functions with different args
callable = process_source_queue_factory(queue)

saved_pages = launch_drivers(num_browsers, callable)


2024-07-21 15:16:05,824 - AInewsbot - INFO - Saving HTML files using 3 browsers
2024-07-21 15:16:05,848 - AInewsbot - INFO - get_driver - 67021 Initializing webdriver
2024-07-21 15:16:05,850 - AInewsbot - INFO - get_driver - 67021 Initializing webdriver
2024-07-21 15:16:05,851 - AInewsbot - INFO - get_driver - 67021 Initializing webdriver
2024-07-21 15:16:22,670 - AInewsbot - INFO - get_driver - Initialized webdriver profile
2024-07-21 15:16:22,671 - AInewsbot - INFO - get_driver - Initialized webdriver profile
2024-07-21 15:16:22,671 - AInewsbot - INFO - get_driver - Initialized webdriver profile
2024-07-21 15:16:22,672 - AInewsbot - INFO - get_driver - Initialized webdriver service
2024-07-21 15:16:22,672 - AInewsbot - INFO - get_driver - Initialized webdriver service
2024-07-21 15:16:22,672 - AInewsbot - INFO - get_driver - Initialized webdriver service
2024-07-21 15:17:29,609 - AInewsbot - INFO - get_driver - Initialized webdriver
2024-07-21 15:17:29,703 - AInewsbot - INFO - Proces

2024-07-21 15:18:15,529 - AInewsbot - INFO - get_files(Google News - Technology - Artificial intelligence) - Saving Google News - Technology - Artificial intelligence (07_21_2024 03_18_15 PM).html as utf-8
2024-07-21 15:18:15,534 - AInewsbot - INFO - Processing NYT Tech
2024-07-21 15:18:15,535 - AInewsbot - INFO - get_files(Technology - The New York Times) - starting get_files https://www.nytimes.com/section/technology
2024-07-21 15:18:24,039 - AInewsbot - INFO - get_files(Discover and Add New Feedly AI Feeds) - Loading additional infinite scroll items
2024-07-21 15:18:24,751 - AInewsbot - INFO - Message: Unable to locate element: //meta[@http-equiv='Content-Type']; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:192:5
NoSuchElementError@chrom

2024-07-21 15:19:09,771 - AInewsbot - INFO - get_files(Technology - WSJ.com) - Saving Technology - WSJ.com (07_21_2024 03_19_09 PM).html as utf-8
2024-07-21 15:19:09,772 - AInewsbot - INFO - Quit webdriver
2024-07-21 15:19:15,171 - AInewsbot - INFO - Message: Unable to locate element: //meta[@http-equiv='Content-Type']; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:192:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:510:5
dom.find/</<@chrome://remote/content/shared/DOM.sys.mjs:136:16

2024-07-21 15:19:15,172 - AInewsbot - INFO - get_files(Technology - The Washington Post) - Saving Technology - The Washington Post (07_21_2024 03_19_15 PM).html as utf-8
2024-07-21 15:19:15,175 - AInewsbot - INFO - Quit webdriver


In [None]:
log(f"Saved {len(saved_pages)} pages")

print(len(saved_pages))
for sourcename, page in saved_pages:
    sources[sourcename]['latest'] = page
    log(f"{sourcename} -> {page}")
    

# Extract news URLs from saved pages

In [None]:
# Parse news URLs and titles from downloaded HTML files
log("Parsing html files")
all_urls = []
for sourcename, filename in saved_pages:
    log(sourcename +' -> ' + filename)
    log(f"{sourcename}", "parse loop")
    links = parse_file(sources[sourcename])
    log(f"{len(links)} links found", "parse loop")
    all_urls.extend(links)

log(f"found {len(all_urls)} links", "parse loop")

# make a pandas dataframe of all the links found
orig_df = (
    pd.DataFrame(all_urls)
    .groupby("url")
    .first()
    .reset_index()
    .sort_values("src")[["src", "title", "url"]]
    .reset_index(drop=True)
    .reset_index(drop=False)
    .rename(columns={"index": "id"})
)
orig_df.head()


In [None]:
# # extracts all links from history where isAI=1
# # useful for training dimensionality reduction
# conn = sqlite3.connect('articles.db')
# c = conn.cursor()
# #  and timestamp > '2024-07-01' 
# query = "select * from news_articles where isAI=1 order by id"
# ai_history_df = pd.read_sql_query(query, conn)
# ai_history_df

# Filter URLs to new AI headlines only

In [None]:
# filter urls we've already seen in previous runs and saved in SQLite
filtered_df = filter_unseen_urls_db(orig_df, before_date=before_date)
len(filtered_df)


In [None]:
# use chatgpt to filter AI-related headlines using a prompt to OpenAI
print(FILTER_PROMPT)


In [None]:
# make pages that fit in a reasonably sized (MAXPAGELEN or MAX_INPUT_TOKENS) prompt
pages = paginate_df(filtered_df)
log(f"Paginated {len(pages)} pages")


In [None]:
# use REST API directly. OpenAI python API doesn't support concurrent requests from a single client
# this runs fast with async aiohttp and on gpt-3.5 (15 seconds vs 2 minutes synchronously with gpt-4o)
# the old API supported submitting multiple payloads in a single completion request
# current API supports a slow 'batch' submission https://platform.openai.com/docs/guides/rate-limits/usage-tiers
# there is a more complex example here - https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py

log("start classify")
enriched_urls = asyncio.run(fetch_pages(pages, prompt=FILTER_PROMPT))
log("end classify")

enriched_df = pd.DataFrame(enriched_urls)
print(len(enriched_df))
log("isAI", len(enriched_df.loc[enriched_df["isAI"]]))
log("not isAI", len(enriched_df.loc[~enriched_df["isAI"]]))
enriched_df.head()


In [None]:
# merge returned df with isAI column into original df on id column
merged_df = pd.merge(filtered_df, enriched_df, on="id", how="outer")
merged_df['date'] = datetime.now().date()
merged_df.head()


In [None]:
# should be empty, shouldn't get back rows that don't match to existing
log(f"Unmatched response rows: {len(merged_df.loc[merged_df['src'].isna()])}")
# should be empty, should get back all rows from orig
log(f"Unmatched source rows: {len(merged_df.loc[merged_df['isAI'].isna()])}")


In [None]:
# keep headlines that are related to AI
AIdf = merged_df.loc[merged_df["isAI"]==1] \
    .reset_index(drop=True)  \
    .reset_index()  \
    .drop(columns=["id"])  \
    .rename(columns={'index': 'id'})

log(f"Found {len(AIdf)} AI headlines")

AIdf

In [None]:
# map title to ascii characters to avoid some dupes with e.g. different quote symbols

AIdf['title'] = AIdf['title'].apply(unicode_to_ascii)


In [None]:
# dedupe identical headlines
AIdf['title_clean'] = AIdf['title'].map(lambda s: "".join(s.split()))
AIdf = AIdf.sort_values("src") \
    .groupby("title_clean") \
    .first() \
    .reset_index(drop=True) \
    .drop(columns=['id']) \
    .reset_index() \
    .rename(columns={'index': 'id'})

log(f"Found {len(AIdf)} unique AI headlines")


In [None]:
# map google news headlines to redirect

AIdf = get_google_news_redirects(AIdf)


In [None]:
# must do this after fixing google actualurl
AIdf['hostname']=AIdf['actual_url'].apply(lambda url: urlparse(url).netloc)
AIdf.head()

### Get site names and update site names based on URL

In [None]:
# get site_name
conn = sqlite3.connect('articles.db')
c = conn.cursor()
#  and timestamp > '2024-07-01' 
query = "select * from sites"
sites_df = pd.read_sql_query(query, conn)
sites_dict = {row.hostname:row.site_name for row in sites_df.itertuples()}

sites_df

In [None]:
AIdf['site_name'] = AIdf['hostname'].apply(lambda hostname: sites_dict.get(hostname, ""))
AIdf.loc[AIdf['site_name']==""]

In [None]:
async def get_site_name(session, row):
    cat_prompt = f"""
based on this url and your knowledge of the Web, what is the name of the site? https://{row.hostname}

return the response as a json object of the form {{"url": "www.yankodesign.com", "site_name": "Yanko Design"}}

    """
    try:
        messages=[
                  {"role": "user", "content": cat_prompt
                  }]

        payload = {"model":  LOWCOST_MODEL,
                   "response_format": {"type": "json_object"},
                   "messages": messages,
                   "temperature": 0
                   }
        response = await fetch_openai(session, payload)
        response_dict = json.loads(response["choices"][0]["message"]["content"])
        return response_dict
    except Exception as exc:
        print(exc)
                
tasks = []
async with aiohttp.ClientSession() as session:
    for row in AIdf.loc[AIdf['site_name']==""].itertuples():
        task = asyncio.create_task(get_site_name(session, row))
        tasks.append(task)
    responses = await asyncio.gather(*tasks)

responses


In [None]:
# update site_dict from responses
new_urls = []
for r in responses:
    if r['url'].startswith('https://'):
        r['url'] = r['url'][8:]
    new_urls.append(r['url'])
    sites_dict[r['url']] = r['site_name']
    print(r['url'], r['site_name'])

AIdf['site_name'] = AIdf['hostname'].apply(lambda hostname: sites_dict.get(hostname, hostname))



In [None]:
for url in new_urls:
    sqlstr = "INSERT OR IGNORE INTO sites (hostname, site_name) VALUES (?, ?);"
    print(url, '->', sites_dict[url])
    conn.execute(sqlstr, (url, sites_dict[url]))
    conn.commit()


In [None]:
# update SQLite database with all seen URLs
conn = sqlite3.connect('articles.db')
cursor = conn.cursor()

for row in AIdf.itertuples():
    insert_article(conn, cursor, row.src, row.hostname, row.title,
                   row.url, row.actual_url, row.isAI, row.date)


# Topic analysis
Try to identify the top topics of the day, to help make a nice summary. 

1st approach - do dimensionality reduction on the headline embeddings with UMAP and cluster with DBSCAN.

2nd approach
 - extract topics from headline using a prompt
 - human canonicalizes topics
 - assign headlines to topics using a prompt
 
 The final summary is pretty inconsistent, would be nice to give chatgpt a prompt that would say, summarize these bullet points using this categorization.
 

### Fit dimensionality reduction model

In [None]:
# # train dimensionality reduction, only need to do this every few months and pickle the model to reflect new topics
# # extracts all links from history where isAI=1
# conn = sqlite3.connect('articles.db')
# c = conn.cursor()
# #  and timestamp > '2024-07-01' 
# query = "select * from news_articles where isAI=1 order by id desc limit 20000"
# ai_history_df = pd.read_sql_query(query, conn)
# len(ai_history_df)

In [None]:
# embedding_model="text-embedding-3-large"
# embedding_df_list = []
# pages = paginate_df(ai_history_df, maxpagelen=1000, max_input_tokens=8192)

# for p in pages:
#     response = client.embeddings.create(input=[obj['title'] for obj in p],
#                                         model=embedding_model)
#     embedding_df_list.append(pd.DataFrame([e.model_dump()['embedding'] for e in response.data]))

# embedding_df = pd.concat(embedding_df_list, axis=0, ignore_index=True)

# embedding_df.to_pickle("historical_embedding_df.pkl")


In [None]:
# # Initialize the UMAP reducer
# reducer = umap.UMAP(n_components=30)
# # Fit the reducer to the data without transforming
# reducer.fit(embedding_df)
# # Pickle the reducer
# with open('reducer.pkl', 'wb') as f:
#     pickle.dump(reducer, f)
# print("UMAP reducer pickled and saved as 'reducer.pkl'")

In [None]:
# attempt to extract top topics 
print(TOPIC_PROMPT)


In [None]:
# get topics
pages = paginate_df(AIdf)

# apply this prompt to AI headlines
log("start topic extraction")
response = asyncio.run(fetch_pages(pages, prompt=TOPIC_PROMPT))
log("end topic extraction")

In [None]:
topic_df = pd.DataFrame(response)
topic_df = topic_df.rename(columns={'topics': 'extracted_topics'})
print(len(topic_df))
topic_df.head()


In [None]:
all_topics = [item.lower() for row in topic_df.itertuples() for item in row.extracted_topics]
item_counts = Counter(all_topics)
filtered_topics = [item for item in item_counts if item_counts[item] >= 2 and item not in {'technology', 'ai', 'artificial intelligence'}]
print(len(filtered_topics))
sorted(filtered_topics)


In [None]:
topic_df['extracted_topics'] = topic_df['extracted_topics'].apply(lambda l: [t.title() for t in l if t.lower() in filtered_topics])

In [None]:
# evergreen topics to hopefully map healdines to canonical standardized topics
# review extracted topics and add
# you could try it with new cats or new cats + evergreen
# but probably look at new cats and human in the loop should add good new cats today to evergreen list
# new_cats = list(json.loads(response.choices[0].message.content).values())[0]
# categories = sorted(list(set(new_cats + evergreen)))
categories = sorted(CANONICAL_TOPICS)
for c in categories:
    print(c)

In [None]:
[t for t in filtered_topics if t.lower() not in [u.lower() for u in CANONICAL_TOPICS]]

In [None]:
AIdf


In [None]:
# maybe try to set timeout in categorize_headline
catdict = dict()

async with aiohttp.ClientSession() as session:
    for i, row in enumerate(AIdf.itertuples()):
        tasks = []
        log(f"Categorizing headline {row.id+1} of {len(AIdf)}")
        h = row.title
        log(h)
        for c in categories:
            task = asyncio.create_task(categorize_headline(h, c, session))
            tasks.append(task)
        responses = await asyncio.gather(*tasks)
        catdict[row.id] = [item for sublist in responses for item in sublist]
        log(str(catdict[row.id]))
        
catdict

In [None]:
topic_df['assigned_topics'] = topic_df['id'].apply(lambda id: catdict.get(id, ""))
topic_df

In [None]:
lcategories = set([c.lower() for c in categories])

In [None]:
def clean_topics(row):
    topics = [x.title() for x in row.extracted_topics if x.lower() not in {"technology", "ai", "artificial intelligence"}]
    assigned_topics = [x.title() for x in row.assigned_topics if x.lower() in lcategories]
    combined = sorted(list(set(topics + assigned_topics)))
    combined = [s.replace("Ai", "AI") for s in combined]
    combined = [s.replace("Genai", "Gen AI") for s in combined]
    combined = [s.replace("Openai", "OpenAI") for s in combined]
    return combined

topic_df["topics"] = topic_df.apply(clean_topics, axis=1)
topic_df["topic_str"] = topic_df.apply(lambda row: ", ".join(row.topics), axis=1)
topic_df

In [None]:
AIdf


In [None]:
AIdf


In [None]:
try:  # for idempotency
    AIdf = AIdf.drop(columns=["title_topic_str"])
except:
    pass
try:  # for idempotency
    AIdf = AIdf.drop(columns=["topic_str"])
except:
    pass

AIdf = pd.merge(AIdf, topic_df[["id", "topic_str"]], on="id", how="inner")
AIdf['title_topic_str'] = AIdf.apply(lambda row: f'{row.title} (Topics: {row.topic_str})', axis=1)
AIdf


In [None]:
with pd.option_context('display.max_rows', None, 'display.max_colwidth', None):
    display(AIdf.loc[AIdf["topic_str"]==""][['title']])


### Semantic sort

In [None]:
# use embeddings to sort headlines by semantical similarity
log(f"Fetching embeddings for {len(AIdf)} headlines")
embedding_model = 'text-embedding-3-large'
response = client.embeddings.create(input=AIdf['title_topic_str'].tolist(),
                                    model=embedding_model)
embedding_df = pd.DataFrame([e.model_dump()['embedding'] for e in response.data])

# sort of a traveling salesman sort
log(f"Sort with agglomerative cluster sort")
sorted_indices = agglomerative_cluster_sort(embedding_df)
AIdf['sort_order'] = sorted_indices

# do dimensionality reduction on embedding_df and cluster analysis
log(f"Perform dimensionality reduction")
with open("reducer.pkl", 'rb') as file:
    # Load the model from the file
    reducer = pickle.load(file)
reduced_data = reducer.transform(embedding_df)
log(f"Cluster with DBSCAN")
dbscan = DBSCAN(eps=0.4, min_samples=3)  # Adjust eps and min_samples as needed
AIdf['cluster'] = dbscan.fit_predict(reduced_data)
AIdf.loc[AIdf['cluster'] == -1, 'cluster'] = 999
    
# sort first by clusters found by DBSCAN, then by semantic ordering
AIdf = AIdf.sort_values(['cluster', 'sort_order']) \
    .reset_index(drop=True) \
    .reset_index() \
    .drop(columns=["id"]) \
    .rename(columns={'index': 'id'})

AIdf


In [None]:
async def write_topic_name(session, topic_list_str, max_retries=3, model=LOWCOST_MODEL):

    TOPIC_WRITER_PROMPT = f"""
You are a topic writing assistant. I will provide a list of headlines with extracted topics in parentheses. 
Your task is to propose a name for a topic that very simply, clearly and accurately captures all the provided 
headlines in less than 7 words. You will output a JSON object with the key "topic_title".

Example Input:
In the latest issue of Caixins weekly magazine: CATL Bets on 'Skateboard Chassis' and Battery Swaps to Dispell Market Concerns (powered by AI) (Topics: Battery Swaps, Catl, China, Market Concerns, Skateboard Chassis)

AI, cheap EVs, future Chevy  the week (Topics: Chevy, Evs)

Electric Vehicles and AI: Driving the Consumer & World Forward (Topics: Consumer, Electric Vehicles, Technology)

Example Output:
{{"topic_title": "Electric Vehicles"}}

Task
Propose the name for the overall topic based on the following provided headlines and individual topics:

{topic_list_str}
"""

    for i in range(max_retries):
        try:
            messages=[
                      {"role": "user", "content": TOPIC_WRITER_PROMPT
                      }]

            payload = {"model":  model,
                       "response_format": {"type": "json_object"},
                       "messages": messages,
                       "temperature": 0
                       }
            response = await fetch_openai(session, payload)
            response_dict = json.loads(response["choices"][0]["message"]["content"])
            return response_dict

            break
        except Exception as exc:
            log(f"Error: {exc}")

    return {}
        

# show clusters
cluster_topics = []
with pd.option_context('display.max_rows', None, 'display.max_colwidth', None):
    async with aiohttp.ClientSession() as session:

        for i in range(30):
            tmpdf = AIdf.loc[AIdf['cluster']==i][["id", "title_topic_str"]]
            if len(tmpdf) ==0:
                break
            display(tmpdf)
            title_topic_str_list = ("\n\n".join(tmpdf['title_topic_str'].to_list()))
            cluster_topic = await write_topic_name(session, title_topic_str_list)
            cluster_topics.append(cluster_topic)
            print(cluster_topic)

    


In [None]:
# we could extract top words using tfidf, something like 
# vectorizer = TfidfVectorizer(stop_words='english')
# tfidf_matrix = vectorizer.fit_transform(documents)
# feature_names = vectorizer.get_feature_names_out()
# topics = []
# for i in range(n_topics):
#     # Get the documents in this cluster
#     cluster_docs = [doc for doc, label in zip(documents, cluster_labels) if label == i]
#     cluster_metadatas = [meta for meta, label in zip(metadatas, cluster_labels) if label == i]

#     # Get the top words for this cluster based on TF-IDF scores
#     tfidf_scores = tfidf_matrix[cluster_labels == i].sum(axis=0).A1
#     top_word_indices = tfidf_scores.argsort()[-n_words_per_topic:][::-1]
#     top_words = [feature_names[index] for index in top_word_indices]


In [None]:
cluster_topic_list = [obj['topic_title'] for obj in cluster_topics]
cluster_topic_list

In [None]:
AIdf['cluster_name'] = AIdf['cluster'].apply(lambda i: cluster_topic_list[i] if i<len(cluster_topic_list) else "")


# Save and email headlines


In [None]:
html_str = ""
for row in AIdf.itertuples():
    log(f"[{row.Index}. {row.title} - {row.site_name}]({row.actual_url})")
    html_str += f'{row.Index}.<a href="{row.actual_url}">{row.title} - {row.site_name}</a><br />\n'


In [None]:
# save headlines
with open('headlines.html', 'w') as f:
    f.write(html_str)


In [None]:
# send mail
log("Sending headlines email")
subject = f'AI headlines {datetime.now().strftime("%H:%M:%S")}'
send_gmail(subject, html_str)


# Save individual pages 

In [None]:
# fetch pages
# Create a queue for multiprocessing and populate it 
log("Queuing URLs for scraping")

queue = multiprocessing.Queue()
for row in AIdf.itertuples():
    queue.put((row.id, row.actual_url, row.title))


In [None]:
# scrape urls in queue asynchronously
num_browsers = 4

callable = process_url_queue_factory(queue)

log(f"fetching {len(AIdf)} pages using {num_browsers} browsers")
saved_pages = launch_drivers(num_browsers, callable)


In [None]:
pages_df = pd.DataFrame(saved_pages)
pages_df.columns = ['id', 'actual_url', 'title', 'path']
pages_df

In [None]:
AIdf = pd.merge(AIdf, pages_df[["id", "path"]], on='id', how="inner")


In [None]:
AIdf

# Summarize individual pages

In [None]:
print(SUMMARIZE_SYSTEM_PROMPT)


In [None]:
print(SUMMARIZE_USER_PROMPT)


In [None]:
# Here we are fetching all at once, could be 200 summaries, so we are firing off 200 REST requests at once
# This seems like a bad idea, could loop through and fire off e.g. 10 at a time, or use queues and workers (seems pointless)
# But it works and runs fast on 3.5 and if ChatGPT doesn't like it they could throttle it

log("Starting summarize")
responses = await fetch_all_summaries(AIdf)
log(f"Received {len(responses)} summaries")
print(responses[0])


In [None]:
# bring summaries into dict
response_dict = {}
for i, response in responses:
    try:
        response_str = response["choices"][0]["message"]["content"]
        response_dict[i] = response_str
    except Exception as exc:
        print(exc)
        
len(response_dict)

In [None]:
AIdf['hostname']=AIdf['actual_url'].apply(lambda url: urlparse(url).netloc)
AIdf['site_name'] = AIdf['hostname'].apply(lambda hostname: sites_dict.get(hostname, ""))


In [None]:
AIdf


In [None]:
# make text for email and also collect data for vector store
markdown_str = ''
vectorstore_list = []
metadata_list=[]
for i, row in enumerate(AIdf.itertuples()):
    topics = []
    if row.cluster_name:
        topics.append(row.cluster_name)
    if row.topic_str:
        topics.append(row.topic_str)
    topic_str = ", ".join(topics)

    mdstr = f"[{i+1}. {row.title} - {row.site_name}]({row.actual_url})  \n\n {topic_str}  \n\n{response_dict[row.id]} \n\n"
    # simpler version for vector store
    vectorstore_list.append(f"[{row.title} - {row.site_name}]({row.actual_url})\n\nTopics: {row.topic_str} \n\n{response_dict[row.id]}\n\n")
    metadata_list.append({'id': row.id, 'title': row.title, 'url': row.actual_url, 'site': row.site_name})
    display(Markdown(mdstr))
    markdown_str += mdstr
    

In [None]:
display(Markdown( vectorstore_list[16]))

In [None]:
print(metadata_list[16])


In [None]:
# Create Document objects with the paragraphs and corresponding metadata
docs = [Document(page_content=paragraph, metadata=meta) 
        for paragraph, meta in zip(vectorstore_list, metadata_list)]
len(docs)

In [None]:
print(docs[16])


In [None]:
# persist_directory = "/Users/drucev/projects/AInewsbot/chroma_db_openai"
try:
    del vectorstore
except Exception as e:
    log(f"{e}")

try:
    shutil.rmtree(persist_directory)
    log(f"Directory '{persist_directory}' and all its contents have been removed successfully.")
except Exception as e:
    log(f"Remove directory error: {e}")
        
embeddings_openAI = OpenAIEmbeddings(model='text-embedding-3-small')
vectorstore = Chroma.from_documents(docs, embeddings_openAI)


In [None]:
# Perform a similarity search
query = "What is the latest with openai?"
results = vectorstore.similarity_search_with_score(query, 
                                        k=20,
                                       )  # k is the number of results to return
# Print the results
urldict = {}
for doc, score in results:
    if urldict.get(doc.metadata['url']):
        continue
    urldict[doc.metadata['url']] = 1
    if score < 1.25:
        print(f"Score:   {score}")
        print(f"Content: {doc.page_content}\n")
        print(f"Metadata: {doc.metadata}\n")
        print("---")

In [None]:
# # or use local embeddings with sentence_transformers
# # Initialize your embedding model
# embeddings_hf = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# # Create the vector store with a persist_directory
# persist_directory = "/Users/drucev/projects/AInewsbot/chroma_db_huggingface"
# vectorstore_hf = Chroma.from_documents(
#     documents=docs,
#     embedding=embeddings_hf,
#     persist_directory=persist_directory
# )

# # Perform a similarity search
# query = "What is the latest with OpenAI?"
# results = vectorstore_hf.similarity_search(query, k=10)  # k is the number of results to return

# # Print the results
# for doc in results:
#     print(f"Content: {doc.page_content}\n")
#     print(f"Metadata: {doc.metadata}\n")
#     print("---")
    


In [None]:
# Convert Markdown to HTML
html_str = markdown.markdown(markdown_str, extensions=['extra'])
# display(HTML(html_str))


In [None]:
# save bullets
with open('bullets.md', 'w') as f:
    f.write(markdown_str)


In [None]:
log("Sending bullet points email")
subject = f'AI news bullets {datetime.now().strftime("%H:%M:%S")}'
send_gmail(subject, html_str)


# Ask ChatGPT for top categories

In [None]:
print(TOP_CATEGORIES_PROMPT)

In [None]:
response = client.chat.completions.create(
    model=MODEL,
    messages=[
              {"role": "user", "content": TOP_CATEGORIES_PROMPT + markdown_str
              }],
    n=1,
    response_format={"type": "json_object"},
    temperature=0.5
)


In [None]:
suggested_categories = list(json.loads(response.choices[0].message.content).values())[0]
suggested_categories

In [None]:
cluster_topic_list

In [None]:
# human category edit 
my_cats = [
 'AI in drug discovery',
 'Machine learning in structural biology',
 'nference and Takeda AI partnership',
 'Ant-inspired AI for drones',
 "Fei-Fei Li's $1B AI startup",
 "Artificial Agency's $16M funding",
 "CTERA's $80M funding",
 "Anthropic's $100M AI fund",
 "Samsung's AI image generation",
 'EU antitrust probe on AI deals',
 'Salesforce AI service agent',
 "Meta's AI regulatory issues",
 'Nvidia and Mistral AI model',
 'TSMC AI chip demand surge',
 'AI search engines',
 'AI in cybersecurity',
 'AI in Healthcare & Precision Medicine',
 'Autonomous Drones',
 'AI Startup Funding',
 'Claude AI for Data Analysis',
 'AI Startup Funding',
 'AI Smartphones',
 'Big Tech Antitrust Investigations',
 'Large Language Models',
 'AI-Powered Customer Service Agents',
 'AI Ethics and Regulation',
 'AI Art by Microsoft',
 'AI Chip Demand Surge',
 'Text-To-Image Diffusion Models',
 'ChatGPT in AI Dominance'
]

In [None]:
md_str = ""
doc_list = []
docid_list = []
similarity_cutoff = 1.25
for cat in my_cats:
    docstr = f"# {cat} \n\n"
    # Perform a similarity search
    results = vectorstore.similarity_search_with_score(cat, 
                                                       k=10,
                                                      )
    if results:
        # Print the results
        urldict = {}
        for doc, score in results:
            if urldict.get(doc.metadata['url']):
                continue
            urldict[doc.metadata['url']] = 1    
            if score > similarity_cutoff:
                break
            docstr += f"{doc.page_content}\n"
            docid_list.append(doc.metadata['id'])
        doc_list.append(docstr)
        md_str += docstr
        
        
display(Markdown(md_str))
            


In [None]:
docid_list

In [None]:
# write sections individually

mail_md_str = ""

for current_topic, cat in enumerate(my_cats):

    section_prompt = f"""
You are an advanced summarization assistant, a sophisticated AI system
designed to write a compelling summary of news input.

Input:
I will provide a markdown list of today's news articles on the topic: {my_cats[current_topic]}.
The input will be in the format
[Site-name-s1](url-s1)
Story-Title-s1

Topics: s1-topic1, s1-topic2, s1-topic3

- s1-bullet-point-1
- s1-bullet-point-2
- s1-bullet-point-3

[Site-name-s2](url-s2)
Story-Title-s2

Topics: s2-topic1, s2-topic2, s2-topic3

- s2-bullet-point-1
- s2-bullet-point-2
- s2-bullet-point-3

Instructions:

Read the input closely.
USE ONLY INFORMATION PROVIDED IN THE INPUT.
Provide the most significant facts without commentary or elaboration.
Write an engaging summary consisting of a title and at least 1 and no more than 5 bullet points.
Use as few bullet points as you need to provide the most significant facts.
Each bullet should contain one sentence with one link.
Each bullet should not repeat points or information from previous bullet points.
DO NOT REPEAT LINKS FROM PREVIOUS BULLET POINTS.
Write in the professional but engaging, narrative style of a tech reporter for a national publication.
Be balanced, professional, informative, providing accurate, clear, concise summaries in a respectful neutral tone.

Please check carefully that you only use information provided in the following input, and that any bullet point
does not repeat information or links prevously provided.

Example Output Format Template (EXAMPLE ONLY, DO NOT OUTPUT THIS TEMPLATE):

# Engaging title

- bullet point a - [site name a](site url a)
- bullet point b - [site name b ](site url b)

Input:

{doc_list[current_topic]}
"""

    response = client.chat.completions.create(
        model=MODEL,
        messages=[
                  {"role": "user", "content": section_prompt
                  }],
        n=1,   
        temperature=0.2
    )

    response_str = response.choices[0].message.content
    response_str = response_str.replace("$", "\\$")
    mail_md_str += response_str + " \n\n"
    display(Markdown(response_str))


In [None]:
print(mail_md_str)

In [None]:
log("Sending full summary email ")
subject = f'AI news summary {datetime.now().strftime("%H:%M:%S")}'
final_html_str = markdown.markdown(mail_md_str, extensions=['extra'])
display(HTML(final_html_str))
send_gmail(subject, final_html_str)


# Final Summary

In [None]:
# summarize by just giving selected stories in semantic order and hinting how to write it
AIdf.loc[AIdf['id'].isin(set(docid_list))]

In [None]:
docs[16].page_content

In [None]:
# make text for email and also collect data for vector store
markdown_str = ''
print()

for i, row in enumerate(AIdf.loc[AIdf['id'].isin(set(docid_list))].itertuples()):
    mdstr = docs[row.id].page_content
    display(Markdown(mdstr.replace('$', '\\$')))
    markdown_str += mdstr
    

In [None]:
# clean up the list of topics in human in the loop workflow
# loop though each topic and summarize, then 
# then combine the summaries for a final prompt

In [None]:
TESTPROMPT = f"""
You are an advanced summarization assistant, a sophisticated AI system
designed to write a compelling summary of news input. You are able to categorize information, 
and identify trends from large volumes of news.

Objective: 
I will provide the text of today's news articles about AI and summary bullet points in markdown format.
Bullet points will contain a title and URL, a list of topics discussed, and a bullet-point summary of
the article. You are tasked with identifying and summarizing the most important news, recurring themes,
common facts and items. Your job is to create a concise summary of today's topics and developments.
You will write an engaging summary of today's news encompassing the most important and frequently 
mentioned topics and themes.
You will write in the professional but engaging, narrative style of a tech reporter for a national publication.
You will be balanced, professional, informative, providing accurate, clear, concise summaries in a neutral tone.
You will group stories into related topics

Input Format Template:

[Site-name-s1](url-s1)
Story-Title-s1

Topics: s1-topic1, s1-topic2, s1-topic3

- s1-bullet-point-1
- s1-bullet-point-2
- s1-bullet-point-3

[Site-name-s2](url-s2)
Story-Title-s2

Topics: s2-topic1, s2-topic2, s2-topic3

- s2-bullet-point-1
- s2-bullet-point-2
- s2-bullet-point-3

Example Output Format Template (EXAMPLE ONLY, DO NOT OUTPUT THIS TEMPLATE):

# Engaging-topic-title-1

- bullet-point-1a - [site-name-1a](site-url-1a)
- bullet-point-1b - [site-name-1b](site-url-1b)

# Engaging-topic-title-2

- bullet-point-2a - [site-name-2a](site-url-2a)
- bullet-point-2b - [site-name-2b](site-url-2b)

Instructions:

Read the input closely.
Very important: USE ONLY INFORMATION PROVIDED IN THE INPUT.
Provide the most significant facts without commentary or elaboration.
Each bullet should contain one sentence with one link.
Each bullet should not repeat points or information from previous bullet points.

Please check carefully that you only use information provided in the following input, that you include
all links in the input, and that any bullet point does not repeat information or links prevously provided.

Input:

"""



In [None]:
response = client.chat.completions.create(
    model=MODEL,
    messages=[
              {"role": "user", "content": TESTPROMPT + markdown_str
              }],
    n=1,   
    temperature=0.5
)

response_str = response.choices[0].message.content
response_str = response_str.replace("$", "\\$")
display(Markdown(response_str))


In [None]:
response_str = response.choices[0].message.content
response_str = response_str.replace("$", "\\$")
display(Markdown(response_str))


In [None]:
log("Sending full summary email ")
subject = f'AI news summary {datetime.now().strftime("%H:%M:%S")}'
final_html_str = markdown.markdown(response_str, extensions=['extra'])
display(HTML(final_html_str))
send_gmail(subject, final_html_str)


In [None]:
log("Finished")


In [None]:
"""You will act like a professional editor with expertise in content optimization.
You are skilled at refining and enhancing written materials, specializing in
ensuring clarity, conciseness, and coherence in various types of documents,
including newsletters.

Objective: Edit the markdown newsletter provided below by removing any redundant
sentences or bullet points that restate previous points and contain the same link.
Leave intact bullet points that are unique and provide distinct information.

Step-by-step instructions:

Carefully read through the entire newsletter to understand the overall structure and content.
Identify sentences and bullet points that repeat information or provide identical links.
Remove all redundant sentences and bullet points that do not contribute new information or unique links.
Ensure that the remaining content flows logically and maintains the intended message and tone of the newsletter.
Double-check the final edited version for any inconsistencies or errors introduced during the editing process.
Take a deep breath and work on this problem step-by-step.
"""

In [None]:
"""You will act like a professional editor with expertise in content optimization.
You are skilled at reviewing and enhancing written materials, specializing in
helping improve clarity, conciseness, and coherence in various types of documents,
including newsletters.

Objective: Review the markdown newsletter provided below and advise on ways to improve it.
Note any links which are repeated, any sections which are similar and could be combined,
and any copy edits. You will only provide suggestions, and not rewrite the copy.

Step-by-step instructions:

Carefully read through the entire newsletter to understand the overall structure and content.
Identify sentences and bullet points that repeat information and provide identical links and should be removed.
Identify any sections which could be combined because they contain similar but not identical content.
Suggest improvements to any sections which are not clear, concise, and coherent.
Take a deep breath and work on this problem step-by-step.
"""

In [None]:
mail_md_str

In [None]:
display(Markdown(mail_md_str.replace("$", "\\$")))


In [None]:
edit_prompt1 = f"""You will act like a professional editor with expertise in content optimization.
You are skilled at reviewing and enhancing written materials, specializing in
helping improve clarity, conciseness, and coherence in various types of documents,
including newsletters.

Objective: Review the markdown newsletter provided below.
It consists of a series of sections, each of which contains several bullet points.
For each section, review each bullet point and advise if it should be moved to a different section.
You will only provide suggestions, and not rewrite the newsletter or provide other comments except
instructions regarding moving bullet points between sections.

Step-by-step instructions:

Carefully read through the entire newsletter to understand the overall structure and content.
Note the titles of the various sections.
Identify sentences and bullet points that should be moved to a different section. Write the
bullet point and the section in should be moved to.
If no bullet points should be moved for a given section, state that no action is required for that section.

Check carefully to make sure all similar bullet points end up grouped together in the same section.

Take a deep breath and work on this problem step-by-step.

Newsletter to edit: 
{mail_md_str}
"""
response = client.chat.completions.create(
    model=MODEL,
    messages=[
              {"role": "user", "content": edit_prompt1
              }],
    n=1,   
    temperature=0.2
)

response_str1 = response.choices[0].message.content
display(Markdown(response_str1.replace("$", "\\$")))


In [None]:
edit_prompt2 = f"""You will act like a professional editor with expertise in content optimization.
You are skilled at reviewing and enhancing written materials, specializing in
helping improve clarity, conciseness, and coherence in various types of documents,
including newsletters.

Objective: Below are editing instructions followed by a markdown newsletter.
Carefully review the editing instructions and the markdown newsletter provided below.
The newsletter consists of a series of sections, each of which contains several bullet points.
Move bullet points according to the editing instructions below from one section to another 
If there is no change to a specific section, include it unchanged in the response as it appears in the input.
Respond with the updated newsletter in markdown format.

Editing instructions:

Carefully read through the entire newsletter to understand the overall structure and content.
Note the titles of the various sections. Then make only the following changes:
{response_str1}

Newsletter to edit: 
{mail_md_str}

"""

response = client.chat.completions.create(
    model=MODEL,
    messages=[
              {"role": "user", "content": edit_prompt2
              }],
    n=1,   
    temperature=0.2
)
response_str2 = response.choices[0].message.content
display(Markdown(response_str2.replace("$", "\\$")))


In [None]:
edit_prompt3 = f"""You will act like a professional editor with expertise in content optimization.
You are skilled at reviewing and enhancing written materials, specializing in
helping improve clarity, conciseness, and coherence in various types of documents,
including newsletters.

Objective: Carefully review each section of the markdown newsletter provided below. 
Each section consists of several bullet points. 

For each section, identify and combine redundant bullet points:

Instructions: 
For each section, identify bullet points containing identical URLs to other bullet points in the same section 
Rewrite the section, combining these similar bullet points to eliminate duplication.
Do not duplicate any URLs within a section.
Check the response carefully and ensure that no links are duplicated within a section.

Newsletter to edit: 
{response_str2}

"""

response = client.chat.completions.create(
    model=MODEL,
    messages=[
              {"role": "user", "content": edit_prompt3
              }],
    n=1,   
    temperature=0.2
)
response_str3 = response.choices[0].message.content
display(Markdown(response_str3.replace("$", "\\$")))


In [None]:
display(Markdown(response_str3[11:].replace("$", "\\$")))


In [None]:
response_str3.replace("$", "\\$")

In [None]:
PROMPT = f"""You will act as a professional editor with a strong background in technology journalism.
You have a deep understanding of current and emerging technology trends, and the ability to 
produce, edit, and curate high-quality content that engages and informs readers. You are 
especially skilled at reviewing and enhancing tech writing, helping improve clarity, conciseness, 
and coherence, and ensuring its accuracy and relevance.

Objective: Carefully review each section of the markdown newsletter provided below, which
contains several sections consistint of bullet points. Edit the newsletter for issues according
to the detailed instructions below, and respond with the updated newsletter or 'Good' if no changes
are needed.

Instructions: 
For each section, review the title and edit it to be as short and engaging, and as consistent with the bullets
in the section as possible
Remove or combine bullet points which are highly duplicative or redundant.
Make bullet points as concise as possible with facts.
Respond with the updated newsletter only in markdown format, without editorial comment, or the word 'OK' 
if no changes are recommended.

Newsletter to edit: 
{mail_md_str}
"""


response = client.chat.completions.create(
    model=MODEL,
    messages=[
              {"role": "user", "content": PROMPT
              }],
    n=1,   
    temperature=0.2
)
response_str3 = response.choices[0].message.content
display(Markdown(response_str3.replace("$", "\\$")))


In [None]:
mail_md_str = response_str3

In [None]:
z = """
# AI in Drug Discovery

- SLAC and Stanford develop AI method to enhance materials discovery - [Google News](https://news.google.com/articles/CBMiSGh0dHBzOi8vcGh5cy5vcmcvbmV3cy8yMDI0LTA3LWFpLWFwcHJvYWNoLW1hdGVyaWFscy1kaXNjb3Zlcnktc3RhZ2UuaHRtbNIBAA?hl=en-US&gl=US&ceid=US:en).
- Intel and IOC create Athlete365, an AI chatbot for Olympic athletes - [Google News](https://news.google.com/articles/CBMidmh0dHBzOi8vd3d3LmludGMuY29tL25ld3MtZXZlbnRzL3ByZXNzLXJlbGVhc2VzL2RldGFpbC8xNzAyL2Zyb20tYXRobGV0ZXMtdG8tZ2VuYWktZGV2ZWxvcGVycy1pbnRlbC10YWNrbGVzLXJlYWwtd29ybGTSAQA?hl=en-US&gl=US&ceid=US:en).
- Thoughtful AI raises \$20M for AI-powered revenue cycle automation - [Google News](https://news.google.com/articles/CBMiSGh0dHBzOi8vd3d3LmZpbnNtZXMuY29tLzIwMjQvMDcvdGhvdWdodGZ1bC1haS1yYWlzZXMtMjBtLWluLWZ1bmRpbmcuaHRtbNIBAA?hl=en-US&gl=US&ceid=US:en).
- OpenAI and Broadcom discuss new AI chip to reduce GPU reliance - [Google News](https://news.google.com/articles/CBMiVGh0dHBzOi8vZmluYW5jZS55YWhvby5jb20vbmV3cy9vcGVuYWktaG9sZHMtdGFsa3MtYnJvYWRjb20tZGV2ZWxvcGluZy0yMTAwMzEwODkuaHRtbNIBAA?hl=en-US&gl=US&ceid=US:en).
- AI Fund and KX Venture Capital partner to boost Thai AI startups - [Google News](https://news.google.com/articles/CBMiZmh0dHBzOi8vd3d3LmJhbmdrb2twb3N0LmNvbS9idXNpbmVzcy9nZW5lcmFsLzI4MzE5MDMvYWktZnVuZC1reHZjLXBhcnRuZXItdG8tZGV2ZWxvcC1sb2NhbC1haS1zdGFydHVwc9IBAA?hl=en-US&gl=US&ceid=US:en).
# Machine Learning in Structural Biology

- MIT uses machine learning to enhance high-entropy materials design - [Google News](https://news.google.com/articles/CBMiSGh0dHBzOi8vdGVjaHhwbG9yZS5jb20vbmV3cy8yMDI0LTA3LW1hY2hpbmUtc2VjcmV0cy1hZHZhbmNlZC1hbGxveXMuaHRtbNIBAA?hl=en-US&gl=US&ceid=US:en).

# OpenAI and Broadcom Partnership

- OpenAI and Broadcom to develop new AI chip - [Financial Times](https://www.ft.com/content/496a0c33-1af3-4dbf-977f-04d6804a8d28).
- Broadcom projects over \$11B in AI sales for FY24 - [TipRanks](https://www.tipranks.com/news/broadcom-nasdaqavgo-eyes-ai-chip-collaboration-with-openai).
- OpenAI hires ex-Google employees for AI server chip development - [Yahoo Finance](https://finance.yahoo.com/news/openai-holds-talks-broadcom-developing-210031089.html).

# Ant-inspired AI for Drones

- Army tests Black Hornet 3 drones for squad deployment - [Fox 7 Austin](https://news.google.com/articles/CBMiYGh0dHBzOi8vd3d3LmZveDdhdXN0aW4uY29tL25ld3MvYXJteS10ZXN0aW5nLXBvY2tldC1zaXplZC1kcm9uZXMtY291bGQtc29vbi1iZS1oYW5kcy1ldmVyeS1zcXVhZNIBZGh0dHBzOi8vd3d3LmZveDdhdXN0aW4uY29tL25ld3MvYXJteS10ZXN0aW5nLXBvY2tldC1zaXplZC1kcm9uZXMtY291bGQtc29vbi1iZS1oYW5kcy1ldmVyeS1zcXVhZC5hbXA?hl=en-US&gl=US&ceid=US:en).
- Drone warfare in Ukraine shifts military helicopter tactics - [Defense News](https://news.google.com/articles/CBMie2h0dHBzOi8vd3d3LmRlZmVuc2VuZXdzLmNvbS9nbG9iYWwvZXVyb3BlLzIwMjQvMDcvMTkvZHJvbmUtd2FyZmFyZS1pbi11a3JhaW5lLXByb21wdHMtZnJlc2gtdGhpbmtpbmctaW4taGVsaWNvcHRlci10YWN0aWNzL9IBAA?hl=en-US&gl=US&ceid=US:en).
- Ukraine's use of AI-driven unmanned systems raises ethical concerns - [American Magazine](https://news.google.com/articles/CBMigwFodHRwczovL3d3dy5hbWVyaWNhbWFnYXppbmUub3JnL3BvbGl0aWNzLXNvY2lldHkvMjAyNC8wNy8xOC91a3JhaW5lLWxldGhhbC1hdXRvbm9tb3VzLXdlYXBvbnMtc3lzdGVtcy1wb3BlLWZyYW5jaXMtdW4taW50ZXJuYXRpb25hbNIBAA?hl=en-US&gl=US&ceid=US:en).
- U.S. and China discuss AI risks in military contexts - [Foreign Policy](https://news.google.com/articles/CBMiT2h0dHBzOi8vZm9yZWlnbnBvbGljeS5jb20vMjAyNC8wNy8xOC9jaGluYS1taWxpdGFyeS1haS1hcnRpZmljaWFsLWludGVsbGlnZW5jZS_SAQA?hl=en-US&gl=US&ceid=US:en).
- Pentagon emphasizes "human-machine teaming" in military - [IEEE Spectrum](https://spectrum.ieee.org/robot-dog-vacuum).

# Fei-Fei Li's AI Startup

- Blackstone aims to be the largest AI infrastructure investor with \$2T in data center expenditures - [Google News](https://news.google.com/articles/CBMiaGh0dHBzOi8vd3d3LmJ1c2luZXNzaW5zaWRlci5jb20vYmxhY2tzdG9uZS1zdGV2ZS1zY2h3YXJ6bWFuLW9uLWFpLWluZnJhc3RydWN0dXJlLWludmVzdG1lbnQtZ29hbHMtMjAyNC030gEA?hl=en-US&gl=US&ceid=US:en).
- OpenAI developing AI chip to reduce Nvidia reliance - [The Verge](https://www.theverge.com/2024/7/19/24201737/openai-wants-in-on-the-ai-chip-business).
- Thoughtful AI raises \$20M for AI-powered revenue cycle automation - [Google News](https://news.google.com/articles/CBMiSGh0dHBzOi8vd3d3LmZpbnNtZXMuY29tLzIwMjQvMDcvdGhvdWdodGZ1bC1haS1yYWlzZXMtMjBtLWluLWZ1bmRpbmcuaHRtbNIBAA?hl=en-US&gl=US&ceid=US:en).
- Jared Leto invests in Captions, a generative AI startup valued at \$500M - [Google News](https://news.google.com/articles/CBMihAFodHRwczovL3d3dy5mb3huZXdzLmNvbS9lbnRlcnRhaW5tZW50L2phcmVkLWxldG8taW52ZXN0cy01MDBtLWFpLXN0YXJ0dXAtZGVzcGl0ZS1jYWxscy1mcm9tLW90aGVyLXN0YXJzLXNodXQtZG93bi1jb250cm92ZXJzaWFsLXRlY2jSAYgBaHR0cHM6Ly93d3cuZm94bmV3cy5jb20vZW50ZXJ0YWlubWVudC9qYXJlZC1sZXRvLWludmVzdHMtNTAwbS1haS1zdGFydHVwLWRlc3BpdGUtY2FsbHMtZnJvbS1vdGhlci1zdGFycy1zaHV0LWRvd24tY29udHJvdmVyc2lhbC10ZWNoLmFtcA?hl=en-US&gl=US&ceid=US:en).
- Major tech companies face financial risks in AI investments - [Google News](https://news.google.com/articles/CBMibGh0dHBzOi8vd3d3Lm1hcmtldHdhdGNoLmNvbS9zdG9yeS9taWNyb3NvZnQtbWV0YS1hbWF6b24tYW5kLWdvb2dsZS1mYWNlLXRoaXMtZ3Jvd2luZy1yaXNrLWFyb3VuZC1haS0yOGJjYTRhN9IBcGh0dHBzOi8vd3d3Lm1hcmtldHdhdGNoLmNvbS9hbXAvc3RvcnkvbWljcm9zb2Z0LW1ldGEtYW1hem9uLWFuZC1nb29nbGUtZmFjZS10aGlzLWdyb3dpbmctcmlzay1hcm91bmQtYWktMjhiY2E0YTc?hl=en-US&gl=US&ceid=US:en).

# Artificial Agency's Funding

- Saronic raises \$175M for autonomous military boats, valued at over \$1B - [Forbes](https://www.forbes.com/sites/davidjeans/2024/07/18/andreessen-horowitz-saronic-funding/).
- Jared Leto invests in Captions, a generative AI startup valued at \$500M - [Google News](https://news.google.com/articles/CBMihAFodHRwczovL3d3dy5mb3huZXdzLmNvbS9lbnRlcnRhaW5tZW50L2phcmVkLWxldG8taW52ZXN0cy01MDBtLWFpLXN0YXJ0dXAtZGVzcGl0ZS1jYWxscy1mcm9tLW90aGVyLXN0YXJzLXNodXQtZG93bi1jb250cm92ZXJzaWFsLXRlY2jSAYgBaHR0cHM6Ly93d3cuZm94bmV3cy5jb20vZW50ZXJ0YWlubWVudC9qYXJlZC1sZXRvLWludmVzdHMtNTAwbS1haS1zdGFydHVwLWRlc3BpdGUtY2FsbHMtZnJvbS1vdGhlci1zdGFycy1zaHV0LWRvd24tY29udHJvdmVyc2lhbC10ZWNoLmFtcA?hl=en-US&gl=US&ceid=US:en).
- AI Fund and KX Venture Capital partner to boost Thai AI startups - [Google News](https://news.google.com/articles/CBMiZmh0dHBzOi8vd3d3LmJhbmdrb2twb3N0LmNvbS9idXNpbmVzcy9nZW5lcmFsLzI4MzE5MDMvYWktZnVuZC1reHZjLXBhcnRuZXItdG8tZGV2ZWxvcC1sb2NhbC1haS1zdGFydHVwc9IBAA?hl=en-US&gl=US&ceid=US:en).
- TSMC to allocate chip capacity for OpenAI's chips if large orders are placed - [Google News](https://news.google.com/articles/CBMiU2h0dHBzOi8vd2NjZnRlY2guY29tL3RzbWMtaXMtd2lsbGluZy10by1hbGxvY2F0ZS1jYXBhY2l0eS1mb3Itb3BlbmFpcy1jaGlwcy1yZXBvcnQv0gFXaHR0cHM6Ly93Y2NmdGVjaC5jb20vdHNtYy1pcy13aWxsaW5nLXRvLWFsbG9jYXRlLWNhcGFjaXR5LWZvci1vcGVuYWlzLWNoaXBzLXJlcG9ydC9hbXAv?hl=en-US&gl=US&ceid=US:en).

# CTERA's \$80M Funding

- CTERA raises \$80M in Series D funding led by Red Dot Capital Partners - [TechCrunch](https://techcrunch.com/2023/10/01/ctera-80m-series-d-funding/).
- Funds to accelerate product development, expand global sales, and enhance customer support - [TechCrunch](https://techcrunch.com/2023/10/01/ctera-80m-series-d-funding/).
- CTERA to innovate cloud storage solutions and strengthen enterprise market position - [TechCrunch](https://techcrunch.com/2023/10/01/ctera-80m-series-d-funding/).

# Anthropic's \$100M AI Fund

- Blackstone aims to be the largest AI infrastructure investor with \$2T in data center expenditures - [Google News](https://news.google.com/articles/CBMiaGh0dHBzOi8vd3d3LmJ1c2luZXNzaW5zaWRlci5jb20vYmxhY2tzdG9uZS1zdGV2ZS1zY2h3YXJ6bWFuLW9uLWFpLWluZnJhc3RydWN0dXJlLWludmVzdG1lbnQtZ29hbHMtMjAyNC030gEA?hl=en-US&gl=US&ceid=US:en).
- Jared Leto invests in Captions, a generative AI startup valued at \$500M - [Google News](https://news.google.com/articles/CBMihAFodHRwczovL3d3dy5mb3huZXdzLmNvbS9lbnRlcnRhaW5tZW50L2phcmVkLWxldG8taW52ZXN0cy01MDBtLWFpLXN0YXJ0dXAtZGVzcGl0ZS1jYWxscy1mcm9tLW90aGVyLXN0YXJzLXNodXQtZG93bi1jb250cm92ZXJzaWFsLXRlY2jSAYgBaHR0cHM6Ly93d3cuZm94bmV3cy5jb20vZW50ZXJ0YWlubWVudC9qYXJlZC1sZXRvLWludmVzdHMtNTAwbS1haS1zdGFydHVwLWRlc3BpdGUtY2FsbHMtZnJvbS1vdGhlci1zdGFycy1zaHV0LWRvd24tY29udHJvdmVyc2lhbC10ZWNoLmFtcA?hl=en-US&gl=US&ceid=US:en).
- TSMC to allocate chip capacity for OpenAI's chips if large orders are placed - [Google News](https://news.google.com/articles/CBMiU2h0dHBzOi8vd2NjZnRlY2guY29tL3RzbWMtaXMtd2lsbGluZy10by1hbGxvY2F0ZS1jYXBhY2l0eS1mb3Itb3BlbmFpcy1jaGlwcy1yZXBvcnQv0gFXaHR0cHM6Ly93Y2NmdGVjaC5jb20vdHNtYy1pcy13aWxsaW5nLXRvLWFsbG9jYXRlLWNhcGFjaXR5LWZvci1vcGVuYWlzLWNoaXBzLXJlcG9ydC9hbXAv?hl=en-US&gl=US&ceid=US:en)
"""


display(Markdown(z))


In [None]:
from openbb import obb
import datetime

# Get today's date
today = datetime.date.today().strftime("%Y-%m-%d")

# log in
obb.account.login(email=os.environ['OPENBB_USER'], password=os.environ['OPENBB_PW'], remember_me=True)

# Search for AI news from today
results = obb.news.company("META, MSFT, GOOG, AAPL, AMZN, NVDA, TSLA", provider='yfinance', limit=20).to_df()


In [None]:
results
