# **News scraper + summarizer + sender** 

This notebook will send Newsletter of latest PH news in Gmail and summarized using simple LLM

In [1]:
# pip freeze

In [20]:
import random
import time
import warnings
from datetime import date, datetime, timedelta

import requests
from bs4 import BeautifulSoup

## **I. Webscrape a News site**

- **POC ver**: Focus only on 1st page of Inquirer's "Nation" section
- **Next steps**:
  1. Add 2nd news website source: CNN Philippines "News & Buzz" section has potential
  2. Check similarity of news from source 1 vs source 2 with LLM to avoid duplicates: Use vector db?

### **A. Inquirer**

In [21]:
# URL = "https://newsinfo.inquirer.net/category/nation"
# page = requests.get(URL).text

# page_no=3
# URL = f"https://newsinfo.inquirer.net/category/nation/page/{str(page_no)}"
# page = requests.get(URL).text

In [22]:
# soup = BeautifulSoup(page, "lxml")

In [23]:
# articles = soup.find_all("div", {"id": "ch-ls-box"})

Get all latest news links

In [24]:
TODAY = date.today()
PREV_DATE = TODAY + timedelta(-1)
PREV_DATE_STR = PREV_DATE.strftime("%B %d, %Y")

## Manual assign
PREV_DATE_STR = "January 17, 2024"
PREV_DATE_STR

'January 17, 2024'

In [38]:
latest_news_links = []

for page_no in range(1, 4):
    articles = []
    attempts = 0
    while not articles and attempts < 5:
        if page_no == 1:
            URL = "https://newsinfo.inquirer.net/category/nation"
        else:
            URL = f"https://newsinfo.inquirer.net/category/nation/page/{str(page_no)}"
        page = requests.get(URL).text
        soup = BeautifulSoup(page, "lxml")
        articles = soup.find_all("div", {"id": "ch-ls-box"})
        if not articles:
            # Add random delay per page retry to avoid bot-behavior
            time.sleep(5 + random.randint(0, 5))
            attempts += 1

    if attempts == 5:
        print(f"Failed to fetch articles from page {page_no} after 5 attempts.")
        continue

    for article in articles:
        try:
            news_link = article.a["href"]
            pub_date = article.find("div", {"id": "ch-postdate"}).span.text
            print(news_link, pub_date)
            if pub_date == PREV_DATE_STR:
                latest_news_links.append(news_link)
        except TypeError:  # Skips div tags mainly for styling
            pass

    # Add random delay per page access to avoid bot-behavior
    time.sleep(5 + random.randint(0, 5))
    
assert latest_news_links, f"No news from {PREV_DATE_STR} were fetch."

https://newsinfo.inquirer.net/1890758/bir-raises-vat-exemption-cap-for-housing-to-p3-6m January 18, 2024
https://newsinfo.inquirer.net/1890756/about-time-transport-groups-say-of-puv-modernization January 18, 2024
https://newsinfo.inquirer.net/1890753/bi-probe-exposes-459-foreigners-in-visa-scam January 18, 2024
https://newsinfo.inquirer.net/1890751/philhealth-pag-ibig-incomes-still-up-even-with-no-premium-hikes January 18, 2024
https://newsinfo.inquirer.net/1890749/pedro-m-calayag-84 January 18, 2024
https://newsinfo.inquirer.net/1890744/cha-cha-signature-sheets-start-to-arrive-at-comelec-offices-no-choice-but-to-receive-them January 18, 2024
https://newsinfo.inquirer.net/1890739/starbucks-sorry-for-signage-limiting-seniors-discount January 18, 2024
https://newsinfo.inquirer.net/1890737/from-house-a-call-to-scrutinize-senates-cha-cha-measure January 18, 2024
https://newsinfo.inquirer.net/1890733/shrinkflation-dti-okays-downsizing-of-key-goods January 18, 2024
https://newsinfo.inquirer.

In [39]:
latest_news_links

['https://newsinfo.inquirer.net/1890240/ca-to-check-credentials-of-171-marcos-appointees',
 'https://newsinfo.inquirer.net/1890238/da-applies-finishing-touches-to-ph-vietnam-rice-deal',
 'https://newsinfo.inquirer.net/1890242/cops-block-groups-protesting-puv-modernization-program',
 'https://newsinfo.inquirer.net/1890240/ca-to-check-credentials-of-171-marcos-appointees',
 'https://newsinfo.inquirer.net/1890238/da-applies-finishing-touches-to-ph-vietnam-rice-deal']

Extract news content per link:

- **NOTE**: This code wont work sometimes due to the strange behavior of the website (when making request). It wont return any html and hence gives null title and content
    - Similar problem when getting the latest news link (prev block of code)

In [40]:
news_data = []

for news_link in latest_news_links:
    article_page = requests.get(news_link).text
    article_soup = BeautifulSoup(article_page, "lxml")

    # Get news title
    title = article_soup.find("h1", {"class": "entry-title"}).text
    # Get news content
    page_body = article_soup.find("div", {"id": "article_content"})
    paragraphs = page_body.find_all("p", {"class": ""})
    content = ""
    for paragraph in paragraphs:
        content += f"{paragraph.text}\n "  # Populate by concatenating each paragraph

    # Add to data list
    news_data.append({"Title": title, "Content": content, "Link": news_link})

In [41]:
news_data

[{'Title': 'CA to check credentials of 171 Marcos appointees',
  'Content': '\n MANILA, Philippines — A total of 171 presidential appointees, mostly military officials and newly designated Finance Secretary Ralph Recto, will go through the scrutiny of the Commission on Appointments (CA) when Congress resumes session on Jan. 22.\n CA Assistant Minority Leader and Surigao del Sur Rep. Johnny Pimentel encouraged the public on Tuesday to formally inform the body of their opposition to or complaint against any of the appointees scheduled for confirmation. Any information, written report, sworn and notarized complaints, or opposition should be submitted to the CA secretariat, he added.\n “The CA has already received the appointment papers of Secretary Recto, along with the nomination papers of two ambassadors and the promotion papers of 168 senior military officers,” Pimentel said in a statement.\n The two diplomats are Flerida Ann Camille Mayo, the Philippine ambassador to Cambodia, and Edg

## **II. Summarize news using LLM**

- **POC ver**: News summarizer using *Fine-tuned LongT5 transformer* from Hugging Face
    - Noticeable model rot due to lack of online training and fine tuning - some words are not summarized properly and model starts to hallucinate
- **Next steps**:
  1. Try other LLMs: Mistral, Phi, Quantized versions, etc
  2. Other gimmicks: Add text-to-image models and attach in email

In [42]:
from transformers import pipeline

summarizer_llm = pipeline(
    "summarization", "pszemraj/long-t5-tglobal-base-16384-book-summary", max_length=256
)

# summary = summarizer_llm(content)
# summarized_news = summary[0]["summary_text"]
# print(summarized_news)

Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.


Store summarized news

In [43]:
news_data_final = [
    # Add key "Summary" with summary of news content to the orig dict of each news
    dict(
        news_dict,
        **{"Summary": summarizer_llm(news_dict["Content"])[0]["summary_text"]},
    )
    for news_dict in news_data
]



In [44]:
news_data_final

[{'Title': 'CA to check credentials of 171 Marcos appointees',
  'Content': '\n MANILA, Philippines — A total of 171 presidential appointees, mostly military officials and newly designated Finance Secretary Ralph Recto, will go through the scrutiny of the Commission on Appointments (CA) when Congress resumes session on Jan. 22.\n CA Assistant Minority Leader and Surigao del Sur Rep. Johnny Pimentel encouraged the public on Tuesday to formally inform the body of their opposition to or complaint against any of the appointees scheduled for confirmation. Any information, written report, sworn and notarized complaints, or opposition should be submitted to the CA secretariat, he added.\n “The CA has already received the appointment papers of Secretary Recto, along with the nomination papers of two ambassadors and the promotion papers of 168 senior military officers,” Pimentel said in a statement.\n The two diplomats are Flerida Ann Camille Mayo, the Philippine ambassador to Cambodia, and Edg

In [18]:
# title = news_data[2]["Title"]
# content = news_data[2]["Content"]
# content

## **III. Gmail Email Sender**

- **POC ver**: Simple style and HTML based. No file attachments
- **Next steps**:
  1. Better style - use CSS?
  2. Add attachments produced by other generative models 

In [17]:
# import smtplib
# import ssl
# from email.message import EmailMessage

# # Define email sender and receiver
# email_sender = 'phnewsletterbot@gmail.com'
# email_password = 'wgjg eryn cqqr ptev' # GET THIS IN https://myaccount.google.com/u/4/apppasswords
# email_receiver = 'earljohn.crusina05@gmail.com'

# # Set the subject and body of the email
# subject = f'Philippine News For {PREV_DATE_STR}'
# body = f"""
# 1. {title} | <a href="{news_link}">Philstar</a> \n\t Summary: {summarized_news}
# """


# em = EmailMessage()
# em['From'] = email_sender
# em['To'] = email_receiver
# em['Subject'] = subject
# em.set_content(body)

# # Add SSL (layer of security)
# context = ssl.create_default_context()

# # Log in and send the email
# with smtplib.SMTP_SSL('smtp.gmail.com', 465, context=context) as smtp:
#     smtp.login(email_sender, email_password)
#     smtp.sendmail(email_sender, email_receiver, em.as_string())

In [14]:
import smtplib
import ssl
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

news_outlet = "INQUIRER"

# Define email sender and receiver
sender_email = "phnewsletterbot@gmail.com"
password = (
    "wgjg eryn cqqr ptev"  # GET THIS IN https://myaccount.google.com/u/4/apppasswords
)
receiver_email = "earljohn.crusina05@gmail.com"

message = MIMEMultipart("alternative")
message["Subject"] = f"Philippine News For {PREV_DATE_STR}"
message["From"] = sender_email
message["To"] = receiver_email


# Create the plain-text and HTML version of your message
text = """\
Good day there! Here are the latest news in the Philippines yesterday:
"""

html = """\
<html>
  <head>
  </head>
  <body style=font-size:16px>
    <p>Good day there! Here are the latest news in the Philippines yesterday:</p>
    <ol>
"""

# Add each news article to the HTML content
for i, article in enumerate(news_data_final, start=1):
    title = article["Title"]
    link = article["Link"]
    summarized_news = article["Summary"]

    # Add the article to the HTML content in a numbered list format
    html += f"""
      <li>
        <b>{title}</b> | <a href="{link}">{news_outlet}</a> 
        <br>
        <ul>{summarized_news}</ul>
        <br>
      </li>
    """
    # Add the article to the plain-text content
    text += f"{i}. {title} | {news_outlet}\n\tSummary: {summarized_news}\n"

# Close the HTML tags
html += """\
    </ol>
  </body>
</html>
"""

# Turn these into plain/html MIMEText objects
part1 = MIMEText(text, "plain")
part2 = MIMEText(html, "html")

# Add HTML/plain-text parts to MIMEMultipart message
message.attach(part1)
message.attach(part2)

# Create a secure connection with the server and send the email
context = ssl.create_default_context()
with smtplib.SMTP_SSL("smtp.gmail.com", 465, context=context) as server:
    server.login(sender_email, password)
    server.sendmail(sender_email, receiver_email, message.as_string())


## **IV. DynamoDB NoSQL Cloud database**

- **POC ver**: Store news for last 3 days only to limit within free tier storage
- **IMPORTANT**: Need to stay within Free tier
    - 25G storage
    - 25 provisioned Read Capacity Unit
    - 25 provisioned Write Capacity Unit

In [15]:
# pip install boto3

In [16]:
PREV_DATE_ID = PREV_DATE.strftime("%Y%m%d")
PREV_DATE_ID

'20240115'

In [17]:
# PREV_DATE_ID = '20231024'

In [18]:
# Add IDs for DynamoDB table
for idx, news_data in enumerate(news_data_final):
    news_data_final[idx]["dateID"] = PREV_DATE_ID
    news_data_final[idx]["newsID"] = (
        PREV_DATE_ID + "-" + str(len(news_data_final) - idx)
    )  # Start ID as 1 for the last news_data in list
    news_data_final[idx]["CreationDate"] = datetime.now().isoformat()

In [19]:
news_data_final

[{'Title': 'No need to meet with House on changing economic provisions, says Zubiri',
  'Content': '\n MANILA, Philippines — There is no longer need for the Senate and the House of Representatives to convene jointly to propose amendments to economic provisions of the 1987 Constitution.\n This is according to Senate President Juan Miguel Zubiri, who filed Resolution of Both Houses\xa0 (RBH6) No. 6 on Monday, proposing amendments to certain economic provisions of the Constitution.\n “No need, we don’t\xa0 need to meet,” Zubiri said in a press conference when asked if the\xa0 two chambers\xa0 would have to meet jointly through a constituent assembly to discuss the proposed amendments.\n “And there’s no specific instruction on the Constitution. We can meet separately for\xa0 that,” he added.\n Article 17 of\xa0 the Constitution\xa0 simply\xa0 states that “Congress upon a vote of three-fourths of all its members” may propose amendments\xa0 to or revision\xa0 of the 1987 Constitution.\n The\

In [20]:
from boto3 import resource
from boto3.dynamodb.conditions import Key

news_nosql_db = resource("dynamodb").Table("news-data-last3days")

In [21]:
def insert_news(news_data_list: list) -> None:
    """Inserts one news record in DynamoDB table.

    Parameters
    ----------
    news_data_list : list
        List of dictionaries with all key-value pairs of a news.
    """
    print("Inserting news...")
    for news_data in news_data_list:
        # Insert 1 news
        response = news_nosql_db.put_item(Item=news_data)
        print(response)

    return print("Done insert!")

In [22]:
def batch_delete_old_news(news_to_delete: list) -> None:
    """Delete old newsID records in DynamoDB table by batch to clear storage.

    Parameters
    ----------
    news_to_delete : list
        List of dictionaries with dateID and newsID of records to be deleted.
    """
    print("Deleting old news...")
    # Break out if input is empty
    if not news_to_delete:
        return print("No news to delete.")

    response = {}
    # NOTE - batch delete is better if Partition Key + Sort Key pair is not unique
    # Overkill if each batch is a unique record but this is the only available method
    with news_nosql_db.batch_writer() as batch:
        for news_data in news_to_delete:
            # Delete 1 news
            part_key = news_data["dateID"]
            sort_key = news_data["newsID"]
            response = batch.delete_item(Key={"dateID": part_key, "newsID": sort_key})
            print(f"dateID:{part_key} newsID:{sort_key} || {response}")

    # Check if the dateIDs were successfully deleted
    print("Scanning again the updated table...")
    response_whole_db = news_nosql_db.scan()
    remaining_dateIDs = set([item["dateID"] for item in response_whole_db["Items"]])
    deleted_dateIDs = set([news["dateID"] for news in news_to_delete])

    assert remaining_dateIDs.isdisjoint(
        deleted_dateIDs
    ), "Not all dateIDs were deleted!"

    return print("Done deletion!")

In [23]:
def count_news_with_dateID(news_dateID: str) -> int:
    """Returns the number of news with Partition Key of dateID

    Parameters
    ----------
    news_dateID : str
        dateID key to query in DynamoDB table.

    Returns
    -------
    int
        Total number of news with dateID Partition key.
    """
    # Query records with dateID as key
    response = {}
    filtering_exp = Key("dateID").eq(news_dateID)
    response = news_nosql_db.query(KeyConditionExpression=filtering_exp)

    # Count records
    news_list = response["Items"]
    total_news = len(news_list)
    return total_news

In [24]:
def get_old_news_to_delete(days_to_keep: int = 3) -> list:
    """Returns unique dateIDs of old news in DynamoDB table that will be deleted to maintain limited storage.

    Parameters
    ----------
    days_to_keep : int, optional
        Number of days to look back to get min date to keep, by default 3.

    Returns
    -------
    list
        Contains dictionary of news to be deleted with Partition Key dateID and Sort Key newsID.
    """
    # Get min dateID to keep in table
    min_date_stored = TODAY + timedelta(-days_to_keep)
    min_date_id = min_date_stored.strftime("%Y%m%d")

    # Scan entire table
    print("Scanning the entire table...")
    response_whole_db = news_nosql_db.scan()
    # Get all dateIDs so far
    old_dateIDs = set([item["dateID"] for item in response_whole_db["Items"]])
    # Select dateIDs for deletion
    delete_dateIDs = [dateID for dateID in old_dateIDs if dateID < min_date_id]
    print("Done retrieving dateIDs \n")

    # Get newsID to be deleted per dateID
    print("Generating newsIDs to delete...")
    news_to_delete = []
    for dateID in delete_dateIDs:
        total_news = count_news_with_dateID(news_dateID=dateID)
        # Generate newsID based on logic of how it was made before using total record count
        newsID_list = [
            {"dateID": dateID, "newsID": dateID + "-" + str(idx + 1)}
            for idx in range(total_news)
        ]
        news_to_delete += newsID_list

    if not news_to_delete:
        warnings.warn("No old records were found.", Warning)
    else:
        print("Done!")

    return news_to_delete

**INSERT news (as of prev day) scraped today**

In [25]:
insert_news(news_data_list=news_data_final)

Inserting news...
{'ResponseMetadata': {'RequestId': '3M0LSC89M8DKEANJT35FEHE4C3VV4KQNSO5AEMVJF66Q9ASUAAJG', 'HTTPStatusCode': 200, 'HTTPHeaders': {'server': 'Server', 'date': 'Tue, 16 Jan 2024 13:15:21 GMT', 'content-type': 'application/x-amz-json-1.0', 'content-length': '2', 'connection': 'keep-alive', 'x-amzn-requestid': '3M0LSC89M8DKEANJT35FEHE4C3VV4KQNSO5AEMVJF66Q9ASUAAJG', 'x-amz-crc32': '2745614147'}, 'RetryAttempts': 0}}
{'ResponseMetadata': {'RequestId': 'S9RH4NPV7GLDL30HDRO1GS8GS3VV4KQNSO5AEMVJF66Q9ASUAAJG', 'HTTPStatusCode': 200, 'HTTPHeaders': {'server': 'Server', 'date': 'Tue, 16 Jan 2024 13:15:21 GMT', 'content-type': 'application/x-amz-json-1.0', 'content-length': '2', 'connection': 'keep-alive', 'x-amzn-requestid': 'S9RH4NPV7GLDL30HDRO1GS8GS3VV4KQNSO5AEMVJF66Q9ASUAAJG', 'x-amz-crc32': '2745614147'}, 'RetryAttempts': 0}}
{'ResponseMetadata': {'RequestId': 'M4C1BVJQFC2UPI0JIVMMJMEH1RVV4KQNSO5AEMVJF66Q9ASUAAJG', 'HTTPStatusCode': 200, 'HTTPHeaders': {'server': 'Server', 'd

**QUERY old news beyond 3 days ago**

In [26]:
news_to_delete = get_old_news_to_delete()
news_to_delete

Scanning the entire table...
Done retrieving dateIDs 

Generating newsIDs to delete...
Done!


[{'dateID': '20240111', 'newsID': '20240111-1'},
 {'dateID': '20240111', 'newsID': '20240111-2'},
 {'dateID': '20240111', 'newsID': '20240111-3'},
 {'dateID': '20240111', 'newsID': '20240111-4'},
 {'dateID': '20240111', 'newsID': '20240111-5'},
 {'dateID': '20240111', 'newsID': '20240111-6'},
 {'dateID': '20240111', 'newsID': '20240111-7'},
 {'dateID': '20240112', 'newsID': '20240112-1'},
 {'dateID': '20240112', 'newsID': '20240112-2'},
 {'dateID': '20240112', 'newsID': '20240112-3'},
 {'dateID': '20240112', 'newsID': '20240112-4'},
 {'dateID': '20240112', 'newsID': '20240112-5'},
 {'dateID': '20240112', 'newsID': '20240112-6'},
 {'dateID': '20240112', 'newsID': '20240112-7'},
 {'dateID': '20240112', 'newsID': '20240112-8'},
 {'dateID': '20240112', 'newsID': '20240112-9'}]

**DELETE old news beyond 3 days ago**

In [27]:
batch_delete_old_news(news_to_delete=news_to_delete)

Deleting old news...
dateID:20240111 newsID:20240111-1 || None
dateID:20240111 newsID:20240111-2 || None
dateID:20240111 newsID:20240111-3 || None
dateID:20240111 newsID:20240111-4 || None
dateID:20240111 newsID:20240111-5 || None
dateID:20240111 newsID:20240111-6 || None
dateID:20240111 newsID:20240111-7 || None
dateID:20240112 newsID:20240112-1 || None
dateID:20240112 newsID:20240112-2 || None
dateID:20240112 newsID:20240112-3 || None
dateID:20240112 newsID:20240112-4 || None
dateID:20240112 newsID:20240112-5 || None
dateID:20240112 newsID:20240112-6 || None
dateID:20240112 newsID:20240112-7 || None
dateID:20240112 newsID:20240112-8 || None
dateID:20240112 newsID:20240112-9 || None
Scanning again the updated table...
Done deletion!
