<a href="https://colab.research.google.com/github/W-Bjwa04/DLD/blob/main/notebook/googlenews_summarize_vertex_langchain-git.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarizing Google News Headlines Using VertexAI PALM API & Langchain

In this tutorial, we will perform news summarization of news content gathered from Google News using the following components:

- GNews API: Collect news titles & metadata from Google News
- Langchain's UnstructuredURLLoader: Retrieve news content
- Vertex PALM API: Generate news summary

Vertex PALM API is a large language model (LLM) that can be used for a variety of tasks, including text summarization. In this tutorial, we will use the text-bison@001 model from PALM API to summarize news content.

Reference and credit to the following resources:
- https://github.com/ranahaani/GNews
- https://alphasec.io/summarize-google-news-results-with-langchain-and-serper-api/

## Objectives:
- Learn how to use GNews API, Langchain's UnstructuredURLLoader, and Vertex PALM API to perform news summarization
- Create a news summarization function that can be used to automate the process of generating news summaries
- Gain a better understanding of the different steps involved in news summarization

## Installation & Preparation

In [1]:
#install all required package
!pip -q install langchain
!pip install google-cloud-aiplatform
!pip install gnews
!pip install unstructured

Collecting gnews
  Downloading gnews-0.4.1-py3-none-any.whl.metadata (19 kB)
Collecting feedparser~=6.0.2 (from gnews)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting dnspython (from gnews)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Collecting sgmllib3k (from feedparser~=6.0.2->gnews)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading gnews-0.4.1-py3-none-any.whl (18 kB)
Downloading feedparser-6.0.11-py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.6/313.6 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=s

In [2]:
# restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [9]:
# import required packages
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate
from langchain.document_loaders import UnstructuredURLLoader


In [None]:
# authenticate to google cloud account
from google.colab import auth as google_auth
google_auth.authenticate_user()

In [None]:
#google cloud project name
#replace with your project name

import vertexai

PROJECT_ID = "my-project-id"  # @param {type:"string"}
vertexai.init(project=PROJECT_ID, location="us-central1")

## Calling GNews API to Get News Metadata
Limit news period using the following time operators:
 - h = hours (eg: 12h)
 - d = days (eg: 7d)
 - m = months (eg: 6m)
 - y = years (eg: 1y)

Example:

`google_news.period = '3d'  # News from last 3 days `

In [2]:
!pip install langchain_community

from gnews import GNews

google_news = GNews()
google_news.period = '1d'  # News from last 1 day
google_news.max_results = 5  # number of responses across a keyword
google_news.country = 'PK'  # News from a specific country = Indonesia
google_news.language = 'en'  # News in a specific language = Bahasa Indonesia
google_news.exclude_websites = ['yahoo.com', 'cnn.com', 'msn.con']  # Exclude news from specific website i.e Yahoo.com and msn.com

#use date range if required
#google_news.start_date = (2023, 1, 1) # Search from 1st Jan 2023
#google_news.end_date = (2023, 4, 1) # Search until 1st April 2023

#get by keyword
news_by_keyword = google_news.get_news('Politics')

Collecting langchain_community
  Downloading langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain_community)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Downloading langchain_community-0.3.24-py3-none-any.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m49.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx_sse-0.4.0-py3-none-any.whl (7.8 kB)
Downloading pydantic_settings-2.9.1-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading python_dotenv-1.1.0-py3

In [3]:
#check collected news metadata
news_by_keyword

[{'title': 'Kashmir conflict: how it impacts Indian, Pakistani politics - DW',
  'description': 'Kashmir conflict: how it impacts Indian, Pakistani politics  DW',
  'published date': 'Mon, 12 May 2025 15:49:47 GMT',
  'url': 'https://news.google.com/rss/articles/CBMilAFBVV95cUxOT3FacllRREdvOE5hQWJiLVVtWDdQcjVEczI0WDY2T3FoZEJIbnZYdkoycms5dHZXX3ZiMHBRQVI1M1FYblY1Tlh3RldpcDIza0N1VXZ0REwwVzR1YnN3RTZ0eFFDTGtiYW9HdzdvMlFwdXV5clItQzFLWGhGbDFycVhCWXB5ZGxNZWJmQjdkcUI2SEYt0gGUAUFVX3lxTFBuamFWTXJVQVpKa3czTVNtS0JWWXc2a0kzR3lVRDJRb1kzR3h5emZWYjZibWNQalhILXFFOXZqejBYemFnanZCWFAwWG5VSUlXOGRlc29MRktBLWxqZVdna041dl8zUkJtSFFxUVlFTWZLRGJPbHpIZ1NUempUc0hVbXZFVWJrbWdVelFueFlfdENiUEI?oc=5&hl=en-US&gl=US&ceid=US:en',
  'publisher': {'href': 'https://www.dw.com', 'title': 'DW'}},
 {'title': 'Current State of Politics in Albania and the Upcoming 2025 General Elections - New Lines Institute',
  'description': 'Current State of Politics in Albania and the Upcoming 2025 General Elections  New Lines Institute',
  

In [None]:
#test another method
#instead of search by keyword, let's retrieve top news from google-news

#get top news from the last 7 days
google_news = GNews(language='id', country='ID', period='7d',
                    start_date=None, end_date=None, max_results=10)
top_news = google_news.get_top_news()

#check collected news metadata
top_news

[{'title': 'Putra Megawati Sopiri Ganjar dan Rombongan Melaju di Atas Karpet Merah Rakernas IV PDI-P - Kompas.com - Nasional Kompas.com',
  'description': 'Putra Megawati Sopiri Ganjar dan Rombongan Melaju di Atas Karpet Merah Rakernas IV PDI-P - Kompas.com  Nasional Kompas.comEffendi Gazali: Ganjar Berhasil Jadi Bintang Rakernas PDIP | Kanal Pemilu Tepercaya  CNN IndonesiaPuan Maharani Heran Minimnya Tepuk Tangan di Rakernas PDIP: Kayak Nonton Wayang  Nasional TempoMomen Ganjar dan Jokowi Gandeng Megawati di Rakernas, PDI-P: Jauhkan Berbagai Spekulasi - Kompas.com  Nasional Kompas.comPakar Sebut Jokowi Sudah Bayangkan Ganjar Dilantik Jadi Presiden RI  detikNewsLihat Liputan Lengkap di Google Berita',
  'published date': 'Sat, 30 Sep 2023 10:31:00 GMT',
  'url': 'https://news.google.com/rss/articles/CBMie2h0dHBzOi8vbmFzaW9uYWwua29tcGFzLmNvbS9yZWFkLzIwMjMvMDkvMzAvMTczMTU5MDEvcHV0cmEtbWVnYXdhdGktc29waXJpLWdhbmphci1kYW4tcm9tYm9uZ2FuLW1lbGFqdS1kaS1hdGFzLWthcnBldC1tZXJhaNIBf2h0dHBzOi8vYW1wL

In [7]:
#collect metadata by news topic
#Available topics: WORLD, NATION, BUSINESS, TECHNOLOGY, ENTERTAINMENT, SPORTS, SCIENCE, HEALTH

from gnews import GNews

google_news = GNews(language='en', country='PK', max_results=5, exclude_websites=['yahoo.com', 'msn'])
latest_news = google_news.get_news('politics')  # Empty string = general latest headlines
latest_news

[{'title': 'Kashmir conflict: how it impacts Indian, Pakistani politics - DW',
  'description': 'Kashmir conflict: how it impacts Indian, Pakistani politics  DW',
  'published date': 'Mon, 12 May 2025 15:49:47 GMT',
  'url': 'https://news.google.com/rss/articles/CBMilAFBVV95cUxOT3FacllRREdvOE5hQWJiLVVtWDdQcjVEczI0WDY2T3FoZEJIbnZYdkoycms5dHZXX3ZiMHBRQVI1M1FYblY1Tlh3RldpcDIza0N1VXZ0REwwVzR1YnN3RTZ0eFFDTGtiYW9HdzdvMlFwdXV5clItQzFLWGhGbDFycVhCWXB5ZGxNZWJmQjdkcUI2SEYt0gGUAUFVX3lxTFBuamFWTXJVQVpKa3czTVNtS0JWWXc2a0kzR3lVRDJRb1kzR3h5emZWYjZibWNQalhILXFFOXZqejBYemFnanZCWFAwWG5VSUlXOGRlc29MRktBLWxqZVdna041dl8zUkJtSFFxUVlFTWZLRGJPbHpIZ1NUempUc0hVbXZFVWJrbWdVelFueFlfdENiUEI?oc=5&hl=en-US&gl=US&ceid=US:en',
  'publisher': {'href': 'https://www.dw.com', 'title': 'DW'}},
 {'title': 'Politics & Diplomacy - Atlantic Council',
  'description': 'Politics & Diplomacy  Atlantic Council',
  'published date': 'Thu, 08 May 2025 07:00:00 GMT',
  'url': 'https://news.google.com/rss/articles/CBMiaEFVX3lxTE1ic1Jk

## Extract news content
The `UnstructuredURLLoader` from Langchain library is usefull toolkit to get easy access to HTML contents from a url. This package is actually a wrapper of `bricks.html` partition from [Unstructured](https://unstructured-io.github.io/unstructured/bricks/partition.html#partition-html) library.  We will use it as a news content extractor by taking input from url collected at previous steps.

In [10]:
#test to extract content from url inside news_by_topic

urls = [latest_news[0]['url'],
        latest_news[1]['url'],
      ]

loader = UnstructuredURLLoader(urls=urls)
content = loader.load()

#check news content
content

[Document(metadata={'source': 'https://news.google.com/rss/articles/CBMilAFBVV95cUxOT3FacllRREdvOE5hQWJiLVVtWDdQcjVEczI0WDY2T3FoZEJIbnZYdkoycms5dHZXX3ZiMHBRQVI1M1FYblY1Tlh3RldpcDIza0N1VXZ0REwwVzR1YnN3RTZ0eFFDTGtiYW9HdzdvMlFwdXV5clItQzFLWGhGbDFycVhCWXB5ZGxNZWJmQjdkcUI2SEYt0gGUAUFVX3lxTFBuamFWTXJVQVpKa3czTVNtS0JWWXc2a0kzR3lVRDJRb1kzR3h5emZWYjZibWNQalhILXFFOXZqejBYemFnanZCWFAwWG5VSUlXOGRlc29MRktBLWxqZVdna041dl8zUkJtSFFxUVlFTWZLRGJPbHpIZ1NUempUc0hVbXZFVWJrbWdVelFueFlfdENiUEI?oc=5&hl=en-US&gl=US&ceid=US:en'}, page_content=''),
 Document(metadata={'source': 'https://news.google.com/rss/articles/CBMiaEFVX3lxTE1ic1JkWTc5TnFfMWxhSEFxTFZORmp4LWdybkhFYmhVam4yU3NOR2dGOVF3cnVLRzdyYW5JZlAzLWxQQnJvZU1qaG9kcUxGSmpwZ05acHI3eXh6QWNjREQzeEYtaVR0bmlI?oc=5&hl=en-US&gl=US&ceid=US:en'}, page_content='')]

### Summarize News with Vertex PALM API

The next step is calling `text-bison@001` to generate the news summary. We need to supply prompt to tell the model on how to summarize the text.


**Prompting**

Correct prompting is essential for getting accurate results from a LLM. Supply `prompt_template`  with prompt text to tell the model to generate news summary, using the following steps:
  * summary consists of maximum 100 words
  * If the text cannot be found or error, return: "Content empty"
  * Use only materials from the text supplied
  * Create summary in Bahasa Indonesia



In [None]:
#prompting to perform news summary
prompt_template = """Generate summary for the following text, using the following steps:
                     1. summary consists of maximum 100 words
                     2. If the text cannot be found or error, return: "Content empty"
                     3. Use only materials from the text supplied
                     4. Create summary in Bahasa Indonesia

                    "{text}"
                    SUMMARY:"""

prompt = PromptTemplate.from_template(prompt_template)

# load the gemini- model
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    # other params...
)

Wrap the summarization process inside a function to loop collections of news urls. The generate_summary function perform the following:
- Retrieve news content from each urls
- Generate summary for each news contents
- Print the output

In [None]:
# create function to generate news summary based on list of news urls
# Load URL , get news content and summarize
def generate_summary(docnews):
    for item in docnews:
        #extract news content
        loader = UnstructuredURLLoader(urls=[item['url']])
        data = loader.load()

        #summarize using stuff for easy processing
        chain = load_summarize_chain(llm,
                                    chain_type="stuff",
                                    prompt=prompt)
        summary = chain.run(data)

        #show summary for each news headlines
        print(item['title'])
        print(item['publisher']['title'], item['published date'])
        print(summary, '\n')

In [None]:
#call the function and generate summary for news by keyword
generate_summary(news_by_keyword)

Tarif Kereta Cepat Jakarta-Kota Bandung Rp 350 Ribu, Mahal? - CNBC Indonesia
CNBC Indonesia Sat, 30 Sep 2023 09:45:00 GMT
 Kereta Cepat Jakarta-Bandung akan diresmikan pada 2 Oktober 2023. Tarifnya diperkirakan sekitar Rp300.000-Rp350.000 untuk kelas ekonomi. Setelah uji coba gratis, tiket akan dikenakan biaya. Presiden Jokowi ingin harga tiket terjangkau dan bisa didiskon untuk menarik minat masyarakat. 

Siaran Pers: Kereta Cepat "Whoosh" Diharapkan Perkuat Capaian ... - Kemenparekraf
Kemenparekraf Sat, 30 Sep 2023 05:53:06 GMT
 Kereta Cepat Jakarta-Bandung atau "Whoosh" akan resmi beroperasi mulai 2 Oktober 2023. Kereta cepat ini diharapkan dapat memperkuat capaian target wisatawan nusantara dan mancanegara di tahun 2023. 

Kemenparekraf dikatakan Dessy senantiasa mendorong pelaku industri agar mulai membuat paket-paket perjalanan wisata dengan memasukkan kereta cepat sebagai salah satu daya tarik ataupun transportasi pilihan. 

Kereta Cepat Jakarta-Bandung "Whoosh" terbagi dalam ti

In [None]:
#call the function and generate summary for news by topics
generate_summary(news_by_topic)

Tottenham Vs Liverpool: Badan Wasit Ngaku Salah, Gol Luis Diaz Harusnya Sah - detikSport
detikSport Sat, 30 Sep 2023 23:00:00 GMT
 Tottenham Hotspur vs Liverpool: Badan Wasit Profesional Inggris (PGMOL) mengakui adanya kesalahan dalam pertandingan tersebut. Gol Luis Diaz yang dianulir seharusnya sah. PGMOL akan melakukan tinjauan penuh atas insiden tersebut. 

MU Vs Palace: Setan Merah Kalah! - detikSport
detikSport Sat, 30 Sep 2023 15:58:02 GMT
 Manchester United kalah 0-1 dari Crystal Palace di Old Trafford pada lanjutan Liga Inggris, Sabtu (30/9/2023) malam WIB. Gol tunggal kemenangan Palace dicetak oleh Joachim Andersen pada menit ke-25. MU gagal memanfaatkan sejumlah peluang dan kesulitan menembus pertahanan Palace yang disiplin. Kekalahan ini membuat MU tertahan di posisi 10 klasemen sementara dengan sembilan poin dari tujuh pertandingan. 

Rating Pemain Manchester City Versus Wolverhampton Wanderers: Erling Haaland Cuma Sekadar Kameo Dalam Kekalahan Pertama - Goal.com
Goal.com S