# News of the Day

In this notebook, we'll show how to use [Unstructured.IO](https://unstructured.io/), [ChromaDB](https://www.trychroma.com/), and [LangChain](https://github.com/langchain-ai/langchain) to summarize topics from the front page of CNN Lite. Without tooling from the modern LLM stack, this would have been a time-consuming project. With Unstructured, Chroma, and LangChain, the entire workflow is less than two dozen lines of code.

## Gather links with `unstructured`

First, we'll gather links from the [CNN Lite](https://lite.cnn.com/) homepage using the `partition_html` function from `unstructured`. When `unstructured` partitions HTML pages, links are included in the metadata for each element, make link collection a simple task. 

In [1]:
from unstructured.partition.html import partition_html

In [2]:
cnn_lite_url = "https://lite.cnn.com/"

In [3]:
elements = partition_html(url=cnn_lite_url)

In [4]:
links = []

for element in elements:
    if element.metadata.link_urls:
        relative_link = element.metadata.link_urls[0][1:]
        if relative_link.startswith("2024"):
            links.append(f"{cnn_lite_url}{relative_link}")

In [5]:
len(links)

98

## Ingest individual articles with `UnstructuredURLLoader`

Now that we have the links, we can preprocess individual news articles with `UnstructuredURLLoader`. `UnstructuredURLLoader` fetches content from the web and then uses the `unstructured` `partition` function to extract content and metadata. In this example we preprocess HTML files, but it works with other response types such as `application/pdf` as well. After calling `.load()`, the result is a list of `langchain` `Document` objects.

In [6]:
from langchain.document_loaders import UnstructuredURLLoader

loaders = UnstructuredURLLoader(urls=links[:20], show_progress_bar=True)

In [7]:
docs = loaders.load()

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.67it/s]


In [8]:
docs[0]

Document(page_content='CNN\n\n3/13/2024\n\nRFK Jr.’s VP prospect Aaron Rodgers has shared\xa0false\xa0Sandy Hook conspiracy theories\xa0in private conversations\n\nBy Pamela Brown and Jake Tapper, CNN\n\nUpdated: \n        5:33 PM EDT, Wed March 13, 2024\n\nSource: CNN\n\nIndependent presidential candidate Robert F. Kennedy Jr. has confirmed that among his potential vice-presidential prospects is New York Jets quarterback Aaron Rodgers,\xa0who\xa0in private conversations shared deranged conspiracy theories\xa0about the 2012 Sandy Hook school shooting not being real.\n\nCNN knows of two people with whom Rodgers has enthusiastically shared these stories,\xa0including with Pamela Brown, one of the journalists writing this piece.\n\nBrown was covering the Kentucky Derby for CNN in\xa02013\xa0when she was introduced to Rodgers, then with the Green Bay Packers, at a post-Derby party. Hearing that she was a journalist with CNN, Rodgers immediately began attacking the news media for covering u

## Load documents into ChromaDB

With the documents preprocessed, we're now ready to load them into ChromaDB. We accomplish this easily by using the OpenAI embeddings the Chroma vectrostore from `langchain`. This workflow will vectorize the documents using the OpenAI embeddings endpoint, and then load the documents and associated vectors into Chroma. Once the documents are in Chroma, we can perform a similarity search to retrieve documents related to our topic of interest.

In [9]:
from langchain.vectorstores.chroma import Chroma
from langchain.embeddings import OpenAIEmbeddings

In [10]:
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embeddings)

In [11]:
query_docs = vectorstore.similarity_search(
    "What is behind the rapid increase in car insurance rates?", k=1
)

## Summarize the Documents

After retrieving relevant documents from Chroma, we're ready to summarize them! There are multiple ways to accomplish this in `langchain`, but `load_summarization_chain` is the easiest. Simply choose an LLM, load the summarization chain, and you're ready to summarize the documents. Here we limit the summary to snippets related to our topic of choice.

In [12]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain

In [13]:
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k")
chain = load_summarize_chain(llm, chain_type="stuff")

In [14]:
print(chain.run(query_docs))

Car insurance rates in the US have increased by almost 21% in the past year, contributing to the overall inflation rate. The rise can be attributed to rising car repair costs, more severe and frequent car accidents, and riskier driving behaviors. The increase in rates varies by state, with Nevada experiencing the highest jump and North Carolina the smallest. While rates are expected to moderate nationally in the second half of 2024, some markets may continue to see increases.
