# News of the Day

In this notebook, we'll show how to use [Unstructured.IO](https://unstructured.io/), [ChromaDB](https://www.trychroma.com/), and [LangChain](https://github.com/langchain-ai/langchain) to summarize topics from the front page of CNN Lite. Without tooling from the modern LLM stack, this would have been a time-consuming project. With Unstructured, Chroma, and LangChain, the entire workflow is less than two dozen lines of code.

## Gather links with `unstructured`

First, we'll gather links from the [CNN Lite](https://lite.cnn.com/) homepage using the `partition_html` function from `unstructured`. When `unstructured` partitions HTML pages, links are included in the metadata for each element, make link collection a simple task. 

In [1]:
from unstructured.partition.html import partition_html

In [2]:
cnn_lite_url = "https://lite.cnn.com/"

In [3]:
elements = partition_html(url=cnn_lite_url)

In [4]:
links = []
for element in elements:
    if element.metadata.links is not None:
        relative_link = element.metadata.links[0]["url"][1:]
        if relative_link.startswith("2023"):
            links.append(f"{cnn_lite_url}{relative_link}")

In [5]:
links[:5]

['https://lite.cnn.com/2023/08/07/health/breast-cancer-overdiagnosis/index.html',
 'https://lite.cnn.com/2023/08/08/investing/ups-earnings/index.html',
 'https://lite.cnn.com/2023/08/08/business/molson-coors-blue-run-spirits-acquisition/index.html',
 'https://lite.cnn.com/2023/08/08/politics/ukraine-counteroffensive-us-briefings/index.html',
 'https://lite.cnn.com/2023/08/08/europe/italian-cheesemaker-dies-italy-intl-scli/index.html']

## Ingest individual articles with `UnstructuredURLLoader`

Now that we have the links, we can preprocess individual news articles with `UnstructuredURLLoader`. `UnstructuredURLLoader` fetches content from the web and then uses the `unstructured` `partition` function to extract content and metadata. In this example we preprocess HTML files, but it works with other response types such as `application/pdf` as well. After calling `.load()`, the result is a list of `langchain` `Document` objects.

In [6]:
from langchain.document_loaders import UnstructuredURLLoader

loaders = UnstructuredURLLoader(urls=links, show_progress_bar=True)

In [7]:
docs = loaders.load()

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:11<00:00,  8.41it/s]


In [8]:
docs[0]

Document(page_content='CNN\n\n8/8/2023\n\nOlder women’s breast cancer is often overdiagnosed, study finds, raising risk of unnecessary treatment\n\nBy Amanda Musa, CNN\n\nUpdated: \n        8:38 AM EDT, Tue August 8, 2023\n\nSource: CNN\n\nA breast cancer diagnosis is an all-too-common reality for women around the world. In the US, about 240,000 cases of breast cancer are diagnosed in women every year, the US Centers for Disease Control and Prevention estimates.\n\nHealth care providers and patients alike are usually inclined to pursue treatment to stop the disease. But some experts say that it isn’t always necessary to treat breast cancer in older women with aggressive therapy.\n\nA study published Monday in the Annals of Internal Medicine found that large numbers of American women ages 70 to 85 are potentially overdiagnosed with breast cancer and therefore could receive unnecessary treatment.\n\n“Overdiagnosis refers to this phenomenon where we find breast cancers on screening that n

## Load documents into ChromaDB

With the documents preprocessed, we're now ready to load them into ChromaDB. We accomplish this easily by using the OpenAI embeddings the Chroma vectrostore from `langchain`. This workflow will vectorize the documents using the OpenAI embeddings endpoint, and then load the documents and associated vectors into Chroma. Once the documents are in Chroma, we can perform a similarity search to retrieve documents related to our topic of interest.

In [9]:
from langchain.vectorstores.chroma import Chroma
from langchain.embeddings import OpenAIEmbeddings

In [10]:
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embeddings)

In [11]:
query_docs = vectorstore.similarity_search("Update on the coup in Niger.", k=1)

## Summarize the Documents

After retrieving relevant documents from Chroma, we're ready to summarize them! There are multiple ways to accomplish this in `langchain`, but `load_summarization_chain` is the easiest. Simply choose an LLM, load the summarization chain, and you're ready to summarize the documents. Here we limit the summary to snippets related to our topic of choice.

In [12]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain

In [13]:
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k")
chain = load_summarize_chain(llm, chain_type="stuff")

In [14]:
print(chain.run(query_docs))

Niger's military has deployed reinforcements to the capital city after refusing to cede power following a deadline set by a regional bloc. The military junta took control of the country in a coup last month, leading to political chaos. The Economic Community of West African States (ECOWAS) has imposed sanctions and threatened military intervention if the junta does not step down. The situation remains uncertain, with ECOWAS leaders seeking a diplomatic solution but willing to use force as a last resort. The uncertainty has caused concern among residents, who are stocking up on supplies and attempting to flee the capital. The future of Niger's elected government is important to its democratic neighbors and Western partners, including the United States and France. Russia's mercenary group Wagner has also shown interest in the situation, potentially seeking to gain influence in the region.
