# News of the Day

In this notebook, we'll show how to use [Unstructured.IO](https://unstructured.io/), [ChromaDB](https://www.trychroma.com/), and [LangChain](https://github.com/langchain-ai/langchain) to summarize topics from the front page of CNN Lite. Without tooling from the modern LLM stack, this would have been a time-consuming project. With Unstructured, Chroma, and LangChain, the entire workflow is less than two dozen lines of code.

## Gather links with `unstructured`

First, we'll gather links from the [CNN Lite](https://lite.cnn.com/) homepage using the `partition_html` function from `unstructured`. When `unstructured` partitions HTML pages, links are included in the metadata for each element, make link collection a simple task. 

In [1]:
from unstructured.partition.html import partition_html

In [2]:
cnn_lite_url = "https://lite.cnn.com/"

In [3]:
elements = partition_html(url=cnn_lite_url)

In [4]:
links = []
for element in elements:
    if element.metadata.links is not None:
        relative_link = element.metadata.links[0]["url"][1:]
        if relative_link.startswith("2023"):
            links.append(f"{cnn_lite_url}{relative_link}")

In [5]:
links[:5]

['https://lite.cnn.com/2023/08/07/entertainment/william-friedkin-death/index.html',
 'https://lite.cnn.com/2023/08/07/us/alabama-boat-dock-fight-warrants/index.html',
 'https://lite.cnn.com/2023/08/07/us/extreme-heat-death-toll-underestimate-climate/index.html',
 'https://lite.cnn.com/2023/08/06/us/charles-gregory-missing-st-augustine/index.html',
 'https://lite.cnn.com/2023/08/07/opinions/womens-world-cup-morocco-nigeria-south-africa-jamaica-aziz/index.html']

## Ingest individual articles with `UnstructuredURLLoader`

Now that we have the links, we can preprocess individual news articles with `UnstructuredURLLoader`. `UnstructuredURLLoader` fetches content from the web and then uses the `unstructured` `partition` function to extract content and metadata. In this example we preprocess HTML files, but it works with other response types such as `application/pdf` as well. After calling `.load()`, the result is a list of `langchain` `Document` objects.

In [6]:
from langchain.document_loaders import UnstructuredURLLoader

loaders = UnstructuredURLLoader(
    urls=links, mode="elements", show_progress_bar=True,
)

In [7]:
docs = loaders.load()

100%|█████████████████████████████████████████████████████████████████████| 94/94 [00:11<00:00,  8.34it/s]


In [8]:
docs[2]

Document(page_content='William Friedkin, ‘Exorcist’ director, dead at 87', metadata={'filetype': 'text/html', 'page_number': 1, 'url': 'https://lite.cnn.com/2023/08/07/entertainment/william-friedkin-death/index.html', 'category': 'Title'})

## Load documents into ChromaDB

With the documents preprocessed, we're not ready to load them into ChromaDB. We accomplish this easily by using the OpenAI embeddings the Chroma vectrostore from `langchain`. This workflow will vectorize the documents using the OpenAI embeddings endpoint, and then load the documents and associated vectors into Chroma. Once the documents are in Chroma, we can perform a similarity search to retrieve documents related to our topic of interest.

In [9]:
from langchain.vectorstores.chroma import Chroma
from langchain.embeddings import OpenAIEmbeddings

In [10]:
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embeddings)

In [11]:
query_docs = vectorstore.similarity_search("Update on the coup in Niger.", k=10)

## Summarize the Documents

After retrieving relevant documents from Chroma, we're ready to summarize them! There are multiple ways to accomplish this in `langchain`, but `load_summarization_chain` is the easiest. Simply choose an LLM, load the summarization chain, and you're ready to summarize the documents. Here we limit the summary to snippets related to our topic of choice.

In [12]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain

In [13]:
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k")
chain = load_summarize_chain(llm, chain_type="stuff")

In [14]:
print(chain.run(query_docs))

The military coup in Niger took place in late July, with President Mohamed Bazoum being seized by the presidential guard. National institutions were shut down and protests from both sides ensued. The coup leaders are now facing a deadline to give up power or face possible military action from neighboring countries. Pro-coup protests have been taking place throughout the country. The Economic Community of West African States (ECOWAS) has enacted sanctions and issued an ultimatum to the military junta. The United States and some Western nations have condemned the coup, and Niger's armed forces have brought in reinforcements in preparation for potential military intervention. As the deadline expired, Niger's airspace closed due to the threat of intervention from neighboring countries.
