# Summarization pipeline

This notebook contains code to summarize the content of a list of URLs using Unstructured + Langchain + OpenAI, for this we download the content of the page, clean and send it to LLM to perform the task.

In [None]:
from langchain.document_loaders import UnstructuredURLLoader
from langchain.docstore.document import Document
from unstructured.cleaners.core import remove_punctuation,clean,clean_extra_whitespace
from langchain import OpenAI
from langchain.chains.summarize import load_summarize_chain
import time
import tiktoken

You'll need an OpenAI key ;) (Or you can try with Cohere, which allows some free use for testing purposes)

In [None]:
openai_key="YOUR-KEY"

In [None]:
def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [None]:
def generate_document(url):
    "Given an URL, return a langchain Document to futher processing"
    loader = UnstructuredURLLoader(urls=[url],
                mode="elements",
                post_processors=[clean,remove_punctuation,clean_extra_whitespace])
    elements = loader.load()
    selected_elements = [e for e in elements if e.metadata['category']=="NarrativeText"]
    full_clean = " ".join([e.page_content for e in selected_elements])
    return Document(page_content=full_clean, metadata={"source":url})

In [None]:
#@cachier(cache_dir="cache_folder") #If local running, this will allow to save in repeated calls
def summarize_document(url,model_name):
    "Given an URL return the summary from OpenAI model"
    llm = OpenAI(model_name='ada',temperature=0,openai_api_key=openai_key)
    chain = load_summarize_chain(llm, chain_type="stuff")
    tmp_doc = generate_document(url)
    summary = chain.run([tmp_doc])
    return clean_extra_whitespace(summary)

# URLs to summarize

Fill the list with URLs you want to summarize

In [None]:
urls= ["https://edition.cnn.com/2007/SHOWBIZ/Movies/07/23/potter.radcliffe.reut/index.html",
       "https://edition.cnn.com/2007/US/07/13/btsc.obrien.criminallyinsane/index.html"]

# Extraction with Unstructured

In [None]:
#Computes summaries for urls
summary_unstructured_curie = {}
for url in tqdm(urls):
    summary_unstructured_curie[url] = summarize_document(url,"curie")
    #time.sleep(15) #enable for live running

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.70s/it]


In [None]:
summary_unstructured_curie

{'https://edition.cnn.com/2007/SHOWBIZ/Movies/07/23/potter.radcliffe.reut/index.html': '"LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how he\'ll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. "I\'ll definitely have some sort of party," he said in an interview. "Hopefully none of you will be reading about it." Radcliffe\'s earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite hi