<a href="https://colab.research.google.com/github/fdeloscogna/Python_experiment/blob/main/post_summarization_openai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarization pipeline

This notebook contains code to summarize the content of a list of URLs using Unstructured + Langchain + OpenAI, for this we download the content of the page, clean and send it to LLM to perform the task.

In [1]:
!pip install langchain
!pip install unstructured
!pip install tiktoken
!pip install openai
!pip install cohere

Collecting langchain
  Downloading langchain-0.1.16-py3-none-any.whl (817 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.32 (from langchain)
  Downloading langchain_community-0.0.33-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2.0,>=0.1.42 (from langchain)
  Downloading langchain_core-0.1.43-py3-none-any.whl (289 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.1/289.1 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.1,>=0.0.1 (from langchain)
  Downl

In [2]:
from langchain.document_loaders import UnstructuredURLLoader
from langchain.docstore.document import Document
from unstructured.cleaners.core import remove_punctuation,clean,clean_extra_whitespace
from langchain import OpenAI
from langchain.chains.summarize import load_summarize_chain
import time
import tiktoken
from tqdm import tqdm
import cohere

You'll need an OpenAI key ;) (Or you can try with Cohere, which allows some free use for testing purposes)

In [3]:
openai_key="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"

In [8]:
def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [9]:
def generate_document(url):
    "Given an URL, return a langchain Document to futher processing"
    loader = UnstructuredURLLoader(urls=[url],
                mode="elements",
                post_processors=[clean,remove_punctuation,clean_extra_whitespace])
    elements = loader.load()
    selected_elements = [e for e in elements if e.metadata['category']=="NarrativeText"]
    full_clean = " ".join([e.page_content for e in selected_elements])
    return Document(page_content=full_clean, metadata={"source":url})

In [17]:
#@cachier(cache_dir="cache_folder") #If local running, this will allow to save in repeated calls
def summarize_document(url,model_name):
    "Given an URL return the summary from OpenAI model"
    llm = OpenAI(model_name='babbage-002',temperature=0,openai_api_key=openai_key)
    chain = load_summarize_chain(llm, chain_type="stuff")
    tmp_doc = generate_document(url)
    summary = chain.run([tmp_doc])
    return clean_extra_whitespace(summary)

# URLs to summarize

Fill the list with URLs you want to summarize

In [18]:
urls= ["https://www.businessinsider.com/nvidia-chips-lamini-ai-amd-jensen-huang-sharon-zhou-2024-4"]

# Extraction with Unstructured

In [19]:
#Computes summaries for urls
summary_unstructured_curie = {}
for url in tqdm(urls):
    summary_unstructured_curie[url] = summarize_document(url,"babbage-002")
    #time.sleep(15) #enable for live running

  0%|          | 0/1 [00:04<?, ?it/s]


RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

In [None]:
summary_unstructured_curie

{'https://edition.cnn.com/2007/SHOWBIZ/Movies/07/23/potter.radcliffe.reut/index.html': '"LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how he\'ll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. "I\'ll definitely have some sort of party," he said in an interview. "Hopefully none of you will be reading about it." Radcliffe\'s earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite hi

In [26]:
# use bart in pytorch
#summarizer = pipeline("summarization")
#summarizer("An apple a day, keeps the doctor away", min_length=5, max_length=20)

# use t5 in tf
summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf")
summarizer("An apple a day, keeps the doctor away", min_length=5, max_length=20)

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
Your max_length is set to 20, but your input_length is only 13. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=6)


[{'summary_text': 'an apple a day, keeps the doctor away from the doctor .'}]

In [28]:
import requests
from bs4 import BeautifulSoup
from transformers import pipeline

# Create a summarizer pipeline
summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf")

def fetch_document_text(url):
    # Fetch the HTML content from the URL
    response = requests.get(url)
    html_content = response.text

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Extract text from the HTML content
    document_text = soup.get_text()

    return document_text

def summarize_document_from_url(url, min_length=5, max_length=20):
    # Fetch document text from the URL
    document_text = fetch_document_text(url)

    # Summarize the document text
    summary = summarizer(document_text, min_length=min_length, max_length=max_length)

    return summary[0]['summary_text']

# Example URL
url = "https://www.businessinsider.com/nvidia-chips-lamini-ai-amd-jensen-huang-sharon-zhou-2024-4"

# Summarize the document from the URL
summary = summarize_document_from_url(url)

print("Summary:")
print(summary)


All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Summary:
lamini AI aims to make it easy for enterprises to train and train AI models .


In [29]:
import requests
from bs4 import BeautifulSoup
from transformers import pipeline

# Create a summarizer pipeline
summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf")

def fetch_document_text(url):
    # Fetch the HTML content from the URL
    response = requests.get(url)
    html_content = response.text

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Extract text from the HTML content
    document_text = soup.get_text()

    return document_text

def summarize_document_from_url(url, chunk_size=1024, min_length=5, max_length=20):
    # Fetch document text from the URL
    document_text = fetch_document_text(url)

    # Split the document text into smaller chunks
    chunks = [document_text[i:i+chunk_size] for i in range(0, len(document_text), chunk_size)]

    # Summarize each chunk of the document text
    summarized_chunks = []
    for chunk in chunks:
        summary = summarizer(chunk, min_length=min_length, max_length=max_length)
        summarized_chunks.append(summary[0]['summary_text'])

    # Combine the summarized chunks into a single summary
    summary = " ".join(summarized_chunks)

    return summary

# Example URL
url = "https://www.businessinsider.com/nvidia-chips-lamini-ai-amd-jensen-huang-sharon-zhou-2024-4"

# Summarize the document from the URL
summary = summarize_document_from_url(url)

print("Summary:")
print(summary)


All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Summary:
business strategy Economy Finance Retail Advertising Careers Media Real Estate Small Business Tech Science AI Sustainability Enterprise Transportation Stocks Indices Crypto Currencies ETFs Lifestyle Entertainment Culture Travel Food Health . h m s Close icon Two crossed lines that form an 'X' Close Sharon Zhou, the AI founder and CEO of lamini AI, is doing just fine without Lamini AI CEO has been using rival AMD's GPUs to take her startup forward  Nvidia's chips became the hottest property of the generative AI boom . Zhou is the first person to major in both classics and computer science at Harvard . Lamini's platform exclusively uses GPUs from Nvidia's main rival, AMD cofounder and former Nvidia software architect played key role in decision-making .  AMD is on its way to building a rival system that they would eventually test . many AMD's new chip, the MI300X, is the "highest performing accelerator in an expert discusses the hardware and infrastructure needed to properly run

In [22]:
!pip install requests
!pip install beautifulsoup4
!pip install transformers



In [20]:
import requests
from bs4 import BeautifulSoup
from transformers import pipeline

# Load a pre-trained model for summarization
summarizer = pipeline("summarization")

def fetch_document_text(url):
    # Fetch the HTML content from the URL
    response = requests.get(url)
    html_content = response.text

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Extract text from the HTML content
    document_text = soup.get_text()

    return document_text

def summarize_document_from_url(url):
    # Fetch document text from the URL
    document_text = fetch_document_text(url)

    # Summarize the document text
    summary = summarizer(document_text, max_length=100, min_length=30, do_sample=False)

    # Extract the summarized text from the result
    summarized_text = summary[0]['summary_text']

    return summarized_text

# Example URL
url = "https://www.businessinsider.com/nvidia-chips-lamini-ai-amd-jensen-huang-sharon-zhou-2024-4"

# Summarize the document from the URL
summary = summarize_document_from_url(url)

print("Summary:")
print(summary)


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (4791 > 1024). Running this sequence through the model will result in indexing errors


IndexError: index out of range in self

In [24]:
import requests
from bs4 import BeautifulSoup
from transformers import pipeline

# Load a pre-trained model for summarization
summarizer = pipeline("summarization")

import requests
from bs4 import BeautifulSoup
from transformers import pipeline

# Specify the model and revision for summarization
model_name = "sshleifer/distilbart-cnn-12-6"
revision = "a4f8f3e"

# Load the specified model for summarization
summarizer = pipeline("summarization", model=model_name, revision=revision)

# Rest of the code remains the same...


def fetch_document_text(url):
    # Fetch the HTML content from the URL
    response = requests.get(url)
    html_content = response.text

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Extract text from the HTML content
    document_text = soup.get_text()

    return document_text

def summarize_document_from_url(url, max_length=1024):
    # Fetch document text from the URL
    document_text = fetch_document_text(url)

    # Truncate document text to fit within the maximum sequence length
    truncated_document_text = document_text[:max_length]

    # Summarize the truncated document text
    summary = summarizer(truncated_document_text, max_length=100, min_length=30, do_sample=False)

    # Extract the summarized text from the result
    summarized_text = summary[0]['summary_text']

    return summarized_text

# Example URL
url = "https://www.businessinsider.com/nvidia-chips-lamini-ai-amd-jensen-huang-sharon-zhou-2024-4"

# Summarize the document from the URL
summary = summarize_document_from_url(url)

print("Summary:")
print(summary)


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Summary:
 A magnifying glass indicates, "Click to perform a search", and a vertical stack of three evenly spaced horizontal lines .      -  - "Meet Sharon Zhou, the AI Founder Doing Just Fine Without Nvidia Chips"   The AI Founder doing Just Fine without Nvidia Chips .


In [4]:
!pip install cohere



In [5]:
import cohere
co = cohere.Client('Z56Vg7XNy4CI4Pjsriddm4CXyQJjPMcwwZ5mA0c2')

In [6]:
text ="""
Who needs Nvidia?
In April last year, Zhou and her cofounder Greg Diamos, based in Palo Alto, brought their new startup, Lamini AI, out of stealth.
In September, Zhou revealed that Lamini's platform had been building customized LLMs with customers over the past year by exclusively using GPUs from Nvidia's main rival, AMD, the chip giant run by Huang's cousin, Lisa Su.

It was a big deal given that almost everyone seemed to be exclusively obsessed with H100 — GPUs that Nvidia has struggled to meet the demand of amid supply constraints. Lamini's reveal even came with a video of Zhou teasing Nvidia about the shortage.

A few things helped the decision. For one, her cofounder Diamos played a key role in helping make the realization that GPUs other than those from Nvidia work perfectly well.

As a former Nvidia software architect, Diamos understood that while GPU hardware was vital for getting top performance out of AI models — he was, after all, the coauthor of a paper on "scaling laws" that showed the importance of computing power — software was important too.

Diamos was witness to that having worked on CUDA, the software first developed by Nvidia in the 2000s. It makes using AI models with GPUs like the H100 and Nvidia's new Blackwell chip, as simple as a plug-and-play system.

So it became clear that if another company could build a similar software ecosystem around its GPUs, there'd be no reason they couldn't compete with Nvidia. Fortunately for them, after consulting with Diamos, according to Zhou, AMD was on its way to building a rival system that they would eventually test.

"Greg and I were just jamming on things, so this has been years in the making, and then once the prototypes worked we were just like let's just double down on this," Zhou said.

More broadly, Zhou recognizes that businesses are so "excited to use LLMs," but many may not want to — or simply can't afford to — wait around for Nvidia to shore up enough supply of its GPUs to meet the demand.

It's another reason AMD has proven so valuable to her ambitions. Thanks to its GPUs being more available, Zhou was confident that Lamini could offer "infrastructure that makes meeting that skyrocketing demand" for LLMs possible.

"This is because Lamini fully utilizes LLM compute at 10x performance and makes it possible to scale quickly without supply constraints, by offering vendor-agnostic compute options, i.e. it's indiscernible to customers to run Lamini on Nvidia and AMD GPUs," she explained.

No wonder the company is ready to double down on AMD. In January, Zhou shared an image to X of the MI300X — AMD's new chip first unveiled in December by CEO Su as the "highest performing accelerator in the world" — live in production at Lamini.

Nvidia's Huang might be leading one of the most powerful companies in Silicon Valley now, but the competition is coming for him. Or as Zhou said of AMD: "They have a real horse in this race."""

In [7]:
response = co.summarize(
    text=text,
    model='command',
    length='medium',
    extractiveness='medium'
)

summary = response.summary

In [8]:
summary

'After consulting with her cofounder, former Nvidia software architect Greg Diamos, Zhou found that AMD was building a similar software ecosystem around its GPUs to compete with Nvidia. Zhou\'s company, Lamini AI, has since been building customised large language models (LLMs) with customers using AMD\'s GPUs, as opposed to Nvidia\'s, due to their better supply availability. Lamini AI has found that it can offer "infrastructure that makes meeting that skyrocketing demand" for LLMs using AMD\'s hardware, which is 10 times faster and can scale quickly without supply constraints.'