# Text Summarizer with LangChain and Deeplake using differents approach to deal with the context

During this lesson, we delved into the challenge of summarizing efficiently in the context of the digital age. We will discuss the strategies of "stuff," "map-reduce," and "refine" for handling large amounts of text and extracting valuable information.  We also highlighted the customizability of LangChain, allowing personalized prompts, multilingual summaries, and storage of URLs in a Deep Lake vector store. By implementing these advanced tools, you can save time, enhance knowledge retention, and improve your understanding of various topics. Enjoy the tailored experience of data storage and summarization with LangChain


## Workflow:

1. Transcribe the PDF file to text
2. Summarize the transcribed text using LangChain with three different approaches: stuff, refine, and map_reduce.
3. Adding multiple URLs to DeepLake database, and retrieving information. 

## Summarization with LangChain

We first import the necessary classes and utilities from the LangChain library.

In [3]:
from langchain import OpenAI, LLMChain
from langchain.chains.mapreduce import MapReduceChain
from langchain.prompts import PromptTemplate
from langchain.chains.summarize import load_summarize_chain

llm = OpenAI(model_name="text-davinci-003", temperature=0)

Could not import azure.core python package.


This code creates an instance of the RecursiveCharacterTextSplitter
 class, which is responsible for splitting input text into smaller chunks.

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=0, separators=[" ", ",", "\n"]
)

## Read the PDF file

In [5]:

# importing required modules 
from pypdf import PdfReader 
  
# creating a pdf reader object 
reader = PdfReader('data/Retrieve rerank generate.pdf') 
  
# printing number of pages in pdf file 
print(len(reader.pages)) 
  
# Convert 
# getting a specific page from the pdf file 
page = reader.pages[0] 
  
# extracting text from page 
text = page.extract_text() 
print(text)


15
Re2G: Retrieve, Rerank, Generate
Michael Glass1, Gaetano Rossiello1, Md Faisal Mahbub Chowdhury1,
Ankita Rajaram Naik1 2,Pengshan Cai1 2,Alﬁo Gliozzo1
1IBM Research AI, Yorktown Heights, NY , USA
2University of Massachusetts Amherst, MA, USA
Abstract
As demonstrated by GPT-3 and T5, transform-
ers grow in capability as parameter spaces be-
come larger and larger. However, for tasks
that require a large amount of knowledge, non-
parametric memory allows models to grow dra-
matically with a sub-linear increase in compu-
tational cost and GPU memory requirements.
Recent models such as RAG and REALM
have introduced retrieval into conditional gen-
eration. These models incorporate neural ini-
tial retrieval from a corpus of passages. We
build on this line of research, proposing Re2G,
which combines both neural initial retrieval
and reranking into a BART-based sequence-
to-sequence generation. Our reranking ap-
proach also permits merging retrieval results
from sources with incomparable s

It is configured with a chunk_size of 1000 characters, no chunk_overlap, and uses spaces, commas, and newline characters as separators. This ensures that the input text is broken down into manageable pieces, allowing for efficient processing by the language model.

We’ll open the text file we’ve saved previously and split the transcripts using .split_text() method.

In [6]:
from langchain.docstore.document import Document

# Create a string with the full text
text=""
for page in reader.pages:
    text += page.extract_text()
    text += '\n'
# Show text
print(text[:300])
#Split the mopdel
texts = text_splitter.split_text(text)
docs = [Document(page_content=t) for t in texts[:4]]
# Show count of documentos
print(len(docs))

Re2G: Retrieve, Rerank, Generate
Michael Glass1, Gaetano Rossiello1, Md Faisal Mahbub Chowdhury1,
Ankita Rajaram Naik1 2,Pengshan Cai1 2,Alﬁo Gliozzo1
1IBM Research AI, Yorktown Heights, NY , USA
2University of Massachusetts Amherst, MA, USA
Abstract
As demonstrated by GPT-3 and T5, transform-
ers g
4


In [7]:
docs

[Document(page_content='Re2G: Retrieve, Rerank, Generate\nMichael Glass1, Gaetano Rossiello1, Md Faisal Mahbub Chowdhury1,\nAnkita Rajaram Naik1 2,Pengshan Cai1 2,Alﬁo Gliozzo1\n1IBM Research AI, Yorktown Heights, NY , USA\n2University of Massachusetts Amherst, MA, USA\nAbstract\nAs demonstrated by GPT-3 and T5, transform-\ners grow in capability as parameter spaces be-\ncome larger and larger. However, for tasks\nthat require a large amount of knowledge, non-\nparametric memory allows models to grow dra-\nmatically with a sub-linear increase in compu-\ntational cost and GPU memory requirements.\nRecent models such as RAG and REALM\nhave introduced retrieval into conditional gen-\neration. These models incorporate neural ini-\ntial retrieval from a corpus of passages. We\nbuild on this line of research, proposing Re2G,\nwhich combines both neural initial retrieval\nand reranking into a BART-based sequence-\nto-sequence generation. Our reranking ap-\nproach also permits merging retrieva

Each Document object is initialized with the content of a chunk from the texts list. The [:4] slice notation indicates that only the first four chunks will be used to create the Document objects.

In [8]:
from langchain.chains.summarize import load_summarize_chain
import textwrap

chain = load_summarize_chain(llm, chain_type="map_reduce")

output_summary = chain.run(docs)
wrapped_text = textwrap.fill(output_summary, width=100)
print(wrapped_text)

 This paper introduces Re2G, a BART-based sequence-to-sequence generation approach that combines
neural initial retrieval and reranking. It allows for a large amount of knowledge to be incorporated
into the model with a sub-linear increase in computational cost and GPU memory requirements, and
permits merging retrieval results from sources with incomparable data. The system is tested on four
diverse tasks and shows large gains of 9-34% over the previous state-of-the-art on the KILT
leaderboard. The code is available as open source.


💡
The textwrap library in Python provides a convenient way to wrap and format plain text by adjusting line breaks in an input paragraph. It is particularly useful when displaying text within a limited width, such as in console outputs, emails, or other formatted text displays. The library includes convenience functions like wrap, fill, and shorten, as well as the TextWrapper class that handles most of the work. If you’re curious, I encourage you to follow this link and find out more, as there are other functions in the textwrap library that can be useful depending on your needs.

With the following line of code, we can see the prompt template that is used with the map_reduce technique. Now we’re changing the prompt and using another summarization method:

In [9]:
print( chain.llm_chain.prompt.template )

Write a concise summary of the following:


"{text}"


CONCISE SUMMARY:


### Stuff approach

The "stuff" approach is the simplest and most naive one, in which all the text from the transcribed video is used in a single prompt. This method may raise exceptions if all text is longer than the available context size of the LLM and may not be the most efficient way to handle large amounts of text. 
We’re going to experiment with the prompt below. This prompt will output the summary as bullet points. Also, we initialized the summarization chain using the stuff as chain_type and the prompt above.

In [10]:
prompt_template = """Write a concise bullet point summary of the following:


{text}


CONSCISE SUMMARY IN BULLET POINTS:"""

BULLET_POINT_PROMPT = PromptTemplate(template=prompt_template, 
                        input_variables=["text"])

chain = load_summarize_chain(llm, 
                             chain_type="stuff", 
                             prompt=BULLET_POINT_PROMPT)

output_summary = chain.run(docs)

wrapped_text = textwrap.fill(output_summary, 
                             width=1000,
                             break_long_words=False,
                             replace_whitespace=False)
print(wrapped_text)


- Re2G is a new approach that combines neural initial retrieval and reranking into a BART-based sequence-to-sequence generation.
- Re2G enables merging retrieval results from sources with incomparable scores, e.g. an ensemble of BM25 and neural initial retrieval.
- Re2G is trained end-to-end using a novel variation of knowledge distillation.
- Re2G is evaluated on four diverse tasks from KILT: slot filling, question answering, fact checking and dialog.
- Re2G achieves large gains in these tasks, with relative gains of 9% to 34% over the previous state-of-the-art on the KILT leaderboard.


### Refine approach

The 'refine' summarization chain is a method for generating more accurate and context-aware summaries. This chain type is designed to iteratively refine the summary by providing additional context when needed. That means: it generates the summary of the first chunk. Then, for each successive chunk, the work-in-progress summary is integrated with new info from the new chunk.

In [11]:
chain = load_summarize_chain(llm, chain_type="refine")

output_summary = chain.run(docs)
wrapped_text = textwrap.fill(output_summary, width=100)
print(wrapped_text)

  Re2G is a new model that combines neural initial retrieval and reranking into a BART-based
sequence-to-sequence generation. This model allows for a large amount of knowledge to be
incorporated into the model with a sub-linear increase in computational cost and GPU memory
requirements. Re2G also permits merging retrieval results from sources with incomparable data and
scores, enabling an ensemble of BM25 and neural initial retrieval. To train our system end-to-end,
we introduce a novel variation of knowledge distillation to train the initial retrieval, reranker
and generation using only ground truth on the target sequence output. We find large gains in four
diverse tasks from the KILT benchmark: zero-shot slot filling, question answering, fact checking and
dialog, with relative gains of 9% to 34% over the previous state-of-the-art. We make our code
available as open source.


## Deeplake as Vector database
Now, we’re ready to import Deep Lake and build a database with embedded documents:

In [12]:
from langchain.vectorstores import DeepLake
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')

# create Deep Lake dataset
my_activeloop_org_id = "edumunozsala"
my_activeloop_dataset_name = "langchain_course_youtube_summarizer"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)
db.add_documents(docs)



Your Deep Lake dataset has been successfully created!
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/edumunozsala/langchain_course_youtube_summarizer
hub://edumunozsala/langchain_course_youtube_summarizer loaded successfully.


Evaluating ingest: 100%|██████████| 1/1 [00:14<00:00
-

Dataset(path='hub://edumunozsala/langchain_course_youtube_summarizer', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape     dtype  compression
  -------   -------   -------   -------  ------- 
 embedding  generic  (4, 1536)  float32   None   
    ids      text     (4, 1)      str     None   
 metadata    json     (4, 1)      str     None   
   text      text     (4, 1)      str     None   


 

['2cc11c53-6c3e-11ee-900d-cc2f714963ed',
 '2cc11c54-6c3e-11ee-a955-cc2f714963ed',
 '2cc11c55-6c3e-11ee-b861-cc2f714963ed',
 '2cc11c56-6c3e-11ee-808d-cc2f714963ed']

In order to retrieve the information from the database, we’d have to construct a retriever object.

In [13]:
retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['k'] = 4

The distance metric determines how the Retriever measures "distance" or similarity between different data points in the database. By setting distance_metric to 'cos', the Retriever will use cosine similarity as its distance metric. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. It's often used in information retrieval to measure the similarity between documents or pieces of text. Also, by setting 'k' to 4, the Retriever will return the 4 most similar or closest results according to the distance metric when a search is performed

In [14]:
from langchain.prompts import PromptTemplate
prompt_template = """Use the following pieces of transcripts from a video to answer the question in bullet points and summarized. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Summarized answer in bullter points:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

Lastly, we can use the chain_type_kwargs argument to define the custom prompt and for chain type the ‘stuff’  variation was picked. You can perform and test other types as well, as seen previously.

In [15]:
from langchain.chains import RetrievalQA

chain_type_kwargs = {"prompt": PROMPT}
qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=retriever,
                                 chain_type_kwargs=chain_type_kwargs)

print( qa.run("Summarize the mentions of Re2G ") )


- Re2G is a new approach that combines both neural initial retrieval and reranking into a BART-based sequence-to-sequence generation.
- Re2G permits merging retrieval results from sources with incomparable scores, enabling an ensemble of BM25 and neural initial retrieval.
- Re2G is trained using a novel variation of knowledge distillation to train the initial retrieval, reranker and generation using only ground truth on the target sequence output.
- Re2G has been evaluated on four diverse tasks from KILT: slot filling, question answering, fact checking and dialog, with relative gains of 9% to 34% over the previous state-of-the-art on the KILT leaderboard.


Of course, you can always tweak the prompt to get the desired result, experiment more with modified prompts using different types of chains and find the most suitable combination. Ultimately, the choice of strategy depends on the specific needs and constraints of your project.