# Text Summarization

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/digillia/Digillia-Colab/blob/main/use-cases/text-summarization.ipynb)

## Load PDFs to Summarize

In [None]:
# !pip3 install -U pypdf

In [11]:
import os
from pypdf import PdfReader

In [15]:
# Directory containing PDF files
pdf_directory = "./text-summarization"

# Empty list to store page text
texts = []

# Loop through each PDF file
for file_name in os.listdir(pdf_directory):
    
    # Check if file is a PDF
    if file_name.endswith('.pdf'):
        
        # Create PDF file object
        reader = PdfReader(os.path.join(pdf_directory, file_name))
        
        # Loop through pages and extract text
        for page in reader.pages:
            
            # Extract text from page
            texts.append(page.extract_text())
            
print(len(texts))
print(texts[4])

68
1. G enerative AI’s impact on 
productivity could add trillions 
of dollars in value to the global economy. Our latest research 
estimates that generative AI could add the equivalent of $2.6 trillion to $4.4 trillion annually across the 63 use cases we analyzed—by comparison, the United Kingdom’s entire GDP in 2021 was $3.1 trillion. This would increase the impact of all artificial intelligence by 15 to 40 percent. This estimate would roughly double if we include the impact of embedding generative AI into software that is currently used for other tasks beyond those use cases.
2.
 A
bout 75 percent of the value that 
generative AI use cases could deliver falls across four areas: Customer operations, marketing and sales, software engineering, and R&D . Across 16 business 
functions, we examined 63 use cases in which the technology can address specific business challenges in ways that produce one or more measurable outcomes. Examples include generative AI’s ability to support interacti

## Load OpenAI Key

In [None]:
# !pip3 install -U openai

In [None]:
import openai
import sys

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

## Text Summarization using LlamaIndex

Sources:
- https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary.html

In [27]:
# !pip3 install -U llama-index

In [29]:
import openai
from llama_index.llms import OpenAI
from llama_index import (
  SimpleDirectoryReader,
  ListIndex,
  ServiceContext,
  get_response_synthesizer,
)
from llama_index.schema import Document
from llama_index.indices.document_summary import DocumentSummaryIndex

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [30]:
documents = SimpleDirectoryReader(pdf_directory).load_data()
# documents = list(map(lambda x: Document(text=x), texts))

# LLM (gpt-3.5-turbo)
llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=1024)

# VectorStoreIndex
# index = VectorStoreIndex.from_documents(documents)

# DocumentSummaryIndex
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize", use_async=True)
index = DocumentSummaryIndex.from_documents(
    documents,
    service_context=service_context,
    response_synthesizer=response_synthesizer,
    show_progress=True,
)

query_engine = index.as_query_engine()
print(query_engine.query("Could you summarize the given context? Return your response which covers the key points of the text and does not miss anything important, please."))


Parsing nodes:   0%|          | 0/68 [00:00<?, ?it/s]

Summarizing documents:   0%|          | 0/68 [00:00<?, ?it/s]

current doc id: 3ac07eac-9adf-472d-a7b5-4e324ebb21ca


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


current doc id: 2193fd32-41b8-4a60-98af-881c784a70fb
current doc id: 5f28876d-6f4f-4da3-9626-808f6332d481
current doc id: d1435ec5-5395-4cef-90ad-2832e6d4c56b
current doc id: 00fef8ff-0942-422a-bee0-531840e30960
current doc id: 9e9e4717-0dd8-4795-9d98-6496cb805261
current doc id: e71c4da8-56b0-4dc0-af4f-c87629fc9704
current doc id: 2a247adc-9b33-48a7-8929-81d1b1742ca7
current doc id: 1874a493-12bd-4567-9b94-1f49292a1d9b
current doc id: 8f3973da-f26f-42bf-87d8-da1b50658178
current doc id: ad57cac9-9b90-414c-95a1-858eba560ad1
current doc id: 52ce632b-dc81-40e0-b045-d3f7c9a590dc
current doc id: ed4360ed-cf31-4e56-b574-b9252916f49e
current doc id: 1c768725-5f7b-4ce0-be43-04324c39adf6
current doc id: 5edfcef9-30bd-4650-907f-dc40b9f0b3b1
current doc id: da4db72e-2cc0-4dfb-b7ac-5032a7d11cc8
current doc id: 5e1331a8-6739-4aed-a0e5-f6bc6f34997c
current doc id: c5b2a214-87c9-4273-ad8b-a4701763434d
current doc id: de5b743a-23d8-42b8-b76f-d3cf62393abd
current doc id: 8152ff12-ca0a-4ed7-b7f7-51589f

Generating embeddings:   0%|          | 0/68 [00:00<?, ?it/s]

The given context is an acknowledgment section of a report titled "The economic potential of generative AI: The next productivity frontier." The research for this report was led by several individuals from McKinsey's offices around the world. The project team included various members who contributed to the research. The report also acknowledges the expertise and perspectives of many other McKinsey colleagues. External advisers, including Martin Neil Baily, Ethan Mollick, Éric Moulines, and Gaël Richard, provided additional insights. The report was edited by Stephanie Strom and David DeLallo, with contributions from other colleagues. The research is independent and funded by McKinsey partners, and any errors in the analysis are the responsibility of the authors.


## Text Summarization using LangChain

Sources:
- https://python.langchain.com/docs/use_cases/summarization

In [34]:
# !pip3 install -U python-dotenv
# !pip3 install -U langchain
# !pip3 install -U langchain-openai

### PDFs Summarization (MapReduce Approach)

In [46]:
from langchain_core.documents import Document
from langchain_community.document_loaders import PyPDFDirectoryLoader
#from langchain_community.document_loaders import WebBaseLoader

from langchain_openai import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain

In [56]:
documents = PyPDFDirectoryLoader(pdf_directory).load()
# documents = list(map(lambda x: Document(page_content=x), texts))

llm_name = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name=llm_name, temperature=0)

chain = load_summarize_chain(llm,  chain_type="map_reduce", token_max=3000)
chain.run(documents)

'Generative AI has the potential to revolutionize industries and improve productivity by automating work activities. It can transform customer operations, marketing and sales, software engineering, and R&D. However, there are challenges to address, such as managing risks and ensuring a smooth transition for workers. Generative AI has the potential to add trillions of dollars to the global economy, but it will take time to fully realize its benefits. The adoption of automation lags behind its potential, but generative AI has accelerated the potential for automation. It is expected to have the biggest impact on knowledge work and more-educated workers. The deployment of generative AI could boost productivity growth and compensate for declining employment growth. However, workers will need support and assistance to adapt to new roles. Stakeholders need to address the risks and challenges associated with generative AI and prepare for its opportunities. The article also discusses the factor

In [52]:
from langchain.chains.llm import LLMChain
#from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain
from langchain.text_splitter import CharacterTextSplitter

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate

In [55]:
template = """The following is a set of documents
{docs}
Based on this list of docs, please identify the main themes 
Helpful Answer:"""

template = "Summarize the main themes in these retrieved docs: {docs}"
prompt = PromptTemplate.from_template(template)

# Chain
def format_docs(documents):
    return "\n\n".join(doc.page_content for doc in documents)

# chain = LLMChain(llm=llm, prompt=prompt)
chain = {"docs": format_docs} | prompt | llm | StrOutputParser()

# Run
chain.invoke(documents)

BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 4097 tokens. However, your messages resulted in 31547 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

## Text Summarization using Google T5

Sources:
- https://huggingface.co/docs/transformers/tasks/summarization
- https://huggingface.co/docs/transformers/model_doc/t5
- https://www.analyticsvidhya.com/blog/2023/06/pdf-summarization-with-transformers-in-python/
- https://blog.research.google/2022/03/auto-generated-summaries-in-google-docs.html
- https://blog.research.google/2020/06/pegasus-state-of-art-model-for.html
- https://medium.com/gopenai/text-summarization-using-flan-t5-5ded2e4ce182
- https://medium.com/artificialis/t5-for-text-summarization-in-7-lines-of-code-b665c9e40771

In [20]:
# !pip3 install -U torch
# !pip3 install -U tqdm
# !pip3 install -U transformers


# in order to clean model cache
# !pip3 install -U "huggingface_hub[cli]"
# huggingface-cli delete-cache

In [22]:
import torch
from tqdm import tqdm

# from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

###  Initialize the Model and Tokenizer

In [17]:
# https://huggingface.co/docs/transformers/model_doc/t5
tr_name = 'T5-base'

# https://huggingface.co/docs/transformers/model_doc/t5v1.1
# tr_name = 'google/t5-v1_1-base'

# tokenizer = T5Tokenizer.from_pretrained(tr_name)
# model = T5ForConditionalGeneration.from_pretrained(tr_name)

tokenizer = AutoTokenizer.from_pretrained(tr_name)
model = AutoModelForSeq2SeqLM.from_pretrained(tr_name, return_dict=True)

### Simple Summarization

In [10]:
text = ("Data science is an interdisciplinary field[10] focused on extracting knowledge from typically large data sets and applying the knowledge and insights from that data to solve problems in a wide range of application domains.[11] The field encompasses preparing data for analysis, formulating data science problems, analyzing data, developing data-driven solutions, and presenting findings to inform high-level decisions in a broad range of application domains. As such, it incorporates skills from computer science, statistics, information science, mathematics, data visualization, information visualization, data sonification, data integration, graphic design, complex systems, communication and business.[12][13] Statistician Nathan Yau, drawing on Ben Fry, also links data science to human–computer interaction: users should be able to intuitively control and explore data.[14][15] In 2015, the American Statistical Association identified database management, statistics and machine learning, and distributed and parallel systems as the three emerging foundational professional communities.[16]")
inputs = tokenizer.encode("sumarize: " + text, return_tensors='pt', max_length=512, truncation=True)
output = model.generate(inputs, min_length=80, max_length=100)
summary = tokenizer.decode(output[0])
print(summary)

<pad> data science is an interdisciplinary field focused on extracting knowledge from typically large data sets. it incorporates skills from computer science, statistics, information science, mathematics, data visualization, information visualization, data sonification, graphic design, complex systems, communication and business. the field encompasses preparing data for analysis, formulating data science problems, analyzing data, developing data-driven solutions.</s>


### PDFs Summarization (MapReduce Approach)

In [24]:
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
summaries = []

def summarize(text):
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=1024, truncation=True)
    outputs = model.generate(inputs, max_length=1000, min_length=100, length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0])    

for text in tqdm(texts):
    summaries.append(summarize(text))

summary = summarize(" ".join(summaries))
print(summary)

100%|██████████| 68/68 [11:10<00:00,  9.86s/it]


<pad> a broader set of stakeholders are grappling with generative AI’s potential to deliver value across the global economy. generative AI could enable labor productivity growth of 0.1 to 0.6 percent annually through 2040, depending on the rate of technology adoption and redeployment of worker time. a broader set of stakeholders are grappling with generative AI’s potential to deliver value across the global economy. a broader set of stakeholders are grappling with generative AI’s impact on business and society.</s>
