# Résumé de Textes

[![Index](https://img.shields.io/badge/Index-blue)](../index.ipynb)
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/digillia/Digillia-Colab/blob/main/use-cases/text-summarization.ipynb)

In [7]:
from IPython import get_ipython

# Supprimer les commentaires pour installer
# !pip3 install -q -U python-dotenv
# !pip3 install -q -U torch
# !pip3 install -q -U tqdm
# !pip3 install -q -U transformers

# À installer dans tous les cas pour Google Colab
if 'google.colab' in str(get_ipython()):
  # BEGIN
  # llmx cause de multiples problèmes de dépendances avec les librairies installées ensuite
  # see https://community.openai.com/t/error-while-importing-openai-from-open-import-openai/578166
  # see https://stackoverflow.com/questions/77759146/issue-installing-openai-in-colab
  !pip3 uninstall llmx -y
  # END
  !pip3 install -q -U langchain
  !pip3 install -q -U langchain-openai
  !pip3 install -q -U llama-index
  !pip3 install -q -U openai
  !pip3 install -q -U pypdf

In [8]:
# This python variable cab be accessed by bash commands
pdf_directory = "text-summarization"

# Récupération des données pour Google Colab
if 'google.colab' in str(get_ipython()):
  !curl --create-dirs -O --output-dir $pdf_directory "https://raw.githubusercontent.com/digillia/Digillia-Colab/main/use-cases/text-summarization/sample_1.pdf"

## Chargement des textes à résumer

In [9]:
import os
from pypdf import PdfReader

# Empty list to store page text
texts = []

# Loop through each PDF file
for file_name in os.listdir(pdf_directory):
    
    # Check if file is a PDF
    if file_name.endswith('.pdf'):
        
        # Create PDF file object
        reader = PdfReader(os.path.join(pdf_directory, file_name))
        
        # Loop through pages and extract text
        for page in reader.pages:
            
            # Extract text from page
            texts.append(page.extract_text())
            
print(len(texts))
print(texts[4])

68
1. G enerative AI’s impact on 
productivity could add trillions 
of dollars in value to the global economy. Our latest research 
estimates that generative AI could add the equivalent of $2.6 trillion to $4.4 trillion annually across the 63 use cases we analyzed—by comparison, the United Kingdom’s entire GDP in 2021 was $3.1 trillion. This would increase the impact of all artificial intelligence by 15 to 40 percent. This estimate would roughly double if we include the impact of embedding generative AI into software that is currently used for other tasks beyond those use cases.
2.
 A
bout 75 percent of the value that 
generative AI use cases could deliver falls across four areas: Customer operations, marketing and sales, software engineering, and R&D . Across 16 business 
functions, we examined 63 use cases in which the technology can address specific business challenges in ways that produce one or more measurable outcomes. Examples include generative AI’s ability to support interacti

## Load OpenAI Key

In [10]:
import openai

if 'google.colab' in str(get_ipython()):
  from google.colab import userdata
  openai.api_key = userdata.get('OPENAI_API_KEY')
else:
  from dotenv import load_dotenv, find_dotenv
  _ = load_dotenv(find_dotenv()) # read local .env file
  openai.api_key  = os.environ['OPENAI_API_KEY']

## Text Summarization using LlamaIndex

Sources:
- https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary.html

In [11]:
from llama_index.llms import OpenAI
from llama_index import (
  SimpleDirectoryReader,
  # ListIndex,
  ServiceContext,
  get_response_synthesizer,
)
# from llama_index.schema import Document
from llama_index.indices.document_summary import DocumentSummaryIndex

documents = SimpleDirectoryReader(pdf_directory).load_data()
# documents = list(map(lambda x: Document(text=x), texts))

# LLM (gpt-3.5-turbo)
llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=1024)

# VectorStoreIndex
# index = VectorStoreIndex.from_documents(documents)

# DocumentSummaryIndex
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize", use_async=True)
index = DocumentSummaryIndex.from_documents(
    documents,
    service_context=service_context,
    response_synthesizer=response_synthesizer,
    show_progress=True,
)

query_engine = index.as_query_engine()
print(query_engine.query("Could you summarize the given context? Return your response which covers the key points of the text and does not miss anything important, please."))




Parsing nodes:   0%|          | 0/68 [00:00<?, ?it/s]

Summarizing documents:   0%|          | 0/68 [00:00<?, ?it/s]

current doc id: 34259558-4e7c-4a64-a74b-e033cad14de7
current doc id: b58c5a30-edf5-4aa1-bde3-ef119dd66c7a
current doc id: 54600fd8-96ae-471a-983a-67a733705414
current doc id: b4d5ad70-be67-4eeb-ac5b-105f84904e56
current doc id: 43a97230-cabb-4459-8c8d-5f90f0cf0477
current doc id: 1e108719-7c1a-4bfa-838e-0a0fb7ad2e1e
current doc id: a6752d59-bab1-4794-8a29-f1be8d97d6d0
current doc id: c959acdf-eb6d-4b26-8cc6-1c41d4e848c9
current doc id: e31ace52-caf0-4262-a06d-5697aa17ea97
current doc id: be3f6e86-e1aa-4960-9564-0309d8e2d679
current doc id: 17987ff3-7d15-4f7f-a440-c2e2ff36329f
current doc id: 13fbc6b1-a3fb-435e-8c0b-33c93a52de5d
current doc id: 8c33145e-067d-4307-97ae-565b7d5be9ca
current doc id: 1319a361-ba9e-4c05-ba63-2e88b06b14a2
current doc id: 573c647a-76ce-4e95-a00c-a9d1bd56e5eb
current doc id: 4e22abc2-4311-4440-be30-8bf5f89af0db
current doc id: 30d2af22-3f9f-4bca-ab78-86bb65f3ae65
current doc id: 922582e2-e5c6-4ec5-bd41-00e3a9696396
current doc id: a40fd348-9352-4bba-a5b8-a55f2a

## Text Summarization using LangChain

Sources:
- https://python.langchain.com/docs/use_cases/summarization

### PDFs Summarization (MapReduce Approach)

In [None]:
# from langchain_core.documents import Document
from langchain_community.document_loaders import PyPDFDirectoryLoader
#from langchain_community.document_loaders import WebBaseLoader

from langchain_openai import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain

documents = PyPDFDirectoryLoader(pdf_directory).load()
# documents = list(map(lambda x: Document(page_content=x), texts))

llm_name = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name=llm_name, temperature=0)

chain = load_summarize_chain(llm,  chain_type="map_reduce", token_max=3000)
chain.run(documents)

'Generative AI has the potential to revolutionize industries and improve productivity by automating work activities. It can transform customer operations, marketing and sales, software engineering, and R&D. However, there are challenges to address, such as managing risks and ensuring a smooth transition for workers. Generative AI has the potential to add trillions of dollars to the global economy, but it will take time to fully realize its benefits. The adoption of automation lags behind its potential, but generative AI has accelerated the potential for automation. It is expected to have the biggest impact on knowledge work and more-educated workers. The deployment of generative AI could boost productivity growth and compensate for declining employment growth. However, workers will need support and assistance to adapt to new roles. Stakeholders need to address the risks and challenges associated with generative AI and prepare for its opportunities. The article also discusses the factor

In [None]:
# from langchain.chains.llm import LLMChain
# from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain
# from langchain.text_splitter import CharacterTextSplitter

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate

template = """The following is a set of documents
{docs}
Based on this list of docs, please identify the main themes 
Helpful Answer:"""

template = "Summarize the main themes in these retrieved docs: {docs}"
prompt = PromptTemplate.from_template(template)

# Chain
def format_docs(documents):
    return "\n\n".join(doc.page_content for doc in documents)

# chain = LLMChain(llm=llm, prompt=prompt)
chain = {"docs": format_docs} | prompt | llm | StrOutputParser()

# Run
chain.invoke(documents)

BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 4097 tokens. However, your messages resulted in 31547 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

## Text Summarization using Google T5

Sources:
- https://huggingface.co/docs/transformers/tasks/summarization
- https://huggingface.co/docs/transformers/model_doc/t5
- https://www.analyticsvidhya.com/blog/2023/06/pdf-summarization-with-transformers-in-python/
- https://blog.research.google/2022/03/auto-generated-summaries-in-google-docs.html
- https://blog.research.google/2020/06/pegasus-state-of-art-model-for.html
- https://medium.com/gopenai/text-summarization-using-flan-t5-5ded2e4ce182
- https://medium.com/artificialis/t5-for-text-summarization-in-7-lines-of-code-b665c9e40771

In [None]:
# Et pour vider le cache du model
# !pip3 install -q -U "huggingface_hub[cli]"
# huggingface-cli delete-cache

###  Initialize the Model and Tokenizer

In [None]:
# from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# https://huggingface.co/docs/transformers/model_doc/t5
tr_name = 'T5-base'

# https://huggingface.co/docs/transformers/model_doc/t5v1.1
# tr_name = 'google/t5-v1_1-base'

# tokenizer = T5Tokenizer.from_pretrained(tr_name)
# model = T5ForConditionalGeneration.from_pretrained(tr_name)

tokenizer = AutoTokenizer.from_pretrained(tr_name)
model = AutoModelForSeq2SeqLM.from_pretrained(tr_name, return_dict=True)

### Simple Summarization

In [None]:
text = ("Data science is an interdisciplinary field[10] focused on extracting knowledge from typically large data sets and applying the knowledge and insights from that data to solve problems in a wide range of application domains.[11] The field encompasses preparing data for analysis, formulating data science problems, analyzing data, developing data-driven solutions, and presenting findings to inform high-level decisions in a broad range of application domains. As such, it incorporates skills from computer science, statistics, information science, mathematics, data visualization, information visualization, data sonification, data integration, graphic design, complex systems, communication and business.[12][13] Statistician Nathan Yau, drawing on Ben Fry, also links data science to human–computer interaction: users should be able to intuitively control and explore data.[14][15] In 2015, the American Statistical Association identified database management, statistics and machine learning, and distributed and parallel systems as the three emerging foundational professional communities.[16]")
inputs = tokenizer.encode("sumarize: " + text, return_tensors='pt', max_length=512, truncation=True)
output = model.generate(inputs, min_length=80, max_length=100)
summary = tokenizer.decode(output[0])
print(summary)

<pad> data science is an interdisciplinary field focused on extracting knowledge from typically large data sets. it incorporates skills from computer science, statistics, information science, mathematics, data visualization, information visualization, data sonification, graphic design, complex systems, communication and business. the field encompasses preparing data for analysis, formulating data science problems, analyzing data, developing data-driven solutions.</s>


### PDFs Summarization (MapReduce Approach)

In [None]:
from tqdm import tqdm

summaries = []

def summarize(text):
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=1024, truncation=True)
    outputs = model.generate(inputs, max_length=1000, min_length=100, length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0])    

for text in tqdm(texts):
    summaries.append(summarize(text))

summary = summarize(" ".join(summaries))
print(summary)

100%|██████████| 68/68 [11:10<00:00,  9.86s/it]


<pad> a broader set of stakeholders are grappling with generative AI’s potential to deliver value across the global economy. generative AI could enable labor productivity growth of 0.1 to 0.6 percent annually through 2040, depending on the rate of technology adoption and redeployment of worker time. a broader set of stakeholders are grappling with generative AI’s potential to deliver value across the global economy. a broader set of stakeholders are grappling with generative AI’s impact on business and society.</s>
