# Résumé de Textes

Nous démontrons ici différentes méthodes simples pour résumer des documents en utilisant des modèles de langage larges.

[![Index](https://img.shields.io/badge/Index-blue)](../index.ipynb)
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/digillia/Digillia-Colab/blob/main/use-cases/text-summarization.ipynb)

In [None]:
import os
import sys

# Supprimer les commentaires pour installer (requirements.txt)
# !pip3 install -q -U python-dotenv
# !pip3 install -q -U torch
# !pip3 install -q -U tqdm
# !pip3 install -q -U transformers

# À installer dans tous les cas pour Google Colab et Github
if 'google.colab' in sys.modules or 'CI' in os.environ:
  # BEGIN
  # llmx cause de multiples problèmes de dépendances avec les librairies installées ensuite
  # see https://community.openai.com/t/error-while-importing-openai-from-open-import-openai/578166
  # see https://stackoverflow.com/questions/77759146/issue-installing-openai-in-colab
  !pip3 uninstall llmx -y
  # END
  !pip3 install -q -U langchain
  !pip3 install -q -U langchain-openai
  !pip3 install -q -U langchain-community
  !pip3 install -q -U llama-index
  !pip3 install -q -U openai
  !pip3 install -q -U pypdf

In [9]:
# Cette variable python est accessible depuis les commandes bash
work_directory = "text-summarization"

# Récupération des données pour Google Colab et Github
if 'google.colab' in sys.modules or 'CI' in os.environ:
  !curl --create-dirs -O --output-dir $work_directory "https://raw.githubusercontent.com/digillia/Digillia-Colab/main/use-cases/text-summarization/sample_1.pdf"

In [10]:
# Fonction utilitaire pour afficher du texte sur plusieurs lignes
def word_wrap(string, n_chars=72):
    if len(string) < n_chars:
        return string
    else:
        return string[:n_chars].rsplit(' ', 1)[0] + '\n' + word_wrap(string[len(string[:n_chars].rsplit(' ', 1)[0])+1:], n_chars)

## Chargement des textes à résumer

In [11]:
import os
from pypdf import PdfReader

# Empty list to store page text
texts = []

# Loop through each PDF file
for file_name in os.listdir(work_directory):
    
    # Check if file is a PDF
    if file_name.endswith('.pdf'):
        
        # Create PDF file object
        reader = PdfReader(os.path.join(work_directory, file_name))
        
        # Loop through pages and extract text
        for page in reader.pages:
            
            # Extract text from page
            texts.append(page.extract_text())
            
print(len(texts))
print(word_wrap(texts[4]))

12
 
Karan Singh, Assistant Professor of Operations Research 


Characteristics common to both language models and supervised
learning: 
1. Predicting Well is the Yardstick. A prediction rule is
good as long as it makes 
reasonable predictions on average. Compared
to more ambitious sub-disciplines in 
statistics, any statements about
causality, p-values, and recovering latent structure are 
absent. We
are similarly impervious to such considerations in language models.
Such 
simplicity of goals enables very flexible prediction rules in
machine learning. Although 
seeming modest in its aim, the art of
machine learning has long been to cast as many 
disparate problems as
questions about prediction as possible. Predicting house prices 
from
square footage is a regular regression task. But, for reverse image
captioning, is 
“predicting” a (high-dimensional) image given a few
words a reasonable or well-defined 
classification task? Yet, this is
how machine learning algorithms function. 
2. M

## Chargement de la Clé pour OpenAI

Il vous faut obtenir d'Open AI une clé pour exécuter ce notebook Jupyter. Vous pouvez consulter [Where do I find my API key?](https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key). Ensuite, le chargement se fait soit à partir de l'environnement (fichier `.env`), soit à partir des secrets de Google Colab.

In [12]:
import openai

openai_api_key = None
if 'google.colab' in sys.modules:
  from google.colab import userdata
  openai_api_key = userdata.get('OPENAI_API_KEY')
else:
  from dotenv import load_dotenv, find_dotenv
  _ = load_dotenv(find_dotenv()) # read local .env file
  openai_api_key  = os.environ['OPENAI_API_KEY']
openai.api_key = openai_api_key

## Résumé de Texte avec LlamaIndex

Cette approche utilise l'API Restful d'Open AI par le framework LlamaIndex 

Docs: https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary.html

In [13]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import (
  SimpleDirectoryReader,
  get_response_synthesizer,
)
from llama_index.core.indices import DocumentSummaryIndex

documents = SimpleDirectoryReader(work_directory).load_data()
llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
embedding_model = OpenAIEmbedding(model="text-embedding-3-small")
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize", use_async=True)
index = DocumentSummaryIndex.from_documents(
    documents,
    llm=llm,
    embedding_model=embedding_model,
    response_synthesizer=response_synthesizer,
    show_progress=True,
)
query_engine = index.as_query_engine()
query = query_engine.query("Could you summarize the given context? Return your response which covers the key points of the text and does not miss anything important, please.")
print(word_wrap(query.response))

Parsing nodes:   0%|          | 0/12 [00:00<?, ?it/s]

Summarizing documents:   0%|          | 0/12 [00:00<?, ?it/s]

current doc id: c9029ca0-5428-43ec-9c29-6086ca915d1f
current doc id: 2a7cfa6d-6676-4c2d-a783-b85123da7767
current doc id: 0950fedd-fa57-4c1c-8a6a-36dcbad2d63c
current doc id: 48589af2-16d1-4cd0-a820-39c218b918e0
current doc id: 1985f9b5-ef73-433c-b6af-6051932660cf
current doc id: 0a0f0ee5-d551-4d06-8e3f-80cff29284c5
current doc id: d05317dd-a324-4b24-8aa6-01e7c6ff0372
current doc id: 462e16e5-35be-4067-92d7-c1ce2ddbdb12
current doc id: a119bb38-bc7d-491f-a888-c3f68c771e58
current doc id: f3633939-6b76-4cbd-a1fe-40b2b2628f92
current doc id: 3e825353-5d1c-484e-b33a-1b9c9c14e2b8
current doc id: 3528b425-eed1-458a-9234-de6f2275f4ca


Generating embeddings:   0%|          | 0/12 [00:00<?, ?it/s]

Language models predict the next word based on a context window of
preceding words by outputting a probability distribution over all
possible words. This randomness in predictions allows for varied
completions on different runs. Sampling the next word probabilistically
is crucial for generating high-quality text, as opposed to choosing the
most likely word greedily. This stochastic process enables models like
ChatGPT to offer diverse responses to the same prompt.


## Résumé de Texte avec LangChain

Cette approche utilise l'API Restful d'Open AI par le framework LangChain 

Docs: https://python.langchain.com/docs/use_cases/summarization

In [14]:
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_openai import ChatOpenAI

documents = PyPDFDirectoryLoader(work_directory).load()
llm_name = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name=llm_name, openai_api_key=openai_api_key, temperature=0)

### L'Approche Prompt (se heurte à la taille du contexte)

In [26]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from openai import BadRequestError

def format_docs(documents):
    return "\n\n".join(doc.page_content for doc in documents)

template = "Summarize the main themes in these retrieved docs: {docs}"
prompt = PromptTemplate.from_template(template)
chain = {"docs": format_docs} | prompt | llm | StrOutputParser()
try:
    response = chain.invoke(documents)
    print(word_wrap(response))
except BadRequestError as e:
    print(e.body['message'])

The main themes in these documents include the introduction and
explanation of generative artificial intelligence (GenAI) tools, the
impact of these tools on business processes and operations, the
technical aspects of language models, the advancements in machine
learning and artificial intelligence that have led to the development
of large language models, the use of word embeddings and contrastive
learning in creating representations for textual data, the role of
transformers in capturing long-range relationships in text, the concept
of in-context learning for repurposing pre-trained language models for
specific tasks, and the potential future applications of foundation
models in various domains. The documents also discuss the
democratization of using large language models like GPT3 through
in-context learning and the importance of prompt engineering in
eliciting useful outputs from these models.


### L'Approche MapReduce

Il s'agit de résumer chacun des documents, puis de résumer l'ensemble des résumés.

In [16]:
from langchain.chains.summarize import load_summarize_chain

chain = load_summarize_chain(llm, chain_type="map_reduce", token_max=3000)
response = chain.invoke(documents)
print(word_wrap(response['output_text']))

Generative artificial intelligence (GenAI) tools are a new class of AI
algorithms that can create novel content based on user prompts, with
recent advancements in machine learning leading to human-level
performance. This has the potential to revolutionize business processes
and create new opportunities, attracting significant investment and
interest. The report by Karan Singh explores large language models
(LLMs) like GPT3, discussing their principles, functions, and impact,
as well as the importance of initial prompts and supervised learning.
The evolution of neural network architectures from RNNs to Transformers
has improved language models, with models like GPT3 showing high
performance but high training costs. In-context learning approaches
like InstructGPT aim to address limitations in following instructions.
The future of AI lies in foundation models like GPT4, with tradeoffs in
performance-resource balance.


### L'Approche Refine

Il s'agit de résumer chaque document avec le résumé des précédents, jusqu'à atteindre le dernier.

In [17]:
from langchain.chains.summarize import load_summarize_chain

chain = load_summarize_chain(llm, chain_type="refine")
response = chain.invoke(documents)
print(word_wrap(response['output_text']))

The discussion by Karan Singh, Assistant Professor of Operations
Research, on contrastive learning and the evolution of neural network
architectures highlights the practicality and effectiveness of
pre-training large language models like GPT2 and GPT3 on auxiliary
tasks before fine-tuning them on specific labeled examples for
downstream tasks. The introduction of transformers has revolutionized
the field by efficiently capturing long-range relations between tokens.
The paper on GPT3's architecture introduces in-context learning as a
cost-effective way to repurpose pre-trained language models for
specific tasks, making them more accessible to end-users. While
fine-tuning may still offer performance gains, GPT3 significantly
reduces the computational load, democratizing the use of large language
models. The report also discusses the development of InstructGPT, which
aligns models to follow user instructions through supervised and
reinforcement learning, and the concept of foundation mode

## Résumé de Texte avec Google T5

Cette approche utilise directement un modèle de langage large en local.

Docs:
- https://huggingface.co/docs/transformers/tasks/summarization
- https://huggingface.co/docs/transformers/model_doc/t5
- https://www.analyticsvidhya.com/blog/2023/06/pdf-summarization-with-transformers-in-python/
- https://blog.research.google/2022/03/auto-generated-summaries-in-google-docs.html
- https://blog.research.google/2020/06/pegasus-state-of-art-model-for.html
- https://medium.com/gopenai/text-summarization-using-flan-t5-5ded2e4ce182
- https://medium.com/artificialis/t5-for-text-summarization-in-7-lines-of-code-b665c9e40771

In [18]:
# Pour supprimer le modèle du cache
# !pip3 install -q -U "huggingface_hub[cli]"
# !huggingface-cli delete-cache

###  Initialisation du Modèle et de son Tokenizer

Notez que les modèles T5-base ont environ 220 millions de paramètres, à comparer avec les environ 175 milliards de paramètres de GPT 3.5 (Open AI). On ne peut pas en attendre la même qualité de résumés.

In [19]:
# from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# https://huggingface.co/docs/transformers/model_doc/t5
tr_name = 'T5-base'

# https://huggingface.co/docs/transformers/model_doc/t5v1.1
# tr_name = 'google/t5-v1_1-base'

# https://huggingface.co/docs/transformers/model_doc/flan-t5
# tr_name = 'google/flan-t5-base'

# tokenizer = T5Tokenizer.from_pretrained(tr_name)
# model = T5ForConditionalGeneration.from_pretrained(tr_name)

tokenizer = AutoTokenizer.from_pretrained(tr_name)
model = AutoModelForSeq2SeqLM.from_pretrained(tr_name, return_dict=True)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### Résumé d'une variable

In [20]:
text = ("Data science is an interdisciplinary field[10] focused on extracting knowledge from typically large data sets and applying the knowledge and insights from that data to solve problems in a wide range of application domains.[11] The field encompasses preparing data for analysis, formulating data science problems, analyzing data, developing data-driven solutions, and presenting findings to inform high-level decisions in a broad range of application domains. As such, it incorporates skills from computer science, statistics, information science, mathematics, data visualization, information visualization, data sonification, data integration, graphic design, complex systems, communication and business.[12][13] Statistician Nathan Yau, drawing on Ben Fry, also links data science to human–computer interaction: users should be able to intuitively control and explore data.[14][15] In 2015, the American Statistical Association identified database management, statistics and machine learning, and distributed and parallel systems as the three emerging foundational professional communities.[16]")
inputs = tokenizer.encode("sumarize: " + text, return_tensors='pt', max_length=512, truncation=True)
output = model.generate(inputs, min_length=80, max_length=100)
summary = tokenizer.decode(output[0])
print(word_wrap(summary))

<pad> data science is an interdisciplinary field focused on extracting
knowledge from typically large data sets . it incorporates skills from
computer science, statistics, information science, mathematics, data
visualization, information visualization, data sonification, graphic
design, complex systems, communication and business . the field
encompasses preparing data for analysis, formulating data science
problems, analyzing data, developing data-driven solutions .</s>


### Résumé de Textes par l'Approche MapReduce

In [21]:
from tqdm import tqdm

summaries = []

def summarize(text):
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs, max_new_tokens=1000, min_new_tokens=250, length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0])    

for text in tqdm(texts):
    summaries.append(summarize(text))

summary = summarize("\n\n".join(summaries))
print(word_wrap(summary))

100%|██████████| 12/12 [06:38<00:00, 33.20s/it]


<pad> recent advances in machine learning (ML), massive datasets, and
substantial increases in computing power have propelled such tools to
human-level performance on academic and professional benchmarks .
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­