# Résumé de Textes

Nous démontrons ici différentes méthodes simples pour résumer des documents en utilisant des modèles de langage larges.

[![Index](https://img.shields.io/badge/Index-blue)](../index.ipynb)
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/digillia/Digillia-Colab/blob/main/use-cases/text-summarization.ipynb)

In [15]:
import os
import sys

# Supprimer les commentaires pour installer (requirements.txt)
# !pip3 install -q -U ipywidgets
# !pip3 install -q -U python-dotenv
# !pip3 install -q -U torch
# !pip3 install -q -U tqdm
# !pip3 install -q -U transformers

# À installer dans tous les cas pour Google Colab et Github
if 'google.colab' in sys.modules or 'CI' in os.environ:
  # BEGIN
  # llmx cause de multiples problèmes de dépendances avec les librairies installées ensuite
  # see https://community.openai.com/t/error-while-importing-openai-from-open-import-openai/578166
  # see https://stackoverflow.com/questions/77759146/issue-installing-openai-in-colab
  !pip3 uninstall llmx -y
  # END
  !pip3 install -q -U langchain
  !pip3 install -q -U langchain-openai
  !pip3 install -q -U llama-index
  !pip3 install -q -U openai
  !pip3 install -q -U pypdf

In [16]:
# Cette variable python est accessible depuis les commandes bash
work_directory = "text-summarization"

# Récupération des données pour Google Colab et Github
if 'google.colab' in sys.modules or 'CI' in os.environ:
  !curl --create-dirs -O --output-dir $work_directory "https://raw.githubusercontent.com/digillia/Digillia-Colab/main/use-cases/text-summarization/sample_1.pdf"

In [17]:
# Fonction utilitaire pour afficher du texte sur plusieurs lignes
def word_wrap(string, n_chars=72):
    if len(string) < n_chars:
        return string
    else:
        return string[:n_chars].rsplit(' ', 1)[0] + '\n' + word_wrap(string[len(string[:n_chars].rsplit(' ', 1)[0])+1:], n_chars)

## Chargement des textes à résumer

In [18]:
import os
from pypdf import PdfReader

# Empty list to store page text
texts = []

# Loop through each PDF file
for file_name in os.listdir(work_directory):
    
    # Check if file is a PDF
    if file_name.endswith('.pdf'):
        
        # Create PDF file object
        reader = PdfReader(os.path.join(work_directory, file_name))
        
        # Loop through pages and extract text
        for page in reader.pages:
            
            # Extract text from page
            texts.append(page.extract_text())
            
print(len(texts))
print(word_wrap(texts[4]))

12
 Karan Singh, Assistant Professor of Operations Research 

Characteristics common to both language models and supervised learning:
1.Predicting Well is the Yardstick. A prediction rule is good as long
as it makes reasonable predictions on average. Compared to more
ambitious sub-disciplines in statistics, any statements about
causality, p-values, and recovering latent structure are absent. We are
similarly impervious to such considerations in language models. Such
simplicity of goals enables very flexible prediction rules in machine
learning. Although seeming modest in its aim, the art of machine
learning has long been to cast as many disparate problems as questions
about prediction as possible. Predicting house prices from square
footage is a regular regression task. But, for reverse image
captioning, is “predicting” a (high-dimensional) image given a few
words a reasonable or well-defined classification task? Yet, this is
how machine learning algorithms function. 2.Model Agnosticis

## Chargement de la Clé pour OpenAI

Il vous faut obtenir d'Open AI une clé pour exécuter ce notebook Jupyter. Vous pouvez consulter [Where do I find my API key?](https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key). Ensuite, le chargement se fait soit à partir de l'environnement (fichier `.env`), soit à partir des secrets de Google Colab.

In [19]:
import openai

openai_api_key = None
if 'google.colab' in sys.modules:
  from google.colab import userdata
  openai_api_key = userdata.get('OPENAI_API_KEY')
else:
  from dotenv import load_dotenv, find_dotenv
  _ = load_dotenv(find_dotenv()) # read local .env file
  openai_api_key  = os.environ['OPENAI_API_KEY']
openai.api_key = openai_api_key

## Résumé de Texte avec LlamaIndex

Cette approche utilise l'API Restful d'Open AI par le framework LlamaIndex 

Docs: https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary.html

In [20]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import (
  SimpleDirectoryReader,
  get_response_synthesizer,
)
from llama_index.core.indices import DocumentSummaryIndex

documents = SimpleDirectoryReader(work_directory).load_data()
llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
embedding_model = OpenAIEmbedding(model="text-embedding-3-small")
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize", use_async=True)
index = DocumentSummaryIndex.from_documents(
    documents,
    llm=llm,
    embedding_model=embedding_model,
    response_synthesizer=response_synthesizer,
    show_progress=True,
)
query_engine = index.as_query_engine()
query = query_engine.query("Could you summarize the given context? Return your response which covers the key points of the text and does not miss anything important, please.")
print(word_wrap(query.response))

Parsing nodes:   0%|          | 0/12 [00:00<?, ?it/s]

Summarizing documents:   0%|          | 0/12 [00:00<?, ?it/s]

current doc id: 80e21738-60e9-4c5b-9f62-c1ad031eadbb
current doc id: e7fe4290-efef-4045-b767-994e58f2e559
current doc id: 5155facd-f403-4e4d-94b1-6f1f23aaa0c9
current doc id: 7b5096a7-22d7-4eec-9157-ccae18871a8d
current doc id: 384a58bb-6d1c-49f0-8b66-341c67244a07
current doc id: a8c8647d-dfbf-4d9d-9b66-b5f5cefde09a
current doc id: fd845f07-eb80-4950-aff9-99b46b990e99
current doc id: f9128ffb-3838-4b03-bb10-a44475917961
current doc id: 6047c5a5-7be7-4f20-860e-952e72f4dc4a
current doc id: 90a4724c-f857-4b4f-8128-7b8b300f7669
current doc id: 00f9b1b1-24e8-4da4-bd48-02e9f6289d2f
current doc id: e244c9a1-7980-4b12-883f-236196bdf94f


Generating embeddings:   0%|          | 0/12 [00:00<?, ?it/s]

The text discusses the challenges of using large language models for
supervised learning tasks due to computational constraints. It
introduces in-context learning as a more cost-effective approach to
repurpose pre-trained language models for specific tasks by providing
labeled examples in the prompt. This method eliminates the need for
computationally expensive adjustments to the model's parameters, making
it more user-friendly. While fine-tuning may offer performance gains,
in-context learning reduces the computational load, making large
language models more accessible for end-users. The text also mentions
GPT3's advancement in narrowing the performance gap between fine-tuning
and in-context learning, thereby democratizing the use of large
language models.


## Résumé de Texte avec LangChain

Cette approche utilise l'API Restful d'Open AI par le framework LangChain 

Docs: https://python.langchain.com/docs/use_cases/summarization

In [21]:
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_openai import ChatOpenAI

documents = PyPDFDirectoryLoader(work_directory).load()
llm_name = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name=llm_name, openai_api_key=openai_api_key, temperature=0)

### L'Approche Prompt (se heurte à la taille du context)

In [22]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from openai import BadRequestError

def format_docs(documents):
    return "\n\n".join(doc.page_content for doc in documents)

template = "Summarize the main themes in these retrieved docs: {docs}"
prompt = PromptTemplate.from_template(template)
chain = {"docs": format_docs} | prompt | llm | StrOutputParser()
try:
    chain.invoke(documents)
except BadRequestError as e:
    print(e.body['message'])

### L'Approche MapReduce

Il s'agit de résumer chacun des documents, puis de résumer l'ensemble des résumés.

In [23]:
from langchain.chains.summarize import load_summarize_chain

chain = load_summarize_chain(llm, chain_type="map_reduce", token_max=3000)
response = chain.invoke(documents)
print(word_wrap(response['output_text']))

Generative artificial intelligence (GenAI) tools are advanced
algorithms that can create new content based on user prompts in various
formats, reaching human-level performance and offering potential
benefits for businesses. The report by Karan Singh explores new-era AI
technologies, such as large language models (LLMs), and their impact on
tasks and future advances. It discusses the importance of initial
prompts in language models, supervised learning techniques, and the
evolution of neural network architectures. The paper also highlights
the limitations of models like GPT3 and introduces new models like
InstructGPT and foundation models. Tradeoffs in models like GPT4 are
discussed, along with methods for improving text generation.


### L'Approche Refine

Il s'agit de résumer chaque document avec le résumé des précédents, jusqu'à atteindre le dernier.

In [24]:
from langchain.chains.summarize import load_summarize_chain

chain = load_summarize_chain(llm, chain_type="refine")
response = chain.invoke(documents)
print(word_wrap(response['output_text']))

The existing summary provides a comprehensive overview of Generative
artificial intelligence (GenAI) tools, particularly large language
models (LLMs), and their impact on various business processes.
Assistant Professor Karan Singh's report delves into the functioning
and principles of GenAI, focusing on LLMs and their successful
applications in commercial settings. The report also discusses the
challenges and advancements in supervised learning, deep learning, and
the evolution of language processing tasks using word embeddings,
recurrent neural networks, and transformers. It highlights the
practical applications of large language models, specifically
in-context learning, as a cost-effective and efficient way to repurpose
pre-trained models for specific tasks. The report also touches on the
future potential of foundation models to expand the use of LLMs in
other domains. The report also provides insights into the tradeoffs
between fine-tuning and in-context learning, as well as the per

## Résumé de Texte avec Google T5

Cette approche utilise directement un modèle de langage large en local.

Docs:
- https://huggingface.co/docs/transformers/tasks/summarization
- https://huggingface.co/docs/transformers/model_doc/t5
- https://www.analyticsvidhya.com/blog/2023/06/pdf-summarization-with-transformers-in-python/
- https://blog.research.google/2022/03/auto-generated-summaries-in-google-docs.html
- https://blog.research.google/2020/06/pegasus-state-of-art-model-for.html
- https://medium.com/gopenai/text-summarization-using-flan-t5-5ded2e4ce182
- https://medium.com/artificialis/t5-for-text-summarization-in-7-lines-of-code-b665c9e40771

In [25]:
# Pour supprimer le modèle du cache
# !pip3 install -q -U "huggingface_hub[cli]"
# !huggingface-cli delete-cache

###  Initialisation du Modèle et de son Tokenizer

Notez que les modèles T5-base ont environ 220 millions de paramètres, à comparer avec les environ 175 milliards de paramètres de GPT 3.5 (Open AI). On ne peut pas en attendre la même qualité de résumés.

In [26]:
# from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# https://huggingface.co/docs/transformers/model_doc/t5
tr_name = 'T5-base'

# https://huggingface.co/docs/transformers/model_doc/t5v1.1
# tr_name = 'google/t5-v1_1-base'

# https://huggingface.co/docs/transformers/model_doc/flan-t5
# tr_name = 'google/flan-t5-base'

# tokenizer = T5Tokenizer.from_pretrained(tr_name)
# model = T5ForConditionalGeneration.from_pretrained(tr_name)

tokenizer = AutoTokenizer.from_pretrained(tr_name)
model = AutoModelForSeq2SeqLM.from_pretrained(tr_name, return_dict=True)

### Résumé d'une variable

In [27]:
text = ("Data science is an interdisciplinary field[10] focused on extracting knowledge from typically large data sets and applying the knowledge and insights from that data to solve problems in a wide range of application domains.[11] The field encompasses preparing data for analysis, formulating data science problems, analyzing data, developing data-driven solutions, and presenting findings to inform high-level decisions in a broad range of application domains. As such, it incorporates skills from computer science, statistics, information science, mathematics, data visualization, information visualization, data sonification, data integration, graphic design, complex systems, communication and business.[12][13] Statistician Nathan Yau, drawing on Ben Fry, also links data science to human–computer interaction: users should be able to intuitively control and explore data.[14][15] In 2015, the American Statistical Association identified database management, statistics and machine learning, and distributed and parallel systems as the three emerging foundational professional communities.[16]")
inputs = tokenizer.encode("sumarize: " + text, return_tensors='pt', max_length=512, truncation=True)
output = model.generate(inputs, min_length=80, max_length=100)
summary = tokenizer.decode(output[0])
print(word_wrap(summary))

<pad> data science is an interdisciplinary field focused on extracting
knowledge from typically large data sets. it incorporates skills from
computer science, statistics, information science, mathematics, data
visualization, information visualization, data sonification, graphic
design, complex systems, communication and business. the field
encompasses preparing data for analysis, formulating data science
problems, analyzing data, developing data-driven solutions.</s>


### Résumé de Textes par l'Approche MapReduce

In [28]:
from tqdm import tqdm

summaries = []

def summarize(text):
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs, max_new_tokens=1000, min_new_tokens=250, length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0])    

for text in tqdm(texts):
    summaries.append(summarize(text))

summary = summarize("\n\n".join(summaries))
print(word_wrap(summary))

100%|██████████| 12/12 [08:30<00:00, 42.55s/it]


<pad> openAI’s chatGPT, a conversational web app based on a generative
(multimodal) language model, took five days to reach one million users.
on the business side, the number of jobs mentioning AI-related skills
quadrupled from 2022 to 2023.
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­