# Résumé de Textes

Nous démontrons ici différentes méthodes simples pour résumer des documents en utilisant des modèles de langage larges.

[![Index](https://img.shields.io/badge/Index-blue)](../index.ipynb)
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/digillia/Digillia-Colab/blob/main/use-cases/text-summarization.ipynb)

In [1]:
import os
import sys

# Supprimer les commentaires pour installer (requirements.txt)
# !pip3 install -q -U ipywidgets
# !pip3 install -q -U python-dotenv
# !pip3 install -q -U torch
# !pip3 install -q -U tqdm
# !pip3 install -q -U transformers

# À installer dans tous les cas pour Google Colab et Github
if 'google.colab' in sys.modules or 'CI' in os.environ:
  # BEGIN
  # llmx cause de multiples problèmes de dépendances avec les librairies installées ensuite
  # see https://community.openai.com/t/error-while-importing-openai-from-open-import-openai/578166
  # see https://stackoverflow.com/questions/77759146/issue-installing-openai-in-colab
  !pip3 uninstall llmx -y
  # END
  !pip3 install -q -U langchain
  !pip3 install -q -U langchain-openai
  !pip3 install -q -U llama-index
  !pip3 install -q -U openai
  !pip3 install -q -U pypdf

In [2]:
# Cette variable python est accessible depuis les commandes bash
work_directory = "text-summarization"

# Récupération des données pour Google Colab et Github
if 'google.colab' in sys.modules or 'CI' in os.environ:
  !curl --create-dirs -O --output-dir $work_directory "https://raw.githubusercontent.com/digillia/Digillia-Colab/main/use-cases/text-summarization/sample_1.pdf"

In [3]:
# Fonction utilitaire pour afficher du texte sur plusieurs lignes
def word_wrap(string, n_chars=72):
    if len(string) < n_chars:
        return string
    else:
        return string[:n_chars].rsplit(' ', 1)[0] + '\n' + word_wrap(string[len(string[:n_chars].rsplit(' ', 1)[0])+1:], n_chars)

## Chargement des textes à résumer

In [4]:
import os
from pypdf import PdfReader

# Empty list to store page text
texts = []

# Loop through each PDF file
for file_name in os.listdir(work_directory):
    
    # Check if file is a PDF
    if file_name.endswith('.pdf'):
        
        # Create PDF file object
        reader = PdfReader(os.path.join(work_directory, file_name))
        
        # Loop through pages and extract text
        for page in reader.pages:
            
            # Extract text from page
            texts.append(page.extract_text())
            
print(len(texts))
print(word_wrap(texts[4]))

12
 Karan Singh, Assistant Professor of Operations Research 

Characteristics common to both language models and supervised learning:
1.Predicting Well is the Yardstick. A prediction rule is good as long
as it makes reasonable predictions on average. Compared to more
ambitious sub-disciplines in statistics, any statements about
causality, p-values, and recovering latent structure are absent. We are
similarly impervious to such considerations in language models. Such
simplicity of goals enables very flexible prediction rules in machine
learning. Although seeming modest in its aim, the art of machine
learning has long been to cast as many disparate problems as questions
about prediction as possible. Predicting house prices from square
footage is a regular regression task. But, for reverse image
captioning, is “predicting” a (high-dimensional) image given a few
words a reasonable or well-defined classification task? Yet, this is
how machine learning algorithms function. 2.Model Agnosticis

## Chargement de la Clé pour OpenAI

Il vous faut obtenir d'Open AI une clé pour exécuter ce notebook Jupyter. Vous pouvez consulter [Where do I find my API key?](https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key). Ensuite, le chargement se fait soit à partir de l'environnement (fichier `.env`), soit à partir des secrets de Google Colab.

In [5]:
import openai

openai_api_key = None
if 'google.colab' in sys.modules:
  from google.colab import userdata
  openai_api_key = userdata.get('OPENAI_API_KEY')
else:
  from dotenv import load_dotenv, find_dotenv
  _ = load_dotenv(find_dotenv()) # read local .env file
  openai_api_key  = os.environ['OPENAI_API_KEY']
openai.api_key = openai_api_key

## Résumé de Texte avec LlamaIndex

Cette approche utilise l'API Restful d'Open AI par le framework LlamaIndex 

Docs: https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary.html

In [6]:
from llama_index.llms import OpenAI
from llama_index import (
  SimpleDirectoryReader,
  ServiceContext,
  get_response_synthesizer,
)
from llama_index.indices.document_summary import DocumentSummaryIndex

documents = SimpleDirectoryReader(work_directory).load_data()
llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=1024)
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize", use_async=True)
index = DocumentSummaryIndex.from_documents(
    documents,
    service_context=service_context,
    response_synthesizer=response_synthesizer,
    show_progress=True,
)
query_engine = index.as_query_engine()
query = query_engine.query("Could you summarize the given context? Return your response which covers the key points of the text and does not miss anything important, please.")
print(word_wrap(query.response))




Parsing nodes:   0%|          | 0/12 [00:00<?, ?it/s]

Summarizing documents:   0%|          | 0/12 [00:00<?, ?it/s]

current doc id: 5ca561cc-5f98-4365-8f08-fdea1efe9691
current doc id: 2cf51eac-43f5-41f2-9f10-6bf704d0aeab
current doc id: eb291e4c-fd52-415b-8bb3-dfa9c7ee4c9b
current doc id: f5f41f56-119f-4171-9924-b9d2ce675296
current doc id: 2854b57a-aaba-4acb-a593-54f1a1edc613
current doc id: ecda5e70-284f-4475-b0da-0b436f42bf31
current doc id: 0926d181-cc46-406a-8122-95a3e228b35c
current doc id: 859c7d43-410c-436d-9e92-2c03e1bdbf67
current doc id: 7b4e7b86-8314-4285-ab68-44e2376facd5
current doc id: 858d5197-4988-4136-958e-20700de1a94d
current doc id: 6d22e68d-b53f-4e8a-bb52-d9c2131e944c
current doc id: 7034fa56-ef33-4907-b511-c9b0610deaae


Generating embeddings:   0%|          | 0/12 [00:00<?, ?it/s]

The context information mentions various details related to the
performance and usage of GPT4, a language model developed by OpenAI. It
highlights the tradeoffs between fine-tuning and in-context learning in
such models. It also refers to the performance of GPT4 on academic,
professional, and programming exams, with a specific mention of its
performance on the bar exam. However, there are concerns raised by
others regarding this performance. The context also includes references
to industry and research leaders' opinions on GPT4. Additionally, it
mentions the initial adoption statistics for ChatGPT, investments in
GenAI, and the user bases for GenAI. The context briefly explains the
use of Beam Search for producing text and prompt engineering for
structuring initial text inputs. Finally, it provides links to papers
related to various topics, including GPT3, Transformers, Word2Vec, and
instruct GPT.


## Résumé de Texte avec LangChain

Cette approche utilise l'API Restful d'Open AI par le framework LangChain 

Docs: https://python.langchain.com/docs/use_cases/summarization

In [7]:
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_openai import ChatOpenAI

documents = PyPDFDirectoryLoader(work_directory).load()
llm_name = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name=llm_name, openai_api_key=openai_api_key, temperature=0)

### L'Approche Prompt (se heurte à la taille du context)

In [8]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from openai import BadRequestError

def format_docs(documents):
    return "\n\n".join(doc.page_content for doc in documents)

template = "Summarize the main themes in these retrieved docs: {docs}"
prompt = PromptTemplate.from_template(template)
chain = {"docs": format_docs} | prompt | llm | StrOutputParser()
try:
    chain.invoke(documents)
except BadRequestError as e:
    print(e.body['message'])

This model's maximum context length is 4097 tokens. However, your messages resulted in 4370 tokens. Please reduce the length of the messages.


### L'Approche MapReduce

Il s'agit de résumer chacun des documents, puis de résumer l'ensemble des résumés.

In [9]:
from langchain.chains.summarize import load_summarize_chain

chain = load_summarize_chain(llm, chain_type="map_reduce", token_max=3000)
response = chain.invoke(documents)
print(word_wrap(response['output_text']))

Generative artificial intelligence (GenAI) tools, specifically large
language models (LLMs), have made significant progress in producing new
content based on user prompts. These tools have the potential to
enhance business processes and have attracted increased interest and
investment. The report explores the principles and functions of LLMs,
their impact, and future advances. It discusses the importance of the
initial prompt, the development of modern LLMs, and the limitations of
classical supervised learning. The article also highlights the
advancements in representation learning through deep learning
algorithms and the architecture of the GPT3 model. It introduces
OpenAI's InstructGPT model and the concept of foundation models.


### L'Approche Refine

Il s'agit de résumer chaque document avec le résumé des précédents, jusqu'à atteindre le dernier.

In [10]:
from langchain.chains.summarize import load_summarize_chain

chain = load_summarize_chain(llm, chain_type="refine")
response = chain.invoke(documents)
print(word_wrap(response['output_text']))

Generative artificial intelligence (GenAI) tools, such as large
language models (LLMs), have made significant progress and are now
comparable to human-level performance on academic and professional
benchmarks. These tools have the potential to enhance business
processes and enable new deliverables. This report primarily focuses on
LLMs capable of generating novel text from textual prompts, but it also
mentions the successful deployment of image- and code-based GenAI
models by companies like Adobe and Github. The report highlights the
importance of prompt engineering and the impact of the initial prompt
on the quality and coherence of the generated text. It discusses the
development of modern large language models and their differences from
traditional predictive models. The report also explores the limitations
of pre-deep-learning-era supervised algorithms and the revolution in
deep learning that automated the process of representation learning. It
introduces the concept of transformer

## Résumé de Texte avec Google T5

Cette approche utilise directement un modèle de langage large en local.

Docs:
- https://huggingface.co/docs/transformers/tasks/summarization
- https://huggingface.co/docs/transformers/model_doc/t5
- https://www.analyticsvidhya.com/blog/2023/06/pdf-summarization-with-transformers-in-python/
- https://blog.research.google/2022/03/auto-generated-summaries-in-google-docs.html
- https://blog.research.google/2020/06/pegasus-state-of-art-model-for.html
- https://medium.com/gopenai/text-summarization-using-flan-t5-5ded2e4ce182
- https://medium.com/artificialis/t5-for-text-summarization-in-7-lines-of-code-b665c9e40771

In [11]:
# Pour supprimer le modèle du cache
# !pip3 install -q -U "huggingface_hub[cli]"
# !huggingface-cli delete-cache

###  Initialisation du Modèle et de son Tokenizer

Notez que les modèles T5-base ont environ 220 millions de paramètres, à comparer avec les environ 175 milliards de paramètres de GPT 3.5 (Open AI). On ne peut pas en attendre la même qualité de résumés.

In [12]:
# from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# https://huggingface.co/docs/transformers/model_doc/t5
tr_name = 'T5-base'

# https://huggingface.co/docs/transformers/model_doc/t5v1.1
# tr_name = 'google/t5-v1_1-base'

# https://huggingface.co/docs/transformers/model_doc/flan-t5
# tr_name = 'google/flan-t5-base'

# tokenizer = T5Tokenizer.from_pretrained(tr_name)
# model = T5ForConditionalGeneration.from_pretrained(tr_name)

tokenizer = AutoTokenizer.from_pretrained(tr_name)
model = AutoModelForSeq2SeqLM.from_pretrained(tr_name, return_dict=True)

### Résumé d'une variable

In [13]:
text = ("Data science is an interdisciplinary field[10] focused on extracting knowledge from typically large data sets and applying the knowledge and insights from that data to solve problems in a wide range of application domains.[11] The field encompasses preparing data for analysis, formulating data science problems, analyzing data, developing data-driven solutions, and presenting findings to inform high-level decisions in a broad range of application domains. As such, it incorporates skills from computer science, statistics, information science, mathematics, data visualization, information visualization, data sonification, data integration, graphic design, complex systems, communication and business.[12][13] Statistician Nathan Yau, drawing on Ben Fry, also links data science to human–computer interaction: users should be able to intuitively control and explore data.[14][15] In 2015, the American Statistical Association identified database management, statistics and machine learning, and distributed and parallel systems as the three emerging foundational professional communities.[16]")
inputs = tokenizer.encode("sumarize: " + text, return_tensors='pt', max_length=512, truncation=True)
output = model.generate(inputs, min_length=80, max_length=100)
summary = tokenizer.decode(output[0])
print(word_wrap(summary))

<pad> data science is an interdisciplinary field focused on extracting
knowledge from typically large data sets. it incorporates skills from
computer science, statistics, information science, mathematics, data
visualization, information visualization, data sonification, graphic
design, complex systems, communication and business. the field
encompasses preparing data for analysis, formulating data science
problems, analyzing data, developing data-driven solutions.</s>


### Résumé de Textes par l'Approche MapReduce

In [14]:
from tqdm import tqdm

summaries = []

def summarize(text):
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs, max_new_tokens=1000, min_new_tokens=250, length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0])    

for text in tqdm(texts):
    summaries.append(summarize(text))

summary = summarize("\n\n".join(summaries))
print(word_wrap(summary))

100%|██████████| 12/12 [09:39<00:00, 48.27s/it]


<pad> openAI’s chatGPT, a conversational web app based on a generative
(multimodal) language model, took five days to reach one million users.
on the business side, the number of jobs mentioning AI-related skills
quadrupled from 2022 to 2023.
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
­­­­­­­­­­­­­­­­­­­­­­­­­­­