# Résumé de Textes

Nous démontrons ici différentes méthodes simples pour résumer des documents en utilisant des modèles de langage larges.

[![Index](https://img.shields.io/badge/Index-blue)](../index.ipynb)
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/digillia/Digillia-Colab/blob/main/use-cases/text-summarization.ipynb)

In [22]:
from IPython import get_ipython

# Supprimer les commentaires pour installer
# !pip3 install -q -U python-dotenv
# !pip3 install -q -U torch
# !pip3 install -q -U tqdm
# !pip3 install -q -U transformers

# À installer dans tous les cas pour Google Colab
if 'google.colab' in str(get_ipython()):
  # BEGIN
  # llmx cause de multiples problèmes de dépendances avec les librairies installées ensuite
  # see https://community.openai.com/t/error-while-importing-openai-from-open-import-openai/578166
  # see https://stackoverflow.com/questions/77759146/issue-installing-openai-in-colab
  !pip3 uninstall llmx -y
  # END
  !pip3 install -q -U langchain
  !pip3 install -q -U langchain-openai
  !pip3 install -q -U llama-index
  !pip3 install -q -U openai
  !pip3 install -q -U pypdf

In [23]:
# This python variable cab be accessed by bash commands
pdf_directory = "text-summarization"

# Récupération des données pour Google Colab
if 'google.colab' in str(get_ipython()):
  !curl --create-dirs -O --output-dir $pdf_directory "https://raw.githubusercontent.com/digillia/Digillia-Colab/main/use-cases/text-summarization/sample_1.pdf"

## Chargement des textes à résumer

In [24]:
import os
from pypdf import PdfReader
from ipywidgets import HTML

# Empty list to store page text
texts = []

# Loop through each PDF file
for file_name in os.listdir(pdf_directory):
    
    # Check if file is a PDF
    if file_name.endswith('.pdf'):
        
        # Create PDF file object
        reader = PdfReader(os.path.join(pdf_directory, file_name))
        
        # Loop through pages and extract text
        for page in reader.pages:
            
            # Extract text from page
            texts.append(page.extract_text())
            
print(len(texts))
HTML(texts[4])

12


HTML(value=' Karan Singh, Assistant Professor of Operations Research \n Characteristics common to both languag…

## Chargement de Clé OpenAI

Il vous faut obtenir d'Open AI une clé pour exécuter ce notebook Jupyter. Vous pouvez consulter [Where do I find my API key?](https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key). Ensuite, le chargement se fait soit à partir de l'environnement (fichier `.env`), soit à partir des secrets de Google Colab.

In [25]:
import openai

if 'google.colab' in str(get_ipython()):
  from google.colab import userdata
  openai.api_key = userdata.get('OPENAI_API_KEY')
else:
  from dotenv import load_dotenv, find_dotenv
  _ = load_dotenv(find_dotenv()) # read local .env file
  openai.api_key  = os.environ['OPENAI_API_KEY']

## Résumé de Texte avec LlamaIndex

Cette approche utilise l'API Restful d'Open AI par le framework LlamaIndex 

Docs: https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary.html

In [27]:
from llama_index.llms import OpenAI
from llama_index import (
  SimpleDirectoryReader,
  ServiceContext,
  get_response_synthesizer,
)
from llama_index.indices.document_summary import DocumentSummaryIndex
from ipywidgets import HTML

documents = SimpleDirectoryReader(pdf_directory).load_data()
llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=1024)
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize", use_async=True)
index = DocumentSummaryIndex.from_documents(
    documents,
    service_context=service_context,
    response_synthesizer=response_synthesizer,
    show_progress=True,
)
query_engine = index.as_query_engine()
query = query_engine.query("Could you summarize the given context? Return your response which covers the key points of the text and does not miss anything important, please.")
HTML(query.response)


Parsing nodes:   0%|          | 0/12 [00:00<?, ?it/s]

Summarizing documents:   0%|          | 0/12 [00:00<?, ?it/s]

current doc id: fe666552-335e-4ce8-9a2b-e019b4785e87
current doc id: cd1d91e7-4f7f-4373-90fb-ee97288dc9cc
current doc id: 068c8817-e176-4deb-be42-d48097e3e67d
current doc id: 1279b706-57d0-4c95-a3dc-724161271427
current doc id: 58fd1974-f14f-4120-b5ae-3ed2521770b5
current doc id: ba05641a-8893-4526-9729-79c2f392f882
current doc id: c5ab2fc9-e73e-4da2-bffb-c9bb14b7ae92
current doc id: 3d952e4a-1489-4f26-9790-7a82ba15ed22
current doc id: e423d7be-aae9-4008-b8c7-fdfdfcea2623
current doc id: 64b8a6a1-de0a-46f8-943a-c8d07cf51e26
current doc id: 0e72056f-f811-4b66-9d7c-bbf771cb93c1
current doc id: baefa322-e3bc-4593-9eb1-a3fe00c2e8da


Generating embeddings:   0%|          | 0/12 [00:00<?, ?it/s]

HTML(value="The context information mentions various details related to the performance and tradeoffs of GPT4,…

## Résumé de Texte avec LangChain

Cette approche utilise l'API Restful d'Open AI par le framework LangChain 

Docs: https://python.langchain.com/docs/use_cases/summarization

In [31]:
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_openai import ChatOpenAI

documents = PyPDFDirectoryLoader(pdf_directory).load()
llm_name = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name=llm_name, temperature=0)

### L'Approche Prompt (se heurte à la taille du context)

In [40]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from openai import BadRequestError

def format_docs(documents):
    return "\n\n".join(doc.page_content for doc in documents)

template = "Summarize the main themes in these retrieved docs: {docs}"
prompt = PromptTemplate.from_template(template)
chain = {"docs": format_docs} | prompt | llm | StrOutputParser()
try:
    chain.invoke(documents)
except BadRequestError as e:
    print(e.body['message'])

This model's maximum context length is 4097 tokens. However, your messages resulted in 4370 tokens. Please reduce the length of the messages.


### L'Approche MapReduce

Il s'agit de résumer chacun des documents, puis de résumer l'ensemble des résumés.

In [33]:
from langchain.chains.summarize import load_summarize_chain
from ipywidgets import HTML

chain = load_summarize_chain(llm, chain_type="map_reduce", token_max=3000)
response = chain.invoke(documents)
HTML(response['output_text'])

HTML(value='Generative artificial intelligence (GenAI) tools, specifically large language models (LLMs), have …

### L'Approche Refine

Il s'agit de résumer chaque document avec le résumé des précédents.

In [42]:
from langchain.chains.summarize import load_summarize_chain
from ipywidgets import HTML

chain = load_summarize_chain(llm, chain_type="refine")
response = chain.invoke(documents)
HTML(response['output_text'])

HTML(value="Generative artificial intelligence (GenAI) tools, such as large language models (LLMs), have made …

## Résumé de Texte avec Google T5

Cette approche utilise directement un modèle de langage large en local.

Docs:
- https://huggingface.co/docs/transformers/tasks/summarization
- https://huggingface.co/docs/transformers/model_doc/t5
- https://www.analyticsvidhya.com/blog/2023/06/pdf-summarization-with-transformers-in-python/
- https://blog.research.google/2022/03/auto-generated-summaries-in-google-docs.html
- https://blog.research.google/2020/06/pegasus-state-of-art-model-for.html
- https://medium.com/gopenai/text-summarization-using-flan-t5-5ded2e4ce182
- https://medium.com/artificialis/t5-for-text-summarization-in-7-lines-of-code-b665c9e40771

In [None]:
# Pour vider le cache du model
# !pip3 install -q -U "huggingface_hub[cli]"
# huggingface-cli delete-cache

###  Initialisation du Modèle et de son Tokenizer

In [43]:
# from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# https://huggingface.co/docs/transformers/model_doc/t5
tr_name = 'T5-base'

# https://huggingface.co/docs/transformers/model_doc/t5v1.1
# tr_name = 'google/t5-v1_1-base'

# tokenizer = T5Tokenizer.from_pretrained(tr_name)
# model = T5ForConditionalGeneration.from_pretrained(tr_name)

tokenizer = AutoTokenizer.from_pretrained(tr_name)
model = AutoModelForSeq2SeqLM.from_pretrained(tr_name, return_dict=True)

### Résumé d'une variable

In [44]:
from ipywidgets import HTML

text = ("Data science is an interdisciplinary field[10] focused on extracting knowledge from typically large data sets and applying the knowledge and insights from that data to solve problems in a wide range of application domains.[11] The field encompasses preparing data for analysis, formulating data science problems, analyzing data, developing data-driven solutions, and presenting findings to inform high-level decisions in a broad range of application domains. As such, it incorporates skills from computer science, statistics, information science, mathematics, data visualization, information visualization, data sonification, data integration, graphic design, complex systems, communication and business.[12][13] Statistician Nathan Yau, drawing on Ben Fry, also links data science to human–computer interaction: users should be able to intuitively control and explore data.[14][15] In 2015, the American Statistical Association identified database management, statistics and machine learning, and distributed and parallel systems as the three emerging foundational professional communities.[16]")
inputs = tokenizer.encode("sumarize: " + text, return_tensors='pt', max_length=512, truncation=True)
output = model.generate(inputs, min_length=80, max_length=100)
summary = tokenizer.decode(output[0])
HTML(summary)

HTML(value='<pad> data science is an interdisciplinary field focused on extracting knowledge from typically la…

### Résumé de Textes par l'Approche MapReduce

In [47]:
from tqdm import tqdm
from ipywidgets import HTML

summaries = []

def summarize(text):
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=1024, truncation=True)
    outputs = model.generate(inputs, max_length=1000, min_length=250, length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0])    

for text in tqdm(texts):
    summaries.append(summarize(text))

summary = summarize("\n\n".join(summaries))
HTML(summary)

  0%|          | 0/12 [00:00<?, ?it/s]