<a href="https://colab.research.google.com/github/davidlealo/100profes/blob/master/resumen_libro_grande.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Configurar el entorno en Google Colab


In [1]:
!pip install openai langchain PyPDF2 transformers

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


## Interfaz para subir el libro
Puedes usar la interfaz de Colab para permitir la carga de archivos, por ejemplo, un archivo PDF o texto.

In [2]:
from google.colab import files

uploaded = files.upload()

for filename in uploaded.keys():
    print(f"Archivo subido: {filename}")


Saving (El Libro Universitario) Martin Heidegger - Caminos de bosque-Alianza Editorial (2010).pdf to (El Libro Universitario) Martin Heidegger - Caminos de bosque-Alianza Editorial (2010).pdf
Archivo subido: (El Libro Universitario) Martin Heidegger - Caminos de bosque-Alianza Editorial (2010).pdf


## Procesar el archivo
Si es un PDF

In [3]:
from PyPDF2 import PdfReader

def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

file_path = list(uploaded.keys())[0]
book_text = extract_text_from_pdf(file_path)
print(f"El libro tiene {len(book_text)} caracteres.")


El libro tiene 802910 caracteres.


## Dividir el texto en fragmentos
Dividir el libro en partes manejables para procesarlas con el modelo

In [5]:
def split_text(text, max_length=2000):
    sentences = text.split('. ')
    chunks = []
    chunk = ""
    for sentence in sentences:
        if len(chunk) + len(sentence) <= max_length:
            chunk += sentence + ". "
        else:
            chunks.append(chunk.strip())
            chunk = sentence + ". "
    if chunk:
        chunks.append(chunk.strip())
    return chunks

text_chunks = split_text(book_text)
print(f"El libro se dividió en {len(text_chunks)} fragmentos.")


El libro se dividió en 423 fragmentos.


## Generar resúmenes de cada fragmento
Usaremos OpenAI o un modelo de Hugging Face para resumir cada fragmento.

### Con OpenAI (requiere una API Key de OpenAI):


In [None]:
import openai
from getpass import getpass

# Solicitar la API Key de manera segura
openai.api_key = getpass("Por favor, introduce tu OpenAI API Key: ")

def summarize_text(text):
    """
    Resume el texto dado utilizando el modelo GPT-3.5-turbo de OpenAI.
    """
    try:
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": f"Resume el siguiente texto: {text}"}],
            max_tokens=500,
            temperature=0.7,
        )
        return response.choices[0].message['content']
    except Exception as e:
        print(f"Error al resumir el texto: {e}")
        return ""

# Procesar cada fragmento y generar resúmenes
try:
    summaries = [summarize_text(chunk) for chunk in text_chunks]
    full_summary = "\n".join(summaries)
    print("Resumen completo generado.")
except Exception as e:
    print(f"Ocurrió un error: {e}")


### Con Hugging Face (modelos como facebook/bart-large-cnn):


In [6]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

def summarize_text_hf(text):
    summary = summarizer(text, max_length=130, min_length=30, do_sample=False)
    return summary[0]['summary_text']

summaries = [summarize_text_hf(chunk) for chunk in text_chunks]
full_summary = "\n".join(summaries)

print("Resumen completo generado.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


Resumen completo generado.


## Exportar el resumen

In [7]:
with open("resumen_libro.txt", "w") as f:
    f.write(full_summary)

from google.colab import files
files.download("resumen_libro.txt")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Nueva version

In [1]:
!pip install transformers
!pip install torch
!pip install PyPDF2
!pip install python-docx
!pip install ebooklib
!pip install beautifulsoup4


Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-docx
Successfully installed python-docx-1.1.2
Collecting ebooklib
  Downloading EbookLib-0.18.tar.gz (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.5/115.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ebooklib
  Building wheel fo

In [3]:
# Carga de libraries
import os
from transformers import pipeline, AutoTokenizer
from PyPDF2 import PdfReader
from docx import Document
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup
import re
import torch
from google.colab import files

In [4]:
# Configurar el modelo de resumen
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=-1)
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

Device set to use cpu


In [None]:
# Funciones para procesar diferentes formatos de archivo

def extract_text_from_pdf(file_path):
    try:
        print("Procesando archivo PDF...")
        reader = PdfReader(file_path)
        text = ""
        for page in reader.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text
        return text.strip() if text.strip() else "Error: No se pudo extraer texto del PDF."
    except Exception as e:
        return f"Error al procesar el PDF: {e}"

def extract_text_from_txt(file_path):
    try:
        print("Procesando archivo TXT...")
        with open(file_path, "r", encoding="utf-8") as file:
            text = file.read()
        return text.strip()
    except Exception as e:
        return f"Error al procesar el TXT: {e}"

def extract_text_from_doc(file_path):
    try:
        print("Procesando archivo DOC/DOCX...")
        doc = Document(file_path)
        text = "\n".join([paragraph.text for paragraph in doc.paragraphs])
        return text.strip()
    except Exception as e:
        return f"Error al procesar el DOC/DOCX: {e}"

def extract_text_from_epub(file_path):
    try:
        print("Procesando archivo EPUB...")
        book = epub.read_epub(file_path)
        text = ""
        for item in book.get_items():
            if item.get_type() == ebooklib.ITEM_DOCUMENT:
                soup = BeautifulSoup(item.get_content(), "html.parser")
                text += soup.get_text() + "\n"
        return text.strip()
    except Exception as e:
        return f"Error al procesar el EPUB: {e}"

# Función para limpiar el texto extraído
def clean_text(text):
    print("Limpiando texto...")
    text = re.sub(r"\s+", " ", text)
    return re.sub(r"[^\w\s.,]", "", text).strip()

# Dividir texto en fragmentos basados en tokens
def split_text(text, max_length=1024):
    print("Dividiendo texto en fragmentos...")
    sentences = text.split('. ')
    chunks, chunk = [], ""
    for sentence in sentences:
        if len(tokenizer(chunk + sentence)['input_ids']) <= max_length:
            chunk += sentence + ". "
        else:
            chunks.append(chunk.strip())
            chunk = sentence + ". "
    if chunk:
        chunks.append(chunk.strip())
    print(f"Fragmentos creados: {len(chunks)}")
    return chunks

# Resumir fragmentos de texto
def summarize_text(text):
    try:
        print("Generando resumen...")
        summary = summarizer(text, max_length=130, min_length=30, do_sample=False)
        return summary[0]['summary_text']
    except Exception as e:
        return f"Error al generar resumen: {e}"

# Función principal para procesar el archivo y resumir
def summarize_file(file_path):
    extension = file_path.split('.')[-1].lower()
    if extension == "pdf":
        text = extract_text_from_pdf(file_path)
    elif extension == "txt":
        text = extract_text_from_txt(file_path)
    elif extension in ["doc", "docx"]:
        text = extract_text_from_doc(file_path)
    elif extension == "epub":
        text = extract_text_from_epub(file_path)
    else:
        return "Error: Formato no soportado. Use PDF, TXT, DOC, DOCX o EPUB."

    if "Error" in text:
        return text

    text = clean_text(text)
    chunks = split_text(text)
    summaries = [summarize_text(chunk) for chunk in chunks]
    return "\n\n".join(summaries)

# Subir archivo desde Colab
print("Por favor, sube un archivo para procesar...")
uploaded = files.upload()

for file_name in uploaded.keys():
    print(f"Procesando {file_name}...")
    summary = summarize_file(file_name)

    # Guardar el resumen en un archivo de texto
    output_file = "resumen_generado.txt"
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(summary)

    print(f"Resumen guardado en {output_file}")
    files.download(output_file)


Por favor, sube un archivo para procesar...


Saving Jacques Ranciere - El filosofo y sus pobres-Universidad Nacional de General Sarmiento INADI (2013).pdf to Jacques Ranciere - El filosofo y sus pobres-Universidad Nacional de General Sarmiento INADI (2013).pdf
Procesando Jacques Ranciere - El filosofo y sus pobres-Universidad Nacional de General Sarmiento INADI (2013).pdf...
Procesando archivo PDF...
Limpiando texto...
Dividiendo texto en fragmentos...


In [None]:
# Subir archivo desde Colab
print("Por favor, sube un archivo para procesar...")
uploaded = files.upload()

for file_name in uploaded.keys():
    print(f"Procesando {file_name}...")
    summary = summarize_file(file_name)

    # Generar nombre del archivo de salida basado en el archivo original
    base_name = os.path.splitext(file_name)[0]  # Obtener el nombre sin la extensión
    output_file = f"Resumen_{base_name}.txt"

    # Guardar el resumen en un archivo de texto
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(summary)

    print(f"Resumen guardado en {output_file}")
    files.download(output_file)