<a href="https://colab.research.google.com/github/Viny2030/UNED/blob/main/10_Extract_text_from_documents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extract text from documents

Up to this point, all the examples have been working with sections of text, which have already been split through some other means. What happens if we're working with documents? First we need to get the text out of these documents, then figure out how to index to best support vector search.

This notebook shows how documents can have text extracted and split to support vector search and retrieval augmented generation (RAG).

# **Extraer texto de documentos**
Hasta este punto, todos los ejemplos han estado trabajando con secciones de texto, que ya se han dividido a través de otros medios. ¿Qué sucede si estamos trabajando con documentos? Primero necesitamos sacar el texto de estos documentos, luego descubrir cómo indexar a la mejor búsqueda de vectores de soporte.

Este cuaderno muestra cómo los documentos pueden extraer y dividir el texto para admitir la búsqueda y la recuperación de la generación aumentada (RAG).

# Install dependencies

Install `txtai` and all dependencies. Since this notebook is using optional pipelines, we need to install the pipeline extras package.

# **Instalar dependencias**
Instale TXTAI y todas las dependencias. Dado que este cuaderno está utilizando tuberías opcionales, necesitamos instalar el paquete de extras de tuberías.

In [1]:
%%capture
!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline]

# Get test data
!wget -N https://github.com/neuml/txtai/releases/download/v6.2.0/tests.tar.gz
!tar -xvzf tests.tar.gz

# Install NLTK
import nltk
nltk.download(['punkt', 'punkt_tab'])

# Create a Textractor instance

The Textractor instance is the main entrypoint for extracting text. This method is backed by Apache Tika, a robust text extraction library written in Java. [Apache Tika](https://tika.apache.org/0.9/formats.html) has support for a large number of file formats: PDF, Word, Excel, HTML and others. The [Python Tika package](https://github.com/chrismattmann/tika-python) automatically installs Tika and starts a local REST API instance used to read extracted data.

*Note: This requires Java to be installed locally.*

Crear una instancia de Textractor
La instancia de Textractor es el punto de entrada principal para extraer texto. Este método está respaldado por Apache Tika, una sólida biblioteca de extracción de texto escrita en Java. Apache Tika tiene soporte para una gran cantidad de formatos de archivo: PDF, Word, Excel, HTML y otros. El paquete Python Tika instala automáticamente Tika e inicia una instancia de API REST local utilizada para leer datos extraídos.

Nota: Esto requiere que Java se instale localmente.

In [3]:
!pip install java -y
!java --version


Usage:   
  pip3 install [options] <requirement specifier> [package-index-options] ...
  pip3 install [options] -r <requirements file> [package-index-options] ...
  pip3 install [options] [-e] <vcs project url> ...
  pip3 install [options] [-e] <local project path> ...
  pip3 install [options] <archive url/path> ...

no such option: -y
openjdk 11.0.25 2024-10-15
OpenJDK Runtime Environment (build 11.0.25+9-post-Ubuntu-1ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.25+9-post-Ubuntu-1ubuntu122.04, mixed mode, sharing)


In [2]:
%%capture

from txtai.pipeline import Textractor

# Create textractor model
textractor = Textractor()

# Extract text

The example below shows how to extract text from a file.

Extraer texto
El siguiente ejemplo muestra cómo extraer texto de un archivo.

In [4]:
textractor("txtai/article.pdf")

INFO:tika.tika:Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar to /tmp/tika-server.jar.
INFO:tika.tika:Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar.md5 to /tmp/tika-server.jar.md5.


'Introducing txtai, an AI-powered search engine \nbuilt on Transformers\n\nAdd Natural Language Understanding to any application\n\nSearch is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s \nthe foundation of the internet and an ever-growing challenge that is never solved or done.\n\nThe field of Natural Language Processing (NLP) is rapidly evolving with a number of new \ndevelopments. Large-scale general language models are an exciting new capability allowing us to add \namazing functionality quickly with limited compute and people. Innovation continues with new models\nand advancements coming in at what seems a weekly basis.\n\nThis article introduces txtai, an AI-powered search engine that enables Natural Language \nUnderstanding (NLU) based search in any application.\n\nIntroducing txtai\ntxtai builds an AI-powered index over sections of text. txtai supports building text indices to perform \nsimilarity searches and create extract

Note that the text from the article was extracted into a single string. Depending on the articles, this may be acceptable. For long articles, often you'll want to split the content into logical sections to build better downstream vectors.

Tenga en cuenta que el texto del artículo se extrajo en una sola cadena. Dependiendo de los artículos, esto puede ser aceptable. Para artículos largos, a menudo querrá dividir el contenido en secciones lógicas para construir mejores vectores aguas abajo.

# Extract sentences

Sentence extraction uses a model that specializes in sentence detection. This call returns a list of sentences.

# Extraer oraciones
La extracción de oraciones utiliza un modelo que se especializa en la detección de oraciones. Esta llamada devuelve una lista de oraciones.

In [5]:
textractor = Textractor(sentences=True)
textractor("txtai/article.pdf")

['Introducing txtai, an AI-powered search engine \nbuilt on Transformers\n\nAdd Natural Language Understanding to any application\n\nSearch is the base of many applications.',
 'Once data starts to pile up, users want to be able to find it.',
 'It’s \nthe foundation of the internet and an ever-growing challenge that is never solved or done.',
 'The field of Natural Language Processing (NLP) is rapidly evolving with a number of new \ndevelopments.',
 'Large-scale general language models are an exciting new capability allowing us to add \namazing functionality quickly with limited compute and people.',
 'Innovation continues with new models\nand advancements coming in at what seems a weekly basis.',
 'This article introduces txtai, an AI-powered search engine that enables Natural Language \nUnderstanding (NLU) based search in any application.',
 'Introducing txtai\ntxtai builds an AI-powered index over sections of text.',
 'txtai supports building text indices to perform \nsimilarity sea

Now the document is split up at the sentence level. These sentences can be feed to a workflow that adds each sentence to an embeddings index. Depending on the task, this may work well. Alternatively, it may be even better to split at the paragraph level.

Ahora el documento se divide en el nivel de oración. Estas oraciones se pueden alimentar a un flujo de trabajo que agrega cada oración a un índice de incrustaciones. Dependiendo de la tarea, esto puede funcionar bien. Alternativamente, puede ser aún mejor dividirse a nivel de párrafo.

# Extract paragraphs

Paragraph detection looks for consecutive newlines. This call returns a list of paragraphs.

# Extraer párrafos
La detección de párrafos busca nuevas líneas consecutivas. Esta llamada devuelve una lista de párrafos.

In [6]:
textractor = Textractor(paragraphs=True)
for paragraph in textractor("txtai/article.pdf"):
  print(paragraph, "\n----")

Introducing txtai, an AI-powered search engine 
built on Transformers 
----
Add Natural Language Understanding to any application 
----
Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s 
the foundation of the internet and an ever-growing challenge that is never solved or done. 
----
The field of Natural Language Processing (NLP) is rapidly evolving with a number of new 
developments. Large-scale general language models are an exciting new capability allowing us to add 
amazing functionality quickly with limited compute and people. Innovation continues with new models
and advancements coming in at what seems a weekly basis. 
----
This article introduces txtai, an AI-powered search engine that enables Natural Language 
Understanding (NLU) based search in any application. 
----
Introducing txtai
txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform 
similarity searches and create e

# Extract sections

Section extraction is format dependent. If page breaks are available, each section is a page. Otherwise, this call returns logical sections such by headings.

# **Secciones de extracción**
La extracción de la sección depende del formato. Si hay saltos de página disponibles, cada sección es una página. De lo contrario, esta llamada devuelve secciones lógicas tales por encabezados.

In [7]:
textractor = Textractor(sections=True)
print("\n[PAGE BREAK]\n".join(section for section in textractor("txtai/article.pdf")))

Introducing txtai, an AI-powered search engine 
built on Transformers

Add Natural Language Understanding to any application

Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s 
the foundation of the internet and an ever-growing challenge that is never solved or done.

The field of Natural Language Processing (NLP) is rapidly evolving with a number of new 
developments. Large-scale general language models are an exciting new capability allowing us to add 
amazing functionality quickly with limited compute and people. Innovation continues with new models
and advancements coming in at what seems a weekly basis.

This article introduces txtai, an AI-powered search engine that enables Natural Language 
Understanding (NLU) based search in any application.

Introducing txtai
txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform 
similarity searches and create extractive question-answer