# Proyecto Bimestral: Sistema de Recuperación de Información basado en Reuters-21578

**Prof. Iván Carrera**

**27 de mayo de 2024**

## 1. Introducción

El objetivo de este proyecto es diseñar, construir, programar y desplegar un Sistema de Recuperación de Información (SRI) utilizando el corpus Reuters-21578. El proyecto se dividirá en varias fases, que se describen a continuación.

## 2. Fases del Proyecto

### 2.1. Adquisición de Datos

**Objetivo:** Obtener y preparar el corpus Reuters-21578.

**Tareas:**

- Descargar el corpus Reuters-21578.
- Descomprimir y organizar los archivos.
- Documentar el proceso de adquisición de datos.

### 2.2. Preprocesamiento

**Objetivo:** Limpiar y preparar los datos para su análisis.





In [1]:
import os
import re
import pandas as pd

from nltk.stem import SnowballStemmer

**Tareas:**

- Extraer el contenido relevante de los documentos.


In [2]:
# Directorio donde se encuentran los archivos
directorio = 'reuters/training/'

# Leer el contenido de los archivos en una lista
documentos = []
for archivo in os.listdir(directorio):
    ruta_archivo = os.path.join(directorio, archivo)
    with open(ruta_archivo, 'r', encoding='latin-1') as f:
        documentos.append(f.read())

print(f'Se han cargado {len(documentos)} documentos.')


Se han cargado 7769 documentos.


- Realizar limpieza de datos: eliminación de caracteres no deseados, normalización de texto, etc.


In [3]:
def limpiar_texto(texto):
    # Eliminar caracteres no deseados (mantener solo letras y espacios)
    texto_limpio = re.sub(r'[^a-zA-Z\s]', '', texto)
    # Normalizar a minúsculas
    texto_limpio = texto_limpio.lower()
    # Eliminar espacios en blanco adicionales
    texto_limpio = re.sub(r'\s+', ' ', texto_limpio).strip()
    return texto_limpio


In [4]:
documentos_limpios = [limpiar_texto(doc) for doc in documentos]

- Tokenización: dividir el texto en palabras o tokens.


In [5]:
# Dividir en palabras
def separar(doc):
    palabras = doc.split()
    return palabras

In [6]:
documentos_tokenizados_split = [separar(doc) for doc in documentos_limpios]

In [7]:
documentos_tokenizados_split[0]

['bahia',
 'cocoa',
 'review',
 'showers',
 'continued',
 'throughout',
 'the',
 'week',
 'in',
 'the',
 'bahia',
 'cocoa',
 'zone',
 'alleviating',
 'the',
 'drought',
 'since',
 'early',
 'january',
 'and',
 'improving',
 'prospects',
 'for',
 'the',
 'coming',
 'temporao',
 'although',
 'normal',
 'humidity',
 'levels',
 'have',
 'not',
 'been',
 'restored',
 'comissaria',
 'smith',
 'said',
 'in',
 'its',
 'weekly',
 'review',
 'the',
 'dry',
 'period',
 'means',
 'the',
 'temporao',
 'will',
 'be',
 'late',
 'this',
 'year',
 'arrivals',
 'for',
 'the',
 'week',
 'ended',
 'february',
 'were',
 'bags',
 'of',
 'kilos',
 'making',
 'a',
 'cumulative',
 'total',
 'for',
 'the',
 'season',
 'of',
 'mln',
 'against',
 'at',
 'the',
 'same',
 'stage',
 'last',
 'year',
 'again',
 'it',
 'seems',
 'that',
 'cocoa',
 'delivered',
 'earlier',
 'on',
 'consignment',
 'was',
 'included',
 'in',
 'the',
 'arrivals',
 'figures',
 'comissaria',
 'smith',
 'said',
 'there',
 'is',
 'still',
 's

- Eliminar stop words y aplicar stemming o lematización.
- Documentar cada paso del preprocesamiento.

In [8]:
# Cargar las stop words desde el archivo
ruta_stop_words = 'reuters/stopwords'
with open(ruta_stop_words, 'r', encoding='latin-1') as f:
    stop_words = set(f.read().split())

In [9]:
# Usar el Snowball Stemmer (puedes cambiar a otro si lo prefieres)
stemmer = SnowballStemmer('english')

def procesar_tokens(tokens):
    # Eliminar stop words
    tokens_filtrados = [token for token in tokens if token not in stop_words]
    # Aplicar stemming
    tokens_stemmizados = [stemmer.stem(token) for token in tokens_filtrados]
    return tokens_stemmizados

In [10]:
documentos_procesados = [procesar_tokens(doc) for doc in documentos_tokenizados_split]

In [11]:
documentos_procesados

[['bahia',
  'cocoa',
  'review',
  'shower',
  'continu',
  'week',
  'bahia',
  'cocoa',
  'zone',
  'allevi',
  'drought',
  'earli',
  'januari',
  'improv',
  'prospect',
  'come',
  'temporao',
  'normal',
  'humid',
  'level',
  'restor',
  'comissaria',
  'smith',
  'week',
  'review',
  'dri',
  'period',
  'mean',
  'temporao',
  'late',
  'year',
  'arriv',
  'week',
  'end',
  'februari',
  'bag',
  'kilo',
  'make',
  'cumul',
  'total',
  'season',
  'mln',
  'stage',
  'year',
  'cocoa',
  'deliv',
  'earlier',
  'consign',
  'includ',
  'arriv',
  'figur',
  'comissaria',
  'smith',
  'doubt',
  'crop',
  'cocoa',
  'harvest',
  'practic',
  'end',
  'total',
  'bahia',
  'crop',
  'estim',
  'mln',
  'bag',
  'sale',
  'stand',
  'mln',
  'hundr',
  'thousand',
  'bag',
  'hand',
  'farmer',
  'middlemen',
  'export',
  'processor',
  'doubt',
  'cocoa',
  'fit',
  'export',
  'shipper',
  'experienc',
  'dificulti',
  'obtain',
  'bahia',
  'superior',
  'certif',
  '

### 2.3. Representación de Datos en Espacio Vectorial

**Objetivo:** Convertir los textos en una forma que los algoritmos puedan procesar.

**Tareas:**

- Utilizar técnicas como Bag of Words (BoW) y TF-IDF para vectorizar el texto.




In [12]:
# Unir los tokens procesados nuevamente en un solo string por documento
documentos_procesados_texto = [' '.join(doc) for doc in documentos_procesados]


In [13]:
documentos_procesados_texto

['bahia cocoa review shower continu week bahia cocoa zone allevi drought earli januari improv prospect come temporao normal humid level restor comissaria smith week review dri period mean temporao late year arriv week end februari bag kilo make cumul total season mln stage year cocoa deliv earlier consign includ arriv figur comissaria smith doubt crop cocoa harvest practic end total bahia crop estim mln bag sale stand mln hundr thousand bag hand farmer middlemen export processor doubt cocoa fit export shipper experienc dificulti obtain bahia superior certif view lower qualiti recent week farmer sold good part cocoa held consign comissaria smith spot bean price rose cruzado arroba kilo bean shipper reluct offer nearbi shipment limit sale book march shipment dlrs tonn port name crop sale light open port junejuli dlrs dlrs york juli augsept dlrs tonn fob routin sale butter made marchapril sold dlrs aprilmay butter time york junejuli dlrs augsept dlrs time york sept octdec dlrs time york d

In [14]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer_bow = CountVectorizer()
vectorizer_tfidf = TfidfVectorizer()

In [15]:
X_bow = vectorizer_bow.fit_transform(documentos_procesados_texto)
print(f'BoW shape: {X_bow.shape}')


BoW shape: (7769, 21411)


In [16]:
df_bow = pd.DataFrame(X_bow.toarray(), columns=vectorizer_bow.get_feature_names_out())

In [17]:
df_bow

Unnamed: 0,aa,aaa,aachen,aaminus,aancor,aap,aaplus,aar,aarnoud,aaron,...,zorinski,zseven,zuccherifici,zuckerman,zulia,zurich,zurichbas,zuyuan,zverev,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7764,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7765,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7766,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7767,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# TF-IDF

X_tfidf = vectorizer_tfidf.fit_transform(documentos_procesados_texto)
print(f'TF-IDF shape: {X_tfidf.shape}')

TF-IDF shape: (7769, 21411)


In [19]:
# Convertir a DataFrame para visualizar
df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=vectorizer_tfidf.get_feature_names_out())
df_tfidf


Unnamed: 0,aa,aaa,aachen,aaminus,aancor,aap,aaplus,aar,aarnoud,aaron,...,zorinski,zseven,zuccherifici,zuckerman,zulia,zurich,zurichbas,zuyuan,zverev,zzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7764,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7766,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7767,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


- Evaluar las diferentes técnicas de vectorización.
- Documentar los métodos y resultados obtenidos.

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(documentos_procesados_texto,test_size=0.2, random_state=42)

X_train_bow = vectorizer_bow.fit_transform(X_train)
X_test_bow = vectorizer_bow.transform(X_test)

# Entrenar un clasificador Naive Bayes con BoW
nb_classifier_bow = MultinomialNB()
nb_classifier_bow.fit(X_train_bow, y_train)
predicciones_bow = nb_classifier_bow.predict(X_test_bow)

ValueError: not enough values to unpack (expected 4, got 2)

### 2.4. Indexación

**Objetivo:** Crear un índice que permita búsquedas eficientes.

**Tareas:**

- Construir un índice invertido que mapee términos a documentos.
- Implementar y optimizar estructuras de datos para el índice.
- Documentar el proceso de construcción del índice.

### 2.5. Diseño del Motor de Búsqueda

**Objetivo:** Implementar la funcionalidad de búsqueda.

**Tareas:**

- Desarrollar la lógica para procesar consultas de usuarios.
- Implementar algoritmos de similitud como similitud coseno o Jaccard.
- Desarrollar un algoritmo de ranking para ordenar los resultados.
- Documentar la arquitectura y los algoritmos utilizados.

### 2.6. Evaluación del Sistema

**Objetivo:** Medir la efectividad del sistema.

**Tareas:**

- Definir un conjunto de métricas de evaluación (precisión, recall, F1-score).
- Realizar pruebas utilizando el conjunto de prueba del corpus.
- Comparar el rendimiento de diferentes configuraciones del sistema.
- Documentar los resultados y análisis.

### 2.7. Interfaz Web de Usuario

**Objetivo:** Crear una interfaz para interactuar con el sistema.

**Tareas:**

- Diseñar una interfaz web donde los usuarios puedan ingresar consultas.
- Mostrar los resultados de búsqueda de manera clara y ordenada.
- Implementar características adicionales como filtros y opciones de visualización.
- Documentar el diseño y funcionalidades de la interfaz.

## 3. Entrega Final

- **Documentación Completa:** Incluyendo los procesos, decisiones tomadas, y resultados de cada fase.
- **Código Fuente:** Organizado y bien comentado.
- **Informe de Evaluación:** Análisis detallado de la evaluación del sistema.
- **Demostración del Sistema:** Presentación funcional del sistema a través de la interfaz web.

## 4. Requisitos Técnicos

- **Lenguajes de Programación:** Python (preprocesamiento y modelado), JavaScript (para la interfaz web).

## 5. Evaluación del Proyecto

- **Funcionamiento:** (35%) Efectividad y eficiencia en la recuperación de información.
- **Documentación:** (35%) Claridad en la documentación de cada fase.
- **Innovación y Creatividad:** (15%) En la implementación de técnicas y la interfaz de usuario.
- **Presentación Final:** (15%) Calidad y claridad de la demostración del sistema.