# Proyecto Bimestral: Sistema de Recuperación de Información basado en Reuters-21578

**Prof. Iván Carrera**

**27 de mayo de 2024**

## 1. Introducción

El objetivo de este proyecto es diseñar, construir, programar y desplegar un Sistema de Recuperación de Información (SRI) utilizando el corpus Reuters-21578. El proyecto se dividirá en varias fases, que se describen a continuación.

## 2. Fases del Proyecto

### 2.1. Adquisición de Datos

**Objetivo:** Obtener y preparar el corpus Reuters-21578.

**Tareas:**

- Descargar el corpus Reuters-21578.
- Descomprimir y organizar los archivos.
- Documentar el proceso de adquisición de datos.

### 2.2. Preprocesamiento

**Objetivo:** Limpiar y preparar los datos para su análisis.





In [1]:
import os
import re
import pandas as pd
from nltk.stem import SnowballStemmer

**Tareas:**

- Extraer el contenido relevante de los documentos.


In [2]:
# Directorio donde se encuentran los archivos
directorio = 'reuters/training/'

# Leer el contenido de los archivos en una lista
documentos = []
nombres_archivos = []

for archivo in os.listdir(directorio):
    ruta_archivo = os.path.join(directorio, archivo)
    with open(ruta_archivo, 'r', encoding='latin-1') as f:
        documentos.append(f.read())
        nombres_archivos.append(int(archivo))


print(f'Se han cargado {len(documentos)} documentos.')


Se han cargado 7769 documentos.


In [3]:
documentos

['BAHIA COCOA REVIEW\n  Showers continued throughout the week in\n  the Bahia cocoa zone, alleviating the drought since early\n  January and improving prospects for the coming temporao,\n  although normal humidity levels have not been restored,\n  Comissaria Smith said in its weekly review.\n      The dry period means the temporao will be late this year.\n      Arrivals for the week ended February 22 were 155,221 bags\n  of 60 kilos making a cumulative total for the season of 5.93\n  mln against 5.81 at the same stage last year. Again it seems\n  that cocoa delivered earlier on consignment was included in the\n  arrivals figures.\n      Comissaria Smith said there is still some doubt as to how\n  much old crop cocoa is still available as harvesting has\n  practically come to an end. With total Bahia crop estimates\n  around 6.4 mln bags and sales standing at almost 6.2 mln there\n  are a few hundred thousand bags still in the hands of farmers,\n  middlemen, exporters and processors.\n 

- Realizar limpieza de datos: eliminación de caracteres no deseados, normalización de texto, etc.


In [4]:
def limpiar_texto(texto):
    # Eliminar caracteres no deseados (mantener solo letras y espacios)
    texto_limpio = re.sub(r'[^a-zA-Z\s]', '', texto)
    # Normalizar a minúsculas
    texto_limpio = texto_limpio.lower()
    # Eliminar espacios en blanco adicionales
    texto_limpio = re.sub(r'\s+', ' ', texto_limpio).strip()
    return texto_limpio


In [5]:
documentos_limpios = [limpiar_texto(doc) for doc in documentos]

- Tokenización: dividir el texto en palabras o tokens.


In [6]:
# Dividir en palabras
def separar(doc):
    palabras = doc.split()
    return palabras

In [7]:
documentos_tokenizados_split = [separar(doc) for doc in documentos_limpios]

In [8]:
documentos_tokenizados_split[0]

['bahia',
 'cocoa',
 'review',
 'showers',
 'continued',
 'throughout',
 'the',
 'week',
 'in',
 'the',
 'bahia',
 'cocoa',
 'zone',
 'alleviating',
 'the',
 'drought',
 'since',
 'early',
 'january',
 'and',
 'improving',
 'prospects',
 'for',
 'the',
 'coming',
 'temporao',
 'although',
 'normal',
 'humidity',
 'levels',
 'have',
 'not',
 'been',
 'restored',
 'comissaria',
 'smith',
 'said',
 'in',
 'its',
 'weekly',
 'review',
 'the',
 'dry',
 'period',
 'means',
 'the',
 'temporao',
 'will',
 'be',
 'late',
 'this',
 'year',
 'arrivals',
 'for',
 'the',
 'week',
 'ended',
 'february',
 'were',
 'bags',
 'of',
 'kilos',
 'making',
 'a',
 'cumulative',
 'total',
 'for',
 'the',
 'season',
 'of',
 'mln',
 'against',
 'at',
 'the',
 'same',
 'stage',
 'last',
 'year',
 'again',
 'it',
 'seems',
 'that',
 'cocoa',
 'delivered',
 'earlier',
 'on',
 'consignment',
 'was',
 'included',
 'in',
 'the',
 'arrivals',
 'figures',
 'comissaria',
 'smith',
 'said',
 'there',
 'is',
 'still',
 's

- Eliminar stop words y aplicar stemming o lematización.
- Documentar cada paso del preprocesamiento.

In [9]:
# Cargar las stop words desde el archivo
ruta_stop_words = 'reuters/stopwords'
with open(ruta_stop_words, 'r', encoding='latin-1') as f:
    stop_words = set(f.read().split())

In [10]:
# Usar el Snowball Stemmer (puedes cambiar a otro si lo prefieres)
stemmer = SnowballStemmer('english')

def procesar_tokens(tokens):
    # Eliminar stop words
    tokens_filtrados = [token for token in tokens if token not in stop_words]
    # Aplicar stemming
    tokens_stemmizados = [stemmer.stem(token) for token in tokens_filtrados]
    return tokens_stemmizados

In [11]:
documentos_procesados = [procesar_tokens(doc) for doc in documentos_tokenizados_split]

In [12]:
documentos_procesados

[['bahia',
  'cocoa',
  'review',
  'shower',
  'continu',
  'week',
  'bahia',
  'cocoa',
  'zone',
  'allevi',
  'drought',
  'earli',
  'januari',
  'improv',
  'prospect',
  'come',
  'temporao',
  'normal',
  'humid',
  'level',
  'restor',
  'comissaria',
  'smith',
  'week',
  'review',
  'dri',
  'period',
  'mean',
  'temporao',
  'late',
  'year',
  'arriv',
  'week',
  'end',
  'februari',
  'bag',
  'kilo',
  'make',
  'cumul',
  'total',
  'season',
  'mln',
  'stage',
  'year',
  'cocoa',
  'deliv',
  'earlier',
  'consign',
  'includ',
  'arriv',
  'figur',
  'comissaria',
  'smith',
  'doubt',
  'crop',
  'cocoa',
  'harvest',
  'practic',
  'end',
  'total',
  'bahia',
  'crop',
  'estim',
  'mln',
  'bag',
  'sale',
  'stand',
  'mln',
  'hundr',
  'thousand',
  'bag',
  'hand',
  'farmer',
  'middlemen',
  'export',
  'processor',
  'doubt',
  'cocoa',
  'fit',
  'export',
  'shipper',
  'experienc',
  'dificulti',
  'obtain',
  'bahia',
  'superior',
  'certif',
  '

### 2.3. Representación de Datos en Espacio Vectorial

**Objetivo:** Convertir los textos en una forma que los algoritmos puedan procesar.

**Tareas:**

- Utilizar técnicas como Bag of Words (BoW) y TF-IDF para vectorizar el texto.




In [13]:
# Unir los tokens procesados nuevamente en un solo string por documento
documentos_procesados_texto = [' '.join(doc) for doc in documentos_procesados]


In [14]:
documentos_procesados_texto

['bahia cocoa review shower continu week bahia cocoa zone allevi drought earli januari improv prospect come temporao normal humid level restor comissaria smith week review dri period mean temporao late year arriv week end februari bag kilo make cumul total season mln stage year cocoa deliv earlier consign includ arriv figur comissaria smith doubt crop cocoa harvest practic end total bahia crop estim mln bag sale stand mln hundr thousand bag hand farmer middlemen export processor doubt cocoa fit export shipper experienc dificulti obtain bahia superior certif view lower qualiti recent week farmer sold good part cocoa held consign comissaria smith spot bean price rose cruzado arroba kilo bean shipper reluct offer nearbi shipment limit sale book march shipment dlrs tonn port name crop sale light open port junejuli dlrs dlrs york juli augsept dlrs tonn fob routin sale butter made marchapril sold dlrs aprilmay butter time york junejuli dlrs augsept dlrs time york sept octdec dlrs time york d

In [15]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer_bow = CountVectorizer()
vectorizer_tfidf = TfidfVectorizer()

In [16]:
X_bow = vectorizer_bow.fit_transform(documentos_procesados_texto)

print(f'BoW shape: {X_bow.shape}')


BoW shape: (7769, 21411)


In [17]:
df_bow = pd.DataFrame(X_bow.toarray(), columns=vectorizer_bow.get_feature_names_out())

In [18]:
df_bow

Unnamed: 0,aa,aaa,aachen,aaminus,aancor,aap,aaplus,aar,aarnoud,aaron,...,zorinski,zseven,zuccherifici,zuckerman,zulia,zurich,zurichbas,zuyuan,zverev,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7764,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7765,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7766,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7767,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
# TF-IDF

X_tfidf = vectorizer_tfidf.fit_transform(documentos_procesados_texto)
print(f'TF-IDF shape: {X_tfidf.shape}')

TF-IDF shape: (7769, 21411)


In [20]:
# Convertir a DataFrame para visualizar
df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=vectorizer_tfidf.get_feature_names_out())
df_tfidf


Unnamed: 0,aa,aaa,aachen,aaminus,aancor,aap,aaplus,aar,aarnoud,aaron,...,zorinski,zseven,zuccherifici,zuckerman,zulia,zurich,zurichbas,zuyuan,zverev,zzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7764,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7766,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7767,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


- Evaluar las diferentes técnicas de vectorización.
- Documentar los métodos y resultados obtenidos.

### 2.4. Indexación

**Objetivo:** Crear un índice que permita búsquedas eficientes.

**Tareas:**

- Construir un índice invertido que mapee términos a documentos.
- Implementar y optimizar estructuras de datos para el índice.
- Documentar el proceso de construcción del índice.



In [21]:
def construir_indice_invertido(documentos):
    indice_invertido = {}  # Usa un diccionario estándar
    for doc_id, doc in enumerate(documentos):
        for palabra in doc:
            if palabra not in indice_invertido:
                indice_invertido[palabra] = set()  # Inicializa un conjunto para nuevas palabras
            indice_invertido[palabra].add(doc_id)  # Agrega el doc_id al conjunto
    return indice_invertido

indice_invertido = construir_indice_invertido(documentos_procesados)


In [22]:
indice_invertido

{'bahia': {0, 941, 1240, 2047},
 'cocoa': {0,
  9,
  82,
  262,
  288,
  300,
  310,
  318,
  319,
  364,
  366,
  389,
  394,
  474,
  489,
  780,
  862,
  941,
  944,
  1166,
  1190,
  1518,
  1528,
  1723,
  1755,
  2016,
  2047,
  2237,
  2573,
  2947,
  3088,
  3389,
  3408,
  3457,
  4013,
  4226,
  4286,
  4643,
  4661,
  4708,
  4805,
  4881,
  4959,
  5132,
  5281,
  5444,
  5448,
  5505,
  5900,
  6036,
  6078,
  6691,
  7027,
  7093,
  7107,
  7432,
  7489,
  7702,
  7736},
 'review': {0,
  25,
  115,
  271,
  332,
  430,
  502,
  505,
  506,
  593,
  627,
  686,
  757,
  770,
  780,
  813,
  819,
  826,
  830,
  854,
  892,
  957,
  1105,
  1218,
  1278,
  1308,
  1362,
  1576,
  1728,
  1778,
  1887,
  1890,
  1999,
  2018,
  2047,
  2141,
  2177,
  2199,
  2299,
  2401,
  2493,
  2527,
  2549,
  2683,
  2704,
  2759,
  2795,
  2847,
  2848,
  2852,
  2862,
  2912,
  2920,
  2935,
  2939,
  3099,
  3108,
  3112,
  3120,
  3214,
  3263,
  3337,
  3368,
  3370,
  3446,
  347

In [23]:
len(indice_invertido)

21411

In [24]:
def obtener_documentos_relevantes(consulta, indice_invertido):
    consulta_procesada = procesar_tokens(separar(limpiar_texto(consulta)))
    # Inicializar un conjunto con los IDs de los documentos relevantes
    documentos_relevantes = set()
    # Iterar sobre cada palabra de la consulta
    for palabra in consulta_procesada:
        # Buscar la palabra en el índice invertido
        if palabra in indice_invertido:
            # Agregar los IDs de los documentos que contienen la palabra al conjunto de documentos relevantes
            documentos_relevantes.update(indice_invertido[palabra])
    #realizar la matriz de similitud
    return documentos_relevantes

In [25]:
consulta_procesada = procesar_tokens(separar(limpiar_texto('earn')))
consulta_vector = vectorizer_bow.transform([' '.join(consulta_procesada)])
df_consulta = pd.DataFrame(consulta_vector.toarray(), columns=vectorizer_bow.get_feature_names_out())
df_consulta

Unnamed: 0,aa,aaa,aachen,aaminus,aancor,aap,aaplus,aar,aarnoud,aaron,...,zorinski,zseven,zuccherifici,zuckerman,zulia,zurich,zurichbas,zuyuan,zverev,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
documentos_encontrados=obtener_documentos_relevantes('earn',indice_invertido)
bow_2 = X_bow[list(documentos_encontrados)]
df_bow_2 = pd.DataFrame(bow_2.toarray(), columns=vectorizer_bow.get_feature_names_out())
df_bow_2

Unnamed: 0,aa,aaa,aachen,aaminus,aancor,aap,aaplus,aar,aarnoud,aaron,...,zorinski,zseven,zuccherifici,zuckerman,zulia,zurich,zurichbas,zuyuan,zverev,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
524,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
525,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
526,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
527,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
documentos_encontrados

{4,
 13,
 41,
 57,
 62,
 70,
 78,
 91,
 100,
 115,
 117,
 119,
 124,
 160,
 177,
 182,
 188,
 199,
 281,
 282,
 285,
 298,
 299,
 301,
 322,
 341,
 348,
 381,
 382,
 407,
 411,
 412,
 443,
 451,
 489,
 502,
 515,
 527,
 533,
 534,
 537,
 539,
 550,
 585,
 596,
 606,
 622,
 636,
 640,
 653,
 665,
 666,
 716,
 717,
 729,
 753,
 758,
 777,
 780,
 790,
 792,
 818,
 868,
 889,
 894,
 913,
 931,
 967,
 973,
 983,
 1021,
 1042,
 1057,
 1070,
 1079,
 1082,
 1100,
 1101,
 1102,
 1103,
 1105,
 1107,
 1110,
 1115,
 1143,
 1147,
 1219,
 1224,
 1228,
 1233,
 1245,
 1249,
 1250,
 1283,
 1284,
 1297,
 1326,
 1332,
 1349,
 1354,
 1373,
 1425,
 1427,
 1429,
 1439,
 1490,
 1504,
 1522,
 1545,
 1547,
 1549,
 1551,
 1570,
 1632,
 1645,
 1647,
 1654,
 1665,
 1679,
 1697,
 1745,
 1768,
 1778,
 1786,
 1788,
 1793,
 1801,
 1802,
 1813,
 1823,
 1836,
 1860,
 1895,
 1900,
 1920,
 1938,
 1944,
 1961,
 1965,
 1966,
 1970,
 1976,
 1992,
 2010,
 2034,
 2060,
 2080,
 2091,
 2092,
 2099,
 2107,
 2140,
 2154,
 2179,
 

### 2.5. Diseño del Motor de Búsqueda

**Objetivo:** Implementar la funcionalidad de búsqueda.

**Tareas:**

- Desarrollar la lógica para procesar consultas de usuarios.
- Implementar algoritmos de similitud como similitud coseno o Jaccard.
- Desarrollar un algoritmo de ranking para ordenar los resultados.
- Documentar la arquitectura y los algoritmos utilizados.

In [28]:
from sklearn.metrics.pairwise import cosine_similarity

# Calcular similitud coseno
similitud_coseno = cosine_similarity(consulta_vector, bow_2).flatten()
similitud_coseno_id = [(doc_id, similitud_coseno[id]) for id, doc_id in enumerate (documentos_encontrados)]
similitud_coseno_id.sort(key=lambda x: x[1], reverse=True)
# Mostrar resultados ordenados por similitud
print(f"Documentos ordenados por similitud: {similitud_coseno_id}")

Documentos ordenados por similitud: [(5540, 0.5157106231293966), (1100, 0.5), (1332, 0.49393916995360654), (4946, 0.4931969619160719), (382, 0.4789474720713997), (2099, 0.4662524041201569), (4577, 0.4472135954999579), (1103, 0.42874646285627205), (3809, 0.423999152002544), (2691, 0.3956282840374722), (1057, 0.3922322702763681), (5684, 0.39056673294247163), (3323, 0.3872983346207417), (596, 0.3721042037676254), (3434, 0.36719403681726276), (7758, 0.3651483716701107), (533, 0.3611575592573076), (4976, 0.35856858280031806), (6447, 0.35355339059327373), (2034, 0.33927557187198837), (534, 0.3380617018914066), (3736, 0.3348247650912125), (2491, 0.3321819194149599), (7677, 0.32826608214930636), (666, 0.32551538350846376), (381, 0.31622776601683794), (1900, 0.31622776601683794), (4811, 0.3113995776646092), (4433, 0.3095292930136547), (7711, 0.3086066999241838), (2698, 0.30779350562554625), (3760, 0.3061862178478973), (6408, 0.30323921743156135), (7727, 0.30151134457776363), (913, 0.29934217004

### Tfidf

In [29]:
consulta_vector = vectorizer_tfidf.transform([' '.join(consulta_procesada)])
df_consulta = pd.DataFrame(consulta_vector.toarray(), columns=vectorizer_bow.get_feature_names_out())
df_consulta

Unnamed: 0,aa,aaa,aachen,aaminus,aancor,aap,aaplus,aar,aarnoud,aaron,...,zorinski,zseven,zuccherifici,zuckerman,zulia,zurich,zurichbas,zuyuan,zverev,zzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [30]:
tfidf_2 = X_tfidf[list(documentos_encontrados)]
df_tfidf_2 = pd.DataFrame(tfidf_2.toarray(), columns=vectorizer_tfidf.get_feature_names_out())
df_tfidf_2

Unnamed: 0,aa,aaa,aachen,aaminus,aancor,aap,aaplus,aar,aarnoud,aaron,...,zorinski,zseven,zuccherifici,zuckerman,zulia,zurich,zurichbas,zuyuan,zverev,zzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
524,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
526,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [31]:
# Calcular similitud coseno
similitud_coseno = cosine_similarity(consulta_vector, tfidf_2).flatten()
similitud_coseno_id = [(doc_id, similitud_coseno[id]) for id, doc_id in enumerate (documentos_encontrados)]
similitud_coseno_id.sort(key=lambda x: x[1], reverse=True)
# Mostrar resultados ordenados por similitud
print(f"Documentos ordenados por similitud: {similitud_coseno_id}")

Documentos ordenados por similitud: [(382, 0.49153764965803437), (5540, 0.45032175264960234), (1332, 0.41469841235281935), (1100, 0.4119406327626826), (1103, 0.4032011159094924), (1057, 0.3793017954006838), (4946, 0.37352652391771163), (2099, 0.3644402416780455), (2034, 0.3573488554704223), (596, 0.35292019766683935), (7758, 0.33733045702226083), (3809, 0.33243520091921014), (2491, 0.3293812803360798), (533, 0.3206743306373214), (4577, 0.31203050047833053), (3760, 0.31022917710052533), (5676, 0.3019556604309389), (5853, 0.30117644439006114), (7711, 0.29631662383284646), (3736, 0.29427619345554085), (4823, 0.29009564706096597), (7677, 0.28915252780565875), (7707, 0.2883324018139173), (1219, 0.28668331355964466), (2080, 0.2838225416407841), (2691, 0.28222245515883315), (5684, 0.2810956482997161), (3434, 0.27886355439314825), (7727, 0.2786827238426178), (3323, 0.2759642944755592), (5606, 0.27533914610084237), (100, 0.2711262296357742), (2707, 0.26953184759759885), (534, 0.2687216530429661

### 2.6. Evaluación del Sistema

**Objetivo:** Medir la efectividad del sistema.

**Tareas:**

- Definir un conjunto de métricas de evaluación (precisión, recall, F1-score).
- Realizar pruebas utilizando el conjunto de prueba del corpus.
- Comparar el rendimiento de diferentes configuraciones del sistema.
- Documentar los resultados y análisis.

In [32]:
# Parsear el archivo cats.txt para obtener la verdad de terreno (ground truth)
ruta_cats = 'reuters/cats.txt'
gran_verdad = {}

with open(ruta_cats, 'r', encoding='latin-1') as f:
    for linea in f:
        if linea.startswith('training/'):
            partes = linea.strip().split()
            doc_id = int(partes[0].split('/')[1])  # Obtener el ID del documento
            categorias = partes[1:]  # Obtener las categorías
            gran_verdad[doc_id] = categorias

# Verificar la estructura de ground_truth
print(f'gran_verdad: {list(gran_verdad.items())[:5]}')

gran_verdad: [(1, ['cocoa']), (5, ['sorghum', 'oat', 'barley', 'corn', 'wheat', 'grain']), (6, ['wheat', 'sorghum', 'grain', 'sunseed', 'corn', 'oilseed', 'soybean', 'sun-oil', 'soy-oil', 'lin-oil', 'veg-oil']), (9, ['earn']), (10, ['acq'])]


In [33]:
consulta_procesada 

['earn']

In [34]:
# Definir la relevancia basada en la coincidencia de categorías
categorias_consulta = {'earn'}  # Categorías esperadas para la consulta
documentos_relevantes_esperados = set()

for doc_id, categorias in gran_verdad.items():
    if any(categoria in categorias_consulta for categoria in categorias):
        documentos_relevantes_esperados.add((doc_id))

In [35]:
documentos_relevantes_esperados = list(documentos_relevantes_esperados)
documentos_relevantes_esperados

[9,
 11,
 12,
 13,
 14,
 18,
 23,
 24,
 8218,
 27,
 8219,
 8221,
 8223,
 8226,
 8227,
 36,
 37,
 38,
 8229,
 40,
 41,
 8236,
 50,
 53,
 8245,
 56,
 8253,
 8254,
 64,
 65,
 66,
 8258,
 8262,
 71,
 74,
 8269,
 8270,
 82,
 83,
 8275,
 85,
 86,
 87,
 8276,
 89,
 93,
 8286,
 8289,
 98,
 8291,
 8292,
 8294,
 8296,
 8297,
 108,
 8300,
 113,
 8311,
 8313,
 8314,
 123,
 8317,
 126,
 8318,
 8320,
 129,
 130,
 8322,
 138,
 139,
 140,
 142,
 143,
 8334,
 145,
 146,
 147,
 8335,
 8337,
 8338,
 151,
 152,
 8339,
 8340,
 8349,
 8350,
 160,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 8364,
 8366,
 183,
 8378,
 187,
 196,
 8388,
 8391,
 201,
 202,
 8394,
 8399,
 210,
 212,
 214,
 8407,
 8409,
 8424,
 8425,
 8426,
 8435,
 8447,
 8448,
 8449,
 8454,
 8457,
 8458,
 279,
 8475,
 8479,
 8480,
 8481,
 8487,
 8488,
 8489,
 8490,
 299,
 8491,
 8501,
 8502,
 8506,
 317,
 8510,
 8513,
 8520,
 8521,
 8526,
 8528,
 8530,
 8533,
 8534,
 345,
 8540,
 8541,
 8545,
 8547,
 356,
 8555,
 8556,
 8559,
 8561,
 8562,
 8565,

In [36]:
# Obtener nombres de archivos ordenados por similitud coseno usando TF-IDF
documentos_encontrados_nombres_tfidf = [int(i) for i, _ in similitud_coseno_id]
documentos_encontrados_nombres_tfidf

[382,
 5540,
 1332,
 1100,
 1103,
 1057,
 4946,
 2099,
 2034,
 596,
 7758,
 3809,
 2491,
 533,
 4577,
 3760,
 5676,
 5853,
 7711,
 3736,
 4823,
 7677,
 7707,
 1219,
 2080,
 2691,
 5684,
 3434,
 7727,
 3323,
 5606,
 100,
 2707,
 534,
 4433,
 7590,
 7379,
 4674,
 6407,
 666,
 381,
 4976,
 6447,
 1965,
 1490,
 1944,
 4811,
 6408,
 913,
 1250,
 1900,
 665,
 2526,
 322,
 7490,
 5590,
 5013,
 3652,
 1961,
 1439,
 1115,
 4106,
 4119,
 527,
 3711,
 5656,
 4496,
 2092,
 6072,
 4933,
 6112,
 2140,
 5270,
 70,
 7481,
 4871,
 5552,
 1788,
 5849,
 5662,
 5654,
 4989,
 4163,
 6169,
 1551,
 2698,
 1224,
 6820,
 2806,
 6918,
 7155,
 1813,
 285,
 7509,
 5485,
 3675,
 7274,
 182,
 1522,
 4057,
 2845,
 585,
 7038,
 6326,
 7172,
 6309,
 7040,
 1283,
 281,
 4,
 124,
 1349,
 5136,
 1654,
 2860,
 3129,
 5904,
 5822,
 5863,
 6492,
 6488,
 5534,
 1354,
 1042,
 2466,
 5032,
 1021,
 1802,
 188,
 6815,
 5322,
 7603,
 5699,
 2787,
 3747,
 2850,
 4580,
 6057,
 6683,
 2937,
 2727,
 7735,
 6115,
 7188,
 931,
 7101,
 

In [37]:
documentos_relevantes_esperados_nuevo=[]
for id_doc in documentos_relevantes_esperados:
    documentos_relevantes_esperados_nuevo.append(nombres_archivos.index(id_doc))

In [38]:
documentos_relevantes_esperados_nuevo

[7125,
 635,
 1297,
 1859,
 2159,
 2526,
 2809,
 2873,
 6629,
 3061,
 6630,
 6632,
 6634,
 6637,
 6638,
 3650,
 3712,
 3777,
 6639,
 3904,
 3981,
 6641,
 4537,
 4741,
 6644,
 4960,
 6648,
 6649,
 5438,
 5512,
 5577,
 6652,
 6654,
 5923,
 6096,
 6656,
 6657,
 6616,
 6676,
 6659,
 6789,
 6859,
 6932,
 6660,
 7056,
 7331,
 6663,
 6666,
 7634,
 6668,
 6669,
 6671,
 6672,
 6673,
 514,
 6678,
 838,
 6683,
 6684,
 6685,
 1490,
 6686,
 1691,
 6687,
 6689,
 1811,
 1860,
 6690,
 2116,
 2137,
 2160,
 2191,
 2211,
 6695,
 2252,
 2284,
 2311,
 6696,
 6697,
 6698,
 2358,
 2362,
 6699,
 6701,
 6706,
 6708,
 2410,
 2426,
 2432,
 2439,
 2449,
 2455,
 2463,
 2471,
 6716,
 6717,
 2538,
 6724,
 2562,
 2611,
 6726,
 6728,
 2644,
 2649,
 6729,
 6731,
 2687,
 2700,
 2710,
 6738,
 6739,
 6746,
 6747,
 6748,
 6756,
 6763,
 6764,
 6765,
 6768,
 6769,
 6770,
 3117,
 6774,
 6776,
 6778,
 6779,
 6780,
 6781,
 6782,
 6784,
 3246,
 6785,
 6791,
 6792,
 6793,
 3379,
 6796,
 6798,
 6803,
 6804,
 6806,
 6807,
 6809,
 6

In [39]:
from sklearn.metrics import precision_score, recall_score, f1_score
# Evaluar resultados de TF-IDF
y_true = [1 if i in documentos_relevantes_esperados_nuevo else 0 for i in range(len(documentos))]
y_pred = [1 if i in documentos_encontrados_nombres_tfidf else 0 for i in range(len(documentos))]


In [40]:
y_true

[0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,


In [41]:
y_pred

[0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


In [42]:
# Calcular precisión, recall y F1-score
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f'Precisión: {precision}')
print(f'Recall: {recall}')
print(f'F1-score: {f1}')

Precisión: 0.725897920604915
Recall: 0.1334723670490094
F1-score: 0.22548443922489725


### 2.7. Interfaz Web de Usuario

**Objetivo:** Crear una interfaz para interactuar con el sistema.

**Tareas:**

- Diseñar una interfaz web donde los usuarios puedan ingresar consultas.
- Mostrar los resultados de búsqueda de manera clara y ordenada.
- Implementar características adicionales como filtros y opciones de visualización.
- Documentar el diseño y funcionalidades de la interfaz.

## 3. Entrega Final

- **Documentación Completa:** Incluyendo los procesos, decisiones tomadas, y resultados de cada fase.
- **Código Fuente:** Organizado y bien comentado.
- **Informe de Evaluación:** Análisis detallado de la evaluación del sistema.
- **Demostración del Sistema:** Presentación funcional del sistema a través de la interfaz web.

## 4. Requisitos Técnicos

- **Lenguajes de Programación:** Python (preprocesamiento y modelado), JavaScript (para la interfaz web).

## 5. Evaluación del Proyecto

- **Funcionamiento:** (35%) Efectividad y eficiencia en la recuperación de información.
- **Documentación:** (35%) Claridad en la documentación de cada fase.
- **Innovación y Creatividad:** (15%) En la implementación de técnicas y la interfaz de usuario.
- **Presentación Final:** (15%) Calidad y claridad de la demostración del sistema.