# Proyecto Bimestral: Sistema de Recuperación de Información basado en Reuters-21578

**Prof. Iván Carrera**

**27 de mayo de 2024**

## 1. Introducción

El objetivo de este proyecto es diseñar, construir, programar y desplegar un Sistema de Recuperación de Información (SRI) utilizando el corpus Reuters-21578. El proyecto se dividirá en varias fases, que se describen a continuación.

## 2. Fases del Proyecto

### 2.1. Adquisición de Datos

**Objetivo:** Obtener y preparar el corpus Reuters-21578.

**Tareas:**

- Descargar el corpus Reuters-21578.
- Descomprimir y organizar los archivos.
- Documentar el proceso de adquisición de datos.

### 2.2. Preprocesamiento

**Objetivo:** Limpiar y preparar los datos para su análisis.


importamos las librerias que se necesitan


In [22]:
import os
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import jaccard_score
from sklearn.metrics.pairwise import cosine_similarity
import funciones as fn

Cargamos el corpus y los nombres de cada texto en sus respectivas variables


In [2]:
# Directorio donde se encuentran los archivos
directorio = 'reuters/training/'
# Leer el contenido de los archivos en una lista
documentos = []
nombres_archivos = []
for archivo in os.listdir(directorio):
    ruta_archivo = os.path.join(directorio, archivo)
    with open(ruta_archivo, 'r', encoding='latin-1') as f:
        documentos.append(f.read())
        nombres_archivos.append(int(archivo))
print(f'Se han cargado {len(documentos)} documentos.')

Se han cargado 7769 documentos.


Eliminamos los caracteres no deseados y normalizamos todos los textos dentro de la lista documentos


In [3]:
documentos_limpios = [fn.limpiar_texto(doc) for doc in documentos]

Dividimos los textos que se encuentran en documentos_limpios en palabras o tokens.


In [4]:
documentos_tokenizados = [fn.separar(doc) for doc in documentos_limpios]
documentos_tokenizados

[['bahia',
  'cocoa',
  'review',
  'showers',
  'continued',
  'throughout',
  'the',
  'week',
  'in',
  'the',
  'bahia',
  'cocoa',
  'zone',
  'alleviating',
  'the',
  'drought',
  'since',
  'early',
  'january',
  'and',
  'improving',
  'prospects',
  'for',
  'the',
  'coming',
  'temporao',
  'although',
  'normal',
  'humidity',
  'levels',
  'have',
  'not',
  'been',
  'restored',
  'comissaria',
  'smith',
  'said',
  'in',
  'its',
  'weekly',
  'review',
  'the',
  'dry',
  'period',
  'means',
  'the',
  'temporao',
  'will',
  'be',
  'late',
  'this',
  'year',
  'arrivals',
  'for',
  'the',
  'week',
  'ended',
  'february',
  'were',
  'bags',
  'of',
  'kilos',
  'making',
  'a',
  'cumulative',
  'total',
  'for',
  'the',
  'season',
  'of',
  'mln',
  'against',
  'at',
  'the',
  'same',
  'stage',
  'last',
  'year',
  'again',
  'it',
  'seems',
  'that',
  'cocoa',
  'delivered',
  'earlier',
  'on',
  'consignment',
  'was',
  'included',
  'in',
  'the'

cargamos las stop words en una variable


In [5]:
# Cargar las stop words desde el archivo
ruta_stop_words = 'reuters/stopwords'
with open(ruta_stop_words, 'r', encoding='latin-1') as f:
    stop_words = set(fn.separar(f.read()))
stop_words

{'a',
 "a's",
 'able',
 'about',
 'above',
 'according',
 'accordingly',
 'across',
 'actually',
 'after',
 'afterwards',
 'again',
 'against',
 "ain't",
 'all',
 'allow',
 'allows',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'an',
 'and',
 'another',
 'any',
 'anybody',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'apart',
 'appear',
 'appreciate',
 'appropriate',
 'are',
 "aren't",
 'around',
 'as',
 'aside',
 'ask',
 'asking',
 'associated',
 'at',
 'available',
 'away',
 'awfully',
 'b',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'believe',
 'below',
 'beside',
 'besides',
 'best',
 'better',
 'between',
 'beyond',
 'both',
 'brief',
 'but',
 'by',
 'c',
 "c'mon",
 "c's",
 'came',
 'can',
 "can't",
 'cannot',
 'cant',
 'cause',
 'causes',
 'certain',
 'certainly',
 'changes',
 'clearly',
 'co',
 'com',
 'come',
 'c

Eliminamos las stop words y despues aplicamos stemming o lematización a cada uno de los documentos tokenizados


In [6]:
documentos_tokenizados_procesados = [fn.procesar_tokens(doc, stop_words) for doc in documentos_tokenizados]
documentos_tokenizados_procesados
pd.DataFrame(documentos_tokenizados_procesados)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,713,714,715,716,717,718,719,720,721,722
0,bahia,cocoa,review,shower,continu,week,bahia,cocoa,zone,allevi,...,,,,,,,,,,
1,comput,termin,system,ltcpml,complet,sale,comput,termin,system,complet,...,,,,,,,,,,
2,nz,trade,bank,deposit,growth,rise,slight,zealand,trade,bank,...,,,,,,,,,,
3,nation,amus,up,viacom,ltvia,bid,viacom,intern,ltnation,amus,...,,,,,,,,,,
4,roger,ltrog,see,st,qtr,net,signific,roger,corp,quarter,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7764,uk,money,market,shortag,forecast,revis,bank,england,revis,forecast,...,,,,,,,,,,
7765,knightridd,ltkrn,set,quarter,qtli,div,cts,cts,prior,pay,...,,,,,,,,,,
7766,technitrol,lttnl,set,quarter,qtli,div,cts,cts,prior,pay,...,,,,,,,,,,
7767,nationwid,cellular,servic,ltncel,qtr,shr,loss,cts,loss,cts,...,,,,,,,,,,


### 2.3. Representación de Datos en Espacio Vectorial

**Objetivo:** Convertir los textos en una forma que los algoritmos puedan procesar.

Unimos los tokens procesados nuevamente en un solo string por documento


In [7]:
# Unir los tokens procesados nuevamente en un solo string por documento
documentos_procesados_unidos = [' '.join(doc) for doc in documentos_tokenizados_procesados]
documentos_procesados_unidos

['bahia cocoa review shower continu week bahia cocoa zone allevi drought earli januari improv prospect come temporao normal humid level restor comissaria smith week review dri period mean temporao late year arriv week end februari bag kilo make cumul total season mln stage year cocoa deliv earlier consign includ arriv figur comissaria smith doubt crop cocoa harvest practic end total bahia crop estim mln bag sale stand mln hundr thousand bag hand farmer middlemen export processor doubt cocoa fit export shipper experienc dificulti obtain bahia superior certif view lower qualiti recent week farmer sold good part cocoa held consign comissaria smith spot bean price rose cruzado arroba kilo bean shipper reluct offer nearbi shipment limit sale book march shipment dlrs tonn port name crop sale light open port junejuli dlrs dlrs york juli augsept dlrs tonn fob routin sale butter made marchapril sold dlrs aprilmay butter time york junejuli dlrs augsept dlrs time york sept octdec dlrs time york d

####  Bag of Words (BoW)

creamos el Bag of Words (BoW) a partir de documentos_procesados_texto


In [8]:
vectorizer_bow = CountVectorizer(binary=True)
matriz_bow = vectorizer_bow.fit_transform(documentos_procesados_unidos)
df_bow = pd.DataFrame(matriz_bow.toarray(), columns=vectorizer_bow.get_feature_names_out())
df_bow

Unnamed: 0,aa,aaa,aachen,aaminus,aancor,aap,aaplus,aar,aarnoud,aaron,...,zorinski,zseven,zuccherifici,zuckerman,zulia,zurich,zurichbas,zuyuan,zverev,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7764,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7765,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7766,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7767,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### TF-IDF
creamos el TF-IDF para vectorizar el texto a partir de documentos_procesados_texto

In [9]:
# TF-IDF
vectorizer_tfidf = TfidfVectorizer()
matriz_tfidf = vectorizer_tfidf.fit_transform(documentos_procesados_unidos)
# Convertir a DataFrame para visualizar
df_tfidf = pd.DataFrame(matriz_tfidf.toarray(),columns=vectorizer_tfidf.get_feature_names_out())
df_tfidf

Unnamed: 0,aa,aaa,aachen,aaminus,aancor,aap,aaplus,aar,aarnoud,aaron,...,zorinski,zseven,zuccherifici,zuckerman,zulia,zurich,zurichbas,zuyuan,zverev,zzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7764,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7766,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7767,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


- Evaluar las diferentes técnicas de vectorización.
- Documentar los métodos y resultados obtenidos.


### 2.4. Indexación

**Objetivo:** Crear un índice que permita búsquedas eficientes.

**Tareas:**

- Construir un índice invertido que mapee términos a documentos.
- Implementar y optimizar estructuras de datos para el índice.
- Documentar el proceso de construcción del índice.


In [10]:
indice_invertido = fn.construir_indice_invertido(documentos_tokenizados_procesados)
indice_invertido

{'bahia': {0, 941, 1240, 2047},
 'cocoa': {0,
  9,
  82,
  262,
  288,
  300,
  310,
  318,
  319,
  364,
  366,
  389,
  394,
  474,
  489,
  780,
  862,
  941,
  944,
  1166,
  1190,
  1518,
  1528,
  1723,
  1755,
  2016,
  2047,
  2237,
  2573,
  2947,
  3088,
  3389,
  3408,
  3457,
  4013,
  4226,
  4286,
  4643,
  4661,
  4708,
  4805,
  4881,
  4959,
  5132,
  5281,
  5444,
  5448,
  5505,
  5900,
  6036,
  6078,
  6691,
  7027,
  7093,
  7107,
  7432,
  7489,
  7702,
  7736},
 'review': {0,
  25,
  115,
  271,
  332,
  430,
  502,
  505,
  506,
  593,
  627,
  686,
  757,
  770,
  780,
  813,
  819,
  826,
  830,
  854,
  892,
  957,
  1105,
  1218,
  1278,
  1308,
  1362,
  1576,
  1728,
  1778,
  1887,
  1890,
  1999,
  2018,
  2047,
  2141,
  2177,
  2199,
  2299,
  2401,
  2493,
  2527,
  2549,
  2683,
  2704,
  2759,
  2795,
  2847,
  2848,
  2852,
  2862,
  2912,
  2920,
  2935,
  2939,
  3099,
  3108,
  3112,
  3120,
  3214,
  3263,
  3337,
  3368,
  3370,
  3446,
  347

### 2.5. Diseño del Motor de Búsqueda

**Objetivo:** Implementar la funcionalidad de búsqueda.

**Tareas:**

- Desarrollar la lógica para procesar consultas de usuarios.
- Implementar algoritmos de similitud como similitud coseno o Jaccard.
- Desarrollar un algoritmo de ranking para ordenar los resultados.
- Documentar la arquitectura y los algoritmos utilizados.


realizamos un input para ingresar la consulta deseada y la procesamos

In [15]:
consulta = input()
consulta_procesada = fn.procesar_tokens(
    fn.separar(fn.limpiar_texto(consulta)), stop_words)
consulta_procesada

['earn']

#### BOW

Realizamos la matriz BOW para la consulta

In [12]:
consulta_vector = vectorizer_bow.transform([' '.join(consulta_procesada)])
df_consulta = pd.DataFrame(consulta_vector.toarray(), columns=vectorizer_bow.get_feature_names_out())
df_consulta

Unnamed: 0,aa,aaa,aachen,aaminus,aancor,aap,aaplus,aar,aarnoud,aaron,...,zorinski,zseven,zuccherifici,zuckerman,zulia,zurich,zurichbas,zuyuan,zverev,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Encontramos los documentos relevantes

In [65]:
documentos_relevantes = fn.obtener_documentos_relevantes(consulta_procesada, indice_invertido)
documentos_relevantes

{4,
 13,
 41,
 57,
 62,
 70,
 78,
 91,
 100,
 115,
 117,
 119,
 124,
 160,
 177,
 182,
 188,
 199,
 281,
 282,
 285,
 298,
 299,
 301,
 322,
 341,
 348,
 381,
 382,
 407,
 411,
 412,
 443,
 451,
 489,
 502,
 515,
 527,
 533,
 534,
 537,
 539,
 550,
 585,
 596,
 606,
 622,
 636,
 640,
 653,
 665,
 666,
 716,
 717,
 729,
 753,
 758,
 777,
 780,
 790,
 792,
 818,
 868,
 889,
 894,
 913,
 931,
 967,
 973,
 983,
 1021,
 1042,
 1057,
 1070,
 1079,
 1082,
 1100,
 1101,
 1102,
 1103,
 1105,
 1107,
 1110,
 1115,
 1143,
 1147,
 1219,
 1224,
 1228,
 1233,
 1245,
 1249,
 1250,
 1283,
 1284,
 1297,
 1326,
 1332,
 1349,
 1354,
 1373,
 1425,
 1427,
 1429,
 1439,
 1490,
 1504,
 1522,
 1545,
 1547,
 1549,
 1551,
 1570,
 1632,
 1645,
 1647,
 1654,
 1665,
 1679,
 1697,
 1745,
 1768,
 1778,
 1786,
 1788,
 1793,
 1801,
 1802,
 1813,
 1823,
 1836,
 1860,
 1895,
 1900,
 1920,
 1938,
 1944,
 1961,
 1965,
 1966,
 1970,
 1976,
 1992,
 2010,
 2034,
 2060,
 2080,
 2091,
 2092,
 2099,
 2107,
 2140,
 2154,
 2179,
 

usando los documentos relevantes filtrar la matriz_bow para tener menos vectores que estan llenas solo de 0

In [16]:
bow_filtrada = matriz_bow[list(documentos_relevantes)]
df_bow_filtrada = pd.DataFrame(bow_filtrada.toarray(), columns=vectorizer_bow.get_feature_names_out())
df_bow_filtrada

Unnamed: 0,aa,aaa,aachen,aaminus,aancor,aap,aaplus,aar,aarnoud,aaron,...,zorinski,zseven,zuccherifici,zuckerman,zulia,zurich,zurichbas,zuyuan,zverev,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
524,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
525,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
526,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
527,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [69]:
# Calcular jaccard
similitud_jaccard = []
for vector_bow in bow_filtrada:
    score = jaccard_score(consulta_vector.toarray().flatten(), vector_bow.toarray().flatten(),average='binary')
    similitud_jaccard.append(float(score))
similitud_jaccard_id = [(iddoc, similitud_jac) for iddoc, similitud_jac in zip(documentos_relevantes,similitud_jaccard)]
similitud_jaccard_id.sort(key=lambda x: x[1], reverse=True)
similitud_jaccard_id    


[(4577, 0.2),
 (7727, 0.125),
 (5853, 0.1111111111111111),
 (381, 0.1),
 (534, 0.07692307692307693),
 (1522, 0.06666666666666667),
 (1788, 0.06666666666666667),
 (115, 0.0625),
 (596, 0.0625),
 (3747, 0.0625),
 (2860, 0.058823529411764705),
 (3125, 0.058823529411764705),
 (3633, 0.058823529411764705),
 (2091, 0.05555555555555555),
 (931, 0.05555555555555555),
 (322, 0.05263157894736842),
 (4949, 0.05263157894736842),
 (5534, 0.05263157894736842),
 (1665, 0.05263157894736842),
 (1786, 0.05263157894736842),
 (1813, 0.05263157894736842),
 (1900, 0.05263157894736842),
 (6557, 0.05),
 (4608, 0.05),
 (4946, 0.05),
 (2987, 0.05),
 (1354, 0.05),
 (7707, 0.05),
 (100, 0.047619047619047616),
 (4481, 0.047619047619047616),
 (4823, 0.047619047619047616),
 (5013, 0.047619047619047616),
 (5176, 0.047619047619047616),
 (57, 0.045454545454545456),
 (4417, 0.045454545454545456),
 (6488, 0.045454545454545456),
 (2707, 0.045454545454545456),
 (6057, 0.045454545454545456),
 (4979, 0.043478260869565216),
 

[(4577, 0.2),
 (7727, 0.125),
 (5853, 0.1111111111111111),
 (381, 0.1),
 (534, 0.07692307692307693),
 (1522, 0.06666666666666667),
 (1788, 0.06666666666666667),
 (115, 0.0625),
 (596, 0.0625),
 (3747, 0.0625),
 (2860, 0.058823529411764705),
 (3125, 0.058823529411764705),
 (3633, 0.058823529411764705),
 (2091, 0.05555555555555555),
 (931, 0.05555555555555555),
 (322, 0.05263157894736842),
 (4949, 0.05263157894736842),
 (5534, 0.05263157894736842),
 (1665, 0.05263157894736842),
 (1786, 0.05263157894736842),
 (1813, 0.05263157894736842),
 (1900, 0.05263157894736842),
 (6557, 0.05),
 (4608, 0.05),
 (4946, 0.05),
 (2987, 0.05),
 (1354, 0.05),
 (7707, 0.05),
 (100, 0.047619047619047616),
 (4481, 0.047619047619047616),
 (4823, 0.047619047619047616),
 (5013, 0.047619047619047616),
 (5176, 0.047619047619047616),
 (57, 0.045454545454545456),
 (4417, 0.045454545454545456),
 (6488, 0.045454545454545456),
 (2707, 0.045454545454545456),
 (6057, 0.045454545454545456),
 (4979, 0.043478260869565216),
 

In [78]:

similitud_coseno = cosine_similarity(consulta_vector, bow_2).flatten()
similitud_coseno_id = [(doc_id, similitud_coseno[id])
                       for id, doc_id in enumerate(documentos_relevantes)]
similitud_coseno_id.sort(key=lambda x: x[1], reverse=True)
# Mostrar resultados ordenados por similitud
print(f"Documentos ordenados por similitud: {similitud_coseno_id}")

Documentos ordenados por similitud: [(4577, 0.4472135954999579), (7727, 0.35355339059327373), (5853, 0.3333333333333333), (381, 0.31622776601683794), (534, 0.2773500981126146), (1522, 0.2581988897471611), (1788, 0.2581988897471611), (115, 0.25), (596, 0.25), (3747, 0.25), (2860, 0.24253562503633297), (3125, 0.24253562503633297), (3633, 0.24253562503633297), (2091, 0.23570226039551587), (931, 0.23570226039551587), (322, 0.22941573387056174), (4949, 0.22941573387056174), (5534, 0.22941573387056174), (1665, 0.22941573387056174), (1786, 0.22941573387056174), (1813, 0.22941573387056174), (1900, 0.22941573387056174), (6557, 0.22360679774997896), (4608, 0.22360679774997896), (4946, 0.22360679774997896), (2987, 0.22360679774997896), (1354, 0.22360679774997896), (7707, 0.22360679774997896), (100, 0.2182178902359924), (4481, 0.2182178902359924), (4823, 0.2182178902359924), (5013, 0.2182178902359924), (5176, 0.2182178902359924), (57, 0.21320071635561041), (4417, 0.21320071635561041), (6488, 0.213

In [109]:
similitud_coseno = cosine_similarity(consulta_vector, bow_2).flatten()
len(similitud_coseno)

529

In [136]:
similitud_coseno = cosine_similarity(
    consulta_vector, matriz_bow).flatten()[:529]
len(similitud_coseno)

529

In [137]:
similitud_coseno_id = [(id, simcos)
                       for id, simcos in enumerate(similitud_coseno)]
similitud_coseno_id.sort(key=lambda x: x[1], reverse=True)
print(f"Documentos ordenados por similitud: {similitud_coseno_id}")

Documentos ordenados por similitud: [(381, 0.31622776601683794), (115, 0.25), (322, 0.22941573387056174), (100, 0.2182178902359924), (57, 0.21320071635561041), (62, 0.19611613513818404), (117, 0.19611613513818404), (13, 0.18569533817705186), (78, 0.18257418583505536), (188, 0.18257418583505536), (282, 0.17407765595569785), (4, 0.1690308509457033), (382, 0.1690308509457033), (124, 0.16666666666666666), (199, 0.1643989873053573), (341, 0.16012815380508713), (177, 0.15617376188860607), (281, 0.15617376188860607), (70, 0.14744195615489714), (411, 0.14002800840280097), (41, 0.1386750490563073), (527, 0.1336306209562122), (301, 0.13245323570650439), (348, 0.13130643285972254), (298, 0.1270001270001905), (91, 0.1259881576697424), (515, 0.12309149097933272), (299, 0.1125087900926024), (407, 0.11043152607484653), (451, 0.1), (285, 0.0936585811581694), (412, 0.09284766908852593), (443, 0.09016696346674323), (182, 0.0890870806374748), (119, 0.08804509063256238), (160, 0.07761505257063328), (489, 

### Tfidf


In [79]:
consulta_vector = vectorizer_tfidf.transform([' '.join(consulta_procesada)])
df_consulta = pd.DataFrame(consulta_vector.toarray(
), columns=vectorizer_bow.get_feature_names_out())
df_consulta

Unnamed: 0,aa,aaa,aachen,aaminus,aancor,aap,aaplus,aar,aarnoud,aaron,...,zorinski,zseven,zuccherifici,zuckerman,zulia,zurich,zurichbas,zuyuan,zverev,zzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [80]:
tfidf_2 = matriz_tfidf[list(documentos_encontrados)]
df_tfidf_2 = pd.DataFrame(
    tfidf_2.toarray(), columns=vectorizer_tfidf.get_feature_names_out())
df_tfidf_2

Unnamed: 0,aa,aaa,aachen,aaminus,aancor,aap,aaplus,aar,aarnoud,aaron,...,zorinski,zseven,zuccherifici,zuckerman,zulia,zurich,zurichbas,zuyuan,zverev,zzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
524,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
526,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [93]:
# Calcular similitud coseno
similitud_coseno = cosine_similarity(consulta_vector, matriz_tfidf).flatten()
similitud_coseno_id = [(doc_id, similitud_coseno[id])
                       for id, doc_id in enumerate(documentos_encontrados)]
similitud_coseno_id.sort(key=lambda x: x[1], reverse=True)
# Mostrar resultados ordenados por similitud
print(f"Documentos ordenados por similitud: {similitud_coseno_id}")

Documentos ordenados por similitud: [(5606, 0.4915376496580345), (4481, 0.2711262296357742), (7647, 0.25048339427887467), (1249, 0.21719391227247944), (4093, 0.20430932290714113), (6435, 0.1900968045831106), (1101, 0.1766195533913958), (6871, 0.17407740751975723), (3129, 0.1673717119045084), (2060, 0.16733218026248908), (502, 0.1670340084059524), (753, 0.15498556682601583), (348, 0.14485717234941675), (2491, 0.13812774020830243), (1147, 0.1358572856363652), (6583, 0.12435604660121245), (5236, 0.11510672534914702), (2299, 0.10710942578737467), (2354, 0.0964918971693197), (7509, 0.08728784734303713), (4368, 0.08105620223201236), (5676, 0.07968706397536664), (7719, 0.07797778076953579), (4257, 0.07592512588011467), (790, 0.07457680942153785), (1143, 0.07347918249368694), (1082, 0.0712002517556392), (2092, 0.06484804816089412), (1992, 0.06235052927650263), (6854, 0.05881881527909289), (5430, 0.04750765242000413), (1697, 0.043361178036393166), (5849, 0.04199728653377434), (1570, 0.036428560

### 2.6. Evaluación del Sistema

**Objetivo:** Medir la efectividad del sistema.

**Tareas:**

- Definir un conjunto de métricas de evaluación (precisión, recall, F1-score).
- Realizar pruebas utilizando el conjunto de prueba del corpus.
- Comparar el rendimiento de diferentes configuraciones del sistema.
- Documentar los resultados y análisis.


In [138]:
# Parsear el archivo cats.txt para obtener la verdad de terreno (ground truth)
ruta_cats = 'reuters/cats.txt'
gran_verdad = {}

with open(ruta_cats, 'r', encoding='latin-1') as f:
    for linea in f:
        if linea.startswith('training/'):
            partes = linea.strip().split()
            # Obtener el ID del documento
            doc_id = int(partes[0].split('/')[1])
            categorias = partes[1:]  # Obtener las categorías
            gran_verdad[doc_id] = categorias

# Verificar la estructura de ground_truth
print(f'gran_verdad: {list(gran_verdad.items())[:5]}')

gran_verdad: [(1, ['cocoa']), (5, ['sorghum', 'oat', 'barley', 'corn', 'wheat', 'grain']), (6, ['wheat', 'sorghum', 'grain', 'sunseed', 'corn', 'oilseed', 'soybean', 'sun-oil', 'soy-oil', 'lin-oil', 'veg-oil']), (9, ['earn']), (10, ['acq'])]


In [139]:
consulta_procesada

['earn']

In [140]:
# Definir la relevancia basada en la coincidencia de categorías
categorias_consulta = {'earn'}  # Categorías esperadas para la consulta
documentos_relevantes_esperados = set()

for doc_id, categorias in gran_verdad.items():
    if any(categoria in categorias_consulta for categoria in categorias):
        documentos_relevantes_esperados.add((doc_id))

In [141]:
documentos_relevantes_esperados = list(documentos_relevantes_esperados)
documentos_relevantes_esperados

[9,
 11,
 12,
 13,
 14,
 18,
 23,
 24,
 8218,
 27,
 8219,
 8221,
 8223,
 8226,
 8227,
 36,
 37,
 38,
 8229,
 40,
 41,
 8236,
 50,
 53,
 8245,
 56,
 8253,
 8254,
 64,
 65,
 66,
 8258,
 8262,
 71,
 74,
 8269,
 8270,
 82,
 83,
 8275,
 85,
 86,
 87,
 8276,
 89,
 93,
 8286,
 8289,
 98,
 8291,
 8292,
 8294,
 8296,
 8297,
 108,
 8300,
 113,
 8311,
 8313,
 8314,
 123,
 8317,
 126,
 8318,
 8320,
 129,
 130,
 8322,
 138,
 139,
 140,
 142,
 143,
 8334,
 145,
 146,
 147,
 8335,
 8337,
 8338,
 151,
 152,
 8339,
 8340,
 8349,
 8350,
 160,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 8364,
 8366,
 183,
 8378,
 187,
 196,
 8388,
 8391,
 201,
 202,
 8394,
 8399,
 210,
 212,
 214,
 8407,
 8409,
 8424,
 8425,
 8426,
 8435,
 8447,
 8448,
 8449,
 8454,
 8457,
 8458,
 279,
 8475,
 8479,
 8480,
 8481,
 8487,
 8488,
 8489,
 8490,
 299,
 8491,
 8501,
 8502,
 8506,
 317,
 8510,
 8513,
 8520,
 8521,
 8526,
 8528,
 8530,
 8533,
 8534,
 345,
 8540,
 8541,
 8545,
 8547,
 356,
 8555,
 8556,
 8559,
 8561,
 8562,
 8565,

In [142]:
# Obtener nombres de archivos ordenados por similitud coseno usando TF-IDF
documentos_encontrados_nombres_tfidf = [int(i) for i, _ in similitud_coseno_id]
documentos_encontrados_nombres_tfidf

[381,
 115,
 322,
 100,
 57,
 62,
 117,
 13,
 78,
 188,
 282,
 4,
 382,
 124,
 199,
 341,
 177,
 281,
 70,
 411,
 41,
 527,
 301,
 348,
 298,
 91,
 515,
 299,
 407,
 451,
 285,
 412,
 443,
 182,
 119,
 160,
 489,
 502,
 0,
 1,
 2,
 3,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 58,
 59,
 60,
 61,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 116,
 118,
 120,
 121,
 122,
 123,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,


In [143]:
documentos_relevantes_esperados_nuevo = []
for id_doc in documentos_relevantes_esperados:
    documentos_relevantes_esperados_nuevo.append(
        nombres_archivos.index(id_doc))

In [144]:
documentos_relevantes_esperados_nuevo

[7125,
 635,
 1297,
 1859,
 2159,
 2526,
 2809,
 2873,
 6629,
 3061,
 6630,
 6632,
 6634,
 6637,
 6638,
 3650,
 3712,
 3777,
 6639,
 3904,
 3981,
 6641,
 4537,
 4741,
 6644,
 4960,
 6648,
 6649,
 5438,
 5512,
 5577,
 6652,
 6654,
 5923,
 6096,
 6656,
 6657,
 6616,
 6676,
 6659,
 6789,
 6859,
 6932,
 6660,
 7056,
 7331,
 6663,
 6666,
 7634,
 6668,
 6669,
 6671,
 6672,
 6673,
 514,
 6678,
 838,
 6683,
 6684,
 6685,
 1490,
 6686,
 1691,
 6687,
 6689,
 1811,
 1860,
 6690,
 2116,
 2137,
 2160,
 2191,
 2211,
 6695,
 2252,
 2284,
 2311,
 6696,
 6697,
 6698,
 2358,
 2362,
 6699,
 6701,
 6706,
 6708,
 2410,
 2426,
 2432,
 2439,
 2449,
 2455,
 2463,
 2471,
 6716,
 6717,
 2538,
 6724,
 2562,
 2611,
 6726,
 6728,
 2644,
 2649,
 6729,
 6731,
 2687,
 2700,
 2710,
 6738,
 6739,
 6746,
 6747,
 6748,
 6756,
 6763,
 6764,
 6765,
 6768,
 6769,
 6770,
 3117,
 6774,
 6776,
 6778,
 6779,
 6780,
 6781,
 6782,
 6784,
 3246,
 6785,
 6791,
 6792,
 6793,
 3379,
 6796,
 6798,
 6803,
 6804,
 6806,
 6807,
 6809,
 6

In [145]:
from sklearn.metrics import precision_score, recall_score, f1_score
# Evaluar resultados de TF-IDF
y_true = [1 if i in documentos_relevantes_esperados_nuevo else 0 for i in range(
    len(documentos))]
y_pred = [1 if i in documentos_encontrados_nombres_tfidf else 0 for i in range(
    len(documentos))]

In [146]:
y_true

[0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,


In [147]:
y_pred

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,


In [148]:
# Calcular precisión, recall y F1-score
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f'Precisión: {precision}')
print(f'Recall: {recall}')
print(f'F1-score: {f1}')

Precisión: 0.3270321361058601
Recall: 0.06013208202989225
F1-score: 0.10158543746330007


el primero bueno me salio:

- Precisión: 0.725897920604915
- Recall: 0.1334723670490094
- F1-score: 0.22548443922489722


metricas de evalucion para todas las categorias


### 2.7. Interfaz Web de Usuario

**Objetivo:** Crear una interfaz para interactuar con el sistema.

**Tareas:**

- Diseñar una interfaz web donde los usuarios puedan ingresar consultas.
- Mostrar los resultados de búsqueda de manera clara y ordenada.
- Implementar características adicionales como filtros y opciones de visualización.
- Documentar el diseño y funcionalidades de la interfaz.

## 3. Entrega Final

- **Documentación Completa:** Incluyendo los procesos, decisiones tomadas, y resultados de cada fase.
- **Código Fuente:** Organizado y bien comentado.
- **Informe de Evaluación:** Análisis detallado de la evaluación del sistema.
- **Demostración del Sistema:** Presentación funcional del sistema a través de la interfaz web.

## 4. Requisitos Técnicos

- **Lenguajes de Programación:** Python (preprocesamiento y modelado), JavaScript (para la interfaz web).

## 5. Evaluación del Proyecto

- **Funcionamiento:** (35%) Efectividad y eficiencia en la recuperación de información.
- **Documentación:** (35%) Claridad en la documentación de cada fase.
- **Innovación y Creatividad:** (15%) En la implementación de técnicas y la interfaz de usuario.
- **Presentación Final:** (15%) Calidad y claridad de la demostración del sistema.
