# Tarea 1
## Punto 2

Comparación de estrategias de motores de búsqueda
A continuación, implementará un motor de búsqueda con cuatro estrategias diferentes.
1. Búsqueda binaria usando índice invertido (BSII)
3. Recuperación ranqueada y vectorización de documentos (RRDV)
Debe hacer su propia implementación usando numpy y pandas.

Conjunto de datos: hay tres archivos que componen el conjunto de datos: 
- “Docs raws texts" contiene 331 documentos en formato NAF (XML; debe usar el título y el 
contenido para modelar cada documento). 
- "Queries raw texts" contiene 35 consultas. 
- "relevance-judgments.tsv" contiene para cada consulta los documentos considerados relevantes 
para cada una de las consultas. Estos documentos relevantes fueron construidos manualmente por 
jueces humanos y sirven como “ground-truth” y evaluación.

Pasos de preprocesamiento: para los siguientes puntos, debe preprocesar documentos y consultas 
mediante tokenización a nivel de palabra, eliminación de palabras vacías, normalización y stemming.


In [28]:
import numpy as np
import pandas as pd
import zipfile
import os
import xml.etree.ElementTree as ET
from nltk.tokenize import word_tokenize
from nltk.corpus import PlaintextCorpusReader

In [29]:
# Se extraen los datos de los archivos comprimidos

# Specify the paths to the compressed files and the target directory
compressed_files = ['docs-raw-texts.zip', 'queries-raw-texts.zip']

# Extract files from each compressed file
for compressed_file in compressed_files:
    with zipfile.ZipFile(compressed_file, 'r') as zip_ref:
        folder_name = os.path.splitext(compressed_file)[0]  # Remove the ".zip" extension
        target_folder = os.path.join(folder_name)
        
        # Create the folder within the target directory
        os.mkdir(target_folder)
        
        # Extract all files to the target folder
        zip_ref.extractall(target_folder)

print("Extracción completada")


Extracción completada


## Creación del Corpus


### carga de datos

In [30]:
import os
import xml.etree.ElementTree as ET
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk

# Download the NLTK stopwords resource
nltk.download('stopwords')

# Directory containing the extracted XML files
xml_files_directory = 'docs-raw-texts'

# NLTK setup
tokenizer = RegexpTokenizer(r'\w+')
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

# Function to preprocess the document text
def preprocess_text(text):
    tokens = tokenizer.tokenize(text.strip().lower())
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    return tokens

# Dictionary to store the inverted index
inverted_index = {}

# Iterate over XML files in the directory
for filename in os.listdir(xml_files_directory):
    if filename.endswith('.naf'):
        xml_path = os.path.join(xml_files_directory, filename)
        
        # Parse the XML file
        tree = ET.parse(xml_path)
        root = tree.getroot()
        
        # Extract content from the XML
        content = root.find('raw').text
        
        # Preprocess the content
        preprocessed_tokens = preprocess_text(content)
        
        # Create the inverted index
        for term in preprocessed_tokens:
            if term in inverted_index:
                inverted_index[term].append(filename)
            else:
                inverted_index[term] = [filename]

print("Inverted index created.")

# You can now save the inverted index to a variable or a file as needed
# For example, to save it to a variable:
inverted_index_variable = inverted_index


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\santi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Inverted index created.


In [31]:
inverted_index_variable[preprocess_text("william")[0]]

['wes2015.d001.naf',
 'wes2015.d001.naf',
 'wes2015.d001.naf',
 'wes2015.d001.naf',
 'wes2015.d001.naf',
 'wes2015.d009.naf',
 'wes2015.d015.naf',
 'wes2015.d015.naf',
 'wes2015.d015.naf',
 'wes2015.d015.naf',
 'wes2015.d015.naf',
 'wes2015.d028.naf',
 'wes2015.d028.naf',
 'wes2015.d028.naf',
 'wes2015.d028.naf',
 'wes2015.d035.naf',
 'wes2015.d035.naf',
 'wes2015.d055.naf',
 'wes2015.d055.naf',
 'wes2015.d055.naf',
 'wes2015.d056.naf',
 'wes2015.d056.naf',
 'wes2015.d056.naf',
 'wes2015.d056.naf',
 'wes2015.d069.naf',
 'wes2015.d069.naf',
 'wes2015.d069.naf',
 'wes2015.d069.naf',
 'wes2015.d069.naf',
 'wes2015.d078.naf',
 'wes2015.d078.naf',
 'wes2015.d078.naf',
 'wes2015.d078.naf',
 'wes2015.d088.naf',
 'wes2015.d088.naf',
 'wes2015.d088.naf',
 'wes2015.d091.naf',
 'wes2015.d092.naf',
 'wes2015.d095.naf',
 'wes2015.d098.naf',
 'wes2015.d098.naf',
 'wes2015.d102.naf',
 'wes2015.d102.naf',
 'wes2015.d102.naf',
 'wes2015.d102.naf',
 'wes2015.d106.naf',
 'wes2015.d109.naf',
 'wes2015.d11