# Tarea 1
## Punto 2

Comparación de estrategias de motores de búsqueda
A continuación, implementará un motor de búsqueda con cuatro estrategias diferentes.
1. Búsqueda binaria usando índice invertido (BSII)
3. Recuperación ranqueada y vectorización de documentos (RRDV)
Debe hacer su propia implementación usando numpy y pandas.

Conjunto de datos: hay tres archivos que componen el conjunto de datos: 
- “Docs raws texts" contiene 331 documentos en formato NAF (XML; debe usar el título y el 
contenido para modelar cada documento). 
- "Queries raw texts" contiene 35 consultas. 
- "relevance-judgments.tsv" contiene para cada consulta los documentos considerados relevantes 
para cada una de las consultas. Estos documentos relevantes fueron construidos manualmente por 
jueces humanos y sirven como “ground-truth” y evaluación.

Pasos de preprocesamiento: para los siguientes puntos, debe preprocesar documentos y consultas 
mediante tokenización a nivel de palabra, eliminación de palabras vacías, normalización y stemming.


In [2]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
     ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
     - -------------------------------------- 0.0/1.5 MB 326.8 kB/s eta 0:00:05
     --- ------------------------------------ 0.1/1.5 MB 944.1 kB/s eta 0:00:02
     --------------------------------- ------ 1.3/1.5 MB 6.7 MB/s eta 0:00:01
     ---------------------------------------- 1.5/1.5 MB 7.4 MB/s eta 0:00:00
Collecting click (from nltk)
  Obtaining dependency information for click from https://files.pythonhosted.org/packages/00/2e/d53fa4befbf2cfa713304affc7ca780ce4fc1fd8710527771b58311a3229/click-8.1.7-py3-none-any.whl.metadata
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting joblib (from nltk)
  Obtaining dependency information for joblib from https://files.pythonhosted.org/packages/10/40/d551139c85db202f1f384ba8bcf96aca2f329440a844f924c8a0040b6d02/jo

In [4]:
import numpy as np
import zipfile
import os
import xml.etree.ElementTree as ET
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk

## Extracción de los datos

In [2]:
# Se extraen los datos de los archivos comprimidos

# Specify the paths to the compressed files and the target directory
compressed_files = ['docs-raw-texts.zip', 'queries-raw-texts.zip']

# Extract files from each compressed file
for compressed_file in compressed_files:
    with zipfile.ZipFile(compressed_file, 'r') as zip_ref:
        folder_name = os.path.splitext(compressed_file)[0]  # Remove the ".zip" extension
        target_folder = os.path.join(folder_name)
        
        if not os.path.exists(target_folder):
            # Create the folder within the target directory
            os.mkdir(target_folder)

        
            # Extract all files to the target folder
            zip_ref.extractall(target_folder)

print("Extracción completada")


Extracción completada


### Paths a directorios

In [3]:
# Directorios que contienen los archivos necesarios, cambiar acá si es necesario
xml_files_directory = 'docs-raw-texts'

queries_directory = "queries-raw-texts"

## Creación del Índice invertido


### Función de preprocesamiento de texto

In [4]:
# Download the NLTK stopwords resource
nltk.download('stopwords')

# NLTK setup
tokenizer = RegexpTokenizer(r'\w+')
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

# Función de preprocesamiento, se usará para todos los inputs al modelo (queries y documentos)
def preprocess_text(text: str) -> list[str]:
    """Realiza el preprocesamiento de un texto.

    Args:
        text (str): El texto a procesar.

    Returns:
        List: Una lista con las palabras del texto, tokenizadas y sin palabras vacías.
    """
    text.strip().lower() # normalización del texto, todo en minúscula y se quitan espacios innecesarios.
    tokens = tokenizer.tokenize(text) #tokenización por espacio
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words] # eliminación de palabras vacias y stemming
    return tokens

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USUARIO\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
preprocess_text(" On 26 July 2023, a coup d'état ocCuRreD in the Republic of the Niger, in which the country's presidential guard removed...; and detained President Mohamed Bazoum.    ")

['on',
 '26',
 'juli',
 '2023',
 'coup',
 'état',
 'occur',
 'republ',
 'niger',
 'countri',
 'presidenti',
 'guard',
 'remov',
 'detain',
 'presid',
 'moham',
 'bazoum']

### Creación del diccionario para el índice invertido con base en el corpus de texto

In [6]:
# Extracts raw text from .naf files
def extract_raw_text(xml_path: str, title: bool = False) -> str:
    """Extrae el texto sin procesar de un archivo .naf.

    Args:
        xml_path (str): La ruta al archivo .naf.
        title (bool): Si es True, el título del documento también se agrega al texto extraído.

    Returns:
        str: El texto sin procesar del archivo .naf.
    """
    if title:
        # Parse the XML file
        tree = ET.parse(xml_path)
        root = tree.getroot()
        # Extract content from the XML
        return root.find(".//nafHeader/fileDesc").get("title") + ", " + root.find('raw').text #Añade título cuando se especifíca
    else:
        # Parse the XML file
        tree = ET.parse(xml_path)
        root = tree.getroot()
        # Extract content from the XML
        return root.find('raw').text

In [7]:
# Dictionary to store the inverted index (term -> list of documents)
inverted_index = {}

# Dictionary to store term frequencies per document (term -> {document: frequency})
term_freq_per_document = {}

# Iterate over XML files in the directory
for filename in os.listdir(xml_files_directory):
    if filename.endswith('.naf'):
        xml_path = os.path.join(xml_files_directory, filename)
        # Extract content from the XML
        content = extract_raw_text(xml_path, title=True)
        
        # Preprocess the content
        preprocessed_tokens = preprocess_text(content)
        
        # Create the inverted index and update term frequencies per document
        for term in preprocessed_tokens:
            if term in inverted_index:
                if filename not in inverted_index[term]:
                    inverted_index[term].append(filename)
            else:
                inverted_index[term] = [filename]
            
            if term in term_freq_per_document:
                if filename in term_freq_per_document[term]:
                    term_freq_per_document[term][filename] += 1
                else:
                    term_freq_per_document[term][filename] = 1
            else:
                term_freq_per_document[term] = {filename: 1}

print("Inverted index created.")


Inverted index created.


In [8]:
print("list of documents where the term is found")
test_term = preprocess_text("confused")[0]
inverted_index[test_term]

list of documents where the term is found


['wes2015.d137.naf', 'wes2015.d280.naf']

In [9]:
print("frequency value of term:")
for term in inverted_index[test_term]:
    print(term_freq_per_document[test_term][term])

frequency value of term:
1
1


### Boolean query

In [10]:
def boolean_query(inverted_index: dict, query: str) -> list[str]:
    """Realiza una consulta booleana sobre un índice invertido.

    Args:
        inverted_index (dict): El índice invertido.
        query (str): La consulta booleana.

    Returns:
        List: Una lista con los documentos que satisfacen la consulta.
    """
    # Tokenize the query
    query_tokens = preprocess_text(query)

    # Initialize result set with documents containing the first term
    result_set = set(inverted_index.get(query_tokens[0], []))

    # Iterate over query tokens and perform AND and NOT operations
    operator = None
    for token in query_tokens[1:]:
        if token == 'and':
            operator = 'AND'
        elif token == 'not':
            operator = 'NOT'
        else:
            if operator == 'AND':
                result_set &= set(inverted_index.get(token, []))
            elif operator == 'NOT':
                result_set -= set(inverted_index.get(token, []))
            else:
                result_set |= set(inverted_index.get(token, []))
            operator = None
    # If the operator is 'AND', the result set is updated using set intersection (&).
    # If the operator is 'NOT', the result set is updated using set difference (-).
    # If no operator is set, the result set is updated using set union (|).
    return list(result_set)


In [11]:
query = "confused"

# Perform queries
result = boolean_query(inverted_index, query)
print("Query 1 result:", result)

Query 1 result: ['wes2015.d280.naf', 'wes2015.d137.naf']


In [12]:
query = "Anne"

# Perform queries
result = boolean_query(inverted_index, query)
print("Query 1 result:", result)

Query 1 result: ['wes2015.d105.naf', 'wes2015.d241.naf', 'wes2015.d193.naf', 'wes2015.d019.naf', 'wes2015.d102.naf', 'wes2015.d212.naf', 'wes2015.d316.naf', 'wes2015.d190.naf', 'wes2015.d247.naf', 'wes2015.d137.naf']


In [13]:
query = "confused AND Anne"

# Perform queries
result = boolean_query(inverted_index, query)
print("Query 1 result:", result)

Query 1 result: ['wes2015.d137.naf']


In [14]:
query = "confused NOT Anne"

# Perform queries
result = boolean_query(inverted_index, query)
print("Query 1 result:", result)

Query 1 result: ['wes2015.d280.naf']


In [15]:
query = "confused OR Anne AND Language NOT william"

# Perform queries
result = boolean_query(inverted_index, query)
print("Query 1 result:", result)

Query 1 result: ['wes2015.d174.naf', 'wes2015.d137.naf']


### Consultas binarias

In [16]:
queries_results = {}

for filename in os.listdir(queries_directory):
    if filename.endswith('.naf'):
        query_path = os.path.join(queries_directory, filename)
        
        # Parse the XML file
        tree = ET.parse(query_path)
        root = tree.getroot()
        
        # Extract query content from the XML
        query_id = root.find('nafHeader/public').get('publicId')
        query_content = root.find('raw').text
        
        # Preprocess the query content
        preprocessed_query = preprocess_text(query_content)
        queries_results[query_id] = preprocessed_query

In [17]:
# Perform AND queries using the inverted index
results = {}
for query_id, query_terms in queries_results.items():
    result_docs = set(inverted_index.get(query_terms[0], []))
    for term in query_terms[1:]:
        result_docs &= set(inverted_index.get(term, []))
    results[query_id] = result_docs

In [18]:
# Write the results to a file
output_filename = 'BSII-AND-queries_results.tsv'
with open(output_filename, 'w') as output_file:
    for query_id, docs in results.items():
        doc_numbers = [doc.split('.')[1] for doc in docs]
        doc_numbers_str = ','.join(doc_numbers)
        output_file.write(f"{query_id}\t{doc_numbers_str}\n")

print("Query results written to", output_filename)

Query results written to BSII-AND-queries_results.tsv
