# Exploring Ranking Models in Information Retrieval

## Objective
Understand the practical implementation and differences between the Vector Space Model and the Binary Independence Model in ranking documents relative to a user query.

### Step 1: Data Preprocessing

Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.
Write a function to load and preprocess the text documents from a specified directory. This step involves reading each file, converting the text to lowercase for uniform processing, and storing the results in a dictionary.

In [1]:
!pip install beautifulsoup4
!pip install requests



In [18]:
import os
import time
import requests
import re
from collections import defaultdict
from bs4 import BeautifulSoup
import operator

regExpresion = r'[^0-9a-zA-Z\s]+'

# Define the path to the directory containing the text files
CORPUS_DIR = '../../week01/data'
documents = {}
for filename in os.listdir(CORPUS_DIR):
    if filename.endswith('.txt'):
        file_path = os.path.join(CORPUS_DIR, filename)
        print(file_path)
        with open(file_path, 'r', encoding='utf-8') as file:
            documents[filename] = file.read().lower()  # Read and convert to lowercase

../../week01/data\A Christmas Carol in Prose Being a Ghost Story of Christmas.txt
../../week01/data\A Dolls House  a play.txt
../../week01/data\A Modest Proposal.txt
../../week01/data\A Room with a View.txt
../../week01/data\A Study in Scarlet.txt
../../week01/data\A Tale of Two Cities.txt
../../week01/data\Adventures of Huckleberry Finn.txt
../../week01/data\Alices Adventures in Wonderland.txt
../../week01/data\Ang Filibusterismo Karugtng ng Noli Me Tangere.txt
../../week01/data\Anne of Green Gables.txt
../../week01/data\Anthem.txt
../../week01/data\Beyond Good and Evil.txt
../../week01/data\Commentary on Genesis Vol 2 Luther on Sin and the Flood.txt
../../week01/data\Cranford.txt
../../week01/data\Crime and Punishment.txt
../../week01/data\Daddy Takes Us to the Garden.txt
../../week01/data\datasource.txt
../../week01/data\Don Quijote.txt
../../week01/data\Don Quixote.txt
../../week01/data\Dracula.txt
../../week01/data\Dubliners.txt
../../week01/data\Emma.txt
../../week01/data\Franken

In [15]:
libros = []

def obtener_indice_invertido():
    global libros
    libros = list(documents.keys())
    print(libros)
    print("Obteniendo índice invertido...")
    

    for libro_id, libro in enumerate(libros):
        file = os.path.join(CORPUS_DIR, libro.strip())
        print(file)
        try:
            with open(file, 'r', encoding='utf-8') as f:
                content = f.read()
                content = re.sub(regExpresion, '', content)
                words = content.split()
                for word in words:
                    global inverted_index
                    if word not in inverted_index:
                        inverted_index[word] = set()
                    inverted_index[word].add(libro_id)
        except:
            print(f"Error al abrir el archivo {file}. No existe o no se puede leer.")
            continue

    print("imprimiendo indice invertido: ")
    for key, value in inverted_index.items():
        print(key, ':', value)
    pass

In [16]:
obtener_indice_invertido()

['A Christmas Carol in Prose Being a Ghost Story of Christmas.txt', 'A Dolls House  a play.txt', 'A Modest Proposal.txt', 'A Room with a View.txt', 'A Study in Scarlet.txt', 'A Tale of Two Cities.txt', 'Adventures of Huckleberry Finn.txt', 'Alices Adventures in Wonderland.txt', 'Ang Filibusterismo Karugtng ng Noli Me Tangere.txt', 'Anne of Green Gables.txt', 'Anthem.txt', 'Beyond Good and Evil.txt', 'Commentary on Genesis Vol 2 Luther on Sin and the Flood.txt', 'Cranford.txt', 'Crime and Punishment.txt', 'Daddy Takes Us to the Garden.txt', 'datasource.txt', 'Don Quijote.txt', 'Don Quixote.txt', 'Dracula.txt', 'Dubliners.txt', 'Emma.txt', 'Frankenstein Or The Modern Prometheus.txt', 'Great Expectations.txt', 'Grimms Fairy Tales.txt', 'Heart of Darkness.txt', 'History of Tom Jones a Foundling.txt', 'How it feels to be colored me.txt', 'Jane Eyre An Autobiography.txt', 'Leviathan.txt', 'Little Women Or Meg Jo Beth and Amy.txt', 'Little Women.txt', 'Meditations.txt', 'Metamorphosis.txt', '

NameError: name 'inverted_index' is not defined

### Step 2:  Vector Space Model (VSM)

Task: Implement a simple Vector Space Model using term frequency.

Requirements:
* _Document and Query Representation:_ Convert each document and the query into a vector where each dimension corresponds to a term from the corpus. Use simple term frequency for weighting.
* _Cosine Similarity Calculation:_ Calculate the cosine similarity between the query vector and each document vector.
* _Ranking:_ Rank the documents based on their cosine similarity scores from highest to lowest.

### Step 3: Binary Independence Model (BIM)

Task: Implement a basic Binary Independence Model to rank documents.

Requirements:
* _Binary Representation:_ Represent the corpus and the query in binary vectors (1 if the term is present, 0 otherwise).
* _Probability Estimation:_ Assume arbitrary probabilities for the presence of each term in relevant and non-relevant documents.
* _Relevance Scoring:_ Calculate the relevance score for each document based on the product of probabilities for terms present in the query.
* _Ranking:_ Rank the documents based on their relevance scores from highest to lowest.