# Basic Boolean Search in Documents

## Objective
Expand the simple term search functionality to include Boolean search capabilities. This will allow users to perform more complex queries by combining multiple search terms using Boolean operators.

## Problem Description
You must enhance the existing search engine from the previous exercise to support Boolean operators: AND, OR, and NOT. This will enable the retrieval of documents based on the logical relationships between multiple terms.

## Requirements

### Step 1: Update Data Preparation
Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.

### Step 2: Create an Inverted Index

Create an inverted index from the documents. This index maps each word to the set of document IDs in which that word appears. This facilitates word lookup in the search process.

### Step 3: Query Processing
- **Parse the Query**: Implement a function to parse the input query to identify the terms and operators.
- **Search Documents**: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.

### Step 4: Displaying Results
- **Output the Results**: Display the documents that match the query criteria. Include functionalities to handle queries that result in no matching documents.

## Evaluation Criteria
- **Correctness**: The Boolean search implementation should correctly interpret and process the queries according to the Boolean logic.
- **Efficiency**: Consider the efficiency of your search process, especially as the complexity of queries increases.
- **User Experience**: Ensure that the interface for inputting queries and viewing results is user-friendly.

This exercise will deepen your understanding of how search engines process and respond to user queries.

## Development
-   Fernando Cardenas

## Notes
Índice invertido es una estructura de datos utilizada para facilitar
la búsqueda rápida y eficiente de información dentro de un conjunto de documentos textuales


Un diccionario en Python proporciona una estructura de datos eficiente para almacenar un índice invertido

Se puede mapear cada palabra (término de búsqueda) a un conjunto de identificadores de documentos en los que aparece esa palabra

inverted_index = {
    'palabra1': {doc_id1, doc_id2},
    'palabra2': {doc_id2, doc_id3},
    ...
}

Para crear un index invertido
 
1.Preprocesamiento del texto 
	a. Eliminación de caracteres especiales, puntuación y símbolos irrelevantes del texto.
	b. Convertir todo el texto a minúsculas para garantizar una coincidencia uniforme
	c. Segmentación de palabras: Divida el texto en palabras individuales o tokens.
	d. Eliminación de stop words: "la", "de", "a" NO APORTAN VALOR


In [2]:
# Importing libraries
import os
import pandas as pd
from nltk.tokenize import word_tokenize

In [3]:
folder_path = 'C:/Users/usuario/Fer-Pc/Escritorio/EPN/2024-A/SEPTIMO_SEMESTRE/RECUPERACION_DE_INFORMACION/ir24a/week01/data'


In [4]:
def load_documents(folder_path):
    documents = {}
    for filename in os.listdir(folder_path):
            path = os.path.join(folder_path, filename)
            with open(path, 'r', encoding='utf-8') as file:
                content = file.read()
                documents[filename] = content
    return documents

In [5]:
documents = load_documents(folder_path)

In [6]:
def inverted_index(documents):
    index = {}
    for filename, content in documents.items():
        tokens = word_tokenize(content)
        for token in set(tokens):  # set for unique tokens 
            if token in index:
                if filename not in index[token]:
                    index[token].append(filename)
            else:
                index[token] = [filename]
    return index

In [7]:
index = inverted_index(documents)
len(index)

312796

In [8]:
index_df = pd.DataFrame.from_dict(index, orient='index')
print(index_df)

                                                             0   \
occupies      A Christmas Carol in Prose Being a Ghost Story...   
plain         A Christmas Carol in Prose Being a Ghost Story...   
present       A Christmas Carol in Prose Being a Ghost Story...   
followers     A Christmas Carol in Prose Being a Ghost Story...   
LICENSE       A Christmas Carol in Prose Being a Ghost Story...   
...                                                         ...   
tea-canister                              Wuthering Heights.txt   
yonder—Ech                                Wuthering Heights.txt   
eyes—there                                Wuthering Heights.txt   
ruffian—I                                 Wuthering Heights.txt   
cham                                      Wuthering Heights.txt   

                                                      1   \
occupies      A Short History of English Agriculture.txt   
plain                                 A Doll's House.txt   
present        

In [9]:
def query_inverted_index(index, query):
    result = set()
    query_tokens = word_tokenize(query)
    for token in query_tokens:
        if token in index:
            result.update(index[token])
    return result

In [10]:
index = inverted_index(documents) 
query = "lock"
result = query_inverted_index(index, query)
print("Documents containing the word '{}':".format(query))
for documento in result:
    print(documento)

Documents containing the word 'lock':
Anne of Green Gables.txt
Little Women.txt
Biographical Anecdotes of William Hogarth, With a Catalogue of His Works.txt
The Yellow Wallpaper.txt
The Works of the Rev.txt
The Kama Sutra of Vatsyayana.txt
The Brothers Karamazov.txt
The Picture of Dorian Gray.txt
Frankenstein.txt
Walden and On The Duty Of Civil Disobedience.txt
The Iliad.txt
A Christmas Carol in Prose Being a Ghost Story of Christmas.txt
Adventures of Huckleberry Finn.txt
Chronicles of London Bridge.txt
Moby Dick.txt
Twenty years after.txt
Grimms' Fairy Tales.txt
The Complete Works of William Shakespeare.txt
A Tale of Two Cities.txt
Jane Eyre- An Autobiography.txt
Pride and Prejudice.txt
The Adventures of Ferdinand Count Fathom ÔÇö Complete.txt
Kentucky in American Letters.txt
Alice's Adventures in Wonderland.txt
Metamorphosis.txt
Treasure Island.txt
A Doll's House.txt
The Importance of Being Earnest- A Trivial Comedy for Serious People.txt
Ulysses.txt
The Count of Monte Cristo.txt
Sta