# Extended Boolean Model

The standard Boolean model lacks ranking, which limits its effectiveness. The Extended Boolean Model solves this by incorporating term weights, using a bag of words approach, and allowing partial matching. This lets us return similarity scores rather than just true/false results, making document retrieval more flexible and accurate.

We'll use the Fuzzy Algebraic technique (only works for two operands) to improve search results.

- 𝑠𝑖𝑚(𝑄1 ∧ 𝑄2, 𝐷𝑖) = 𝑠𝑖𝑚(𝑄1, 𝐷𝑖) x 𝑠𝑖𝑚(𝑄2, 𝐷𝑖)

- 𝑠𝑖𝑚 (𝑄1 ∨ 𝑄2, 𝐷𝑖) = 𝑠𝑖𝑚(𝑄1, 𝐷𝑖) + 𝑠𝑖𝑚(𝑄2, 𝐷𝑖) - 𝑠𝑖𝑚(𝑄1, 𝐷𝑖) x 𝑠𝑖𝑚(𝑄2, 𝐷𝑖)

In [19]:
# We need a list where we can store all the extract text from pdf
documents = []

In [20]:
# Extract text from PDF files
import pdfplumber


def extract_text_from_pdf(file_path):
    text = ""
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text()
    return text


docs_collection = ['computer_science.pdf', 'physics.pdf', 'art.pdf',
                   'football_history.pdf', 'artificial_intelligence.pdf']


for path in docs_collection:
  # For every document in the collection we fetch the text and store
  doc = extract_text_from_pdf(path)
  documents.append(doc)


In [73]:
import spacy
from collections import Counter

# Step 1: Load spaCy model
# You can also use the larger model "en_core_web_lg" for better accuracy, but it is more expensive.
nlp = spacy.load("en_core_web_sm")

# Step 2: Preprocess and create a vocabulary
vocab = set()  # Set to store unique vocabulary terms
preprocessed_docs = []  # List to hold preprocessed documents

for doc in documents:
    # Apply spaCy NLP pipeline to the document
    doc_nlp = nlp(doc)
    # Filter and lemmatize tokens, excluding stop words and certain POS tags
    filtered_tokens = [
        token.lemma_.upper() for token in doc_nlp
        if not token.is_stop and token.pos_ not in {"DET", "ADP", "AUX", "CCONJ", "ADJ", "PUNCT", "SPACE"}
    ]
    # Update vocabulary set with filtered tokens
    # Vocab keeps only unique tokens
    vocab.update(filtered_tokens)
    preprocessed_docs.append(filtered_tokens)  # Append filtered tokens to preprocessed documents list

# Sorted list for consistent vector representation
vocab = sorted(vocab)

# Step 3: Convert documents to Boolean vectors
doc_vectors = []

for tokens in preprocessed_docs:
    # Using bag of words
    # The vector stores how often each vocab term appears in the document
    # Example:
    # Document (tokens): ['ai', 'art', 'ai', 'sport']
    # Vocabulary (vocab): ['ai', 'physics', 'math', 'art', 'sport']
    # Resulting doc_vector: [2, 0, 0, 1, 1]
    doc_vectors.append([tokens.count(term) for term in vocab])


def get_query_vector(term, vocab):
    # Generate a binary query vector for the given term
    # 1 if term is in query else 0
    return [1 if token == term else 0 for token in vocab]

def calculate_similarity(query_vector, doc_vec):
    # Calculate similarity score for a given query vector and document vector
    # For every doc where query match:
    # Return: (Sum of the term frequency) / (Total tokens in doc)
    return sum(doc_vec[i] for i, val in enumerate(query_vector) if val == 1) / len(doc_vec)


def search_docs(query, doc_vectors, vocab):
    # Preprocess query to split into terms and operand
    query = query.upper().split()
    # Get only terms
    terms = [t for t in query if t not in {'AND', 'OR'}]
    # Get only the operand
    operand = next((t for t in query if t in {'AND', 'OR'}), None)

    # Generate binary vectors for query terms
    query_vectors = [get_query_vector(term, vocab) for term in terms]

    relevant_docs = []

    for doc_index, doc_vec in enumerate(doc_vectors):
      # First we calculate similarities
      # Note: The second value might be None if the query has no operands.
      sim1 = calculate_similarity(query_vectors[0], doc_vec)
      sim2 = calculate_similarity(query_vectors[1], doc_vec) if len(query_vectors) > 1 else None

      if operand == 'AND' and sim2 is not None:
        # score = 𝑠𝑖𝑚(𝑄1 ∧ 𝑄2, 𝐷𝑖)
        # 𝑠𝑖𝑚(𝑄1 ∧ 𝑄2, 𝐷𝑖) = 𝑠𝑖𝑚(𝑄1, 𝐷𝑖) x 𝑠𝑖𝑚(𝑄2, 𝐷𝑖)
        score = sim1 * sim2
      elif operand == 'OR' and sim2 is not None:
        # score = 𝑠𝑖𝑚(𝑄1 ∨ 𝑄2, 𝐷𝑖)
        # 𝑠𝑖𝑚(𝑄1 ∨ 𝑄2, 𝐷𝑖) = 𝑠𝑖𝑚(𝑄1, 𝐷𝑖) + 𝑠𝑖𝑚(𝑄2, 𝐷𝑖) - 𝑠𝑖𝑚(𝑄1, 𝐷𝑖) x 𝑠𝑖𝑚(𝑄2, 𝐷𝑖)
        score = sim1 + sim2 - sim1 * sim2
      else:
        # If no operands in the query
        score = sim1

      # Store only positive scores
      if score > 0:
          relevant_docs.append({'doc_index': doc_index, 'doc_score': score})

    # Sort and output relevant documents
    relevant_docs.sort(key=lambda x: x['doc_score'], reverse=True)

    # Show relevant documents
    for doc in relevant_docs:
        print(doc)


In [74]:
search_docs("quantum and technology", doc_vectors, vocab)

{'doc_index': 1, 'doc_score': 8.007647303174531e-05}
{'doc_index': 0, 'doc_score': 1.5014338693452247e-05}


In [75]:
search_docs("technology or quantum", doc_vectors, vocab)

{'doc_index': 1, 'doc_score': 0.037951243437482796}
{'doc_index': 0, 'doc_score': 0.008933531522604087}
{'doc_index': 3, 'doc_score': 0.0044742729306487695}
{'doc_index': 4, 'doc_score': 0.0044742729306487695}


In [76]:
search_docs("technology", doc_vectors, vocab)

{'doc_index': 0, 'doc_score': 0.006711409395973154}
{'doc_index': 3, 'doc_score': 0.0044742729306487695}
{'doc_index': 4, 'doc_score': 0.0044742729306487695}
{'doc_index': 1, 'doc_score': 0.0022371364653243847}


In [77]:
search_docs("art", doc_vectors, vocab)

{'doc_index': 2, 'doc_score': 0.015659955257270694}


# Conclusion

All the queries returned accurate results. If we check the PDF documents, we can confirm that the search effectively identifies and ranks the most relevant documents.

### Advantages

- Provides ranked results and partial matches, enhancing user control over result presentation.

### Disadvantages

- Users might struggle with complex queries using simple AND/OR combinations.


We have only explored Fuzzy Algebraic, but several variants of the Extended Boolean Model exist for calculating AND and OR operators, including Fuzzy Set, Soft Boolean Operator, Paice Model, and P-Norm Model.