# Standard Boolean Model

The standard Boolean model, introduced in the 1970s, is a classic text retrieval method. Back then, computing power was limited, and data was stored on sequential tape drives. This model uses Boolean logic (AND, OR, NOT) to search for documents. Documents and queries are treated as **sets of words** and the model helps find documents by matching these sets based on Boolean operations. Despite more advanced methods being developed, the Boolean model is still used because it remains effective.

We will create a standard Boolean model that performs the following operations:

- 𝑄 = 𝑡: The term 𝑡 must be present.
- 𝑄 = ¬𝑡: The term 𝑡 must not be present.
- 𝑄 = 𝑄1 ∨ 𝑄2: Either sub-query 𝑄1 or sub-query 𝑄2 must be satisfied.
- 𝑄 = 𝑄1 ∧ 𝑄2: Both sub-query 𝑄1 and sub-query 𝑄2 must be satisfied.

To simplify, we will focus on using only one operation at a time. Our code will not support multiple or mixed operations.

In [170]:
# We need a list where we can store all the extract text from pdf
documents = []

In [171]:
# Extract text from PDF files
import pdfplumber


def extract_text_from_pdf(file_path):
    text = ""
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text()
    return text


docs_collection = ['computer_science.pdf', 'physics.pdf', 'art.pdf',
                   'football_history.pdf', 'artificial_intelligence.pdf']


for path in docs_collection:
  # For every document in the collection we fetch the text and store
  doc = extract_text_from_pdf(path)
  documents.append(doc)


In [173]:
import spacy
from collections import Counter

# Step 1: Load spaCy model
# Load the spaCy small English model.
# You can also use the larger model "en_core_web_lg" for better accuracy, but it is more expensive.
nlp = spacy.load("en_core_web_sm")

# Step 2: Preprocess and create a vocabulary
vocab = set()  # Set to store unique vocabulary terms
preprocessed_docs = []  # List to hold preprocessed documents

for doc in documents:
    # Apply spaCy NLP pipeline to the document
    doc_nlp = nlp(doc)
    # Filter and lemmatize tokens, excluding stop words and certain POS tags
    filtered_tokens = [
        token.lemma_.upper() for token in doc_nlp
        if not token.is_stop and token.pos_ not in {"DET", "ADP", "AUX", "CCONJ", "ADJ", "PUNCT", "SPACE"}
    ]
    # Update vocabulary set with filtered tokens
    # Vocab keeps only unique tokens
    vocab.update(filtered_tokens)
    preprocessed_docs.append(filtered_tokens)  # Append filtered tokens to preprocessed documents list

# Sorted list for consistent vector representation
vocab = sorted(vocab)

# Step 3: Convert documents to Boolean vectors
doc_vectors = []  # List to hold document vectors

for tokens in preprocessed_docs:
    # Create Boolean vectors for each document, if term is present in vocabulary
    doc_vectors.append([1 if term in tokens else 0 for term in vocab])

def search_docs(query, doc_vectors, vocab):
    # Preprocess and split query into terms and operator
    full_query = query.upper().split()
    # Keep only the terms
    terms = [t for t in full_query if t not in ['AND', 'OR', 'NOT']]
    # Get the operation used in the query
    operation = next((t for t in full_query if t in ['AND', 'OR', 'NOT']), None)
    # Create Boolean vector for the query
    query_vector = [1 if token in terms else 0 for token in vocab]

    if full_query:
        for doc_index, doc_vec in enumerate(doc_vectors):
            if operation == 'AND':
                # All 1s in the query vector match the corresponding 1s in the document vector
                match = all(q == d for q, d in zip(query_vector, doc_vec) if q == 1)
            elif operation == 'NOT':
                # All 1s in the query vector are not present in the document vector
                match = all(q != d for q, d in zip(query_vector, doc_vec) if q == 1)
            # This can be used for OR operation, but also when there is no operation
            else:
                # Check if any 1s in the query vector are present in the document vector
                match = any(q == d for q, d in zip(query_vector, doc_vec) if q == 1)

            # Print document index if it matches the query
            if match:
                print(f'Doc: {doc_index}')


# Evaluation

Let’s test different queries and see the results.

In [179]:
query = "ai"

search_docs(query, doc_vectors, vocab)

Doc: 0
Doc: 4


In [180]:
query = "not quantum"

search_docs(query, doc_vectors, vocab)

Doc: 2
Doc: 3
Doc: 4


In [185]:
query = "art and physics"

search_docs(query, doc_vectors, vocab)

In [186]:
query = "art or physics"

search_docs(query, doc_vectors, vocab)

Doc: 1
Doc: 2


# Conclusion

All the queries returned accurate results. If we check the PDF documents, we can see that the search effectively finds relevant documents.

### Advantages

- **Simplicity**: The Standard Boolean Model is straightforward, offering a clear and intuitive description of query semantics.
- **Ease of Implementation**: It is easy to implement and understand, making it user-friendly.

### Disadvantages

- **Lack of Control**: There is limited control over the number of retrieved documents, which can lead to either too few or too many results.
- **Binary Nature**: The model only returns documents that strictly match the query terms, which may not always align with user expectations.