<a href="https://colab.research.google.com/github/abdulehsan/Information-Retrieval/blob/main/IR_Lab_01_(Report).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Information Retrieval Lab**

Implementing Boolean Retrieval Model, which is a fundamental approach in IR that uses Boolean logic to retrieve relevant documents based on exact keyword matching.

In [93]:
Chapter_1 = {
1 : "In the Name of Allah—the Most Compassionate, Most Merciful.",
2: "All praise is for Allah—Lord of all worlds",
3: "the Most Compassionate, Most Merciful",
4: "Master of the Day of Judgment.",
5: "You ˹alone˺ we worship and You ˹alone˺ we ask for help.",
6: "Guide us along the Straight Path,",
7: "the Path of those You have blessed—not those You are displeased with, or those who are astray."
}

Chapter_1

{1: 'In the Name of Allah—the Most Compassionate, Most Merciful.',
 2: 'All praise is for Allah—Lord of all worlds',
 3: 'the Most Compassionate, Most Merciful',
 4: 'Master of the Day of Judgment.',
 5: 'You ˹alone˺ we worship and You ˹alone˺ we ask for help.',
 6: 'Guide us along the Straight Path,',
 7: 'the Path of those You have blessed—not those You are displeased with, or those who are astray.'}

**Preprocessing (removing punctuation, tokenization, converting to lower space), Creating a Vocabulary for the dataset**

*   Splitted each words by either space, hyphens, em dash, en dash
*   Removed punctuation
*   Stored each word in set named Vocab
*   Used .lower() on each word , to lower case the word







In [94]:
import re

def preprocess_text(doc):

  vocab = set()

  for text in doc.values():
    text = re.split(r'\s|-|\u2014|\u2013',text)
    for word in text:
      word = re.sub(r'[^\w\s]', '', word)
      vocab.add(word.lower())

  return vocab

In [95]:
vocab = preprocess_text(Chapter_1)
vocab

{'all',
 'allah',
 'alone',
 'along',
 'and',
 'are',
 'ask',
 'astray',
 'blessed',
 'compassionate',
 'day',
 'displeased',
 'for',
 'guide',
 'have',
 'help',
 'in',
 'is',
 'judgment',
 'lord',
 'master',
 'merciful',
 'most',
 'name',
 'not',
 'of',
 'or',
 'path',
 'praise',
 'straight',
 'the',
 'those',
 'us',
 'we',
 'who',
 'with',
 'worlds',
 'worship',
 'you'}

**Removing Stop Words**

Removed Stop words from Vocabulary by using nltk Library

In [96]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [97]:
vocab = {word for word in vocab if word not in stop_words}
vocab

{'allah',
 'alone',
 'along',
 'ask',
 'astray',
 'blessed',
 'compassionate',
 'day',
 'displeased',
 'guide',
 'help',
 'judgment',
 'lord',
 'master',
 'merciful',
 'name',
 'path',
 'praise',
 'straight',
 'us',
 'worlds',
 'worship'}

**Creating Term Document Incidence Matrix**

We built an inverted index, a dictionary where:

Keys are words (terms from the vocabulary).

Values are 0s & 1s of document IDs that contain the corresponding term.
This allows quick lookup for Boolean queries.


In [98]:
def term_doc_matrix(doc,vocab):
  term_doc_matrix = {}

  for word in vocab:
    term_doc_matrix[word] = []
    for text in doc.values():
      if word in text.lower().strip():
        term_doc_matrix[word].append(1)
      else:
        term_doc_matrix[word].append(0)
  return term_doc_matrix

term_doc_matrix = term_doc_matrix(Chapter_1,vocab)
term_doc_matrix

{'guide': [0, 0, 0, 0, 0, 1, 0],
 'praise': [0, 1, 0, 0, 0, 0, 0],
 'name': [1, 0, 0, 0, 0, 0, 0],
 'worship': [0, 0, 0, 0, 1, 0, 0],
 'help': [0, 0, 0, 0, 1, 0, 0],
 'allah': [1, 1, 0, 0, 0, 0, 0],
 'compassionate': [1, 0, 1, 0, 0, 0, 0],
 'along': [0, 0, 0, 0, 0, 1, 0],
 'astray': [0, 0, 0, 0, 0, 0, 1],
 'blessed': [0, 0, 0, 0, 0, 0, 1],
 'day': [0, 0, 0, 1, 0, 0, 0],
 'judgment': [0, 0, 0, 1, 0, 0, 0],
 'displeased': [0, 0, 0, 0, 0, 0, 1],
 'path': [0, 0, 0, 0, 0, 1, 1],
 'us': [0, 0, 0, 0, 0, 1, 0],
 'lord': [0, 1, 0, 0, 0, 0, 0],
 'alone': [0, 0, 0, 0, 1, 0, 0],
 'worlds': [0, 1, 0, 0, 0, 0, 0],
 'straight': [0, 0, 0, 0, 0, 1, 0],
 'master': [0, 0, 0, 1, 0, 0, 0],
 'ask': [0, 0, 0, 0, 1, 0, 0],
 'merciful': [1, 0, 1, 0, 0, 0, 0]}

**Creating Inverted Index**

We built an inverted index, a dictionary where:

Keys are words (terms from the vocabulary).

Values are sets of document IDs that contain the corresponding term.
This allows quick lookup for Boolean queries.

In [99]:
def inverted_index(doc,vocab):
  index = {}

  for word in vocab:
    index[word] = []
    for doc_id, text in doc.items():
      if word in text.lower().strip():
        index[word].append(doc_id)
  return index

index = inverted_index(Chapter_1,vocab)
index

{'guide': [6],
 'praise': [2],
 'name': [1],
 'worship': [5],
 'help': [5],
 'allah': [1, 2],
 'compassionate': [1, 3],
 'along': [6],
 'astray': [7],
 'blessed': [7],
 'day': [4],
 'judgment': [4],
 'displeased': [7],
 'path': [6, 7],
 'us': [6],
 'lord': [2],
 'alone': [5],
 'worlds': [2],
 'straight': [6],
 'master': [4],
 'ask': [5],
 'merciful': [1, 3]}

**Implemented a boolean retrieval model for single word**

The function accepts 1 parameter , which will be the word the user enters.
Then we have to check the doc_id

In [110]:
def get_doc(word):
  doc = index[word]
  for k,v in Chapter_1.items():
    if k in doc:
      print(v)


get_doc('allah')

In the Name of Allah—the Most Compassionate, Most Merciful.
All praise is for Allah—Lord of all worlds


**We implemented Boolean query handling using set operations:**

**AND:** Returns documents containing both terms.

**OR:** Returns documents containing at least one term.

**NOT:** Returns documents containing one term but not the other.

In [111]:
def query(query):
  query = query.lower().split()
  fword = query[0]
  op = query[1]
  sword = query[2]

  if op == 'and':
    doc =  set(index[fword]) & set(index[sword])
  elif op == 'or':
    doc = set(index[fword]) | set(index[sword])
  elif op == 'not':
    doc = set(index[fword]) - set(index[sword])
  for k,v in Chapter_1.items():
    if k in doc:
      print(v)

query('allah and merciful')

In the Name of Allah—the Most Compassionate, Most Merciful.


**Key Findings**

Some words in the dataset contain em dash (—) and en dash (–).

Using .split() alone only splits by spaces, so words with dashes remain connected.

Solution: A regular expression (re.split(r"\s|-|\u2014|\u2013", text)) was used to split by:



*   Spaces
*   Hyphens (-)
*   En dash (–)
*   Em dash (—)



**Document Representation: Term-Document Matrix vs. Inverted Index**

A Term-Document Matrix stores all terms in a structured form but is space-consuming.
The Inverted Index is more efficient for searching, as it maps terms directly to document IDs.





**Conclusion**

This lab successfully demonstrated how a Boolean Retrieval Model works. We learned how preprocessing improves retrieval accuracy and how an inverted index optimizes search. Future enhancements can improve ranking and support more complex queries.