<a href="https://colab.research.google.com/github/abdulehsan/Information-Retrieval/blob/main/IR_LAB_(TF_IDF).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Implementation of Vector Space Model**


In [2]:
documents = {
    1: "Information retrieval is the retrieval process of finding relevant documents according to user's information need.",
    2: "Search engines use indexing for fast retrieval.",
    3: "An index maps words to document IDs.",
    4: "Some information retrieval systems don't work optimally because of no text preprocessing"
}

Importing Libraries

In [3]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
import re
import math
import numpy as np

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [4]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

Defining a preprocessing function that takes the corpus as a argument and perform preprocessing steps like :



*   Creating a vocabulary
*   Lower case normalization
*   Removing Stop words
*   Perform Lemmatization





In [5]:
def preprocessing(doc):

  vocab = set()

  for line in doc.values():
    for word in line.split():
      word = re.sub(r'[^\w\s]', '', word)
      vocab.add(word.lower())
      vocab = {word for word in vocab if word not in stop_words}
      vocab = {lemmatizer.lemmatize(word) for word in vocab}
      # print(vocab)
      vocab = {lemmatizer.lemmatize(word,pos='v') for word in vocab}
      # # vocab = [token.lemma_ for token in vocab]

  return vocab

vocab = preprocessing(documents)
vocab

{'accord',
 'document',
 'dont',
 'engine',
 'fast',
 'find',
 'id',
 'index',
 'information',
 'map',
 'need',
 'optimally',
 'preprocessing',
 'process',
 'relevant',
 'retrieval',
 'search',
 'system',
 'text',
 'use',
 'user',
 'word',
 'work'}

**Creating a Term Frequency Matrix**

Steps performed :

* Preprocessed and Created a dictionary where keys are the terms , and values are the frequency of their occurence in the document


In [6]:
def cal_tf(vocab,doc):
  tf = {}

  for word in vocab:
    tf[word] = []
    for line in doc.values():
      words = line.split()
      words = [re.sub(r'[^\w\s]', '', word) for word in words]
      words = [word.lower() for word in words if word not in stop_words]
      words = [lemmatizer.lemmatize(word) for word in words]
      words = [lemmatizer.lemmatize(word,pos='v') for word in words]
      tf[word].append(words.count(word))
  return tf

tf_matrix = cal_tf(vocab,documents)
tf_matrix

{'find': [1, 0, 0, 0],
 'dont': [0, 0, 0, 1],
 'process': [1, 0, 0, 0],
 'text': [0, 0, 0, 1],
 'preprocessing': [0, 0, 0, 1],
 'map': [0, 0, 1, 0],
 'relevant': [1, 0, 0, 0],
 'optimally': [0, 0, 0, 1],
 'search': [0, 1, 0, 0],
 'engine': [0, 1, 0, 0],
 'system': [0, 0, 0, 1],
 'id': [0, 0, 1, 0],
 'use': [0, 1, 0, 0],
 'index': [0, 1, 1, 0],
 'fast': [0, 1, 0, 0],
 'accord': [1, 0, 0, 0],
 'retrieval': [2, 1, 0, 1],
 'need': [1, 0, 0, 0],
 'work': [0, 0, 0, 1],
 'word': [0, 0, 1, 0],
 'user': [1, 0, 0, 0],
 'document': [1, 0, 1, 0],
 'information': [2, 0, 0, 1]}

**Creating a weighted term frequency matrix**

* Terms as a key and 1 + log(tf) as value

In [7]:
def cal_tf_log_matrix(tf_matrix):
  tf_log_matrix = {}

  for key,value in tf_matrix.items():
    tf_log_matrix[key] = []
    for i in value:
      if i > 0:
        tf_log_matrix[key].append(1 + math.log10(i))
      else:
        tf_log_matrix[key].append(0)
  return tf_log_matrix
tf_log_matrix = cal_tf_log_matrix(tf_matrix)
tf_log_matrix

{'find': [1.0, 0, 0, 0],
 'dont': [0, 0, 0, 1.0],
 'process': [1.0, 0, 0, 0],
 'text': [0, 0, 0, 1.0],
 'preprocessing': [0, 0, 0, 1.0],
 'map': [0, 0, 1.0, 0],
 'relevant': [1.0, 0, 0, 0],
 'optimally': [0, 0, 0, 1.0],
 'search': [0, 1.0, 0, 0],
 'engine': [0, 1.0, 0, 0],
 'system': [0, 0, 0, 1.0],
 'id': [0, 0, 1.0, 0],
 'use': [0, 1.0, 0, 0],
 'index': [0, 1.0, 1.0, 0],
 'fast': [0, 1.0, 0, 0],
 'accord': [1.0, 0, 0, 0],
 'retrieval': [1.3010299956639813, 1.0, 0, 1.0],
 'need': [1.0, 0, 0, 0],
 'work': [0, 0, 0, 1.0],
 'word': [0, 0, 1.0, 0],
 'user': [1.0, 0, 0, 0],
 'document': [1.0, 0, 1.0, 0],
 'information': [1.3010299956639813, 0, 0, 1.0]}

**Creating idf matrix**

* Defining a function for IDF matrix that takes the vocabulary and tf matrix as argument

* Created a dictionary where keys are terms and value are log(N/df) where N is the total number of documents which in this case is 4 , df is the document frequency

In [28]:
def cal_idf(vocab,tf_matrix,doc):

  idf_matrix = {}

  for word in vocab:
    idf_matrix[word] = 0
    for key,value in tf_matrix.items():
      count = 0
      if key == word:
        for i in value:
          if i > 0:
            count += 1
        idf_matrix[word]=math.log10(len(doc)/(count))
  return idf_matrix

idf_matrix = cal_idf(vocab,tf_matrix,documents)
idf_matrix

{'find': 0.6020599913279624,
 'dont': 0.6020599913279624,
 'process': 0.6020599913279624,
 'text': 0.6020599913279624,
 'preprocessing': 0.6020599913279624,
 'map': 0.6020599913279624,
 'relevant': 0.6020599913279624,
 'optimally': 0.6020599913279624,
 'search': 0.6020599913279624,
 'engine': 0.6020599913279624,
 'system': 0.6020599913279624,
 'id': 0.6020599913279624,
 'use': 0.6020599913279624,
 'index': 0.3010299956639812,
 'fast': 0.6020599913279624,
 'accord': 0.6020599913279624,
 'retrieval': 0.12493873660829992,
 'need': 0.6020599913279624,
 'work': 0.6020599913279624,
 'word': 0.6020599913279624,
 'user': 0.6020599913279624,
 'document': 0.3010299956639812,
 'information': 0.3010299956639812}

**Creating a tf-idf matrix**

* Created a dictionary for tf-idf matrix where keys are terms and values are tf.idf values ( Multiplying tf and idf values )

In [30]:
def cal_tf_idf_matrix(tf_log_matrix,idf):
  tf_idf_matrix = {}

  for key,value in tf_log_matrix.items():
    tf_idf_matrix[key] = []
    for i in value:
      tf_idf_matrix[key].append(i*idf[key])
  return tf_idf_matrix

tf_idf_matrix = cal_tf_idf_matrix(tf_log_matrix,idf_matrix)
tf_idf_matrix

{'find': [0.6020599913279624, 0.0, 0.0, 0.0],
 'dont': [0.0, 0.0, 0.0, 0.6020599913279624],
 'process': [0.6020599913279624, 0.0, 0.0, 0.0],
 'text': [0.0, 0.0, 0.0, 0.6020599913279624],
 'preprocessing': [0.0, 0.0, 0.0, 0.6020599913279624],
 'map': [0.0, 0.0, 0.6020599913279624, 0.0],
 'relevant': [0.6020599913279624, 0.0, 0.0, 0.0],
 'optimally': [0.0, 0.0, 0.0, 0.6020599913279624],
 'search': [0.0, 0.6020599913279624, 0.0, 0.0],
 'engine': [0.0, 0.6020599913279624, 0.0, 0.0],
 'system': [0.0, 0.0, 0.0, 0.6020599913279624],
 'id': [0.0, 0.0, 0.6020599913279624, 0.0],
 'use': [0.0, 0.6020599913279624, 0.0, 0.0],
 'index': [0.0, 0.3010299956639812, 0.3010299956639812, 0.0],
 'fast': [0.0, 0.6020599913279624, 0.0, 0.0],
 'accord': [0.6020599913279624, 0.0, 0.0, 0.0],
 'retrieval': [0.16254904394775974,
  0.12493873660829992,
  0.0,
  0.12493873660829992],
 'need': [0.6020599913279624, 0.0, 0.0, 0.0],
 'work': [0.0, 0.0, 0.0, 0.6020599913279624],
 'word': [0.0, 0.0, 0.6020599913279624, 0

**Query Processing**

* Peroforming Preprocessing
* Calculating tf of query
* Calculating weighted tf of query
* Calculating tf-idf of query


In [46]:
query = "Information Retrieval Processes"

query_vocab = vocab

query_tf = cal_tf(query_vocab, {1: query})

query_tf_log = cal_tf_log_matrix(query_tf)

query_tf_idf = {}
for term in query_vocab:
    query_tf_idf[term] = [query_tf_log[term][0] * idf_matrix.get(term, 0)]

query_tf_idf

{'find': [0.0],
 'dont': [0.0],
 'process': [0.6020599913279624],
 'text': [0.0],
 'preprocessing': [0.0],
 'map': [0.0],
 'relevant': [0.0],
 'optimally': [0.0],
 'search': [0.0],
 'engine': [0.0],
 'system': [0.0],
 'id': [0.0],
 'use': [0.0],
 'index': [0.0],
 'fast': [0.0],
 'accord': [0.0],
 'retrieval': [0.12493873660829992],
 'need': [0.0],
 'work': [0.0],
 'word': [0.0],
 'user': [0.0],
 'document': [0.0],
 'information': [0.3010299956639812]}

In [43]:
def cosine_similarity(vector1, vector2):
  """Calculates the cosine similarity between two vectors."""
  dot_product = np.dot(vector1, vector2)
  magnitude1 = np.linalg.norm(vector1)
  magnitude2 = np.linalg.norm(vector2)
  similarity = dot_product / (magnitude1 * magnitude2)
  return similarity

In [44]:
doc_vectors = []
for doc_id in range(1, len(documents) + 1):
    doc_vector = []
    for term in query_vocab:  # Ensure order matches query_tf_idf
        tf_idf_value = tf_idf_matrix.get(term, [0] * len(documents))[doc_id - 1]
        doc_vector.append(tf_idf_value)
    doc_vectors.append(doc_vector)

query_vector = [query_tf_idf.get(term, [0])[0] for term in query_vocab] # Ensure order matches doc_vectors

In [None]:
# query_vector = list(query_tf_idf.values())[0]

In [45]:
# Calculate cosine similarity for each document
similarities = {}
for i, doc_vector in enumerate(doc_vectors):
  similarity = cosine_similarity(query_vector, doc_vector)
  similarities[i+1] = similarity  # Store similarity with document ID

# Rank documents based on similarity (descending order)
ranked_documents = sorted(similarities.items(), key=lambda item: item[1], reverse=True)

# Display the ranked documents
print("Ranked Documents:")
for doc_id, similarity in ranked_documents:
  print(f"Document {doc_id}: Similarity = {similarity}")

Ranked Documents:
Document 1: Similarity = 0.46767925108253683
Document 4: Similarity = 0.10273571045099146
Document 2: Similarity = 0.018277676453873085
Document 3: Similarity = 0.0


**Conclusion**

In this notebook we implemented Ranked Retrieval , Vector Space Model

* First of all we created a vocabulary for corpus
* Applied Preprocessing ( Removing Stop words , Normalization , Lemmatizaton )
* Created TF Matrix
* Created TF weighted Matrix by logarithm
* Created IDF Matrix
* Created TF.IDF Matrix

* Created TF.IDF Matrix for query

* Calculate Cosine Similarity for Ranked Retrieving

Key FIndings:

* If a word is present in query and not present in vocabulary then the idf of that term will be zero , it will be out of vocabulary term , so it will not collaborate in calculating cosine similarity.

* The lemmatizer using Wordnet by default lemmatize noun .
* To Lemmatize verbs we have to pass 'pos' parameter to the function with the value v