THIS IRS Project is based on the vector space model.
Uses TF-IDF Vectorization, Cosine Similarity concepts, Ranking and Retrieval.

**TF-IDF Vectorization:** The code uses the TfidfVectorizer from scikit-learn to transform the documents into TF-IDF (Term Frequency-Inverse Document Frequency) vectors. TF-IDF represents the importance of each term in each document relative to the entire corpus.

**Cosine Similarity**: The code calculates cosine similarity between the query vector and each document vector to determine the similarity between the query and each document. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space and is used in the vector space model to rank documents based on their relevance to a query.

**Ranking and Retrieval**: The code retrieves the top k documents with the highest cosine similarity scores as the search results. This ranking is based on how closely the documents match the query in the vector space.

In [4]:
from google.colab import drive
drive.mount('/content/drive')



In [6]:
import os
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [12]:


class InformationRetrievalSystem:
    def __init__(self, documents_dir):
        self.documents_dir = documents_dir
        self.document_paths = [os.path.join(documents_dir, filename) for filename in os.listdir(documents_dir)]
        self.document_texts = self.load_documents()

        # Initialize TF-IDF vectorizer
        self.vectorizer = TfidfVectorizer(stop_words='english')
        self.tfidf_matrix = self.vectorizer.fit_transform(self.document_texts)

    def load_documents(self):
        document_texts = []
        for document_path in self.document_paths:
            with open(document_path, 'r', encoding='utf-8') as file:
                document_texts.append(file.read())
        return document_texts

    def search(self, query, top_k=5):
        query_vector = self.vectorizer.transform([query])
        cosine_similarities = cosine_similarity(query_vector, self.tfidf_matrix).flatten()
        top_indices = cosine_similarities.argsort()[-top_k:][::-1]  # Get indices of top k documents
        top_documents = [(self.document_paths[idx], cosine_similarities[idx]) for idx in top_indices]
        return top_documents

if __name__ == "__main__":
    documents_dir = '/content/drive/MyDrive/Colab Notebooks/IRS'
    ir_system = InformationRetrievalSystem(documents_dir)

    while True:
        query = input("Enter your query (type 'exit' to quit): ")
        if query.lower() == 'exit':
            break
        top_documents = ir_system.search(query)
        print("Top 5 documents related to the query:")
        for doc_path, similarity in top_documents:
            print(f"{doc_path} (Similarity: {similarity:.2f})")


Enter your query (type 'exit' to quit): Data Science
Top 5 documents related to the query:
/content/drive/MyDrive/Colab Notebooks/IRS/doc5.txt (Similarity: 0.50)
/content/drive/MyDrive/Colab Notebooks/IRS/doc4.txt (Similarity: 0.42)
/content/drive/MyDrive/Colab Notebooks/IRS/doc3.txt (Similarity: 0.00)
/content/drive/MyDrive/Colab Notebooks/IRS/doc1.txt (Similarity: 0.00)
/content/drive/MyDrive/Colab Notebooks/IRS/doc2.txt (Similarity: 0.00)
Enter your query (type 'exit' to quit): data
Top 5 documents related to the query:
/content/drive/MyDrive/Colab Notebooks/IRS/doc5.txt (Similarity: 0.47)
/content/drive/MyDrive/Colab Notebooks/IRS/doc4.txt (Similarity: 0.39)
/content/drive/MyDrive/Colab Notebooks/IRS/doc3.txt (Similarity: 0.00)
/content/drive/MyDrive/Colab Notebooks/IRS/doc1.txt (Similarity: 0.00)
/content/drive/MyDrive/Colab Notebooks/IRS/doc2.txt (Similarity: 0.00)
Enter your query (type 'exit' to quit): is
Top 5 documents related to the query:
/content/drive/MyDrive/Colab Notebo