### Stragtegy:
- **Vectorization:** Convert the documents and the querry into vectors using TF-IDF vectorizer.
- **Similarity Measurement:** Measure the similarity between the query vector and each document vector.
- **Retrieval:** Retrieve the most similar documents


In [71]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import transformers 
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity



In [94]:
# We make make the corpus from book reviews in "SubjQA" dataset
data=load_dataset('subjqa','books',trust_remote_code=True)
corpus=data['train']['context'][:100]

In [106]:

vectorizer=TfidfVectorizer()
vectorizer.fit(corpus)
tfidf_matrix=vectorizer.transform(corpus)

tfidf_dense=tfidf_matrix.todense()
tfidf_df = pd.DataFrame(tfidf_dense, columns=vectorizer.get_feature_names_out())
tfidf_df

Unnamed: 0,000,05,100,11,12,153this,16,173rd,1890,18th,...,you,young,younger,your,yours,yourself,zamperini,zany,zen,zombies
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.163891,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.163891,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.046832,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
97,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
98,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.035068,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000


In [105]:
# Sample query
query = "Which book should I read to start studying history of America?"

query_vector = vectorizer.transform([query])
cosine_similarities = cosine_similarity(query_vector,tfidf_matrix,dense_output=True).flatten()

# Get the indices of the top N most similar documents
top_n=2
related_docs_indices=np.argsort(cosine_similarities)[::-1][:top_n].astype('int')

print(f"Query: {query}")
print("\nTop related documents:")
for idx in related_docs_indices:
    print(f"Document {idx + 1}: {corpus[idx]} \n(Similarity: {cosine_similarities[idx]:.2f})\n")

Query: Which book should I read to start studying history of America?

Top related documents:
Document 64: The double standard of Political Correctness may have some validity insofar as it is a response to past injustices, mostly of the racial variety, but for Michael Moore to title his book "Stupid White Men" leaves him open, in my view, to as much criticism as I would rightly receive had I written a book called "Dumbellionite Negroes". The easy put-down of Moore is to say something like, "It takes one to know one," but in all fairness, Moore is not stupid. Furthermore, conservatives such as myself love to tout the "marketplace of ideas," because it is in this free discourse of Democracy where we succeed, whether it be talk radio, best selling books, Fox News, the rising tide of country music vs. Hollywood's recent failures, or the simple fact that Republicans now dominate the White House, the Congress, the Senate, governorships and state legislatures. We must begrudgingly accept the 

In [107]:
# some other retrieve techanices may be employed, BM25, DPR, etc