# Implementation of a Simple Text based Search Engine



Words should be first converted to a numerical form in order to perform operations on them. Some ways to do them are
- Bag of Words
    - Based on the occurence of each word in a document
    - Word order does not matter in this implementation
- TF-IDF (Term Frequency - Inverse Document Frequency)
    - Based on the frequency of terms in a document and in the entire set of documents
    - Again word order does not matter

In [1]:
import requests
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [3]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [4]:
document_df = pd.DataFrame(documents, columns = ['course', 'section', 'question', 'text'])
document_df.head()

Unnamed: 0,course,section,question,text
0,data-engineering-zoomcamp,General course-related questions,Course - When will the course start?,The purpose of this document is to capture fre...
1,data-engineering-zoomcamp,General course-related questions,Course - What are the prerequisites for this c...,GitHub - DataTalksClub data-engineering-zoomca...
2,data-engineering-zoomcamp,General course-related questions,Course - Can I still join the course after the...,"Yes, even if you don't register, you're still ..."
3,data-engineering-zoomcamp,General course-related questions,Course - I have registered for the Data Engine...,You don't need it. You're accepted. You can al...
4,data-engineering-zoomcamp,General course-related questions,Course - What can I do before the course starts?,You can start by installing and setting up all...


## Finding similar documents given a query

Converting the query to its numerical (TFIDF) form and computing its dot product with the original matrix will result in list of similarity score, which allows to fetch the documents that are highly similar to the query.

In [21]:
fields = list(document_df.columns)
fields.remove('course')

embeddings = {}
vectorizers = {}

for field in fields:
    vect = TfidfVectorizer(stop_words="english", min_df=5)
    num_matrix =vect.fit_transform(document_df[field])
    embeddings[field] = num_matrix
    vectorizers[field] = vect

In [22]:
embeddings

{'section': <Compressed Sparse Row sparse matrix of dtype 'float64'
 	with 3090 stored elements and shape (948, 66)>,
 'question': <Compressed Sparse Row sparse matrix of dtype 'float64'
 	with 3431 stored elements and shape (948, 291)>,
 'text': <Compressed Sparse Row sparse matrix of dtype 'float64'
 	with 23808 stored elements and shape (948, 1333)>}

In [23]:
vectorizers

{'section': TfidfVectorizer(min_df=5, stop_words='english'),
 'question': TfidfVectorizer(min_df=5, stop_words='english'),
 'text': TfidfVectorizer(min_df=5, stop_words='english')}

In [30]:
query = "I just discovered the course, is it too late to join?"
filters = {'course': 'data-engineering-zoomcamp'}
boosts = {'question': 3}

In [31]:
score = np.zeros(len(document_df))

for field in fields:
    num_query = vectorizers[field].transform([query])
    similarity_score = cosine_similarity(embeddings[field], num_query).flatten()
    score_booster = boosts.get(field, 1.0)
    score+= score_booster*similarity_score

### Adding filters to the query (course for example)

In [32]:
for field, value in filters.items():
    mask = (document_df[field]==value).astype(int).values
    score = score*mask

In [33]:
sorted_score_indices = np.argsort(score)
top_5_scores = sorted_score_indices[-5:]
document_df.iloc[top_5_scores]

Unnamed: 0,course,section,question,text
4,data-engineering-zoomcamp,General course-related questions,Course - What can I do before the course starts?,You can start by installing and setting up all...
34,data-engineering-zoomcamp,General course-related questions,How can we contribute to the course?,Star the repo! Share it with friends if you fi...
1,data-engineering-zoomcamp,General course-related questions,Course - What are the prerequisites for this c...,GitHub - DataTalksClub data-engineering-zoomca...
0,data-engineering-zoomcamp,General course-related questions,Course - When will the course start?,The purpose of this document is to capture fre...
7,data-engineering-zoomcamp,General course-related questions,Course - Can I follow the course after it fini...,"Yes, we will keep all the materials after the ..."
