# Search Engine Workshop (prep for LLM-Zoomcamp)

Reference notebook:  https://github.com/alexeygrigorev/build-your-own-search-engine \
Video: https://www.youtube.com/watch?v=nMrGK5QgPVE

In [3]:
import pandas as pd

In [4]:
import requests 

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [5]:
documents[2]

{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
 'section': 'General course-related questions',
 'question': 'Course - Can I still join the course after the start date?',
 'course': 'data-engineering-zoomcamp'}

In [6]:
pd.DataFrame(documents)

Unnamed: 0,text,section,question,course
0,The purpose of this document is to capture fre...,General course-related questions,Course - When will the course start?,data-engineering-zoomcamp
1,GitHub - DataTalksClub data-engineering-zoomca...,General course-related questions,Course - What are the prerequisites for this c...,data-engineering-zoomcamp
2,"Yes, even if you don't register, you're still ...",General course-related questions,Course - Can I still join the course after the...,data-engineering-zoomcamp
3,You don't need it. You're accepted. You can al...,General course-related questions,Course - I have registered for the Data Engine...,data-engineering-zoomcamp
4,You can start by installing and setting up all...,General course-related questions,Course - What can I do before the course starts?,data-engineering-zoomcamp
...,...,...,...,...
943,Problem description\nThis is the step in the c...,Module 6: Best practices,Github actions: Permission denied error when e...,mlops-zoomcamp
944,Problem description\nWhen a docker-compose fil...,Module 6: Best practices,Managing Multiple Docker Containers with docke...,mlops-zoomcamp
945,Problem description\nIf you are having problem...,Module 6: Best practices,AWS regions need to match docker-compose,mlops-zoomcamp
946,Problem description\nPre-commit command was fa...,Module 6: Best practices,Isort Pre-commit,mlops-zoomcamp


In [7]:
df = pd.DataFrame(documents, columns=['course', 'section', 'question', 'text'])

In [8]:
df.head()

Unnamed: 0,course,section,question,text
0,data-engineering-zoomcamp,General course-related questions,Course - When will the course start?,The purpose of this document is to capture fre...
1,data-engineering-zoomcamp,General course-related questions,Course - What are the prerequisites for this c...,GitHub - DataTalksClub data-engineering-zoomca...
2,data-engineering-zoomcamp,General course-related questions,Course - Can I still join the course after the...,"Yes, even if you don't register, you're still ..."
3,data-engineering-zoomcamp,General course-related questions,Course - I have registered for the Data Engine...,You don't need it. You're accepted. You can al...
4,data-engineering-zoomcamp,General course-related questions,Course - What can I do before the course starts?,You can start by installing and setting up all...


In [9]:
df.tail()

Unnamed: 0,course,section,question,text
943,mlops-zoomcamp,Module 6: Best practices,Github actions: Permission denied error when e...,Problem description\nThis is the step in the c...
944,mlops-zoomcamp,Module 6: Best practices,Managing Multiple Docker Containers with docke...,Problem description\nWhen a docker-compose fil...
945,mlops-zoomcamp,Module 6: Best practices,AWS regions need to match docker-compose,Problem description\nIf you are having problem...
946,mlops-zoomcamp,Module 6: Best practices,Isort Pre-commit,Problem description\nPre-commit command was fa...
947,mlops-zoomcamp,Module 6: Best practices,How to destroy infrastructure created via GitH...,Problem description\nInfrastructure created in...


In [10]:
df[df.course == 'data-engineering-zoomcamp']

Unnamed: 0,course,section,question,text
0,data-engineering-zoomcamp,General course-related questions,Course - When will the course start?,The purpose of this document is to capture fre...
1,data-engineering-zoomcamp,General course-related questions,Course - What are the prerequisites for this c...,GitHub - DataTalksClub data-engineering-zoomca...
2,data-engineering-zoomcamp,General course-related questions,Course - Can I still join the course after the...,"Yes, even if you don't register, you're still ..."
3,data-engineering-zoomcamp,General course-related questions,Course - I have registered for the Data Engine...,You don't need it. You're accepted. You can al...
4,data-engineering-zoomcamp,General course-related questions,Course - What can I do before the course starts?,You can start by installing and setting up all...
...,...,...,...,...
430,data-engineering-zoomcamp,Workshop 2 - RisingWave,Unable to Open Dashboard as xdg-open doesn’t o...,Refer to the solution given in the first solut...
431,data-engineering-zoomcamp,Workshop 2 - RisingWave,Resolving Python Interpreter Path Inconsistenc...,Example Error:\nWhen attempting to execute a P...
432,data-engineering-zoomcamp,Workshop 2 - RisingWave,How does windowing work in Sql?,Ans : Windowing in streaming SQL involves defi...
433,data-engineering-zoomcamp,Triggers in Mage via CLI,"Encountering the error ""ModuleNotFoundError: N...","Python 3.12.1, is not compatible with kafka-py..."


Vector spaces 

- turn the docs into vectors
- term-document matrix:
    - rows: documents
    - columns: words/tokens
- bag of words:
    - word order is lost
    - sparse matrix

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
cv = CountVectorizer(min_df=5)

In [13]:
cv.fit(df.text)

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"


In [14]:
cv.get_feature_names_out()

array(['01', '02', '03', ..., 'youtube', 'zip', 'zoomcamp'],
      shape=(1524,), dtype=object)

In [15]:
doc_examples = [
    "Course starts on 15th Jan 2024",
    "Prerequisites listed on GitHub",
    "Submit homeworks after start date",
    "Registration not required for participation",
    "Setup Google Cloud and Python before course"
]

In [16]:
cv = CountVectorizer(stop_words='english')

In [17]:
cv.fit(doc_examples)

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,'english'
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"


In [18]:
cv.get_feature_names_out()

array(['15th', '2024', 'cloud', 'course', 'date', 'github', 'google',
       'homeworks', 'jan', 'listed', 'participation', 'prerequisites',
       'python', 'registration', 'required', 'setup', 'start', 'starts',
       'submit'], dtype=object)

In [19]:
X = cv.transform(doc_examples)

In [20]:
X.todense()

matrix([[1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0],
        [0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]])

In [21]:
pd.DataFrame(X.todense(), columns=cv.get_feature_names_out()).T

Unnamed: 0,0,1,2,3,4
15th,1,0,0,0,0
2024,1,0,0,0,0
cloud,0,0,0,0,1
course,1,0,0,0,1
date,0,0,1,0,0
github,0,1,0,0,0
google,0,0,0,0,1
homeworks,0,0,1,0,0
jan,1,0,0,0,0
listed,0,1,0,0,0


In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

cv = TfidfVectorizer(stop_words='english', min_df=5)
X = cv.fit_transform(df.text)

names = cv.get_feature_names_out()

df_docs = pd.DataFrame(X.toarray(), columns=names)
df_docs.round(2)

Unnamed: 0,01,02,03,04,05,06,09,10,100,11,...,y_val,yaml,year,yellow,yellow_tripdata_2021,yes,yml,youtube,zip,zoomcamp
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.00,0.00,0.0,0.0,0.00
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.00,0.00,0.0,0.0,0.43
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.28,0.00,0.0,0.0,0.00
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.00,0.00,0.0,0.0,0.00
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.00,0.00,0.0,0.0,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
943,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.00,0.11,0.0,0.0,0.00
944,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.00,0.00,0.0,0.0,0.00
945,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.17,0.0,0.0,0.0,0.00,0.00,0.0,0.0,0.00
946,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.00,0.00,0.0,0.0,0.00


In [23]:
X

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 23808 stored elements and shape (948, 1333)>

In [24]:
query = "I just discovered the course, is it too late to join?"

q = cv.transform([query])
q.toarray()

array([[0., 0., 0., ..., 0., 0., 0.]], shape=(1, 1333))

In [25]:
query_dict = dict(zip(names, q.toarray()[0]))
query_dict

{'01': np.float64(0.0),
 '02': np.float64(0.0),
 '03': np.float64(0.0),
 '04': np.float64(0.0),
 '05': np.float64(0.0),
 '06': np.float64(0.0),
 '09': np.float64(0.0),
 '10': np.float64(0.0),
 '100': np.float64(0.0),
 '11': np.float64(0.0),
 '12': np.float64(0.0),
 '127': np.float64(0.0),
 '13': np.float64(0.0),
 '14': np.float64(0.0),
 '15': np.float64(0.0),
 '16': np.float64(0.0),
 '17': np.float64(0.0),
 '19': np.float64(0.0),
 '1st': np.float64(0.0),
 '20': np.float64(0.0),
 '2019': np.float64(0.0),
 '2020': np.float64(0.0),
 '2021': np.float64(0.0),
 '2022': np.float64(0.0),
 '2023': np.float64(0.0),
 '2024': np.float64(0.0),
 '21': np.float64(0.0),
 '22': np.float64(0.0),
 '24': np.float64(0.0),
 '25': np.float64(0.0),
 '2pacx': np.float64(0.0),
 '30': np.float64(0.0),
 '35': np.float64(0.0),
 '403': np.float64(0.0),
 '42': np.float64(0.0),
 '50': np.float64(0.0),
 '5000': np.float64(0.0),
 '5431': np.float64(0.0),
 '5432': np.float64(0.0),
 '60': np.float64(0.0),
 '600': np.floa

In [26]:
# Computing similarity between query and document (cosine similarity)

X.dot(q.T).todense()

matrix([[0.48049682],
        [0.        ],
        [0.        ],
        [0.2083882 ],
        [0.        ],
        [0.        ],
        [0.        ],
        [0.17557272],
        [0.        ],
        [0.        ],
        [0.        ],
        [0.15870689],
        [0.        ],
        [0.        ],
        [0.        ],
        [0.09680922],
        [0.        ],
        [0.        ],
        [0.07529201],
        [0.        ],
        [0.        ],
        [0.        ],
        [0.29986763],
        [0.10520675],
        [0.        ],
        [0.        ],
        [0.        ],
        [0.27447476],
        [0.12828407],
        [0.        ],
        [0.        ],
        [0.        ],
        [0.        ],
        [0.05163407],
        [0.        ],
        [0.        ],
        [0.        ],
        [0.        ],
        [0.        ],
        [0.03156309],
        [0.04914818],
        [0.07138962],
        [0.        ],
        [0.04329773],
        [0.        ],
        [0

In [27]:
# Can now compute similarity for all documents
from sklearn.metrics.pairwise import cosine_similarity

In [28]:
cosine_similarity(X, q) 

array([[0.48049682],
       [0.        ],
       [0.        ],
       [0.2083882 ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.17557272],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.15870689],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.09680922],
       [0.        ],
       [0.        ],
       [0.07529201],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.29986763],
       [0.10520675],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.27447476],
       [0.12828407],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.05163407],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.03156309],
       [0.04914818],
       [0.07138962],
       [0.        ],
       [0.04329773],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.   

In [29]:
#turn into a vector using flatten
score = cosine_similarity(X, q).flatten()

In [30]:
import numpy as np

In [31]:
# This gives us the the indices of the documents, rather than the scores themselves, and sorts them from highest to lowest.  
# Interested in the 5 highest similarity scores.
np.argsort(score)

array([555, 839, 838, 837, 836, 835, 834, 833, 832, 863, 862, 861, 860,
       840, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 575,
       534, 514, 513, 512, 543, 542, 541, 540, 539, 538, 537, 536, 535,
       574, 533, 532, 531, 530, 529, 559, 558, 556, 843, 842, 841, 592,
       607, 606, 605, 604, 603, 602, 601, 600, 597, 596, 595, 594, 576,
       622, 621, 620, 618, 617, 616, 615, 614, 613, 612, 611, 610, 590,
       573, 572, 571, 570, 569, 567, 566, 564, 563, 562, 560, 591, 515,
       589, 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 477, 437,
       434, 433, 432, 463, 462, 461, 460, 459, 453, 479, 478, 438, 476,
       859, 856, 855, 854, 853, 852, 851, 850, 849, 848, 879, 421, 402,
       401,  44, 431, 429, 428, 427, 426, 425, 424, 423, 422, 878, 420,
       419, 418, 417, 416, 447, 446, 444, 443, 442, 441, 497, 480, 510,
       509, 508, 507, 506, 505, 504, 501, 500, 499, 498, 481, 496, 527,
       526, 524, 523, 521, 520, 519, 518, 517, 516, 494, 877, 87

In [32]:
df.iloc[449].text

'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.'

In [33]:
fields = ['section', 'question', 'text']

In [34]:
matrices = {}
vectorizers = {}

for f in fields:
    cv = TfidfVectorizer(stop_words='english', min_df=5)
    X = cv.fit_transform(df[f])
    matrices[f] = X
    vectorizers[f] = cv

In [35]:
df

Unnamed: 0,course,section,question,text
0,data-engineering-zoomcamp,General course-related questions,Course - When will the course start?,The purpose of this document is to capture fre...
1,data-engineering-zoomcamp,General course-related questions,Course - What are the prerequisites for this c...,GitHub - DataTalksClub data-engineering-zoomca...
2,data-engineering-zoomcamp,General course-related questions,Course - Can I still join the course after the...,"Yes, even if you don't register, you're still ..."
3,data-engineering-zoomcamp,General course-related questions,Course - I have registered for the Data Engine...,You don't need it. You're accepted. You can al...
4,data-engineering-zoomcamp,General course-related questions,Course - What can I do before the course starts?,You can start by installing and setting up all...
...,...,...,...,...
943,mlops-zoomcamp,Module 6: Best practices,Github actions: Permission denied error when e...,Problem description\nThis is the step in the c...
944,mlops-zoomcamp,Module 6: Best practices,Managing Multiple Docker Containers with docke...,Problem description\nWhen a docker-compose fil...
945,mlops-zoomcamp,Module 6: Best practices,AWS regions need to match docker-compose,Problem description\nIf you are having problem...
946,mlops-zoomcamp,Module 6: Best practices,Isort Pre-commit,Problem description\nPre-commit command was fa...


In [36]:
vectorizers

{'section': TfidfVectorizer(min_df=5, stop_words='english'),
 'question': TfidfVectorizer(min_df=5, stop_words='english'),
 'text': TfidfVectorizer(min_df=5, stop_words='english')}

In [37]:
n = len(df)

In [38]:
n

948

In [39]:
score = np.zeros(n)

query = "I just discovered the course, is it too late to join?"

for f in fields:
    q = vectorizers[f].transform([query])
    X = matrices[f]

    f_score = cosine_similarity(X, q).flatten()

    score = score + f_score

In [40]:
filter = {
    'course': 'data-engineering-zoomcamp'
}

In [44]:
for field, value in filter.items():
    mask = (df[field] == value).astype(int).values
    score = score * mask

In [45]:
idx = np.argsort(score)[-5:]

In [46]:
df.iloc[idx]

Unnamed: 0,course,section,question,text
4,data-engineering-zoomcamp,General course-related questions,Course - What can I do before the course starts?,You can start by installing and setting up all...
34,data-engineering-zoomcamp,General course-related questions,How can we contribute to the course?,Star the repo! Share it with friends if you fi...
1,data-engineering-zoomcamp,General course-related questions,Course - What are the prerequisites for this c...,GitHub - DataTalksClub data-engineering-zoomca...
7,data-engineering-zoomcamp,General course-related questions,Course - Can I follow the course after it fini...,"Yes, we will keep all the materials after the ..."
0,data-engineering-zoomcamp,General course-related questions,Course - When will the course start?,The purpose of this document is to capture fre...


In [63]:
# To get better results, we can boost matches where the terms in the question field match more closely to the query.  
# We can give more score to "question" rather than "text"

score = np.zeros(n)

query = "I just discovered the course, is it too late to join?"

boosts = {
    'question': 3,
    'text': 0.5
}

for f in fields:
    q = vectorizers[f].transform([query])
    X = matrices[f]

    f_score = cosine_similarity(X, q).flatten()

    boost = boosts.get(f, 1.0)

    score = score + boost * f_score

In [64]:
filter = {
    'course': 'data-engineering-zoomcamp'
}

In [65]:
for field, value in filter.items():
    mask = (df[field] == value).astype(int).values
    score = score * mask

In [66]:
idx = np.argsort(score)[-5:]

In [67]:
df.iloc[idx]

Unnamed: 0,course,section,question,text
5,data-engineering-zoomcamp,General course-related questions,Course - how many Zoomcamps in a year?,"There are 3 Zoom Camps in a year, as of 2024. ..."
4,data-engineering-zoomcamp,General course-related questions,Course - What can I do before the course starts?,You can start by installing and setting up all...
34,data-engineering-zoomcamp,General course-related questions,How can we contribute to the course?,Star the repo! Share it with friends if you fi...
1,data-engineering-zoomcamp,General course-related questions,Course - What are the prerequisites for this c...,GitHub - DataTalksClub data-engineering-zoomca...
7,data-engineering-zoomcamp,General course-related questions,Course - Can I follow the course after it fini...,"Yes, we will keep all the materials after the ..."


## Putting it all together

In [68]:
class TextSearch:

    def __init__(self, text_fields):
        self.text_fields = text_fields
        self.matrices = {}
        self.vectorizers = {}

    def fit(self, records, vectorizer_params={}):
        self.df = pd.DataFrame(records)

        for f in self.text_fields:
            cv = TfidfVectorizer(**vectorizer_params)
            X = cv.fit_transform(self.df[f])
            self.matrices[f] = X
            self.vectorizers[f] = cv

    def search(self, query, n_results=10, boost={}, filters={}):
        score = np.zeros(len(self.df))

        for f in self.text_fields:
            b = boost.get(f, 1.0)
            q = self.vectorizers[f].transform([query])
            s = cosine_similarity(self.matrices[f], q).flatten()
            score = score + b * s

        for field, value in filters.items():
            mask = (self.df[field] == value).values
            score = score * mask

        idx = np.argsort(-score)[:n_results]
        results = self.df.iloc[idx]
        return results.to_dict(orient='records')

In [69]:
index = TextSearch(
    text_fields=['section', 'question', 'text']
)
index.fit(documents)

index.search(
    query='I just signed up. Is it too late to join the course?',
    n_results=5,
    boost={'question': 3.0},
    filters={'course': 'data-engineering-zoomcamp'}
)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
  'section': 'General course-related questions',
  'question': 'Course - When will the course start?',
  'course': 'data-engineerin

### (See minsearch code that Alexey created to implement the above process)


## Vector Search

Vector search is useful when words don't match exactly.  Showing a few different types of vector search.  First is applying singular value decomposition (SVD).  Just a way to reduce dimensionality while preserving as much information as possible.

In [70]:
# Document vector
X

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 23808 stored elements and shape (948, 1333)>

In [71]:
from sklearn.decomposition import TruncatedSVD

In [73]:
X = matrices['text']
cv = vectorizers['text']

In [74]:
svd = TruncatedSVD(n_components=16)
X_emb = svd.fit_transform(X)

In [75]:
X_emb.shape

(948, 16)

In [76]:
X_emb[0]

array([ 0.09653079, -0.08210454, -0.10018914, -0.08088948,  0.06961661,
       -0.06096289,  0.02329437, -0.09692912, -0.24640051,  0.29372813,
        0.09957282,  0.06467022,  0.01350107, -0.10302662,  0.01484633,
        0.0047849 ])

In [78]:
# representing the query 

query = 'I just signed up. Is it too late to join the course?'

Q = cv.transform([query])
Q_emb = svd.transform(Q)
Q_emb[0]

array([ 0.05790176, -0.03845671, -0.05538784, -0.02823445,  0.04001751,
       -0.06301191,  0.01383448, -0.06642338, -0.17336885,  0.18981088,
        0.07654223,  0.06638852,  0.00524203, -0.072997  ,  0.00971868,
       -0.02389385])

In [84]:
# compute similarity of the document vector and the query vector
np.dot(X_emb[0], Q_emb[0])

np.float64(0.003963165680206169)

In [85]:
score = cosine_similarity(X_emb, Q_emb).flatten()
idx = np.argsort(-score)[:10]
df.loc[idx]

Unnamed: 0,course,section,question,text
779,machine-learning-zoomcamp,Miscellaneous,Reproducibility,Problem description:\nDo we have to run everyt...
810,mlops-zoomcamp,+-General course questions,Format for questions: [Problem title],MLOps Zoomcamp FAQ\nThe purpose of this docume...
221,data-engineering-zoomcamp,error: Error while reading table: trips_data_a...,GCP BQ - Invalid project ID . Project IDs must...,Problem occurs when misplacing content after f...
816,mlops-zoomcamp,+-General course questions,Can I still graduate when I didn’t complete ho...,"In order to obtain the certificate, completion..."
677,machine-learning-zoomcamp,8. Neural Networks and Deep Learning,The same accuracy on epochs,Problem description\nThe accuracy and the loss...
755,machine-learning-zoomcamp,Projects (Midterm and Capstone),Problem title,Problem description\nSolution description\n(op...
892,mlops-zoomcamp,Module 3: Orchestration,Problem title,Problem description\nSolution description\n(op...
946,mlops-zoomcamp,Module 6: Best practices,Isort Pre-commit,Problem description\nPre-commit command was fa...
940,mlops-zoomcamp,Module 6: Best practices,Git commit with pre-commit hook raises error ‘...,Problem description\ngit commit -m 'Updated xx...
791,machine-learning-zoomcamp,Miscellaneous,Chart for classes and predictions,How to visualize the predictions per classes a...


There are some benefits to doing search in this way, including taking care of synonyms.  We also have vectors we can use for other purposes, such as training a ML model.  With the original sparse vector, would be much more difficult

Something to consider...a bit of an issue...hard to interpret negative values.  So there is a different way to compress this matrix, called **Non-Negative Matrix Factorization**.  

In [87]:
from sklearn.decomposition import NMF

nmf = NMF(n_components=16)
X_emb = nmf.fit_transform(X)
X_emb[0]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.00063186, 0.        , 0.27011993, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        ])

In [88]:
Q = cv.transform([query])
Q_emb = nmf.transform(Q)
Q_emb[0]

array([0.        , 0.00241631, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.17952633, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.00239042,
       0.        ])

In [90]:
score = cosine_similarity(X_emb, Q_emb).flatten()
idx = np.argsort(-score)[:10]
df.loc[idx]

Unnamed: 0,course,section,question,text
2,data-engineering-zoomcamp,General course-related questions,Course - Can I still join the course after the...,"Yes, even if you don't register, you're still ..."
449,machine-learning-zoomcamp,General course-related questions,The course has already started. Can I still jo...,"Yes, you can. You won’t be able to submit some..."
814,mlops-zoomcamp,+-General course questions,What if my answer is not exactly the same as t...,Please choose the closest one to your answer. ...
11,data-engineering-zoomcamp,General course-related questions,Certificate - Can I follow the course in a sel...,"No, you can only get a certificate if you fini..."
0,data-engineering-zoomcamp,General course-related questions,Course - When will the course start?,The purpose of this document is to capture fre...
764,machine-learning-zoomcamp,Projects (Midterm and Capstone),What If I submitted only two projects and fail...,If you have submitted two projects (and peer-r...
7,data-engineering-zoomcamp,General course-related questions,Course - Can I follow the course after it fini...,"Yes, we will keep all the materials after the ..."
451,machine-learning-zoomcamp,General course-related questions,Can I submit the homework after the due date?,"No, it’s not possible. The form is closed afte..."
436,machine-learning-zoomcamp,General course-related questions,Is it going to be live? When?,"The course videos are pre-recorded, you can st..."
450,machine-learning-zoomcamp,General course-related questions,When does the next iteration start?,The course is available in the self-paced mode...


These methods are the simplest possible methods for creating embeddings.  The issue with these approaches is that the input is bag of words, so you lose the information about word order.  Sometimes word order is quite important.  For RAG, it might not be that important--especially for this dataset--but some situations, including more advanced RAG, may demand another approach.

Book reference:  Introduction to Information Retrieval.
Also recommends searching "metrics for ranking" for selecting embedders, how many dimensions to choose, etc.

## BERT

BERT is a neural network (transformer) that turns a document into an embedding.  They capture not only semantic similarity but also word order.  It's especially useful for things like "find out more" or abbreviations ("don't do that"...)


For this part, we'll get models from Huggingface.

In [91]:
import torch
from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.eval()  # Set the model to evaluation mode if not training

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [92]:
texts = [
    "Yes, we will keep all the materials after the course finishes.",
    "You can follow the course at your own pace after it finishes"
]
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
encoded_input


{'input_ids': tensor([[  101,  2748,  1010,  2057,  2097,  2562,  2035,  1996,  4475,  2044,
          1996,  2607, 12321,  1012,   102],
        [  101,  2017,  2064,  3582,  1996,  2607,  2012,  2115,  2219,  6393,
          2044,  2009, 12321,   102,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])}

In [93]:
with torch.no_grad():  # Disable gradient calculation for inference
    outputs = model(**encoded_input)
    hidden_states = outputs.last_hidden_state

In [94]:
hidden_states.shape

torch.Size([2, 15, 768])

In [95]:
hidden_states[0]

tensor([[ 0.1010,  0.0181,  0.1303,  ..., -0.2932,  0.1863,  0.6615],
        [ 1.0608, -0.1242,  0.1370,  ..., -0.1605,  1.0429,  0.3532],
        [ 0.1802,  0.0776,  0.3941,  ..., -0.1379,  0.5974,  0.1704],
        ...,
        [ 0.4738, -0.0184,  0.2186,  ..., -0.0013, -0.0833, -0.2170],
        [ 0.6516,  0.1216, -0.2494,  ...,  0.1557, -0.5632, -0.4310],
        [ 0.7164,  0.2157, -0.0281,  ...,  0.2281, -0.6725, -0.3245]])

In [96]:
hidden_states[0].shape

torch.Size([15, 768])

In [97]:
sentence_embeddings = hidden_states.mean(dim=1)
sentence_embeddings.shape

torch.Size([2, 768])

In [98]:
sentence_embeddings

tensor([[ 0.3600, -0.1607,  0.3545,  ...,  0.0429,  0.0348, -0.0382],
        [ 0.1785, -0.5000,  0.2528,  ..., -0.1141, -0.3361,  0.4110]])

In [99]:
X.shape

(948, 1333)

While BERT is overkill for this example, it's much more powerful for much more complex documents or queries.

In [100]:
X_emb = sentence_embeddings.numpy()

In [101]:
def make_batches(seq, n):
    result = []
    for i in range(0, len(seq), n):
        batch = seq[i:i+n]
        result.append(batch)
    return result

In [102]:
from tqdm.auto import tqdm
texts = df['text'].tolist()
text_batches = make_batches(texts, 8)

all_embeddings = []

for batch in tqdm(text_batches):
    encoded_input = tokenizer(batch, padding=True, truncation=True, return_tensors='pt')

    with torch.no_grad():
        outputs = model(**encoded_input)
        hidden_states = outputs.last_hidden_state
        
        batch_embeddings = hidden_states.mean(dim=1)
        batch_embeddings_np = batch_embeddings.cpu().numpy()
        all_embeddings.append(batch_embeddings_np)

final_embeddings = np.vstack(all_embeddings)

  0%|          | 0/119 [00:00<?, ?it/s]

In [103]:
def compute_embeddings(texts, batch_size=8):
    text_batches = make_batches(texts, 8)
    
    all_embeddings = []
    
    for batch in tqdm(text_batches):
        encoded_input = tokenizer(batch, padding=True, truncation=True, return_tensors='pt')
    
        with torch.no_grad():
            outputs = model(**encoded_input)
            hidden_states = outputs.last_hidden_state
            
            batch_embeddings = hidden_states.mean(dim=1)
            batch_embeddings_np = batch_embeddings.cpu().numpy()
            all_embeddings.append(batch_embeddings_np)
    
    final_embeddings = np.vstack(all_embeddings)
    return final_embeddings

In [104]:
X_text = compute_embeddings(df['text'].tolist())

  0%|          | 0/119 [00:00<?, ?it/s]

In [105]:
X_text

array([[-0.00456303, -0.11667512,  0.6274718 , ..., -0.03659191,
         0.10031679,  0.0292713 ],
       [-0.1423361 , -0.1985392 ,  0.28455415, ..., -0.01139053,
        -0.1539977 ,  0.09535079],
       [ 0.19672246, -0.08461305,  0.28200513, ...,  0.11395867,
        -0.06448027, -0.0128261 ],
       ...,
       [-0.2821744 , -0.33324358,  0.29784983, ..., -0.35042733,
         0.03266054,  0.09537254],
       [-0.42807093, -0.39468756,  0.3094198 , ..., -0.05943285,
        -0.12965176,  0.07887058],
       [-0.16892126, -0.25146285,  0.47843292, ..., -0.18535416,
        -0.1610892 ,  0.27272922]], shape=(948, 768), dtype=float32)