## Week 3 : Vector Search

## Q1. Getting the embeddings model

In [1]:
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd

In [2]:
embedding_model = SentenceTransformer("multi-qa-distilbert-cos-v1")

In [3]:
user_question = "I just discovered the course. Can I still join it?"

In [4]:
vector_question = embedding_model.encode(user_question)



**What's the first value of the resulting vector?**

In [5]:
vector_question[0]

0.07822264

*Answer: 0.07*

### Prepare the documents

In [6]:
import requests 

base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'
relative_url = '03-vector-search/eval/documents-with-ids.json'

docs_url = f'{base_url}/{relative_url}?raw=1'
docs_response = requests.get(docs_url)
documents = docs_response.json()

In [7]:
filtered_documents = [d for d in documents if d["course"] == "machine-learning-zoomcamp"]
len(filtered_documents)

375

## Q2. Creating the embeddings

In [8]:
embeddings = []

for doc in filtered_documents:
    qa_text = f"{doc['question']} {doc['text']}"
    vector = embedding_model.encode(qa_text).tolist()
    embeddings.append(vector)

In [10]:
X = np.array(embeddings)
X.shape

(375, 768)

## Q3. Search

**What's the highest score in the results?**

In [11]:
v = vector_question
scores = X.dot(v)

In [15]:
scores.max()

0.6506573332655323

*Answer:* 0.65

### Vector search

In [12]:
class VectorSearchEngine():
    def __init__(self, documents, embeddings):
        self.documents = documents
        self.embeddings = embeddings

    def search(self, v_query, num_results=10):
        scores = self.embeddings.dot(v_query)
        idx = np.argsort(-scores)[:num_results]
        return [self.documents[i] for i in idx]

search_engine = VectorSearchEngine(documents=documents, embeddings=X)
search_engine.search(v, num_results=5)

[{'text': 'You can find the latest and up-to-date deadlines here: https://docs.google.com/spreadsheets/d/e/2PACX-1vQACMLuutV5rvXg5qICuJGL-yZqIV0FBD84CxPdC5eZHf8TfzB-CJT_3Mo7U7oGVTXmSihPgQxuuoku/pubhtml\nAlso, take note of Announcements from @Au-Tomator for any extensions or other news. Or, the form may also show the updated deadline, if Instructor(s) has updated it.',
  'section': 'General course-related questions',
  'question': 'Homework - What are homework and project deadlines?',
  'course': 'data-engineering-zoomcamp',
  'id': 'a1daf537'},
 {'text': 'After you submit your homework it will be graded based on the amount of questions in a particular homework. You can see how many points you have right on the page of the homework up top. Additionally in the leaderboard you will find the sum of all points you’ve earned - points for Homeworks, FAQs and Learning in Public. If homework is clear, others work as follows: if you submit something to FAQ, you get one point, for each learning i

## Q4. Hit-rate for our search engine

In [13]:
base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'
relative_url = '03-vector-search/eval/ground-truth-data.csv'
ground_truth_url = f'{base_url}/{relative_url}?raw=1'

df_ground_truth = pd.read_csv(ground_truth_url)
df_ground_truth = df_ground_truth[df_ground_truth.course == 'machine-learning-zoomcamp']
ground_truth = df_ground_truth.to_dict(orient='records')

## Q5. Indexing with ElasticSearch

## Q6. Hit-rate for Elasticsearch