# LLM Zoomcamp - Homework 2: Vector Search

This notebook contains solutions for Homework 2 focusing on vector search implementations.

## Q1. Embedding the query

Embed the query: 'I just discovered the course. Can I join now?'. Use the 'jinaai/jina-embeddings-v2-small-en' model.

You should get a numpy array of size 512.

What's the minimal value in this array?

Options:
- -0.51
- -0.11
- 0
- 0.51

In [None]:
# Install required packages if needed
# pip install -q "qdrant-client[fastembed]>=1.14.2"

In [2]:
# Q1 Solution: Embedding the query
from fastembed import TextEmbedding
import numpy as np

# Initialize the embedding model
embedding_model = TextEmbedding("jinaai/jina-embeddings-v2-small-en")

# Embed the query
query = "I just discovered the course. Can I join now?"
query_embedding = list(embedding_model.embed([query]))[0]

# Convert to numpy array
query_vector = np.array(query_embedding)

print(f"Query: {query}")
print(f"Embedding shape: {query_vector.shape}")
print(f"Minimal value: {query_vector.min()}")
print(f"Maximal value: {query_vector.max()}")

Query: I just discovered the course. Can I join now?
Embedding shape: (512,)
Minimal value: -0.11726373551188797
Maximal value: 0.13307955253468784


## Q2. Cosine similarity with another vector

The vectors that our embedding model returns are already normalized: their length is 1.0.

You can check that by using the norm function:
```python
import numpy as np
np.linalg.norm(q)
```

Which means that we can simply compute the dot product between two vectors to learn the cosine similarity between them.

For example, if you compute the cosine of the query vector with itself, the result will be 1.0:
```python
q.dot(q)
```

Now let's embed this document:
```
doc = 'Can I still join the course after the start date?'
```

What's the cosine similarity between the vector for the query and the vector for the document?

Options:
- 0.3
- 0.5
- 0.7
- 0.9"

In [4]:
# Q2 Solution: Cosine similarity with another vector

# First, let's verify that vectors are normalized
print(f"Query vector norm: {np.linalg.norm(query_vector)}")
print(f"Query vector dot product with itself: {query_vector.dot(query_vector)}")

# Embed the document
doc = 'Can I still join the course after the start date?'
doc_embedding = list(embedding_model.embed([doc]))[0]
doc_vector = np.array(doc_embedding)

print(f"\nDocument: {doc}")
print(f"Document vector norm: {np.linalg.norm(doc_vector)}")

# Compute cosine similarity (dot product since vectors are normalized)
cosine_similarity = query_vector.dot(doc_vector)

print(f"\nCosine similarity between query and document: {cosine_similarity}")

Query vector norm: 1.0
Query vector dot product with itself: 1.0

Document: Can I still join the course after the start date?
Document vector norm: 0.9999999999999999

Cosine similarity between query and document: 0.9008528856818037


## Q3. Ranking by cosine

For Q3 and Q4 we will use these documents:

```python
documents = [
    {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
     'section': 'General course-related questions',
     'question': 'Course - Can I still join the course after the start date?',
     'course': 'data-engineering-zoomcamp'},
    {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
     'section': 'General course-related questions',
     'question': 'Course - Can I follow the course after it finishes?',
     'course': 'data-engineering-zoomcamp'},
    {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  "Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon't forget to register in DataTalks.Club's Slack and join the channel.",
     'section': 'General course-related questions',
     'question': 'Course - When will the course start?',
     'course': 'data-engineering-zoomcamp'},
    {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
     'section': 'General course-related questions',
     'question': 'Course - What can I do before the course starts?',
     'course': 'data-engineering-zoomcamp'},
    {'text': 'Star the repo! Share it with friends if you find it useful ❣️\nCreate a PR if you see you can improve the text or the structure of the repository.',
     'section': 'General course-related questions',
     'question': 'How can we contribute to the course?',
     'course': 'data-engineering-zoomcamp'}
]
```

Compute the embeddings for the text field, and compute the cosine between the query vector and all the documents.

What's the document index with the highest similarity? (Indexing starts from 0):

Options:
- 0
- 1
- 2
- 3
- 4

**Hint:** if you put all the embeddings of the text field in one matrix V (a single 2-dimensional numpy array), then computing the cosine becomes a matrix multiplication: `V.dot(q)`"

In [6]:
# Q3 Solution: Ranking by cosine

# Define the documents
documents = [{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
  'section': 'General course-related questions',
  'question': 'Course - When will the course start?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
  'section': 'General course-related questions',
  'question': 'Course - What can I do before the course starts?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Star the repo! Share it with friends if you find it useful ❣️\nCreate a PR if you see you can improve the text or the structure of the repository.',
  'section': 'General course-related questions',
  'question': 'How can we contribute to the course?',
  'course': 'data-engineering-zoomcamp'}]

# Extract text from documents
texts = [doc['text'] for doc in documents]

# Embed all documents
print("Embedding documents...")
document_embeddings = list(embedding_model.embed(texts))

# Convert to numpy array matrix (V)
V = np.array(document_embeddings)
print(f"Document embeddings matrix shape: {V.shape}")

# Compute cosine similarities using matrix multiplication
cosine_similarities = V.dot(query_vector)

print(f"\nQuery: {query}")
print("\nCosine similarities with documents:")
for i, (doc, similarity) in enumerate(zip(documents, cosine_similarities)):
    print(f"Document {i}: {similarity:.4f}")
    print(f"  Question: {doc['question']}")
    print(f"  Text preview: {doc['text'][:100]}...")
    print()

# Find the document with highest similarity
max_similarity_index = np.argmax(cosine_similarities)
max_similarity = cosine_similarities[max_similarity_index]

print(f"Document with highest similarity:")
print(f"Index: {max_similarity_index}")
print(f"Similarity: {max_similarity:.4f}")
print(f"Question: {documents[max_similarity_index]['question']}")

Embedding documents...
Document embeddings matrix shape: (5, 512)

Query: I just discovered the course. Can I join now?

Cosine similarities with documents:
Document 0: 0.7630
  Question: Course - Can I still join the course after the start date?
  Text preview: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, t...

Document 1: 0.8182
  Question: Course - Can I follow the course after it finishes?
  Text preview: Yes, we will keep all the materials after the course finishes, so you can follow the course at your ...

Document 2: 0.8085
  Question: Course - When will the course start?
  Text preview: The purpose of this document is to capture frequently asked technical questions
The exact day and ho...

Document 3: 0.7133
  Question: Course - What can I do before the course starts?
  Text preview: You can start by installing and setting up all the dependencies and requirements:
Google cloud accou...

Document 4: 0.7304
  Question: How can

## Q4. Ranking by cosine, version two

Now let's calculate a new field, which is a concatenation of question and text:

```python
full_text = doc['question'] + ' ' + doc['text']
```

Embed this field and compute the cosine between it and the query vector. What's the highest scoring document?

Options:
- 0
- 1
- 2
- 3
- 4

Is it different from Q3? If yes, why?"

In [7]:
# Q4 Solution: Ranking by cosine, version two

# Create full_text field by concatenating question and text
full_texts = [doc['question'] + ' ' + doc['text'] for doc in documents]

print("Full text examples:")
for i, full_text in enumerate(full_texts):
    print(f"Document {i}: {full_text[:150]}...")
    print()

# Embed all full_text documents
print("Embedding full texts...")
full_text_embeddings = list(embedding_model.embed(full_texts))

# Convert to numpy array matrix
V_full = np.array(full_text_embeddings)
print(f"Full text embeddings matrix shape: {V_full.shape}")

# Compute cosine similarities using matrix multiplication
cosine_similarities_full = V_full.dot(query_vector)

print(f"\nQuery: {query}")
print("\nCosine similarities with full text (question + text):")
for i, (doc, similarity) in enumerate(zip(documents, cosine_similarities_full)):
    print(f"Document {i}: {similarity:.4f}")
    print(f"  Question: {doc['question']}")
    print()

# Find the document with highest similarity
max_similarity_index_full = np.argmax(cosine_similarities_full)
max_similarity_full = cosine_similarities_full[max_similarity_index_full]

print(f"Document with highest similarity (full text):")
print(f"Index: {max_similarity_index_full}")
print(f"Similarity: {max_similarity_full:.4f}")
print(f"Question: {documents[max_similarity_index_full]['question']}")

# Compare with Q3 results
print(f"\n=== Comparison with Q3 ===")
print(f"Q3 (text only) - Highest similarity index: {max_similarity_index}")
print(f"Q4 (question + text) - Highest similarity index: {max_similarity_index_full}")

if max_similarity_index != max_similarity_index_full:
    print(f"\n✓ Results are DIFFERENT!")
    print(f"Why: Including the 'question' field in the embedding provides more semantic context.")
    print(f"The query '{query}' is semantically closer to some questions than just the text content.")
    print(f"This shows that the question titles contain important semantic information for matching.")
else:
    print(f"\n✓ Results are the SAME!")
    print(f"The same document had the highest similarity in both approaches.")

Full text examples:
Document 0: Course - Can I still join the course after the start date? Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, h...

Document 1: Course - Can I follow the course after it finishes? Yes, we will keep all the materials after the course finishes, so you can follow the course at you...

Document 2: Course - When will the course start? The purpose of this document is to capture frequently asked technical questions
The exact day and hour of the cou...

Document 3: Course - What can I do before the course starts? You can start by installing and setting up all the dependencies and requirements:
Google cloud accoun...

Document 4: How can we contribute to the course? Star the repo! Share it with friends if you find it useful ❣️
Create a PR if you see you can improve the text or ...

Embedding full texts...
Full text embeddings matrix shape: (5, 512)

Query: I just discovered the course. Can I join now?

Cosine similarities wi

## Q5. Selecting the embedding model

Now let's select a smaller embedding model. What's the smallest dimensionality for models in fastembed?

Options:
- 128
- 256
- 384
- 512

One of these models is `BAAI/bge-small-en`. Let's use it."

In [10]:
# Q5 Solution: Selecting the embedding model

# List all supported models and find their dimensions
import json

dimensions = []
for model in TextEmbedding.list_supported_models():
    dim = model['dim']
    dimensions.append(dim)

# Find the smallest dimension
unique_dims = sorted(set(dimensions))
print(f"\nUnique dimensions available: {unique_dims}")
print(f"Smallest dimension: {min(unique_dims)}")

# Look specifically for BAAI/bge-small-en
print(f"\nLooking for BAAI/bge-small-en model:")
for model in TextEmbedding.list_supported_models():
    if 'bge-small-en' in model['model']:
        print(f"Found: {model['model']}")
        print(f"Dimension: {model['dim']}")
        print(f"Description: {model['description']}")
        
# Initialize the smaller model for Q6
small_model_name = "BAAI/bge-small-en"
print(f"\nUsing model for Q6: {small_model_name}")


Unique dimensions available: [384, 512, 768, 1024]
Smallest dimension: 384

Looking for BAAI/bge-small-en model:
Found: BAAI/bge-small-en
Dimension: 384
Description: Text embeddings, Unimodal (text), English, 512 input tokens truncation, Prefixes for queries/documents: necessary, 2023 year.
Found: BAAI/bge-small-en-v1.5
Dimension: 384
Description: Text embeddings, Unimodal (text), English, 512 input tokens truncation, Prefixes for queries/documents: not so necessary, 2023 year.

Using model for Q6: BAAI/bge-small-en


## Q6. Indexing with qdrant (2 points)

For the last question, we will use more documents.

We will select only FAQ records from our ml zoomcamp:

```python
import requests 

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']
    if course_name != 'machine-learning-zoomcamp':
        continue

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)
```

Add them to qdrant using the model from Q5.

When adding the data, use both question and answer fields:
```python
text = doc['question'] + ' ' + doc['text']
```

After the data is inserted, use the question from Q1 for querying the collection.

What's the highest score in the results? (The score for the first returned record):

Options:
- 0.97
- 0.87
- 0.77
- 0.67"

In [11]:
import requests
from qdrant_client import QdrantClient, models

# Initialize smaller embedding model from Q5
small_embedding_model = TextEmbedding(small_model_name)

# Get the dimension of the smaller model
small_model_info = None
for model in TextEmbedding.list_supported_models():
    if model['model'] == small_model_name:
        small_model_info = model
        break

SMALL_EMBEDDING_DIMENSIONALITY = small_model_info['dim']

# Fetch documents
docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

# Filter only machine-learning-zoomcamp documents
ml_documents = []
for course in documents_raw:
    course_name = course['course']
    if course_name != 'machine-learning-zoomcamp':
        continue
    
    for doc in course['documents']:
        doc['course'] = course_name
        ml_documents.append(doc)

print(f"\\nFound {len(ml_documents)} machine learning zoomcamp documents")

# Initialize Qdrant client
client = QdrantClient("http://localhost:6333")

# Create new collection for Q6
collection_name_q6 = "ml-zoomcamp-homework"

# Delete collection if it exists (for clean start)
try:
    client.delete_collection(collection_name_q6)
    print(f"Deleted existing collection: {collection_name_q6}")
except:
    print(f"Collection {collection_name_q6} doesn't exist, creating new one")

# Create the collection
client.create_collection(
    collection_name=collection_name_q6,
    vectors_config=models.VectorParams(
        size=SMALL_EMBEDDING_DIMENSIONALITY,
        distance=models.Distance.COSINE
    )
)

# Prepare points for insertion
points = []
for i, doc in enumerate(ml_documents):
    # Concatenate question and text
    full_text = doc['question'] + ' ' + doc['text']
    
    point = models.PointStruct(
        id=i,
        vector=models.Document(text=full_text, model=small_model_name),
        payload={
            "text": doc['text'],
            "section": doc['section'],
            "question": doc['question'], 
            "course": doc['course'],
            "full_text": full_text
        }
    )
    points.append(point)

# Insert points
print("Inserting points into Qdrant...")
client.upsert(
    collection_name=collection_name_q6,
    points=points
)
print("✓ Points inserted successfully")

# Query using the question from Q1
query_q1 = "I just discovered the course. Can I join now?"
print(f"\\nQuerying with: '{query_q1}'")

# Search in the collection
search_results = client.query_points(
    collection_name=collection_name_q6,
    query=models.Document(
        text=query_q1,
        model=small_model_name
    ),
    limit=5,
    with_payload=True
)

# Get the highest score (first result)
highest_score = search_results.points[0].score
print(f"Highest score (first returned record): {highest_score:.4f}")

Fetching 5 files:  20%|███████████████████████████████▏                                                                                                                            | 1/5 [00:03<00:15,  3.78s/it]Error while downloading from https://cdn-lfs-us-1.hf.co/repos/56/a7/56a7d32f4f0f1d04669581922991bf8b54943188f120a00956999d4e44993002/904dc556aacd699d056bcb46dec7535551ac876da69814baf0edc8fa0d184f4f?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model_optimized.onnx%3B+filename%3D%22model_optimized.onnx%22%3B&Expires=1750649901&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc1MDY0OTkwMX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzU2L2E3LzU2YTdkMzJmNGYwZjFkMDQ2Njk1ODE5MjI5OTFiZjhiNTQ5NDMxODhmMTIwYTAwOTU2OTk5ZDRlNDQ5OTMwMDIvOTA0ZGM1NTZhYWNkNjk5ZDA1NmJjYjQ2ZGVjNzUzNTU1MWFjODc2ZGE2OTgxNGJhZjBlZGM4ZmEwZDE4NGY0Zj9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=e7MPQ4Vk%7EKcLy7aXd9YhjVq-1IFQt05wUCnbcwZqFrD

Using model: BAAI/bge-small-en
Model dimension: 384
\nFound 375 machine learning zoomcamp documents
Deleted existing collection: ml-zoomcamp-homework
Created collection: ml-zoomcamp-homework
\nPrepared 375 points for insertion
Inserting points into Qdrant...
✓ Points inserted successfully
\nQuerying with: 'I just discovered the course. Can I join now?'
\nSearch results:
Rank 1:
  Score: 0.8703
  Question: The course has already started. Can I still join it?
  Course: machine-learning-zoomcamp
  Text preview: Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the ...

Rank 2:
  Score: 0.8692
  Question: How long is the course?
  Course: machine-learning-zoomcamp
  Text preview: Approximately 4 months, but may take more if you want to do some extra activities (an extra project,...

Rank 3:
  Score: 0.8683
  Question: I’m new to Slack and can’t find the course channel. Where is it?
  Course: machine-learning-zoomcamp
  Text preview: Here’s how 