# Lesson 1 - Semantic Search

Welcome to Lesson 1. 

To access the `requirement.txt` file, go to `File` and click on `Open`.
 
I hope you enjoy this course!

### Import the Needed Packages

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [51]:
from datasets import load_dataset
from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer

import os
import time
import torch

from dotenv import load_dotenv
load_dotenv('/Volumes/Kaam-Dhanda/PersonalProjects/Trail-Workspace/Vector_Database/.env')

True

In [52]:
from tqdm.auto import tqdm

### Load the Dataset

In [31]:
dataset = load_dataset(
    "glue",
    "qqp",
    split="train[240000:290000]"
)

In [37]:
dataset = dataset[:]

In [44]:
questions = []
for record in dataset['question1']:
    questions.append(record)

for record in dataset['question2']:
    questions.append(record)
    
question = list(set(questions))
print('\n'.join(questions[:10]))
print('-' * 50)
print(f'Number of questions: {len(questions)}')

Are cruise missiles fake?
How do I start a designing business?
I forgot my Facebook password and email password. How can I log into Facebook?
What's the best monitor size for a 1920x1080 resolution?
What does 'insight' mean? If I were to give my insights on a philosophical reading/topic, how should I structure my answer?
Laptop with 1.7GHz Intel Core i3 4005U, 8GB RAM, and Nvidia GeForce 930M(2gb). Is it good for medium gaming without lag?
What is a suitable solar panel installation provider near Marina, California CA?
Mainly who is responsible for corruption in India, and why?
What are the differences in life between Chinese and western cultures?
What is the best way to charge a capacitor?
--------------------------------------------------
Number of questions: 100000


### Check cuda and Setup the model

**Note**: "Checking cuda" refers to checking if you have access to GPUs (faster compute). In this course, we are using CPUs. So, you might notice some code cells taking a little longer to run.

We are using *all-MiniLM-L6-v2* sentence-transformers model that maps sentences to a 384 dimensional dense vector space.

In [45]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != 'cuda':
    print('Sorry no cuda.')
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

Sorry no cuda.


In [46]:
query = 'which city is the most populated in the world?'
xq = model.encode(query)
xq.shape

(384,)

### Setup Pinecone

In [53]:
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')

In [56]:
pinecone = Pinecone(api_key=PINECONE_API_KEY)
INDEX_NAME = 'dl-ai-ehbsbzticlwqtafsvvpixsspbkmydweq'

if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:
    pinecone.delete_index(INDEX_NAME)
print(INDEX_NAME)
pinecone.create_index(name=INDEX_NAME, 
    dimension=model.get_sentence_embedding_dimension(), 
    metric='cosine',
    spec=ServerlessSpec(cloud='aws', region='us-east-1'))

index = pinecone.Index(INDEX_NAME)
print(index)

dl-ai-ehbsbzticlwqtafsvvpixsspbkmydweq
<pinecone.db_data.index.Index object at 0x34fa12cf0>


### Create Embeddings and Upsert to Pinecone

In [62]:
batch_size=200
vector_limit=10000

questions = question[:vector_limit]

import json

for i in tqdm(range(0, len(questions), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(questions))
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    metadatas = [{'text': text} for text in questions[i:i_end]]
    
    # create embeddings
    xc = model.encode(questions[i:i_end])
    # create records list for upsert
    records = zip(ids, xc, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records)

100%|██████████| 50/50 [01:33<00:00,  1.87s/it]


In [63]:
index.describe_index_stats()

{'_response_info': {'raw_headers': {'connection': 'keep-alive',
                                    'content-length': '189',
                                    'content-type': 'application/json',
                                    'date': 'Sun, 28 Dec 2025 19:15:31 GMT',
                                    'grpc-status': '0',
                                    'server': 'envoy',
                                    'x-envoy-upstream-service-time': '39',
                                    'x-pinecone-request-id': '3477912484416197520',
                                    'x-pinecone-request-latency-ms': '39',
                                    'x-pinecone-response-duration-ms': '40'}},
 'dimension': 384,
 'index_fullness': 0.0,
 'memoryFullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'__default__': {'vector_count': 10000}},
 'storageFullness': 0.0,
 'total_vector_count': 10000,
 'vector_type': 'dense'}

### Run Your Query

In [72]:
# small helper function so we can repeat queries later
def run_query(query):
  embedding = model.encode(query).tolist()
  results = index.query(top_k=10, vector=embedding, include_metadata=True, include_values=False)
  for result in results['matches']:
    print(f"{round(result['score'], 2)} : {result['metadata']['text']}")

In [73]:
run_query('which city has the highest population in the world?')

0.65 : What are the most dangerous cities in the world?
0.58 : Which country has the most ethnically diverse population in the world?
0.54 : What's the highest mountain in the world?
0.54 : Which is India's highest peak?
0.54 : Who has the most money in the world?
0.53 : What is the least known country in the world?
0.5 : How many people die in the world per second?
0.49 : Which city has the best biriyani in India?
0.49 : Which are the top 10 dangerous cities in the US? Why are they dangerous?
0.48 : What are the most boring capital cities?


In [66]:
query = 'how do i make chocolate cake?'
run_query(query)

0.57: How do I make cream cheese?
0.56: What is the best way to eat Ferrero Rocher chocolates?
0.55: How do you make baking soda?
0.51: How can I make Whipped cream at home?
0.49: What is the best chocolate brand according to you?
0.48: How is baking soda made?
0.46: Where can I get good quality cupcakes and a lot of different flavor in Gold Coast?
0.45: Where do we get the best cakes in Mumbai?
0.44: Why does the Easter bunny give away chocolate eggs?
0.43: What are the best ways to cook sweet potatoes?


In [74]:
query = 'Why am I so Lonely?'
run_query(query)

0.73 : What can I do to not feel lonely?
0.68 : Why do people feel lonely even though they are with people?
0.67 : What do you do when you feel lonely?
0.66 : What can you do when you are lonely?
0.65 : What should be done when one feels lonely?
0.63 : I think I'm doomed to be single for the rest of my life. Why do I feel this way and how can I fix it?
0.53 : How can I get over the loneliness I feel that stems from being the only atheist I know?
0.53 : Why am I afraid with no reason?
0.52 : I'm a high school student. Why am I attracted to unavailable girls?
0.49 : Why is life sad?
