# Semantic Search
Vector similarity search, also known as vector search, is a process that utilizes distance metrics to assess the similarity between vectors stored in a vector database. In essence, it involves finding the most similar items in a set of vectors to a given query vector. This task is commonly referred to as nearest neighbor search.

![](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png)

**Note: To know more about Semantic Search Go through with this [Link](https://txt.cohere.com/what-is-semantic-search/)**



### Semantic Search with Hugging Face LLM and Pinecone

#### 1. Loading the Quora Questions Dataset
   - Loading the dataset containing questions from Quora.

#### 2. Converting Questions to Vectors
   - Encoding each question in the dataset into a high-dimensional vector representation using Hugging Face LLM.

#### 3. Indexing Vectors in Pinecone
   - Storing and indexing the question vectors in Pinecone to enable efficient similarity search.

#### 4. Performing Semantic Search
   - Using the indexed vectors in Pinecone to perform semantic search given a query question.

#### 5. Displaying Search Results
   - Presenting the top search results based on top K.

## Step 1: Install the Required Libraries and Import the Needed Packages

Get the Pinecone API Key using this [Link](https://login.pinecone.io/login?state=hKFo2SBrTnc0d3JQd0s1azAxOG5BSzFSYXBzeWVXdUVhX0VLeqFupWxvZ2luo3RpZNkgRjEtcXFVb003cnhlYk44Tlk1TmVSa3hOLUUteG1aLWGjY2lk2SBUOEkyaEc2Q2FaazUwT05McWhmN3h6a1I0WmhMcVM0Qw&client=T8I2hG6CaZk50ONLqhf7xzkR4ZhLqS4C&protocol=oauth2&audience=https%3A%2F%2Fus-central1-production-console.cloudfunctions.net%2Fapi%2Fv1&scope=openid%20profile%20email%20read%3Acurrent_user&redirect_uri=https%3A%2F%2Fapp.pinecone.io&sessionType=signup&response_type=code&response_mode=query&nonce=R0dVNm0uc2w5TC5NRC5TY3Myck9KMHdrWGE4NHc3MDF1bHNjMW1ZZkY3Sg%3D%3D&code_challenge=z8dxKkM1zjVekEsPcGUrInIzDOKdnpueQKjtruSvPpo&code_challenge_method=S256&auth0Client=eyJuYW1lIjoiYXV0aDAtcmVhY3QiLCJ2ZXJzaW9uIjoiMS41LjAifQ%3D%3D)

In [1]:
pip install datasets -U sentence-transformers pinecone-client -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/510.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.6/510.5 kB[0m [31m5.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m501.8/510.5 kB[0m [31m8.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.5/156.5 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.0/211.0 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?

In [2]:
from google.colab import userdata
PINECONE_API_KEY=userdata.get('PINECONE_API_KEY')

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from pinecone import Pinecone, ServerlessSpec
import os
import time
import torch

In [5]:
from tqdm.auto import tqdm

### Step 2 : Load the [HuggingFace Dataset](https://huggingface.co/datasets/quora)

In [6]:
dataset = load_dataset('quora', split='train[240000:290000]')

Downloading data:   0%|          | 0.00/35.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/404290 [00:00<?, ? examples/s]

In [7]:
## Lets take a look at sample of our dataset
dataset[:5]

{'questions': [{'id': [207550, 351729],
   'text': ['What is the truth of life?', "What's the evil truth of life?"]},
  {'id': [33183, 351730],
   'text': ['Which is the best smartphone under 20K in India?',
    'Which is the best smartphone with in 20k in India?']},
  {'id': [351731, 351732],
   'text': ['Steps taken by Canadian government to improve literacy rate?',
    'Can I send homemade herbal hair oil from India to US via postal or private courier services?']},
  {'id': [37799, 94186],
   'text': ['What is a good way to lose 30 pounds in 2 months?',
    'What can I do to lose 30 pounds in 2 months?']},
  {'id': [351733, 351734],
   'text': ['Which of the following most accurately describes the translation of the graph y = (x+3)^2 -2 to the graph of y = (x -2)^2 +2?',
    'How do you graph x + 2y = -2?']}],
 'is_duplicate': [False, True, False, True, False]}

### Lets store all the unique questions in one place

In [8]:
# Initialize an empty list to store questions
questions = []

# Loop through each record in the dataset and extract the 'text' field from 'questions' records
for record in dataset['questions']:
    # Extend the 'questions' list with the text from each record
    questions.extend(record['text'])

# Remove duplicate questions by converting the list to a set and back to a list
question = list(set(questions))

# Print the first 10 questions from the dataset
print('\n'.join(questions[:10]))

# Print a separator line for clarity
print('-' * 50)

# Print the total number of unique questions in the dataset
print(f'Number of questions: {len(questions)}')


What is the truth of life?
What's the evil truth of life?
Which is the best smartphone under 20K in India?
Which is the best smartphone with in 20k in India?
Steps taken by Canadian government to improve literacy rate?
Can I send homemade herbal hair oil from India to US via postal or private courier services?
What is a good way to lose 30 pounds in 2 months?
What can I do to lose 30 pounds in 2 months?
Which of the following most accurately describes the translation of the graph y = (x+3)^2 -2 to the graph of y = (x -2)^2 +2?
How do you graph x + 2y = -2?
--------------------------------------------------
Number of questions: 100000


### Step 3: Check cuda and Setup the model

**Note**: "Checking cuda" refers to checking if you have access to GPUs (faster compute). In this course, we are using CPUs. So, you might notice some code cells taking a little longer to run.

We are using *all-MiniLM-L6-v2* sentence-transformers model that maps sentences to a 384 dimensional dense vector space.

In [9]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != 'cuda':
    print('Sorry no cuda.')

#This model maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

Sorry no cuda.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [10]:
query = 'which city is the most populated in the world?'
xq = model.encode(query)
xq.shape

(384,)

As we can see this model is converting sentence into vectors in which dimension is 384


### Step 4: Setup Pinecone and Create Index Object

A **vector database** stores data as high-dimensional vectors, representing features or attributes. Each vector has a set number of dimensions, ranging from tens to thousands, depending on data complexity.

When creating a vector index in Pinecone, it's important to specify a distance metric, which determines how distances between vectors are calculated and influences similarity measurements.

Pinecone is a cloud-based vector database, providing storage, indexing, and querying of high-dimensional vectors efficiently through internet-accessible APIs.

- **Index**: An index in Pinecone, built on vector embeddings, allows for fast and efficient searching, similarity calculations, and other operations, enabling quick retrieval of similar vectors based on proximity in the high-dimensional vector space.

- **Vector Store**: Pinecone serves as a specialized vector database optimized for storing and querying high-dimensional vectors. These vectors represent objects in a numerical space, with each dimension corresponding to object features or attributes. Pinecone's capabilities are ideal for applications like similarity search, recommendation systems, and natural language processing.


### Distance based Metrics

![](https://miro.medium.com/v2/resize:fit:720/format:webp/1*FTVRr_Wqz-3_k6Mk6G4kew.png)



In [11]:
# Initialize Pinecone client with the API key
pinecone = Pinecone(api_key=PINECONE_API_KEY)

# Generate index name using a utility function
INDEX_NAME = "semantic-search"
# Check if the index already exists, and if so, delete it
if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:
    pinecone.delete_index(INDEX_NAME)

# Print the index name to be created or re-used
print(INDEX_NAME)

from pinecone import Pinecone, PodSpec
# Create a new index with specified parameters
pinecone.create_index(name=INDEX_NAME,
    dimension=model.get_sentence_embedding_dimension(),
    metric='cosine',
    spec=PodSpec(
    environment="gcp-starter")) #ServerlessSpec(cloud='aws', region='us-west-2')

# Initialize an index object for the newly created index
index = pinecone.Index(INDEX_NAME)

# Print the initialized index object
print(index)

semantic-search
<pinecone.data.index.Index object at 0x7f6824ba3a30>


### Step 5: Create Embeddings,metadatas,IDs for batchwise questions and Upsert to Pinecone

1. **Embeddings**: Convert batchwise questions into numerical representations.
2. **Metadata**: Include additional info like source, category, etc., for each question.
3. **IDs**: Assign unique identifiers to each question.
4. **Upsert to Pinecone**: Use Pinecone's API to store embeddings, metadata, and IDs efficiently in the Pinecone vector database.

This process allows for organized storage and easy retrieval of batchwise questions and associated information using Pinecone.

In [12]:
batch_size=200
vector_limit=8000

questions = question[:vector_limit]

import json

for i in tqdm(range(0, len(questions), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(questions))
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    metadatas = [{'text': text} for text in questions[i:i_end]]
    # create embeddings
    xc = model.encode(questions[i:i_end])
    # create records list for upsert
    records = zip(ids, xc, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records)

  0%|          | 0/40 [00:00<?, ?it/s]

In [13]:
## Check statistics about the contents of an index, including the vector count per namespace and the number of dimensions, and the index fullness.
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.068,
 'namespaces': {'': {'vector_count': 6800}},
 'total_vector_count': 6800}

### Step 6: Run Your Query

In [14]:
# Define a helper function to run a query using the given model and index
def run_query(query):
    # Encode the query using the model and convert it to a list
    embedding = model.encode(query).tolist()

    # Query the index with the encoded query
    # Set top_k=10 to retrieve the top 10 results
    # Set include_metadata=True to include metadata in the results
    # Set include_values=False to exclude vector values in the results
    results = index.query(top_k=10, vector=embedding, include_metadata=True, include_values=False)

    # Iterate over the matches in the results
    for result in results['matches']:
        # Print the score and corresponding metadata text for each match
        print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

## Step 7: Get the Semantic Search Results for query

During semantic search, the model compares the query with all the texts in the dataset. It does this by looking at how similar the meanings are between the query and each text. It measures this similarity using something called cosine similarity, which helps it figure out how closely related the meanings are. Then, it gives us back the texts that are most similar to our query, helping us find what we're looking for more easily.

![](https://radimrehurek.com/gensim/_images/sphx_glr_run_scm_001.png)

In [15]:
run_query('which city has the highest population in the world?')

0.7: Which is the most beautiful city in world?
0.64: Which city has the most museums per capita?
0.56: What is the best tourist place in world?
0.56: What is best city in India?
0.54: What are the largest slums in the world?
0.53: What is the least known country in the world?
0.52: What are the most unsafe cities in America?
0.52: What are the most powerful countries in the world?
0.51: What are the largest cities in Spain and what are they most known for?
0.5: Where is the most beautiful beach in the world?


In [16]:
query = 'how do i make chocolate cake?'
run_query(query)

0.77: What are some ways to add chocolate chips to a cake mix?
0.56: What are the best date cake recipes?
0.55: How can I learn about baking cakes and desserts?
0.53: How do you make a perfume out of Skittles?
0.5: How do I make a red fondant?
0.49: What is a red velvet cake?
0.49: How do I make art?
0.49: Are You Looking For Tasty Chocolates in Bangalore?
0.46: How can one make the Mint Mojito coffee at home similar to the one at Phillz?
0.46: How can I make homemade pancakes without baking soda?


As we can see the Semantic Seach is giving Good Results