#### This Notebook use Vector Search and store Embedding into a vector store along with Indexing

#### Author: Saurabh Mangal (saurabhmangal@google.com)
##### Date: 21st Feb
##### Description: This notebook contains part 3 of lab

 Copyright (c) [2024] [saurabhmangal@] -- 
 This notebook is licensed under the Commercial License.

### Querying a created index

In [143]:
import json

# build dicts for product names and embs
product_names = {}
product_embs = {}
product_text = {}
with open("questions_test.json") as f:
    for l in f.readlines():
        p = json.loads(l)
        id = p["id"]
        product_names[id] = p["id"]
        product_text[id] = p['splitted_texts_chunks']
        product_embs[id] = p["embedding"]

In [144]:
# get the embedding for ID 6523 "cloudveil women's excursion short"
# you can also try with other IDs such as 12711, 18090, 19536 and 11863
query_emb = product_embs["0"]

In [148]:
# run query
response = my_index_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID, queries=[query_emb], num_neighbors=3
)

# show the results
for idx, neighbor in enumerate(response[0]):
    print(f"{neighbor.distance:.2f} {product_names[neighbor.id]} {product_text[neighbor.id]}")

1.00 0 CHAPTER ONE
THE BOY WHO LIVED
M r. and Mrs. Dursley, of number four, Privet Drive, were proud to say
0.83 270 CHAPTER TWO
THE VANISHING GLASS
N early ten years had passed since the Dursleys had woken up to find
0.79 94 ut the Potters.…
Mrs. Dursley came into the living room carrying two cups of tea. It was
no good. He


### Run Query
Finally it's ready to use Vector Search. In the following code, it creates an embedding for a test question, and find similar question with the Vector Search.

In [149]:
import time
import tqdm  # to show a progress bar

# get embeddings for a list of texts
BATCH_SIZE = 5

# Load the text embeddings model
from vertexai.preview.language_models import TextEmbeddingModel

model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

def get_embeddings_wrapper(texts):
    embs = []
    for i in tqdm.tqdm(range(0, len(texts), BATCH_SIZE)):
        time.sleep(1)  # to avoid the quota error
        result = model.get_embeddings(texts[i : i + BATCH_SIZE])
        embs = embs + [e.values for e in result]
    return embs

In [150]:

df = pd.read_csv('df_exploded_2.csv')


test_embeddings = get_embeddings_wrapper(["Who is the best help to Harry Potter?"])
# Test query
response = my_index_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=test_embeddings,
    num_neighbors=20,
)

# show the result
import numpy as np

for idx, neighbor in enumerate(response[0]):
    id = np.int64(neighbor.id)
    similar = df.query("id == @id", engine="python")
    print(f"{neighbor.distance:.4f} {similar.splitted_texts_chunks.values[0]}")


100%|██████████| 1/1 [00:01<00:00,  1.11s/it]

0.0592 the way up the street, screaming for sweets. Harry Potter come
and live here!”
“It’s the best place 
0.0421  last place you would expect astonishing things to

0.0291 The

0.0286  calmly. “Voldemort had powers I will
never have.”
“Only because you’re too — well — noble to use th
0.0278 y way here.”
Professor McGonagall sniffed angrily.
“Oh yes, I’ve celebrating, all right,” she said i
0.0273 id Professor McGonagall. “And I don’t suppose you’re going to
tell me why you’re here, of all places
0.0269 Well, I just thought…maybe…it was something to do with…you
know…her crowd.”
Mrs. Dursley sipped her 
0.0268 cked Mrs.
Dursley on the cheek, and tried to kiss Dudley good-bye but missed, because
Dudley was now
0.0264 er sister and her good-for-nothing
husband were as unDursleyish as it was possible to be. The Dursle
0.0262 We’ve had precious
little to celebrate for eleven years.”
“I know that,” said Professor McGonagall i
0.0254 em.”
“It’s lucky it’s dark. I haven’t blushed so much sinc




### Get an existing Index
To get an index object that already exists, replace the following [your-index-id] with the index ID and run the cell. You can check the ID on the Vector Search Console > INDEXES tab.


In [151]:
%pip install --upgrade google-cloud-aiplatform -q

Note: you may need to restart the kernel to use updated packages.


In [152]:
from google.cloud import aiplatform

PROJECT_ID = "my-project-0004-346516"
REGION = LOCATION = "us-central1"
VPC_NETWORK = "matching-engine-vpc"
PEERING_RANGE_NAME = "ann-langchain-me-range"

aiplatform.init(project=PROJECT_ID, location=LOCATION)

### Update all this information below 

#### this setting is obtained from matching engine end point
##### https://console.cloud.google.com/vertex-ai/locations/us-central1/index-endpoints/3345510418113101824/deployed-indexes/vs_quickstart_deployed_02060053?


In [153]:
# my_index_id = "vs-quickstart-index-endpoint-02051523"  # @param {type:"string"}
# my_index = aiplatform.MatchingEngineIndex(my_index_id)
del(my_index)
my_index = aiplatform.MatchingEngineIndex(
    index_name='projects/255766800726/locations/us-central1/indexes/7702109738896457728'
)

my_index_endpoint_id = "projects/255766800726/locations/us-central1/indexEndpoints/3018929076384563200"



# my_index_endpoint_id = "[your-index-endpoint-id]"  # @param {type:"string"}
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint(my_index_endpoint_id)


#### Querying the earlier created index

In [154]:
from langchain_google_vertexai import VertexAIEmbeddings

embeddings = VertexAIEmbeddings(model="textembedding-gecko@001")

# text = "This is a test document."

# query_result = embeddings.embed_query(text)


Model_name will become a required arg for VertexAIEmbeddings starting from Feb-01-2024. Currently the default is set to textembedding-gecko@001


In [155]:
# this is embedding vector (should be created by calling the embeddings models)

text = "harry potter owl and the green colur boy."

test_embeddings = embeddings.embed_query(text)
print("preview embeddings",test_embeddings[0:2])

preview embeddings [-0.0006251202430576086, 0.03292103856801987]


### Update the information below

In [156]:
# this setting is obtained from matching ending https://console.cloud.google.com/vertex-ai/locations/us-central1/index-endpoints/3345510418113101824/deployed-indexes/vs_quickstart_deployed_02060053?project=my-project-0004-346516


from google.cloud import aiplatform_v1

# Set variables for the current deployed index.
API_ENDPOINT="87003740.us-central1-255766800726.vdb.vertexai.goog"
INDEX_ENDPOINT= "projects/255766800726/locations/us-central1/indexEndpoints/3018929076384563200"
DEPLOYED_INDEX_ID="vs_feature_deployed_02201235"
neighbor_count = 10

In [157]:


# Configure Vector Search client
client_options = {
  "api_endpoint": API_ENDPOINT
}
vector_search_client = aiplatform_v1.MatchServiceClient(
  client_options=client_options,
)

# Build FindNeighborsRequest object
datapoint = aiplatform_v1.IndexDatapoint(
  feature_vector=test_embeddings
)
query = aiplatform_v1.FindNeighborsRequest.Query(
  datapoint=datapoint,
  # The number of nearest neighbors to be retrieved
  neighbor_count=neighbor_count
)
request = aiplatform_v1.FindNeighborsRequest(
  index_endpoint=INDEX_ENDPOINT,
  deployed_index_id=DEPLOYED_INDEX_ID,
  # Request can have multiple queries
  queries=[query],
  return_full_datapoint=False,
)

# Execute the request
response = vector_search_client.find_neighbors(request)

# Handle the response
# print(response)

df_new = pd.DataFrame()
print('neighbor_count', neighbor_count)
for i in range(0,neighbor_count):
    x=response.nearest_neighbors[0]
    # print('id',x.neighbors[i].datapoint.datapoint_id, 'type', type(x.neighbors[i].datapoint.datapoint_id), 'distance',x.neighbors[i].distance)
    
    df_match = df.loc[df['id'] == int(x.neighbors[i].datapoint.datapoint_id) ]

    # Append the matching rows to the new DataFrame
    df_new = pd.concat([df_new, df_match])

# Print the new DataFrame
print(df_new)


neighbor_count 10
     index                                     pagewise_texts  page_id  \
46       1  something peculiar — a cat reading a map. For ...        2   
132      4  agree.”\nHe didn’t say another word on the sub...        5   
147      5  nearest street lamp went out with a little pop...        6   
202      7  Professor McGonagall’s voice trembled as she w...        8   
218      8  see how much better off he’ll be, growing up a...        9   
133      4  agree.”\nHe didn’t say another word on the sub...        5   
97       3  learned a new word (“Won’t!”). Mr. Dursley tri...        4   
141      5  nearest street lamp went out with a little pop...        6   
154      5  nearest street lamp went out with a little pop...        6   
161      6  “It certainly seems so,” said Dumbledore. “We ...        7   

                                        splitted_texts  \
46   something peculiar — a cat reading a map. For ...   
132  agree.”\nHe didn’t say another word on the sub

In [158]:
def get_id_with_embedding_matching(test_embeddings) :
    
    datapoint = aiplatform_v1.IndexDatapoint(
      feature_vector=test_embeddings
    )
    query = aiplatform_v1.FindNeighborsRequest.Query(
      datapoint=datapoint,
      # The number of nearest neighbors to be retrieved
      neighbor_count=neighbor_count
    )
    request = aiplatform_v1.FindNeighborsRequest(
      index_endpoint=INDEX_ENDPOINT,
      deployed_index_id=DEPLOYED_INDEX_ID,
      # Request can have multiple queries
      queries=[query],
      return_full_datapoint=False,
    )

    # Execute the request
    response = vector_search_client.find_neighbors(request)
    
    df_new = pd.DataFrame()

    for i in range(0,neighbor_count):
        x=response.nearest_neighbors[0]
        # print('id',x.neighbors[i].datapoint.datapoint_id, 'distance',x.neighbors[i].distance)

        df_match = df.loc[df['id'] == int(x.neighbors[i].datapoint.datapoint_id) ]

        # Append the matching rows to the new DataFrame
        df_new = pd.concat([df_new, df_match])

    # Print the new DataFrame
    # print(df_new)
    
    i,j,k = df_new.index[0:3]
    print(i,j,k)
    
    pagewise_texts_v1 = df_new.loc[i, 'pagewise_texts']
    pagewise_texts_v2 = df_new.loc[j, 'pagewise_texts']
    pagewise_texts_v3 = df_new.loc[k, 'pagewise_texts']
    
    splitted_texts_v1 = df_new.loc[i, 'splitted_texts']
    splitted_texts_v2 = df_new.loc[j, 'pagewise_texts']
    splitted_texts_v3 = df_new.loc[k, 'pagewise_texts']
    
    splitted_texts_chunks_v1 = df_new.loc[i, 'splitted_texts_chunks']
    splitted_texts_chunks_v2 = df_new.loc[j, 'splitted_texts_chunks']
    splitted_texts_chunks_v3 = df_new.loc[k, 'splitted_texts_chunks']
    
    page_id_v1 = df_new.loc[i, 'page_id'] 
    page_id_v2 = df_new.loc[j, 'page_id'] 
    page_id_v3 = df_new.loc[k, 'page_id'] 
    
    return(pagewise_texts_v1,pagewise_texts_v2,pagewise_texts_v3,
           splitted_texts_v1,splitted_texts_v2,splitted_texts_v3,
           splitted_texts_chunks_v1,splitted_texts_chunks_v2,splitted_texts_chunks_v3,
        page_id_v1,page_id_v2,page_id_v3,i,j,k)

In [159]:
import pandas as pd
filename = "./harry_potte_qa.csv"
df_qa = pd.read_csv(filename, sep ="|")

df_qa.head()

Unnamed: 0,Question,Answer
0,What is the name of the magical creature that ...,Thestral
1,What is the name of the school newspaper at Ho...,The Daily Prophet
2,What is the name of the magical map that shows...,Marauder's Map
3,Which Hogwarts house does Luna Lovegood belong...,Ravenclaw
4,What magical creature is known for guarding Gr...,Ukrainian Ironbelly (a dragon)


In [160]:
df_qa.columns

Index(['Question', 'Answer'], dtype='object')

In [161]:
# for i in range(0, len(df_qa)):
#     df_qa.loc[i, "Question_emb"] = embeddings.embed_query( df_qa.loc[i, "Question"])
#     # print("preview embeddings",test_embeddings[0:2])
    
import csv
import csv

with open('harry_potte_qa.csv', 'r') as input_file, open('harry_potte_qa_output.csv', 'w', newline='') as output_file:

  # Create CSV reader and writer objects
  reader = csv.reader(input_file, delimiter='|')
  writer = csv.writer(output_file, delimiter='|')

  # Read and write the header row
  header = next(reader) + ['i','j','k','pagewise_texts_v1','pagewise_texts_v2','pagewise_texts_v3','splitted_texts_v1','splitted_texts_v2','splitted_texts_v3','splitted_texts_chunks_v1','splitted_texts_chunks_v2','splitted_texts_chunks_v3','page_id_v1','page_id_v2','page_id_v3']
  writer.writerow(header)

  # Loop through the remaining rows
  for i, row in enumerate(reader):
    question = row[0].split('|')[0]  # Use 'i' to access the correct element in the row
    question_emb = embeddings.embed_query( question )
    pagewise_texts_v1,pagewise_texts_v2,pagewise_texts_v3,splitted_texts_v1,splitted_texts_v2,splitted_texts_v3,splitted_texts_chunks_v1,splitted_texts_chunks_v2,splitted_texts_chunks_v3,page_id_v1,page_id_v2,page_id_v3,i,j,k = get_id_with_embedding_matching(question_emb) 
    
    # print( i , question)
    row_out = row + [i,j,k,pagewise_texts_v1,pagewise_texts_v2,pagewise_texts_v3,splitted_texts_v1,splitted_texts_v2,splitted_texts_v3,splitted_texts_chunks_v1,splitted_texts_chunks_v2,splitted_texts_chunks_v3,page_id_v1,page_id_v2,page_id_v3]
    
    # Write the row to the output file
    writer.writerow(row_out)

# Usage example:
! head -n 2 harry_potte_qa_output.csv

202 11 147
46 292 202
11 97 292
202 11 46
9 11 46
202 261 9
183 226 238
161 46 11
202 24 9
133 161 202
11 202 273
202 292 315
202 261 161
167 202 238
292 273 11
202 11 100
11 20 311
202 147 247
202 144 46
202 147 124
24 202 9
11 261 202
202 46 146
124 161 143
11 202 9
124 147 146
46 11 133
11 327 311
161 202 267
282 298 11
202 147 46
11 202 184
202 11 161
46 11 202
202 161 11
46 161 124
11 161 202
202 171 161
11 292 46
202 247 276
11 24 311
46 261 1
46 161 202
46 171 141
261 11 202
11 202 327
143 184 289
11 46 9
202 20 11
46 202 261
46 202 247
24 202 141
202 11 100
202 171 11
46 249 24
202 46 141
184 11 161
202 261 282
202 11 292
202 161 46
202 46 261
124 161 166
261 161 46
5 202 184
11 97 202
46 171 5
311 46 202
46 202 11
202 171 46
11 161 46
202 1 141
261 46 289
202 68 161
11 292 46
46 184 261
184 202 46
171 261 96
11 184 20
202 5 184
202 261 11
11 202 161
202 11 261
202 16 12
20 327 238
184 202 171
4 20 46
11 273 4
161 261 289
11 9 161
4 11 20
4 9 11
11 161 46
24 161 11
202 46 11
16

In [162]:
import pandas as pd
filename = "./harry_potte_qa_output.csv"
df_qa = pd.read_csv(filename, sep ="|")

df_qa.head()

Unnamed: 0,Question,Answer,i,j,k,pagewise_texts_v1,pagewise_texts_v2,pagewise_texts_v3,splitted_texts_v1,splitted_texts_v2,splitted_texts_v3,splitted_texts_chunks_v1,splitted_texts_chunks_v2,splitted_texts_chunks_v3,page_id_v1,page_id_v2,page_id_v3
0,What is the name of the magical creature that ...,Thestral,202,11,147,Professor McGonagall’s voice trembled as she w...,CHAPTER ONE\nTHE BOY WHO LIVED\nM r. and Mrs. ...,nearest street lamp went out with a little pop...,We may never know.”\nProfessor McGonagall pull...,CHAPTER ONE\nTHE BOY WHO LIVED\nM r. and Mrs. ...,nearest street lamp went out with a little pop...,"the way up the street, screaming for sweets. H...",er sister and her good-for-nothing\nhusband we...,y way here.”\nProfessor McGonagall sniffed ang...,8,1,6
1,What is the name of the school newspaper at Ho...,The Daily Prophet,46,292,202,something peculiar — a cat reading a map. For ...,"pulling a spider off one of them, put them on....",Professor McGonagall’s voice trembled as she w...,something peculiar — a cat reading a map. For ...,"pulling a spider off one of them, put them on....",Professor McGonagall’s voice trembled as she w...,five\ndifferent people. He made several import...,"the stairs was full of them, and that was whe...","the way up the street, screaming for sweets. H...",2,14,8
2,What is the name of the magical map that shows...,Marauder's Map,11,97,292,CHAPTER ONE\nTHE BOY WHO LIVED\nM r. and Mrs. ...,learned a new word (“Won’t!”). Mr. Dursley tri...,"pulling a spider off one of them, put them on....",CHAPTER ONE\nTHE BOY WHO LIVED\nM r. and Mrs. ...,learned a new word (“Won’t!”). Mr. Dursley tri...,"pulling a spider off one of them, put them on....",er sister and her good-for-nothing\nhusband we...,"After all,\nthey normally pretended she didn’...","the stairs was full of them, and that was whe...",1,4,14
3,Which Hogwarts house does Luna Lovegood belong...,Ravenclaw,202,11,46,Professor McGonagall’s voice trembled as she w...,CHAPTER ONE\nTHE BOY WHO LIVED\nM r. and Mrs. ...,something peculiar — a cat reading a map. For ...,We may never know.”\nProfessor McGonagall pull...,CHAPTER ONE\nTHE BOY WHO LIVED\nM r. and Mrs. ...,something peculiar — a cat reading a map. For ...,"the way up the street, screaming for sweets. H...",er sister and her good-for-nothing\nhusband we...,five\ndifferent people. He made several import...,8,1,2
4,What magical creature is known for guarding Gr...,Ukrainian Ironbelly (a dragon),9,11,46,CHAPTER ONE\nTHE BOY WHO LIVED\nM r. and Mrs. ...,CHAPTER ONE\nTHE BOY WHO LIVED\nM r. and Mrs. ...,something peculiar — a cat reading a map. For ...,CHAPTER ONE\nTHE BOY WHO LIVED\nM r. and Mrs. ...,CHAPTER ONE\nTHE BOY WHO LIVED\nM r. and Mrs. ...,something peculiar — a cat reading a map. For ...,y\ncould bear it if anyone found out about the...,er sister and her good-for-nothing\nhusband we...,five\ndifferent people. He made several import...,1,1,2


In [163]:
# question_emb = embeddings.embed_query( question )

In [164]:
# i,j,k = df_new.index[0:3]
# print(i,j,k)

In [165]:
# get_id_with_embedding_matching(question_emb)