In [None]:
!pip install voyageai

Collecting voyageai
  Downloading voyageai-0.1.7-py3-none-any.whl (25 kB)
Collecting aiolimiter<2.0.0,>=1.1.0 (from voyageai)
  Downloading aiolimiter-1.1.0-py3-none-any.whl (7.2 kB)
Installing collected packages: aiolimiter, voyageai
Successfully installed aiolimiter-1.1.0 voyageai-0.1.7


In [None]:
import os
import voyageai
os.environ['VOYAGE_API_KEY'] = "pa-k2_wb1Mj37_Ppl1FFBmCvu-ybdIKZzels0GeMF7PnUI"
vo = voyageai.Client(api_key=os.environ.get("VOYAGE_API_KEY"),)

In [None]:
os.environ['VOYAGE_API_KEY']

'pa-k2_wb1Mj37_Ppl1FFBmCvu-ybdIKZzels0GeMF7PnUI'

# Vectorize/embed the documents

In [None]:
# Prepare data
documents = [
    "The Mediterranean diet emphasizes fish, olive oil, and vegetables, believed to reduce chronic diseases.",
    "Photosynthesis in plants converts light energy into glucose and produces essential oxygen.",
    "20th-century innovations, from radios to smartphones, centered on electronic advancements.",
    "Rivers provide water, irrigation, and habitat for aquatic species, vital for ecosystems.",
	  "Apple’s conference call to discuss fourth fiscal quarter results and business updates is scheduled for Thursday, November 2, 2023 at 2:00 p.m. PT / 5:00 p.m. ET.",
    "Shakespeare's works, like 'Hamlet' and 'A Midsummer Night's Dream,' endure in literature."
]

In [None]:
# Embed the documents
documents_embeddings = vo.embed(documents, model="voyage-lite-02-instruct", input_type="document").embeddings

If you are working with more than 8 documents, you will need to use a for loop to encode them:

# A minimalist retrieval system

The main feature of the embeddings is that the cosine similarity between two embeddings captures the semantic relatedness of the corresponding original passages. This allows us to use the embeddings to do semantic retrieval / search.

Suppose the user sends a "query" (e.g., a question or a comment) to the chatbot:

In [None]:
query = "When is Apple's conference call scheduled?"

To find out the document that is most similar to the query among the existing data, we can first embed/vectorize the query:

In [None]:
# Get the embedding of the query
query_embedding = vo.embed([query], model="voyage-2", input_type="query").embeddings[0]

**Nearest neighbor Search:** We can find a few closest embeddings in the documents embeddings based on the cosine similarity, and retrieve the corresponding document using the nearest_neighbors function.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def k_nearest_neighbors(query_embedding, documents_embeddings, k=5):
  query_embedding = np.array(query_embedding) # convert to numpy array
  documents_embeddings = np.array(documents_embeddings) # convert to numpy array

  # Reshape the query vector embedding to a matrix of shape (1, n) to make it compatible with cosine_similarity
  query_embedding = query_embedding.reshape(1, -1)

  # Calculate the similarity for each item in data
  cosine_sim = cosine_similarity(query_embedding, documents_embeddings)

  # Sort the data by similarity in descending order and take the top k items
  sorted_indices = np.argsort(cosine_sim[0])[::-1]

  # Take the top k related embeddings
  top_k_related_indices = sorted_indices[:k]
  top_k_related_embeddings = documents_embeddings[sorted_indices[:k]]
  top_k_related_embeddings = [list(row[:]) for row in top_k_related_embeddings] # convert to list

  return top_k_related_embeddings, top_k_related_indices

In [None]:
# Use the nearest neighbor algorithm to find the document with the highest similarity
retrieved_embd, retrieved_embd_index = k_nearest_neighbors(query_embedding, documents_embeddings, k=1)
retrieved_doc = [documents[index] for index in retrieved_embd_index]

print(retrieved_doc)

['Apple’s conference call to discuss fourth fiscal quarter results and business updates is scheduled for Thursday, November 2, 2023 at 2:00 p.m. PT / 5:00 p.m. ET.']


**$k$-nearest neighbors Search ($k$-NN):** It is often useful to retrieve not only the closest document but also the $k$ most closest documents. The k_nearest_neighbors algorithm enables us to achieve this. It is important to note that `nearest_neighbors` is special case of `k_nearest_neighbors` when $k=1$.

In [None]:
# Use the k-nearest neighbor algorithm to identify the top-k documents with the highest similarity
retrieved_embds, retrieved_embd_indices = k_nearest_neighbors(query_embedding, documents_embeddings, k=3)
retrieved_docs = [documents[index] for index in retrieved_embd_indices]

print(retrieved_docs)

['Apple’s conference call to discuss fourth fiscal quarter results and business updates is scheduled for Thursday, November 2, 2023 at 2:00 p.m. PT / 5:00 p.m. ET.', '20th-century innovations, from radios to smartphones, centered on electronic advancements.', 'Photosynthesis in plants converts light energy into glucose and produces essential oxygen.']
