# Anthropic / VoyageAI Embeddings

(C) 2024 by [Damir Cavar](http://damir.cavar.me/)

The vectors from [VoyageAI](https://www.voyageai.com/) are recommended by [Anthropic](https://www.anthropic.com/). You will need to get an API key to use these vectors. 

The Python module `voyageai` is required.

In [None]:
!pip install -U voyageai

I store my API keys in the `secret.py` file in the same location as this notebook.

In [17]:
import voyageai
import csv
import os
from secret import voyageai_apikey

We need to define the model for the embeddings that we want to use, as well as the maximum batch size for the list of words that we can submit to the VoyageAI API:

In [18]:
model = "voyage-3"
batch_size = 128

We can create a client now to communicate with the remote VoyageAI API:

In [19]:
vo = voyageai.Client(api_key=voyageai_apikey)

The following function requests the embeddings for a word list and returns them:

In [20]:
def get_embeddings(wordlist):
    return vo.embed(wordlist, model=model, input_type="document").embeddings

The following function saves the embeddings to a CSV file:

In [24]:
def save_embeddings(wordlist, embeddings, output_file):
    if not os.path.exists(output_file):
        with open(output_file, mode='a', encoding='utf-8', newline='') as ofp:
            writer = csv.writer(ofp)
            header = ["word"] + [str(i) for i in range(len(embeddings[0]))]
            writer.writerow(header)
    with open(output_file, mode='a', encoding='utf-8', newline='') as ofp:
        writer = csv.writer(ofp)
        for word, embedding in zip(wordlist, embeddings):
            row = [word] + embedding  # Concatenate word with its embedding values
            writer.writerow(row)

Let us try a set of words:

In [22]:
words = """
cat dog bird fish
car truck bike bus
apple banana orange pear
"""

The following loop will split the word list into a list of word lists with the maximum length of the batch size. It will request the embeddings for a batch of words and store them in the target CSV file.

In [25]:
new_words = list({ word for word in words.split() })
new_words_lists = [ new_words[i:i+batch_size] for i in range(0, len(new_words), batch_size) ]
output_file = os.path.join("data", "voyage_embeddings.csv")
for nwl in new_words_lists:
    if nwl:
        embeddings = get_embeddings(nwl)
        save_embeddings(nwl, embeddings, output_file)

(C) 2024 by [Damir Cavar](http://damir.cavar.com/)