# OpenAI Embeddings

(C) 2024 by [Damir Cavar](http://damir.cavar.me/)

Pulling embeddings for words and phrases from OpenAI's GPT.

    Model	~ Pages per dollar	Performance on MTEB eval	Max input
    text-embedding-3-small	62,500	62.3%	8191
    text-embedding-3-large	9,615	64.6%	8191
    text-embedding-ada-002	12,500	61.0%	8191

For this code to work you will need the `openai` Python module.

In [None]:
!pip install -U openai

In the file `secret.py` in the same folder as this notebook create the variable `openai_apikey` as a string variable and set it to your API key from OpenAI. You might have to create the file `secret.py`. 

In [1]:
import openai
from tenacity import retry, wait_random_exponential, stop_after_attempt, retry_if_not_exception_type
from secret import openai_apikey

Use either  `text-embedding-3-small` or `text-embedding-3-large` for the embeddings.

In [2]:
EMBEDDING_MODEL = "text-embedding-3-large"
EMBEDDING_CTX_LENGTH = 8191
EMBEDDING_ENCODING = 'cl100k_base'

Create a client for the OpenAI API endpoint:

In [3]:
client = openai.OpenAI(api_key=openai_apikey)

The following function will fetch the vectors for the words in the text_or_tokens string. It will wait and repeat, if the endpoint is busy or a delay is imposed by it:

In [4]:
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6), retry=retry_if_not_exception_type(openai.BadRequestError))
def get_embedding(text_or_tokens, client, model=EMBEDDING_MODEL):
	text_or_tokens = text_or_tokens.replace("\n", " ")
	return client.embeddings.create(input=text_or_tokens, model=model).data[0].embedding

Use this word list:

In [5]:
wordlist = list(set("""
cat dog fish bird
car truck bike bus
""".split()))

Loop over the word list and request the embedding vector from the OpenAI API endpoint:

In [7]:
for word in wordlist:
    try:
        embeddings = get_embedding(word, client)
        print(word, embeddings)
    except openai.BadRequestError as e:
        print(e)

car [-0.021504336968064308, 0.037356678396463394, 0.00785928312689066, 0.004147026222199202, -0.005187963135540485, 0.018360624089837074, -0.027674710378050804, -0.004673765506595373, -0.01412998791784048, 0.008176999166607857, -0.013586526736617088, -0.008498894982039928, -0.010844138450920582, -0.034112632274627686, -0.03471462056040764, 0.002023347420617938, -0.05551663786172867, -0.007871824316680431, 0.014798862859606743, -0.01797601953148842, 0.032557498663663864, -0.013410947285592556, 0.011839090846478939, 0.027958981692790985, 0.04250701889395714, 0.03367786481976509, 0.007541567552834749, 0.03023315779864788, -0.04849344864487648, -0.027591101825237274, 0.011421043425798416, 0.05026596784591675, -0.014104905538260937, 0.026303516700863838, -0.03461429104208946, 0.028494082391262054, -0.0015791724435985088, -0.0023661458399146795, -0.008795708417892456, 0.005463873967528343, -0.001173666911199689, -0.010292316786944866, -0.028744909912347794, 0.002888704650104046, 0.0140965441

(C) 2024 by [Damir Cavar](http://damir.cavar.com/)