# Toy example of getting embeddings and using vectorstore

In [1]:
import requests
from collections.abc import Sequence
import json

## Obtain embeddings

### Load the OpenAI API key

The key is stored in a file that is not tracked by the repo. Paste the key obtained from OpenAI into secretkey.txt

In [2]:
API_SECRET_FN = "secretkey.txt"

with open(API_SECRET_FN, "r") as fh:
    APIKEY = fh.read().strip()

### Function to send embedding request

In [3]:
def get_embeddings(inp_text: str) -> list:
    """Send natural language text to OpenAI Embeddings API

    Args:
        input_text (str): A string that represents the input

    Returns:
        list: list of embeddings
    """
    header = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {APIKEY}"
    }
    
    data = {
        "input": inp_text,
        "model": "text-embedding-ada-002"
    }
    
    result = requests.post(
        url = "https://api.openai.com/v1/embeddings",
        headers = header,
        json = data
    )
    
    return json.loads(result.content)["data"][0]["embedding"]

### Example data

Here are 6 sentences, chosen as to be from 2 different semantics contexts.

In [4]:
texts = [
    "burgers are good",
    "I love kebab",
    "Fries are the best",
    "I fix cars",
    "airplanes are easy to repair",
    "sometimes boats can break"
]

### Obtain embeddings for the toy dataset

In [5]:
embs = [get_embeddings(txt) for txt in texts]

## Store embeddings in database

I am here working with a postgresql database that runs from a docker container defined in ./database .

We activate the vector extension, which allows for the main functionalities:
* Storing of vector of variable length
* Distance operations against a query vector
* Index generation of the vector column

In [6]:
from sqlalchemy import create_engine, text

### Establish connection to database

In [7]:
engine = create_engine("postgresql+psycopg2://root:password@localhost:5432/postgres")

### Insert embeddings into database

In [8]:
with engine.connect() as c:
    for i in range(len(embs)):
        params = {
            "insemb": str(embs[i]).replace(" ", ""),
            "rawtext": texts[i]
            }
        c.execute(text("insert into embed (rawtext, embeddings) values (:rawtext, :insemb)"), parameters = params)
    
    c.commit()

### Get distance from a query

Sort table by cosine distance from a query. Since default ordering is ASC, the closest embeddings are on top.

Note how the distances differ between the 2 contexts.

In [9]:
querytext = "burgers go great with fries"
queryemb = get_embeddings(querytext)

with engine.connect() as c:
    res = c.execute(text("select embeddings <=> :qemb, rawtext from embed;"), parameters = {"qemb": str(queryemb).replace(" ", "")})
    
for r in res:
    print(r)

(0.0727817242665948, 'burgers are good')
(0.20274515086528044, 'I love kebab')
(0.08842932123787695, 'Fries are the best')
(0.2441495830388496, 'I fix cars')
(0.2594703722631435, 'airplanes are easy to repair')
(0.2721928901212477, 'sometimes boats can break')


### Create an index

This is trivial with this small dataset, but is included only to show the method.

Vector columns can be indexed using IVFFlat, which used approximate nearest neighbors (ANN) to map vectors to clusters, i.e. Cluster centers have a mapping to the vectors within the respective clusters.

An important parameter in the process is the amount of such lists. More lists mean faster search but worse recall.

In [10]:
with engine.connect() as c:
    c.execute(text("CREATE INDEX ON embed USING ivfflat (embeddings vector_cosine_ops) WITH (lists = 1)"))
    c.commit()