#### OpenAI Embeddings
Source:
- OpenAI Guide: https://platform.openai.com/docs/guides/embeddings
- Embedding Documentation: https://platform.openai.com/docs/api-reference/embeddings/create

OpenAI’s text embeddings measure the relatedness of text strings.
- An embedding is a vector (list) of floating point numbers.
- The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

Default Embedding Vector Length:
- text-embedding-3-small: 1536: Knowledge till September 2021
- text-embedding-3-large: 3072: Knowledge till September 2021

Note: *Frobenious Normalization of Embeddings must be 1* <br>
Frobenius Normalization: <br>
$| A_{\text{norm}} \|_F = \frac{A}{\|A\|_F}$ <br>
$A$ is a matrix with elements $a_{ij}$ <br>
$\| A \|_F = \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n} |a_{ij}|^2}$


In [22]:
import numpy as np

from openai import OpenAI

# open_ai_key = (
    # ""
# )
openAI_params = {
    'api_key': open_ai_key
}
client = OpenAI(**openAI_params)

In [101]:
# Example: Getting embeddings (Single String Example)
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=["My name is Abhinandan"]   
)

## Analyze the response
response_dict = response.to_dict()

print(f"Model used: {response_dict["model"]}")
print(f"object returned: {response_dict["object"]}")
print(f"Usage: {response_dict["usage"]}")

embeddings_data = response_dict['data']
print(f"Number of Embeddings: {len(embeddings_data)}\n")
for i, embeddings in enumerate(embeddings_data):
    print(f"Details about Embeddings: {i+1}")
    print("Shape of Embeddings", np.shape(embeddings['embedding']))
    print("Index of Embeddings", embeddings['index'])
    print("Type of Embeddings", embeddings['object'])
    print("\n")

Model used: text-embedding-3-small
object returned: list
Usage: {'prompt_tokens': 6, 'total_tokens': 6}
Number of Embeddings: 1

Details about Embeddings: 1
Shape of Embeddings (1536,)
Index of Embeddings 0
Type of Embeddings embedding




In [52]:
# Example: Getting embeddings (Array of String Example)
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=["My name is Abhinandan", "He likes Ice Cream"]   
)

## Analyze the response
response_dict = response.to_dict()

print(f"Model used: {response_dict["model"]}")
print(f"object returned: {response_dict["object"]}")
print(f"Usage: {response_dict["usage"]}")

embeddings_data = response_dict['data']
print(f"Number of Embeddings: {len(embeddings_data)}\n")
for i, embeddings in enumerate(embeddings_data):
    print(f"Details about Embeddings: {i+1}")
    print("Shape of Embeddings", np.shape(embeddings['embedding']))
    print("Index of Embeddings", embeddings['index'])
    print("Type of Embeddings", embeddings['object'])
    print("\n")

Model used: text-embedding-3-small
object returned: list
Usage: {'prompt_tokens': 10, 'total_tokens': 10}
Number of Embeddings: 2

Details about Embeddings: 1
Shape of Embeddings (1536,)
Index of Embeddings 0
Type of Embeddings embedding


Details about Embeddings: 2
Shape of Embeddings (1536,)
Index of Embeddings 1
Type of Embeddings embedding




In [53]:
# Example: Getting embeddings (Array of Integers)
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=[0, 1, 2, 3] 
)

## Analyze the response
response_dict = response.to_dict()

print(f"Model used: {response_dict["model"]}")
print(f"object returned: {response_dict["object"]}")
print(f"Usage: {response_dict["usage"]}")

embeddings_data = response_dict['data']
print(f"Number of Embeddings: {len(embeddings_data)}\n")
for i, embeddings in enumerate(embeddings_data):
    print(f"Details about Embeddings: {i+1}")
    print("Shape of Embeddings", np.shape(embeddings['embedding']))
    print("Index of Embeddings", embeddings['index'])
    print("Type of Embeddings", embeddings['object'])
    print("\n")

Model used: text-embedding-3-small
object returned: list
Usage: {'prompt_tokens': 4, 'total_tokens': 4}
Number of Embeddings: 1

Details about Embeddings: 1
Shape of Embeddings (1536,)
Index of Embeddings 0
Type of Embeddings embedding




In [54]:
# Example: Getting embeddings (Array of Array of Integers)
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=[[0, 1, 2, 3], [10,12, 22], [19, 14, 18]]
)

## Analyze the response
response_dict = response.to_dict()

print(f"Model used: {response_dict["model"]}")
print(f"object returned: {response_dict["object"]}")
print(f"Usage: {response_dict["usage"]}")

embeddings_data = response_dict['data']
print(f"Number of Embeddings: {len(embeddings_data)}\n")
for i, embeddings in enumerate(embeddings_data):
    print(f"Details about Embeddings: {i+1}")
    print("Shape of Embeddings", np.shape(embeddings['embedding']))
    print("Index of Embeddings", embeddings['index'])
    print("Type of Embeddings", embeddings['object'])
    print("\n")

Model used: text-embedding-3-small
object returned: list
Usage: {'prompt_tokens': 10, 'total_tokens': 10}
Number of Embeddings: 3

Details about Embeddings: 1
Shape of Embeddings (1536,)
Index of Embeddings 0
Type of Embeddings embedding


Details about Embeddings: 2
Shape of Embeddings (1536,)
Index of Embeddings 1
Type of Embeddings embedding


Details about Embeddings: 3
Shape of Embeddings (1536,)
Index of Embeddings 2
Type of Embeddings embedding




#### Additional Parameters
- encoding_format: The format to return the embeddings in. Can be either float or base64
- dimensions: The number of dimensions the resulting output embeddings should have
- user: A unique identifier representing your end-user, which can help OpenAI to monitor and detect abuse.

In [117]:
# Example: Getting embeddings (Array of Array of Integers)
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=[[0, 1, 2, 3], [10,12, 22], [19, 14, 18]],
    dimensions=12,
    encoding_format="float"
)

## Analyze the response
response_dict = response.to_dict()

print(f"Model used: {response_dict["model"]}")
print(f"object returned: {response_dict["object"]}")
print(f"Usage: {response_dict["usage"]}")

embeddings_data = response_dict['data']
print(f"Number of Embeddings: {len(embeddings_data)}\n")
for i, embeddings in enumerate(embeddings_data):
    print(f"Details about Embeddings: {i+1}")
    print("Shape of Embeddings", np.shape(embeddings['embedding']))
    print("Index of Embeddings", embeddings['index'])
    print("Type of Embeddings", embeddings['object'])
    print("\n")

Model used: text-embedding-3-small
object returned: list
Usage: {'prompt_tokens': 10, 'total_tokens': 10}
Number of Embeddings: 3

Details about Embeddings: 1
Shape of Embeddings (12,)
Index of Embeddings 0
Type of Embeddings embedding


Details about Embeddings: 2
Shape of Embeddings (12,)
Index of Embeddings 1
Type of Embeddings embedding


Details about Embeddings: 3
Shape of Embeddings (12,)
Index of Embeddings 2
Type of Embeddings embedding




#### Embedding utils

#### Cosine Similarity
Cosine similarity is a metric used to measure how similar two vectors are, based on the cosine of the angle between them.

$\text{cosine similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}$

Interpretation of Values: <br>
- **Cosine similarity = 1**: The vectors are identical in direction (or very close).
- **Cosine similarity = 0**: The vectors are orthogonal, meaning they have no similarity (they are at 90 degrees to each other)
- **Cosine similarity = -1**: The vectors are opposite in direction.

In [112]:
import numpy as np
from typing import List

def cosine_similarity(x: List | np.array, y: List | np.array) -> float:
    """
    Calculates the cosine similarity between two vectors

    Args:
      x (List | np.array): Vector 1
      y (List | np.array): Vector 2
    
    Returns: 
        float: Cosine Similarity between two vectors
    """

    norm_x = np.linalg.norm(x)
    norm_y = np.linalg.norm(y)
    cosine = np.dot(x,y)/(norm_x*norm_y) if norm_x and norm_y else 0
    return cosine

x = [0, 0, 1, 9]
y = [2, 54, 13, 15]
cosine_similarity(x, y)

np.float64(0.28390859282968584)