# Embeddings

In this notebook, we will explore how to access different types of embeddings
in llamaindex


1.   OpenAI
2.   Google Gemini
3.   CohereAI
4.   Open-Source from HuggingFace



Download the required packages by executing the below commands in either Anaconda Prompt (in Windows) or Terminal (in Linux or Mac OS)

pip install llama-index-embeddings-gemini llama-index-embeddings-cohere

## Load the Keys

In [8]:
import os
from dotenv import load_dotenv, find_dotenv

In [9]:
load_dotenv('/home/santhosh/Projects/courses/Pinnacle/.env')

True

In [10]:
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
GOOGLE_API_KEY = os.environ['GOOGLE_API_KEY']
HUGGINGFACE_API_KEY = os.environ['HUGGINGFACE_API_KEY']
COHERE_API_KEY = os.environ["COHERE_API_KEY"]

## 1. OpenAI Embeddings

In [4]:
from llama_index.embeddings.openai import OpenAIEmbedding

In [11]:
# Define embedding model with OpenAI Embedding
embed_model = OpenAIEmbedding(model='text-embedding-3-small', api_key=OPENAI_API_KEY)

In [15]:
# Get the text embedding
embedding = embed_model.get_text_embedding("The cat sat on the mat")


In [16]:
# Get the dimension of the embedding
len(embedding)

1536

In [17]:
embedding[:10] 

[-0.030747132375836372,
 -0.0495988205075264,
 -0.005045865662395954,
 -0.0015119818272069097,
 0.03620351850986481,
 -0.002039680490270257,
 -0.008921581320464611,
 0.027152638882398605,
 0.0070855459198355675,
 -0.011837258003652096]

In [18]:
# You can get embeddings in batches
embeddings = embed_model.get_text_embedding_batch(["What are Embeddings?", \
                                                   "In 1967, a professor at MIT built the first ever NLP program Eliza to understand natural language."])

In [19]:
len(embeddings)

2

In [20]:
embeddings[0][:5] # embeddings of the 1st sentence

[0.009798001497983932,
 -0.029589135199785233,
 -0.008733644150197506,
 0.006137794349342585,
 -0.010891923680901527]

In [21]:
embeddings[1][:5]  # embeddings of the 2nd sentence

[-0.052310045808553696,
 0.011894790455698967,
 -0.002292225370183587,
 0.00246118544600904,
 0.014733320102095604]

In [22]:
len(embeddings[1]) # length of each embedding

1536

## 2. Using Google Gemini Embeddings

In [23]:
# imports
from llama_index.embeddings.gemini import GeminiEmbedding

In [25]:
model_name = "models/text-embedding-004"

In [26]:
embed_model = GeminiEmbedding(model_name=model_name, api_key=GOOGLE_API_KEY)

In [27]:
embeddings = embed_model.get_text_embedding("A journey to the centre of Earth")

In [28]:
print(f"Dimension of embeddings: {len(embeddings)}")

Dimension of embeddings: 768


In [29]:
embeddings[:5]

[0.0062123113, -0.0035789867, -0.036939133, 0.047532395, 0.047872357]

## 3. Using CohereAI Embeddings

In [30]:
from llama_index.embeddings.cohere import CohereEmbedding

In [31]:
embed_model = CohereEmbedding(
    cohere_api_key=COHERE_API_KEY,
    model_name="embed-english-v3.0",
    input_type="search_query",
)

In [32]:
embeddings = embed_model.get_text_embedding("Hello CohereAI!")

In [34]:
print(len(embeddings))

1024


In [35]:
print(embeddings[:5])

[-0.041931152, -0.022384644, -0.07067871, -0.011886597, -0.019210815]


## 4. Open Source Embeddings from HuggingFace.

In [36]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

In [38]:
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
# BAAI/bge-large-en-v1.5

In [39]:
embedding = embed_model.get_text_embedding("Roses are red, and the sky is blue!?")

In [40]:
len(embedding)

384

In [41]:
embeddings = embed_model.get_text_embedding_batch(["Hugging Face Text Embeddings Inference",
                                                   "OpenAI Embedding",
                                                   "Open Source",
                                                   "Closed Source"])

In [42]:
len(embeddings)

4

In [43]:
embeddings[0][:5]

[-0.03793426603078842,
 0.031132353469729424,
 -0.04273267462849617,
 -0.017657337710261345,
 0.00661458820104599]

In [44]:
len(embeddings[0])

384

## 5. Loading SOTA Embedding Model

In [35]:
embed_model = HuggingFaceEmbedding(model_name='WhereIsAI/UAE-Large-V1')

In [36]:
embedding = embed_model.get_text_embedding("Hugging Face Text Embeddings Inference")

In [37]:
print(embedding[:5])

[0.02925274148583412, 0.011324609629809856, 0.010814081877470016, -0.017124859616160393, 0.00806861650198698]


In [38]:
len(embedding)

1024

## You can check the top embeddings from this leaderboard: https://huggingface.co/spaces/mteb/leaderboard