# LlamaIndex Bottoms-Up Development - Embeddings
Embeddings are numerical representations of text. To generate embeddings for text, a specific model is required.

In LlamaIndex, the default embedding model is `text-embedding-ada-002` from OpenAI. You can also leverage any embedding models offered by Langchain and Huggingface using our `LangchainEmbedding` wrapper.

In this notebook, we cover the low-level usage for both OpenAI embeddings and HuggingFace embeddings.

In [1]:
import openai
import os

os.environ["OPENAI_API_KEY"] = "YOUR_KEY_HERE"
openai.api_key = os.environ["OPENAI_API_KEY"]

In [17]:
from llama_index.embeddings import OpenAIEmbedding
openai_embedding = OpenAIEmbedding()
embed = openai_embedding.get_text_embedding("hello world!")
print(len(embed))
print(embed[:10])

1536
[-0.007699163164943457, -0.005479877814650536, -0.015905963256955147, -0.0334259532392025, -0.01677805744111538, -0.0032573381904512644, -0.015437375754117966, -0.0020988842006772757, -0.0029791139531880617, -0.026969850063323975]


## Custom Embeddings
While we can integrate with any embeddings offered by Langchain, you can also implement the `BaseEmbedding` class and run your own custom embedding model!

For this, we will use the `InstructorEmbedding` pip package, in order to run `hkunlp/instructor-large` model found here: https://huggingface.co/hkunlp/instructor-large

In [3]:
# Instal dependencies
# !pip install InstructorEmbedding torch transformers sentence_transformers

Test the embeddings! Instructor embeddings work by telling it to represent text in a particular domain. 

This makes sense for our llama-docs-bot, since we are search very specific documentation!

Let's quickly test to make sure everything works.

In [18]:
from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)

load INSTRUCTOR_Transformer
max_seq_length  512
[[-6.15552627e-02  1.04199704e-02  5.88438474e-03  1.93768851e-02
   5.71417809e-02  2.57655438e-02 -4.01949983e-05 -2.80044544e-02
  -2.92965565e-02  4.91884872e-02  6.78200200e-02  2.18692329e-02
   4.54528667e-02  1.50187155e-02 -4.84451763e-02 -3.25259715e-02
  -3.56492773e-02  1.19935405e-02 -6.83917757e-03  3.03126313e-02
   5.17491512e-02  3.48140411e-02  4.91032843e-03  6.68928549e-02
   1.52824540e-02  3.54217142e-02  1.07743582e-02  6.89828768e-02
   4.44019474e-02 -3.23419608e-02  1.24268020e-02 -2.15528086e-02
  -1.62690766e-02 -4.15058173e-02 -2.42291158e-03 -3.07157822e-03
   4.27047275e-02  1.56428572e-02  2.57812925e-02  5.92843145e-02
  -1.99174173e-02  1.32361818e-02  1.08408015e-02 -4.00610566e-02
  -1.36213051e-03 -1.57032814e-02 -2.53812131e-02 -1.31972972e-02
  -7.83779565e-03 -1.14009101e-02 -4.82025519e-02 -2.58416049e-02
  -4.98769898e-03  4.98239547e-02  1.19490270e-02 -5.55060506e-02
  -2.82120295e-02 -3.3220872

Looks good! But we can see the output is batched (i.e. a list of lists), so we need to undo the batching in our implementation!

There are only 4 methods we need to implement below.

In [19]:
from typing import Any, List
from InstructorEmbedding import INSTRUCTOR
from llama_index.embeddings.base import BaseEmbedding

class InstructorEmbeddings(BaseEmbedding):
    def __init__(
        self, 
        instructor_model_name: str = "hkunlp/instructor-large",
        instruction: str = "Represent the Computer Science text for retrieval:",
        **kwargs: Any,
    ) -> None:
        self._model = INSTRUCTOR(instructor_model_name)
        self._instruction = instruction
        super().__init__(**kwargs)

    def _get_query_embedding(self, query: str) -> List[float]:
        embeddings = model.encode([[self._instruction, query]])
        return embeddings[0].tolist()
    
    async def _aget_query_embedding(self, query: str) -> List[float]:
        return self._get_query_embedding(query)

    def _get_text_embedding(self, text: str) -> List[float]:
        embeddings = model.encode([[self._instruction, text]])
        return embeddings[0].tolist() 
    
    def _get_text_embeddings(self, texts: List[str]) -> List[List[float]]:
        embeddings = model.encode([[self._instruction, text] for text in texts])
        return embeddings.tolist()

In [20]:
# set the batch size to 1 to avoid memory issues
# if you have a large GPU, you can increase this
instructor_embeddings = InstructorEmbeddings(embed_batch_size=1)

load INSTRUCTOR_Transformer
max_seq_length  512


In [21]:
embed = instructor_embeddings.get_text_embedding("How do I create a vector index?")
print(len(embed))
print(embed[:10])

768
[0.003987083211541176, 0.01212295051664114, 0.0026905445847660303, 0.015817083418369293, -0.0055559673346579075, 0.03673828765749931, 0.010727006942033768, 0.006661377381533384, -0.0392913743853569, 0.013146862387657166]


## Custom Embeddings w/ LlamaIndex

Since Instructor embeddings have a max length of 512, we set the chunk size to 512 as well.

However, if the emebddings are longer, there will not be an error, but only the first 512 tokens will be captured!

In [22]:
from llama_index import ServiceContext, set_global_service_context
from llama_index.llms import OpenAI

llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=instructor_embeddings, chunk_size=512)
set_global_service_context(service_context)

In [9]:
import os
import sys
sys.path.append(os.path.join(os.getcwd(), '..'))

from llama_docs_bot.indexing import create_query_engine

# remove any existing indices
# !rm -rf ./*_index

query_engine = create_query_engine()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [10]:
response = query_engine.query('What is the Sub Question query engine?')
response.print_response_stream()

The Sub Question query engine is a system that breaks down complex queries into smaller subquestions through a process called query decomposition. It analyzes the query and identifies different parts or subqueries within it. Each subquery is then routed to a specific subindex within a composed graph, which represents a subset of the overall knowledge corpus. By transforming the original query into simpler subquestions, the engine is able to provide more suitable and targeted answers from the data. This approach is particularly useful for handling complex queries that require knowledge augmentation.

In [11]:
print(response.get_formatted_sources(length=256))

> Source (Node id: 313d0f40-e2b8-467c-841d-311b5b592e34): Sub question: How does the Sub Question query engine work?
Response: The Sub Question query engine works by breaking down a complex query into smaller subquestions. This is done through a process called query decomposition. The engine analyzes the query...

> Source (Node id: e14dc321-291f-4d8a-a0b3-6b6a6e6c0259): Sub question: What are the different components of the Sub Question query engine?
Response: The different components of the Sub Question query engine are:
1. Single-step query decomposition: This component transforms a complicated question into a simple...

> Source (Node id: c27eeb80-53ec-42df-a45c-3e170dcf7b2c): Sub question: How can I configure the Sub Question query engine?
Response: To configure the Sub Question query engine, you can follow these steps:

1. Identify the specific query transformation technique you want to use. In this case, it would be the si...


### Compare to default embeddings

Note that an index must be using the same embedding model at query time that was used to create the index.

So below, we delete the existing indicies and rebuild them using OpenAI embeddings.

In [12]:
from llama_index.embeddings.openai import OpenAIEmbedding

service_context = ServiceContext.from_defaults(llm=llm, embed_model=OpenAIEmbedding(), chunk_size=512)
set_global_service_context(service_context)

# delete old vector index so we can re-create it
!rm -rf ./*_index

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [13]:
query_engine = create_query_engine()

response = query_engine.query('What is the Sub Question query engine?')
response.print_response_stream()

The Sub Question query engine is a system that breaks down complex questions into smaller subqueries. It uses a query decomposition feature to transform the complicated question into a simpler one over a data collection. By routing the query to multiple subindexes within a composed graph, the engine provides sub-answers to the original question. It consists of components such as a query engine, natural language query input, rich response output, indices, retrievers, composed graph, and query decomposition. The engine can be configured by defining sub-questions, creating a query engine, configuring retrievers, composing the query engine, and testing and refining the system.

In [14]:
print(response.get_formatted_sources(length=256))

> Source (Node id: d3467693-43d5-4a9e-871c-b425d94e2285): Sub question: How does the Sub Question query engine work?
Response: The Sub Question query engine works by breaking down a complex question into smaller subqueries. It uses a single-step query decomposition feature to transform the complicated question...

> Source (Node id: 392bd3f5-83ad-4728-ad66-fe1cca196689): Sub question: What are the different components of the Sub Question query engine?
Response: The different components of the Sub Question query engine are:

1. Query engine: It is a generic interface that allows users to ask questions over their data.

2...

> Source (Node id: e9dca9c4-62b8-492e-8918-75a14eb83e75): Sub question: How can I configure the Sub Question query engine?
Response: To configure the Sub Question query engine, you need to follow these steps:

1. Define the sub-questions: Determine the specific questions you want the query engine to answer. Th...


# Conclusion
In this notebook, we showed how to use the low-level embeddings, as well as how to create your own embeddings class.

If you wanted to use these embeddings in your project (which we will be doing in future guides!), you can use the sample example below.