## Embedding Techniques

Converting text into vectors.

Calling load_dotenv(). will parse a .env file and then load all the variables found as environment variables.



In [None]:
import os
from dotenv import load_dotenv

load_dotenv() # Load all the environment variables
print(f"OPENAI_API_KEY: {os.environ['OPENAI_API_KEY'][:10]}***") # Ensure OPENAI_API_KEY is set

OPENAI_API_KEY: sk-proj-tu***


Use OpenAI embeddings.

Doc: [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings)

In [5]:
from langchain_openai import OpenAIEmbeddings

model_name = 'text-embedding-3-small'
embeddings = OpenAIEmbeddings(model=model_name)
embeddings

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x110d0bcb0>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x1115206e0>, model='text-embedding-3-small', dimensions=None, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

In [6]:
text = 'This is an tutorial on OpenAI embeddings'
query_result = embeddings.embed_documents(texts=[text])

In [8]:
print(f'{len(query_result)} vectors with size {0 if len(query_result) == 0 else len(query_result[0])}')

1 vectors with size 1536


## Embed documents

As we will use **ChromaDB**, we need to install it first using `pip install chromadb`

1. Load TXT file using langchain `TextLoader`
2. Split the txt into documents using langchant `RecursiveCharacterTextSpliter`
3. Call `OpenAIEmbeddings` to embed chunks into vectors
4. Store embeddings into `ChromaDB`

In [10]:
import os
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

file_name = 'levski.txt'
assert os.path.exists(file_name)

text_loader = TextLoader(file_path='levski.txt')
text_docs = text_loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=20)
split_docs = text_splitter.split_documents(text_docs)
split_docs

# Create OpenAI embeddings
model_name = 'text-embedding-3-small'
embedding = OpenAIEmbeddings(model=model_name)

# Create ChromaDB
chroma_store = Chroma.from_documents(documents=split_docs, embedding=embedding,
                                     persist_directory='./chromadb')

How to **query** from the **ChromaDb**

In [None]:
query = 'Levski have won a total of 74 trophies'
chroma_store.similarity_search(query=query, k=3)