**Coursebook: Understanding Embedding in LLM**

- Part 3 of Understanding Embedding in LLM
- Course Length: 9 hours
- Last Updated: July 2023
---

Developed by Algoritma's Research and Development division

## Background

The coursebook is part of the **Large Language Models Specialization** developed by [Algoritma](https://algorit.ma/). The coursebook is intended for a restricted audience only, i.e. the individuals and organizations having received this coursebook directly from the training organization. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.Algoritma is a data science education center based in Jakarta. We organize workshops and training programs to help working professionals and students gain mastery in various data science sub-fields: data visualization, machine learning, data modeling, statistical inference etc.

# Understanding Embedding in LLM 

## Training Objectives

- **Understanding Embeddings in Large Language Models (LLM) for Natural Language Processing**
   - Basic concepts of embeddings in LLM
   - Usage of embeddings in natural language processing
   - Demonstration of embeddings usage in text analysis

- **Advanced Embeddings in Large Language Models (LLM) for Text Processing**
   - In-depth understanding of embeddings in LLM
   - Introduction to popular embedding techniques like Word2Vec and FastText
   - Implementation of embeddings in text processing using Python
   - Demonstration of embedding techniques in text processing tasks

- **Advanced Applications of Embeddings in Text Processing with Large Language Models (LLM)**
   - Application of embeddings in advanced text processing
   - Usage of embeddings for text classification and contextual understanding
   - Demonstration of embeddings usage in advanced tasks

## Understanding Embeddings in Large Language Models (LLM) for Natural Language Processing


We have built GPT chatbot from LLM that can generate answer from our data. Let's dive deep how LLM understanding natural language by understanding Embedding. 

### Basic Concept of Embedding (Vector)

A vector (or embedding) is an array of numbers. That on its own is exciting, but what is even more exciting is that these arrays can represent more complex data like text, images, audio or even video. In the case of text, these representations are designed to capture semantic and syntactic relationships between words, allowing algorithms to understand and process language more effectively.

Word embeddings, specifically, are dense vector representations that encode the meaning of a word based on its context in a large corpus of text. In simpler terms, they map words to numerical vectors in a high-dimensional space, where similar words are located closer to each other. This is done in a vector database (we will talk about this later)

Creating these embeddings is done by an embedding model. There are multiple embedding models that can be used. OpenAI also provide embedding model but we will use free LLM model so we don't ran out of credit. We will use "all-MiniLM-L6-v2" embedding model.

Making embeddings can be visualised in the following way:

![embedding](assets/embedding.gif)

This embedding process apply in many LLM implementation, for example QnA system or GPT chatbot. The question asked to the chatbot will be embedded as well, and on the basis of similarity search, the retriever will return the embeddings with the data to answer the question. After this, the LLM will return a coherent and well-structured answer.

But let's dive deep the concept one by one start by how to perform embedding from raw text to vector form.

In [47]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

In [69]:
sentences = [
    "This is document about cat",
    "This is document about car",
    "Example of the long sentences: China increased its coal-fired power capacity by 42.9 GW, or 4.5%, in the 18 months to June 2019, according to a report by Global Energy Monitor. The study also found that another 121.3 GW of coal-fired power plants are under construction in China, which has pledged to reduce its coal usage. However, the country’s absolute coal consumption has still increased in line with rising energy demand. China accounts for more than 40% of the world's total coal generation capacity."
]

In [70]:
# Perform embedding using embed_documents()
embedded_sentences = embedding_function.embed_documents(sentences)

# show embedded result
embedded_sentences

[[0.03584069758653641,
  0.0844104215502739,
  0.004435782786458731,
  0.06132490187883377,
  -0.09679650515317917,
  -0.011024974286556244,
  -0.03996598348021507,
  0.02713777869939804,
  -0.03685332089662552,
  0.04033125191926956,
  0.022044340148568153,
  0.01601887308061123,
  -0.042161595076322556,
  0.010156717151403427,
  -0.09993438422679901,
  0.0030053777154535055,
  -0.10560218244791031,
  0.03724587708711624,
  0.013642314821481705,
  0.05051238834857941,
  0.02487064152956009,
  0.08287020027637482,
  0.05163392424583435,
  -0.014835968613624573,
  -0.0781128853559494,
  0.07514715194702148,
  -0.09232696890830994,
  -0.07739245146512985,
  0.021818267181515694,
  -0.01919436641037464,
  -0.07935808598995209,
  0.05101941525936127,
  0.039700187742710114,
  0.050386570394039154,
  -0.012143719010055065,
  -0.048662662506103516,
  0.060300786048173904,
  -0.01976785436272621,
  0.08548810333013535,
  0.10743720084428787,
  -0.049603745341300964,
  -0.06975889205932617,
  

In [71]:
# See len of embedded sentences
len(embedded_sentences)

3

We have three sentences and have embedded all of them. Let's see the embedding shape all of them.

In [57]:
# Shape of embedded sentences
for text in embedded_sentences:
    print(len(text))

384
384
384


In [72]:
# show the first 25 vector of first sentences
embedded_sentences[0][:25]

[0.03584069758653641,
 0.0844104215502739,
 0.004435782786458731,
 0.06132490187883377,
 -0.09679650515317917,
 -0.011024974286556244,
 -0.03996598348021507,
 0.02713777869939804,
 -0.03685332089662552,
 0.04033125191926956,
 0.022044340148568153,
 0.01601887308061123,
 -0.042161595076322556,
 0.010156717151403427,
 -0.09993438422679901,
 0.0030053777154535055,
 -0.10560218244791031,
 0.03724587708711624,
 0.013642314821481705,
 0.05051238834857941,
 0.02487064152956009,
 0.08287020027637482,
 0.05163392424583435,
 -0.014835968613624573,
 -0.0781128853559494]

We see that the embedding generates 384 vector no matter how short or long the sentences. This is because [sentence-transformers/msmarco-MiniLM-L-12-v3](sentence-transformers/msmarco-MiniLM-L-12-v3) maps sentences & paragraphs to a 384 dimensional dense vector space and ca be used for tasks like semantic search.

### Find Similarity Between Documents

We have embedding three sentences (documents) above. What if we want to know which document contains relevance information about our question?

We can resolve this problem by embedding our question to vector dimension and compute the similarity between our question and our documents.

The similarity compute using `cosine distance` which the lower the distances, the similar the vector is.

For example if we want to know which document contains information about China and coal.

In [76]:
# Embed the question/query
embed_query = embedding_function.embed_documents(['China and coal'])

In [77]:
# Import the cosine_distance
from sklearn.metrics.pairwise import cosine_distances

# Compute the cosine distance between query and documents
cosine_distances(embed_query, embedded_sentences)

array([[0.97059793, 0.92054657, 0.43505046]])

Based on result above, we know that the third document has lowest distance, that means the third document contains relevane information about "China and coal" which actually really obvious since the third document contains summary news about China coal-powered plant.

Let's see another example. What if we want to know which document contains about vehicle.

In [81]:
query = ["show document about vehicle"]
embed_query = embedding_function.embed_documents(query)

cosine_distances(embed_query, embedded_sentences)

array([[0.69098626, 0.20956952, 0.99632427]])

Since the second document contains the lowest distance, lets see the second document.

In [84]:
# show second document
sentences

['This is document about cat',
 'This is document about car',
 "Example of the long sentences: China increased its coal-fired power capacity by 42.9 GW, or 4.5%, in the 18 months to June 2019, according to a report by Global Energy Monitor. The study also found that another 121.3 GW of coal-fired power plants are under construction in China, which has pledged to reduce its coal usage. However, the country’s absolute coal consumption has still increased in line with rising energy demand. China accounts for more than 40% of the world's total coal generation capacity."]

Notice that the second document doesn't contains any "vehicle" word but contains "car" which semantically we know that car is vehicle. 

Also notice that the first and second document just have 1 different letter "cat" and "car". But the distance pretty far. 

This is because the embedding model embed the sentence based on its semantic textual meaning instead of just the letter/word like conventional embedding function.

Let's use another example

In [82]:
query = ["show document about animals"]
embed_query = embedding_function.embed_documents(query)

cosine_distances(embed_query, embedded_sentences)

array([[0.39287084, 0.61589003, 1.00654091]])

In [85]:
sentences[0]

'This is document about cat'

We see even the documents doesn't contains any "animals" word, but by compute the cosine distance of the embedding vector we can find the document that can represent the semantic meaning from our query.

## Advanced Embedding in Large Language Models (LLM) for Text Processing

### Vector Database (CHROMA DB)

When working with Large Language Models (LLMs) like GPT-4 or Google's PaLM 2, we will often be working with big amounts of unstructured, textual data. Structured data can just be stored in a SQL database, but that is much harder with unstructured data. When we for instance have a lot of text files like above example with information on a certain topic, it might be good to store this information in a different way in order to retrieve the desired data in the most efficient way. The answer to this: **Vector Databases**.

The specific vector database that we will use is the **ChromaDB** vector database.

[Chroma Website](https://docs.trychroma.com/getting-started#:~:text=Chroma%20is%20a%20database%20for,hosted%20version%20is%20coming%20soon!):

> Chroma is a database for building AI applications with embeddings. It comes with everything you need to get started built in, and runs on your machine. ChromaDB

By using `Chroma` we don't have to embed and compute the cosine distance manually.

`Chroma` and `LangChain` already integrated. Let's import the necessary function.

In [None]:
# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

In [None]:
# load the document and split it into chunks
loader = TextLoader("data_input/state_of_the_union.txt")
document = loader.load()

In [None]:
document

[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \n\nGroups of citize

The document has been loaded. But if we see the document, the document contains many paragraph. Let's split the document by each paragraph so its easy for us to find which paragraph most similar to our query.

In [None]:
# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
text = text_splitter.split_documents(document)

text[:5]

[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.', metadata={'source'

We have split each paragraph in document. Let's embed and store our sentence and embed vector into Chroma.

In [None]:
# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma
db = Chroma.from_documents(text, embedding_function)

Chroma will store our ids, embeddings, documents and metadatas into `collection`.

In [None]:
db._collection.get().keys()

dict_keys(['ids', 'embeddings', 'documents', 'metadatas'])

In [None]:
db._collection.get()['documents'][:5]

['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.',
 'A former top litigator in private prac

In [None]:
db._collection.get()['ids'][:5]

['9af61905-21fe-11ee-a541-c85b76361b4c',
 '9af61925-21fe-11ee-82d8-c85b76361b4c',
 '9af61907-21fe-11ee-b064-c85b76361b4c',
 '9af61908-21fe-11ee-8cb4-c85b76361b4c',
 '9af61909-21fe-11ee-882d-c85b76361b4c']

In [None]:
db._collection.get()['metadatas'][:5]

[{'source': 'data_input/state_of_the_union.txt'},
 {'source': 'data_input/state_of_the_union.txt'},
 {'source': 'data_input/state_of_the_union.txt'},
 {'source': 'data_input/state_of_the_union.txt'},
 {'source': 'data_input/state_of_the_union.txt'}]

Hehehe

```python
(method) def similarity_search_with_score(
    query: str,
    k: int = DEFAULT_K,
    filter: Dict[str, str] | None = None,
    **kwargs: Any
) -> List[Tuple[Document, float]]
Run similarity search with Chroma with distance.

Args:
    query (str): Query text to search for.
    k (int): Number of results to return. Defaults to 4.
    filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.

Returns:
    List[Tuple[Document, float]]: List of documents most similar to the query text and cosine distance in float for each. Lower score represents more similarity.
```

In [None]:
# Embed query and find similar document
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search_with_score(query, k=5)

In [None]:
docs

[(Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'data_input/state_of_the_union.txt'}),
  1.1972053050994873),
 (Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman

We see the `docs` return 5 most similar document.

What if we have bunch of txt file in one folder and want to embed all of them. We can load all files in document using `DirectoryLoader` and split each paragraph using `RecursiveCharacterTextSplitter`.

In [None]:
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
# Load and process the all text files in new_articles
loader = DirectoryLoader('data_input/new_articles/', glob="./*.txt", loader_cls=TextLoader)

documents = loader.load()

Because we have bunch of text file, we have to perform text splitting using `RecursiveCharacterTextSplitter`.

In [None]:
#splitting the text into
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
text = text_splitter.split_documents(documents)

In [None]:
text[:5]

[Document(page_content='Welcome to The Interchange! If you received this in your inbox, thank you for signing up and your vote of confidence. If you’re reading this as a post on our site, sign up here so you can receive it directly in the future. Every week, we’ll take a look at the hottest fintech news of the previous week. This will include everything from funding rounds to trends to an analysis of a particular space to hot takes on a particular company or phenomenon. There’s a lot of fintech news out there and it’s our job to stay on top of it — and make sense of it — so you can stay in the know. — Mary Ann and Christine\n\nBusy, busy, busy\n\nIt was a busy week in startup and venture lands, and the fintech space was no exception.', metadata={'source': 'data_input\\new_articles\\05-07-fintech-space-continues-to-be-competitive-and-drama-filled.txt'}),
 Document(page_content='Busy, busy, busy\n\nIt was a busy week in startup and venture lands, and the fintech space was no exception.\n

We have split all our text. The following steps is same like the code before.

In [None]:
# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma
vectordb = Chroma.from_documents(text, embedding_function)

# query it
query = "What is the news about Pando?"
docs = vectordb.similarity_search_with_score(query)

docs

[(Document(page_content='Pando was co-launched by Jayakrishnan and Abhijeet Manohar, who previously worked together at iDelivery, an India-based freight tech marketplace — and their first startup. The two saw firsthand manufacturers, distributors and retailers were struggling with legacy tech and point solutions to understand, optimize and manage their global logistics operations — or at least, that’s the story Jayakrishnan tells.\n\n“Supply chain leaders were trying to build their own tech and throwing people at the problem,” he said. “This caught our attention — we spent months talking to and building for enterprise users at warehouses, factories, freight yards and ports and eventually, in 2018, decided to start Pando to solve for global logistics through a software-as-a-service platform offering.”', metadata={'source': 'data_input\\new_articles\\05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt'}),
  0.9159137606620789),
 (Document(page_content='Signaling that inv

### Saving to Disk

Embedding text takes much to time and resources to processes, especially if we have large amount of text.

To encounter this problem, we can save our vector database (including the vector sentences) to disk so each time we need to use this vector we just load the saved vector database.

To save vector db to disk, simply initialize the Chroma client and pass the directory where you want the data to be saved to.

> Caution: Chroma makes a best-effort to automatically save data to disk, however multiple in-memory clients can stomp each other's work. As a best practice, only have one client per path running at any given time.

> Protip: Sometimes you can call db.persist() to force a save

In [None]:
# save to disk
# load it into Chroma
vectordb = Chroma.from_documents(text, embedding_function, persist_directory="./chroma_db")
vectordb.persist()

The vector db will save to ./chroma_db folder. So whenever we want to use the embedding file. Just load it from ".chroma_db" file.

Let's see how to load saved vector db in Chroma.

In [None]:
vectordb_load = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)

# query it
query = "What is the news about Pando?"
docs_load = vectordb_load.similarity_search_with_score(query)

docs_load

[(Document(page_content='Pando was co-launched by Jayakrishnan and Abhijeet Manohar, who previously worked together at iDelivery, an India-based freight tech marketplace — and their first startup. The two saw firsthand manufacturers, distributors and retailers were struggling with legacy tech and point solutions to understand, optimize and manage their global logistics operations — or at least, that’s the story Jayakrishnan tells.\n\n“Supply chain leaders were trying to build their own tech and throwing people at the problem,” he said. “This caught our attention — we spent months talking to and building for enterprise users at warehouses, factories, freight yards and ports and eventually, in 2018, decided to start Pando to solve for global logistics through a software-as-a-service platform offering.”', metadata={'source': 'data_input\\new_articles\\05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt'}),
  0.9159137606620789),
 (Document(page_content='Signaling that inv

After we load the vector db, we can still perform query semantic search and got the same result from our vector database before saving to disk.

------------ BORDER ----------------

## Advanced Applications of Embedding in Text Processing with Large Language Models (LLM)

### Create QnA System from Vector Database

#### Create Retriever

In order to get the data out of the database again, we need to create a retriever. This retriever will return all the documents (or chucks) related to the question asked:

In [None]:
# # Create retriever
# retriever = vectordb.as_retriever()

#### Create a Q&A chain

In [None]:
# from langchain.llms import OpenAI
# from langchain.chains import RetrievalQA

# # create the chain to answer questions
# qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(),
#                                   chain_type="stuff",
#                                   retriever=retriever,
#                                   return_source_documents=True,
#                                   verbose=True)