### NOTES:

Typical challenges without RAG
1. LLM depend on last update
2. Factual errors
3. Might not have the context

So its needed to include knowledge
1. Fine Tuning
-- pretrained + last layer retrained using proprietary data
-- expensive
-- needs expertise

2. In context learning
-- using prompts to give context and examples
-- max tokens limit
-- needs memory for permanent session storage


3. RAG
-- LLM + Info Retrieval
-- load docs
- indexing & storage
-- create chunk -> managing size
-- embeddings -> numeric tranformation of each chunk
-- indexing -> helps with efficient retrieval
-- create a vector store -> storage of this information with efficient retrieval
-- retrieves info -> get the right answer by taking the user query, embedding it, matching with vector store to get context, and then fetching the top k response back, these are then used to create the prompt for the LLM
-- synthesize the response -> top k relevant chunks + context + user query is passed to LLM and get an answer  
-- query engine -> provide output through generator LLM
-- evaluation


## RAG Notes

### RAG
Data comes from 
-- API
-- Raw file
-- Vector store
-- Database

Needs 
-- framework like Llamanindex or Langchain
-- embedding models
-- vector store
-- LLM

### Llamanindex
-- single framework to build LLM search + retrieval (Q&A) type applications




In [2]:
#STEPS
#1. Ingest
#2. Index
#3. Retrieve
#4. REsponse Systhesizer
#5. Ouery
'''
%%capture

!pip install llama-index
!pip install python-dotenv

'''

'\n%%capture\n\n!pip install llama-index\n!pip install python-dotenv\n\n'

In [91]:
import os
import tqdm
from pathlib import Path
from dotenv import load_dotenv



load_dotenv(r'C:\Users\myohollc\Documents\llamaindex\av_RAG_LlamaIndex\.env')

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
os.environ['GOOGLE_API_KEY'] = GOOGLE_API_KEY


In [4]:
if GOOGLE_API_KEY=='':
    print('key not loaded')
else:
    print('key loaded')

if OPENAI_API_KEY=='':
    print('key not loaded')
else:
    print('key loaded')

key loaded
key loaded


In [5]:
### Step 0: Data Ingestion

In [6]:
###
from llama_index.core import SimpleDirectoryReader

## all pages get loaded as a list, where each element is a page

documents = SimpleDirectoryReader(input_files=["./data/transformers.pdf"]).load_data()

In [7]:
type(documents)
documents[0]
print(documents[0])
documents[0].text
documents[0].metadata
documents[0].id_


Doc ID: 7c85d717-7765-4182-8c33-0bc60c7943a7
Text: Provided proper attribution is provided, Google hereby grants
permission to reproduce the tables and figures in this paper solely
for use in journalistic or scholarly works. Attention Is All You Need
Ashish Vaswani∗ Google Brain avaswani@google.com Noam Shazeer∗ Google
Brain noam@google.com Niki Parmar∗ Google Research nikip@google.com
Jakob Usz...


'7c85d717-7765-4182-8c33-0bc60c7943a7'

In [8]:
### Step 1a: Embedding Model

In [9]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.gemini import GeminiEmbedding

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
#openai

#embed_model = OpenAIEmbedding(model_name="text-embedding-3-large", title="this is a openai document")
#example usage
#embeddings = embed_model.get_text_embedding("OpenAI Embeddings")

#gemini
embed_model = GeminiEmbedding(model_name="models/embedding-001", title="this is a gemini document")

# example usage
embeddings = embed_model.get_text_embedding("Google Gemini Embeddings.")

In [11]:
type(embeddings)
embeddings[0:3]
len(embeddings)

768

In [12]:
### Step 1b: LLM

In [13]:
### openAI
'''
from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-3.5-turbo",
    # api_key="some key",  # uses OPENAI_API_KEY env var by default
)
'''


'\nfrom llama_index.llms.openai import OpenAI\n\nllm = OpenAI(\n    model="gpt-3.5-turbo",\n    # api_key="some key",  # uses OPENAI_API_KEY env var by default\n)\n'

In [14]:

### gemini
from llama_index.llms.gemini import Gemini

llm = Gemini(
    model="models/gemini-1.5-flash",
    # api_key="some key",  # uses OPENAI_API_KEY env var by default
)

In [15]:
### Step 2: Indexing

In [16]:
from llama_index.core import VectorStoreIndex
# creates a datastore that has documents and the embeddings

index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

In [17]:
type(index)

llama_index.core.indices.vector_store.base.VectorStoreIndex

In [18]:
### Step 3: Retrieval

In [19]:
# Set up index as a retriever to be able to query the index and get the responses, mainly top k chunks.
retriever = index.as_retriever()

In [20]:
retrieved_nodes = retriever.retrieve("What is self attention?")

In [21]:
retrieved_nodes

[NodeWithScore(node=TextNode(id_='fb7a5a27-b5f8-4d82-ab6e-fb95ec2bf3c3', embedding=None, metadata={'page_label': '3', 'file_name': 'transformers.pdf', 'file_path': 'data\\transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-06-27', 'last_modified_date': '2025-02-15'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='90c9486e-fb94-43a3-b96e-331cffc7fa3c', node_type='4', metadata={'page_label': '3', 'file_name': 'transformers.pdf', 'file_path': 'data\\transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-06-27', 'last_modified_date': '2025-02-15'}, hash='a251912bdd0963f1b37603da43eecebc172b05ad372017b68ce0564976555f8d')}, meta

In [22]:
type(retrieved_nodes)
len(retrieved_nodes)
retrieved_nodes[0].metadata
print(retrieved_nodes[0].text)

Figure 1: The Transformer - model architecture.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
respectively.
3.1 Encoder and Decoder Stacks
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
wise fully connected feed-forward network. We employ a residual connection [11] around each of
the two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is
LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension dmodel = 512.
Decoder: The decoder is also composed of a stack of N = 6identical layers. 

In [23]:
print(retrieved_nodes[1].text)

Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top:
Full attentions for head 5. Bottom: Isolated attentions from just the word ‘its’ for attention heads 5
and 6. Note that the attentions are very sharp for this word.
14


In [24]:
retrieved_nodes[0].node

TextNode(id_='fb7a5a27-b5f8-4d82-ab6e-fb95ec2bf3c3', embedding=None, metadata={'page_label': '3', 'file_name': 'transformers.pdf', 'file_path': 'data\\transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-06-27', 'last_modified_date': '2025-02-15'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='90c9486e-fb94-43a3-b96e-331cffc7fa3c', node_type='4', metadata={'page_label': '3', 'file_name': 'transformers.pdf', 'file_path': 'data\\transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-06-27', 'last_modified_date': '2025-02-15'}, hash='a251912bdd0963f1b37603da43eecebc172b05ad372017b68ce0564976555f8d')}, metadata_template='{key}

In [25]:
### Step 4: Response Synthesis

In [26]:
from llama_index.core import get_response_synthesizer

# this is needed for creating a way the llm can respond
response_synthesizer = get_response_synthesizer(llm=llm)

In [27]:
type(response_synthesizer)

llama_index.core.response_synthesizers.compact_and_refine.CompactAndRefine

In [28]:
### Stage 5: Query Engine

In [29]:
# this is used to query the indexed document and receive synthesized responses from the LLM
# it takes the index and the response synthesizer
query_engine = index.as_query_engine(llm=llm, response_synthesizer=response_synthesizer)

In [30]:
response = query_engine.query("What is self attention?")

In [31]:
print(response.response)
print(len(response.response))

Self-attention is a mechanism that maps a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.  The output is calculated as a weighted sum.  In the decoder, self-attention is modified to prevent positions from attending to subsequent positions.

300


In [32]:
len(response.source_nodes)
response.source_nodes[0].text

'Figure 1: The Transformer - model architecture.\nThe Transformer follows this overall architecture using stacked self-attention and point-wise, fully\nconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,\nrespectively.\n3.1 Encoder and Decoder Stacks\nEncoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two\nsub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-\nwise fully connected feed-forward network. We employ a residual connection [11] around each of\nthe two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is\nLayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer\nitself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding\nlayers, produce outputs of dimension dmodel = 512.\nDecoder: The decoder is also composed of a stack of N = 6ident

In [33]:
print(query_engine.query("Why do we need positional encodings?").response)

To utilize the order of a sequence in a model without recurrence or convolution, positional encodings are added to the input embeddings.  This injects information about the relative or absolute position of tokens in the sequence.



In [34]:
print(query_engine.query("What are the different types of Transformer Models?").response)

Based on the provided text, the described Transformer model uses an encoder-decoder structure.  The encoder is composed of six identical layers, each with a multi-head self-attention mechanism and a fully connected feed-forward network.  The decoder also has six identical layers, each with the two sub-layers of the encoder plus an additional multi-head attention sub-layer over the encoder's output.



In [35]:
print(query_engine.query("What are Encoder and Decoder blocks?").response)

The encoder comprises six identical layers, each containing two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network.  Residual connections and layer normalization are used.  The decoder also has six identical layers, each with the two sub-layers found in the encoder, plus a third sub-layer for multi-head attention over the encoder's output.  Similar residual connections and layer normalization are employed, and a modification to the self-attention sub-layer prevents positions from attending to subsequent positions.



In [36]:
## Data Loaders

In [37]:
!pip install llama-index-readers-file llama-index-readers-web
!pip install unstructured



In [38]:
#!wget "https://github.com/prashant9501/RAG-using-LlamaIndex-course/raw/33675a285b06b4048af6fa0221dd5fd289157a8a/transformers.pdf" -O 'data/transformers.pdf'


In [39]:
### Using PDF reader

In [40]:
from pathlib import Path
from llama_index.readers.file import PDFReader

In [41]:
#simple pdf reader

loader = PDFReader()

In [42]:
pdf_documents = loader.load_data(Path("./data/transformers.pdf"))

In [43]:
len(pdf_documents)

15

In [44]:
### loading csv files

In [45]:
!wget https://datahack-prod.s3.amazonaws.com/train_file/train_v9rqX0R.csv -O 'data/transactions.csv'

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [46]:
from llama_index.readers.file import CSVReader

loader = CSVReader()

csv_documents = loader.load_data(Path("data/transactions.csv"))

In [47]:
len(csv_documents)
csv_documents[0].metadata

{'filename': 'transactions.csv', 'extension': '.csv'}

In [48]:
### Loading Web Page

In [49]:
from llama_index.readers.web import UnstructuredURLLoader

loader =UnstructuredURLLoader(urls=["https://huggingface.co/blog/moe"])
web_documents = loader.load_data()

In [50]:
type(web_documents)
len(web_documents)
print(web_documents[0].text)

Back to Articles

Mixture of Experts Explained

Published December 11, 2023

Update on GitHub

Upvote

352

osanseviero Omar Sanseviero

lewtun Lewis Tunstall

philschmid Philipp Schmid

smangrul Sourab Mangrulkar

ybelkada Younes Belkada

pcuenq Pedro Cuenca

With the release of Mixtral 8x7B (announcement, model card), a class of transformer has become the hottest topic in the open AI community: Mixture of Experts, or MoEs for short. In this blog post, we take a look at the building blocks of MoEs, how they’re trained, and the tradeoffs to consider when serving them for inference.

Let’s dive in!

Table of Contents

What is a Mixture of Experts?

A Brief History of MoEs

What is Sparsity?

Load Balancing tokens for MoEs

MoEs and Transformers

Switch Transformers

Stabilizing training with router Z-loss

What does an expert learn?

How does scaling the number of experts impact pretraining?

Fine-tuning MoEs

When to use sparse MoEs vs dense models?

Making MoEs go brrr

Expert Paralle

In [51]:

from llama_index.core import Document

In [65]:
### Multiple docs in different formats read from source files directly as a single llama index document
### if any other format (web page etc.) needs to be added, perhaps can be converted to .md and saved
### or use this function



In [75]:
pwd

'c:\\Users\\myohollc\\Documents\\llamaindex\\av_RAG_LlamaIndex'

In [None]:


documents = SimpleDirectoryReader(input_files=['./data/transactions.csv',
                                    './data/transformers.pdf']).load_data()

In [62]:
len(documents)

16

In [58]:
document = Document(text="\n\n".join([doc.text for doc in documents]))

In [60]:
with open('data/blog.html', 'w', encoding="utf-8") as file:
    file.write(document.text)

In [76]:
### Chunking

In [79]:
from llama_index.core.node_parser import SentenceSplitter, TokenTextSplitter

essay_dcoument = SimpleDirectoryReader(input_files=["./data/paul_graham_essay.txt"], filename_as_id=True).load_data(show_progress=True)

Loading files: 100%|██████████| 1/1 [00:00<00:00, 225.99file/s]


In [97]:
len(essay_dcoument)

1

In [86]:
splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=20)
nodes = splitter.get_nodes_from_documents(essay_dcoument)

In [87]:
len(nodes)

18

In [88]:
print(nodes[5].text)

So if you see a big painting of this type hanging in the apartment of a hedge fund manager, you know he paid millions of dollars for it. That's not always why artists have a signature style, but it's usually why buyers pay a lot for such work. [6]

There were plenty of earnest students too: kids who "could draw" in high school, and now had come to what was supposed to be the best art school in the country, to learn to draw even better. They tended to be confused and demoralized by what they found at RISD, but they kept going, because painting was what they did. I was not one of the kids who could draw in high school, but at RISD I was definitely closer to their tribe than the tribe of signature style seekers.

I learned a lot in the color class I took at RISD, but otherwise I was basically teaching myself to paint, and I could do that for free. So in 1993 I dropped out. I hung around Providence for a bit, and then my college friend Nancy Parmet did me a big favor. A rent-controlled apa

In [93]:
import tiktoken

# to look at chunking and token outside of the above splitter
# the code below retrieves token ids after breaking the chunks into tokens and then encoding them with token ids

for i, node in enumerate(nodes):
    encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
    print(f'number of tokens in node {i+1} - {len(encoding.encode(node.text))}')

number of tokens in node 1 - 967
number of tokens in node 2 - 963
number of tokens in node 3 - 975
number of tokens in node 4 - 949
number of tokens in node 5 - 947
number of tokens in node 6 - 949
number of tokens in node 7 - 942
number of tokens in node 8 - 973
number of tokens in node 9 - 965
number of tokens in node 10 - 967
number of tokens in node 11 - 966
number of tokens in node 12 - 969
number of tokens in node 13 - 958
number of tokens in node 14 - 972
number of tokens in node 15 - 954
number of tokens in node 16 - 960
number of tokens in node 17 - 981
number of tokens in node 18 - 326


In [96]:
print(nodes[0].text)

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the

In [98]:
from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage #to be able to chat with OpenAI

In [99]:
llm = OpenAI(model='gpt-3.5-turbo')

In [101]:
message = [ChatMessage(role="system", content="answer the question accurately"),
            ChatMessage(role="user", content="why is the sky red?")
            ]

In [102]:
response = OpenAI().chat(message)

Retrying llama_index.llms.openai.base.OpenAI._chat in 1.0 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}.
Retrying llama_index.llms.openai.base.OpenAI._chat in 1.68253079723454 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}.


RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}