In [67]:
from langchain_community.document_loaders import TextLoader
import os
# from dotenv import load_dotenv
# load_dotenv() This can be used to load environment variables if there are any.
from langchain_community.llms import Ollama #we use open source model llama2 from Ollama
from langchain_community.document_loaders import WebBaseLoader #collecting data from websites
import bs4 #BeautifulSoup for web scraping
from langchain_community.document_loaders import PyPDFLoader #used to load data from .pdf
from langchain.text_splitter import RecursiveCharacterTextSplitter #used to split the given database into chunks
from langchain_openai import OpenAIEmbeddings #But paid
from langchain_community.vectorstores import Chroma
#from langchain_community.embeddings import FastEmbedEmbeddings #good opensource embedding
from transformers import AutoTokenizer, AutoModel
from tqdm import tqdm
from langchain.embeddings import HuggingFaceEmbeddings #Not so good, but we can change the models and see the performances
from langchain_community.vectorstores import FAISS #Vector DB by Meta
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
#from langchain.embeddings import FastEmbedEmbeddings
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain.embeddings import OllamaEmbeddings
from langchain.chains.combine_documents import create_stuff_documents_chain #This is a chain which combines the prompt and the LLM
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain #This combines a retriever and a create_stuff_documents_chain to take a user query and give the result


In this project, a simple RAG is developed using LangChain. When a user gives a prompt, it retrieves relevant details from a given set of databse and use that to generate quality rresults without hallucination using LLM

In [2]:
## Data Ingestion

loader=TextLoader("speech.txt")
text_documents=loader.load()
text_documents

[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness be

In [3]:
'''Load data from a .html website'''

loader=WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
                     bs_kwargs=dict(parse_only=bs4.SoupStrainer(
                         class_=("post-title","post-content","post-header")

                     )))

text_documents=loader.load()

In [4]:
## Pdf reader
loader=PyPDFLoader('attention.pdf')
docs=loader.load()

In [36]:

text_splitter=RecursiveCharacterTextSplitter(chunk_size=2000,chunk_overlap=200)
documents=text_splitter.split_documents(docs)
documents[:5]

[Document(metadata={'source': 'attention.pdf', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\n

In [37]:
'''Chroma DB using FastEmbedEmbeddings'''
db_chroma = Chroma.from_documents(documents,FastEmbedEmbeddings())

Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 13680.05it/s]


In [38]:
# Load the tokenizer and model from Hugging Face
os.environ['HUGGINGFACEHUB_API_TOKEN']="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"

'''HuggingFace Embedding model using Sentence Transformer'''
huggingface_embeddings=HuggingFaceBgeEmbeddings(
    model_name= "sentence-transformers/all-MiniLM-l6-v2",
    model_kwargs={'device':'cpu'},
    encode_kwargs={'normalize_embeddings':True})




In [39]:
'''Chroma DB using FastEmbedEmbeddings'''
db_chroma_hf = Chroma.from_documents(documents,huggingface_embeddings)

In [61]:
'''Rterieval using FastEmbed Embeddings'''
query = "Masked multihead attention"
retireved_results=db_chroma.similarity_search(query)
print(retireved_results[1].page_content)

output values. These are concatenated and once again projected, resulting in the final values, as
depicted in Figure 2.
Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions. With a single attention head, averaging inhibits this.
MultiHead( Q, K, V ) = Concat(head 1, ...,head h)WO
where head i= Attention( QWQ
i, KWK
i, V WV
i)
Where the projections are parameter matrices WQ
i∈Rdmodel×dk,WK
i∈Rdmodel×dk,WV
i∈Rdmodel×dv
andWO∈Rhdv×dmodel.
In this work we employ h= 8 parallel attention layers, or heads. For each of these we use
dk=dv=dmodel/h= 64 . Due to the reduced dimension of each head, the total computational cost
is similar to that of single-head attention with full dimensionality.
3.2.3 Applications of Attention in our Model
The Transformer uses multi-head attention in three different ways:
•In "encoder-decoder attention" layers, the queries come from the previous decoder layer,


In [60]:
'''Rterieval using HuggingFace Embeddings'''
query = "What is Scaled dot product attention"
retireved_results=db_chroma_hf.similarity_search(query)
print(retireved_results[1].page_content)

dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has
extremely small gradients4. To counteract this effect, we scale the dot products by1√dk.
3.2.2 Multi-Head Attention
Instead of performing a single attention function with dmodel-dimensional keys, values and queries,
we found it beneficial to linearly project the queries, keys and values htimes with different, learned
linear projections to dk,dkanddvdimensions, respectively. On each of these projected versions of
queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional
4To illustrate why the dot products get large, assume that the components of qandkare independent random
variables with mean 0and variance 1. Then their dot product, q·k=Pdk
i=1qiki, has mean 0and variance dk.
4


In [63]:
## FAISS Vector Database
db = FAISS.from_documents(documents, FastEmbedEmbeddings())
query = "What is Scaled dot product attention"
retireved_results=db.similarity_search(query)
print(retireved_results[0].page_content)

Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 24105.20it/s]


Scaled Dot-Product Attention
 Multi-Head Attention
Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several
attention layers running in parallel.
of the values, where the weight assigned to each value is computed by a compatibility function of the
query with the corresponding key.
3.2.1 Scaled Dot-Product Attention
We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of
queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the
query with all keys, divide each by√dk, and apply a softmax function to obtain the weights on the
values.
In practice, we compute the attention function on a set of queries simultaneously, packed together
into a matrix Q. The keys and values are also packed together into matrices KandV. We compute
the matrix of outputs as:
Attention( Q, K, V ) = softmax(QKT
√dk)V (1)
The two most commonly used attention functions are additive attention [ 2], and

In [64]:
## FAISS Vector Database
embeddings = FastEmbedEmbeddings()
db = FAISS.from_documents(documents, embeddings)
query = "Authors of Attentions is all you need paper"
retireved_results=db.similarity_search(query)
print(retireved_results[0].page_content)

Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 15803.71it/s]


Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top:
Full attentions for head 5. Bottom: Isolated attentions from just the word ‘its’ for attention heads 5
and 6. Note that the attentions are very sharp for this word.
14


In [31]:
'''parameters of FastEmbed Embeddings
model_name: str (default: "BAAI/bge-small-en-v1.5")
Name of the FastEmbedding model to use. You can find the list of supported models here.
max_length: int (default: 512)
The maximum number of tokens. Unknown behavior for values > 512.
cache_dir: Optional[str] (default: None)
The path to the cache directory. Defaults to local_cache in the parent directory.
threads: Optional[int] (default: None)
The number of threads a single onnxruntime session can use.
doc_embed_type: Literal["default", "passage"] (default: "default")
"default": Uses FastEmbed's default embedding method.
"passage": Prefixes the text with "passage" before embedding.
batch_size: int (default: 256)
Batch size for encoding. Higher values will use more memory, but be faster.
parallel: Optional[int] (default: None)'''

'parameters of FastEmbed Embeddings\nmodel_name: str (default: "BAAI/bge-small-en-v1.5")\nName of the FastEmbedding model to use. You can find the list of supported models here.\nmax_length: int (default: 512)\nThe maximum number of tokens. Unknown behavior for values > 512.\ncache_dir: Optional[str] (default: None)\nThe path to the cache directory. Defaults to local_cache in the parent directory.\nthreads: Optional[int] (default: None)\nThe number of threads a single onnxruntime session can use.\ndoc_embed_type: Literal["default", "passage"] (default: "default")\n"default": Uses FastEmbed\'s default embedding method.\n"passage": Prefixes the text with "passage" before embedding.\nbatch_size: int (default: 256)\nBatch size for encoding. Higher values will use more memory, but be faster.\nparallel: Optional[int] (default: None)'

In [45]:
'''Ollama embeddings'''
#ollama pull llama2
embeddings = OllamaEmbeddings(
    model="llama2",
)

In [65]:
## FAISS Vector Database
db = FAISS.from_documents(documents, embeddings)
query = "Who was the author of the paper?"
retireved_results=db.similarity_search(query)
print(retireved_results[0].page_content)

[25] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated
corpus of english: The penn treebank. Computational linguistics , 19(2):313–330, 1993.
[26] David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference ,
pages 152–159. ACL, June 2006.
[27] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention
model. In Empirical Methods in Natural Language Processing , 2016.
[28] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive
summarization. arXiv preprint arXiv:1705.04304 , 2017.
[29] Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact,
and interpretable tree annotation. In Proceedings of the 21st International Conference on
Computational Linguistics and 44th Annual Meeting of the ACL , pages 433–440. ACL, July
2006.
[30] Ofir P

In [50]:
'''LLM model'''
## Load Ollama LAMA2 LLM model
llm=Ollama(model="llama2")
llm

Ollama()

In [55]:
'''Chat prompt template: The input is filled by the user query and the context is filled by thw retrieved details automatically'''

prompt = ChatPromptTemplate.from_template("""
Answer the following question based only on the provided context. 
Think step by step before providing a detailed answer. 
I will tip you $1000 if the user finds the answer helpful. 
<context>
{context}
</context>
Question: {input}""")

In [56]:
'''This chain takes a list of documents and formats them all into a prompt, then passes to LLM to obtain the final result'''
document_chain=create_stuff_documents_chain(llm,prompt)

In [66]:
"""
Retrievers: A retriever is an interface that returns documents given
 an unstructured query. It is more general than a vector store.
 A retriever does not need to be able to store documents, only to 
 return (or retrieve) them. Vector stores can be used as the backbone
 of a retriever, but there are other types of retrievers as well. 
 https://python.langchain.com/docs/modules/data_connection/retrievers/   
"""

retriever=db.as_retriever()
retriever

VectorStoreRetriever(tags=['FAISS', 'FastEmbedEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x7fd6dfd46d40>)

In [68]:
"""
Retrieval chain:This chain takes in a user inquiry, which is then
passed to the retriever to fetch relevant documents. Those documents 
(and original inputs) are then passed to an LLM to generate a response
https://python.langchain.com/docs/modules/chains/
"""

retrieval_chain=create_retrieval_chain(retriever,document_chain)

In [69]:
response = retrieval_chain.invoke({"input":"Scaled Dot-Product Attention"})

In [70]:
response

{'input': 'Scaled Dot-Product Attention',
 'context': [Document(metadata={'source': 'attention.pdf', 'page': 3}, page_content='Scaled Dot-Product Attention\n Multi-Head Attention\nFigure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several\nattention layers running in parallel.\nof the values, where the weight assigned to each value is computed by a compatibility function of the\nquery with the corresponding key.\n3.2.1 Scaled Dot-Product Attention\nWe call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of\nqueries and keys of dimension dk, and values of dimension dv. We compute the dot products of the\nquery with all keys, divide each by√dk, and apply a softmax function to obtain the weights on the\nvalues.\nIn practice, we compute the attention function on a set of queries simultaneously, packed together\ninto a matrix Q. The keys and values are also packed together into matrices KandV. We compute\nthe matrix

In [71]:
response['answer']

'The Scaled Dot-Product Attention mechanism is a type of attention used in the Transformer architecture. It is called "scaled dot-product attention" because it combines the ideas of dot-product attention and scaling.\n\nIn scaled dot-product attention, the compatibility function between the query and key is computed using a dot product. However, unlike traditional dot-product attention, the resulting values are then scaled by a factor of √dk before applying the softmax function to obtain the weights on the values. This scaling factor is important because it helps prevent the magnitude of the dot products from growing too large as the sequence length increases, which can cause problems during training and inference.\n\nThe main advantage of scaled dot-product attention is that it allows for faster and more space-efficient computation compared to traditional dot-product attention, especially for large sequences. However, it also introduces some additional complexity in terms of scaling a

In [72]:
response = retrieval_chain.invoke({"input":"Authors of Attention is all you need"})

In [73]:
response['answer']

'The authors of "Attention is All You Need" (Kaiser et al., 2016) propose a new attention mechanism based on multi-head attention, which improves the performance of neural machine translation models. They compare their proposed attention mechanism with additive attention and dot-product attention, showing that it outperforms both in terms of accuracy and computational efficiency.\n\nThe authors explain that the main issue with traditional attention mechanisms is that they can suffer from the "attention collapse" problem, where the model pays too much attention to a single part of the input and ignores the rest. To address this issue, they propose a multi-head attention mechanism that computes multiple attention weights in parallel, each with its own learnable weight matrix. This allows the model to attend to different parts of the input simultaneously and capture more contextual information.\n\nThe authors also analyze the computational complexity of different attention mechanisms and 