In [1]:
## Data Ingestion -> bringing data from different sources into one place 
### From txt file
from langchain_community.document_loaders import TextLoader
loader = TextLoader("speech.txt")
text_docs = loader.load()

In [14]:
text_docs[0].page_content[:500]  # first 500 characters


'The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight withou'

In [2]:
## From Web 
from langchain_community.document_loaders import WebBaseLoader
import bs4
# loader = WebBaseLoader(
#     web_paths=("https://medium.com/@ashubhai/webrtc-applications-with-node-js-and-react-js-7f4d4313bace",),
#     bs_kwargs=dict(
#         parse_only=bs4.SoupStrainer(
#             class_=("pw-post-title","pw-post-body") 
#         )
#     )
# )

loader = WebBaseLoader(
    web_paths=("https://medium.com/@ashubhai/webrtc-applications-with-node-js-and-react-js-7f4d4313bace",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            "p"  
        )
    )
)

web_docs = loader.load()



USER_AGENT environment variable not set, consider setting it to identify your requests.


In [3]:
web_docs[0].page_content[48:]

'(Web Real-Time Communication) is a groundbreaking technology enabling real-time multi-media communication within web browsers. By eliminating the need for third-party plugins or software, it facilitates seamless, efficient, and accessible browser-based interactions. In this article, we’ll explore how WebRTC works, its core components, and implement a real-time communication application using Node.js as the signaling server and React.js for the frontend.3. Open Standard:WebRTC’s open standard ensures compatibility across major browsers and platforms, encouraging widespread adoption.The above image shows two browsers directly connected to each other in a P2P mode. In a WebRTC P2P connection, media data (audio, video, etc.) is transmitted directly between the two peers without going through an intermediary server. This direct connection enhances efficiency and reduces latency.While media data is transmitted directly between peers, WebRTC requires certain servers for specific tasks like s

In [10]:
## From PDF
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("attention.pdf")
docs=loader.load()

In [11]:
import re
text = docs[0].page_content
clean_text = re.sub(r'\s+', ' ', text)  # collapse multiple spaces/newlines
print(clean_text)


Attention Is All You Need Ashish Vaswani∗ Google Brain avaswani@google.com Noam Shazeer∗ Google Brain noam@google.com Niki Parmar∗ Google Research nikip@google.com Jakob Uszkoreit∗ Google Research usz@google.com Llion Jones∗ Google Research llion@google.com Aidan N. Gomez∗† University of Toronto aidan@cs.toronto.edu Łukasz Kaiser ∗ Google Brain lukaszkaiser@google.com Illia Polosukhin∗‡ illia.polosukhin@gmail.com Abstract The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring signiﬁcantly less time to train. Our model 

In [12]:
## Lets divide the documents into smaller chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
documents = text_splitter.split_documents(docs)

In [13]:
import re
text = documents[0].page_content
clean_text = re.sub(r'\s+', ' ', text)  # collapse multiple spaces/newlines
print(clean_text)


Attention Is All You Need Ashish Vaswani∗ Google Brain avaswani@google.com Noam Shazeer∗ Google Brain noam@google.com Niki Parmar∗ Google Research nikip@google.com Jakob Uszkoreit∗ Google Research usz@google.com Llion Jones∗ Google Research llion@google.com Aidan N. Gomez∗† University of Toronto aidan@cs.toronto.edu Łukasz Kaiser ∗ Google Brain lukaszkaiser@google.com Illia Polosukhin∗‡ illia.polosukhin@gmail.com Abstract The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring signiﬁcantly


In [16]:
## Vector Embedding And Vector Store
from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
embeddings = OllamaEmbeddings(model="llama3.2:1b")
db = Chroma.from_documents(documents[:30],embeddings)

In [17]:
query = "An attention function can be described as mapping a query"
retireved_results=db.similarity_search(query)
print(retireved_results[0].page_content)

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU
[20], ByteNet [15] and ConvS2S [8], all of which use convolutional neural networks as basic building
block, computing hidden representations in parallel for all input and output positions. In these models,
the number of operations required to relate signals from two arbitrary input or output positions grows
in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes
it more difﬁcult to learn dependencies between distant positions [ 11]. In the Transformer this is
reduced to a constant number of operations, albeit at the cost of reduced effective resolution due
to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as
described in section 3.2.
Self-attention, sometimes called intra-attention is an attention mechanism relating different positions


In [20]:
## Advance RAG : Retrievers and adding LLM
from langchain_ollama import OllamaLLM
llm = OllamaLLM(model="llama3.2:1b")


In [23]:
## Design prompt template 
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template("""
Answer the following question based only on the context provided.
                                          Think step by step before proving the detailed answer.
                                          I will tip you $1000 if the user find the answer satisfactory.
                                          <context>
                                          {context}
                                          </context>
                                          Question: {input}
""")


In [24]:
## Chain Introduction -> Create Stuff document chain
from langchain.chains.combine_documents import create_stuff_documents_chain
doc_chain = create_stuff_documents_chain(llm,prompt)

In [25]:
retriever = db.as_retriever()
retriever

VectorStoreRetriever(tags=['Chroma', 'OllamaEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x00000214FF45D950>, search_kwargs={})

In [26]:
# Retriever chain -> doc_chain + retriever
from langchain.chains import create_retrieval_chain
retrieval_chain = create_retrieval_chain(retriever,doc_chain)

In [29]:
response = retrieval_chain.invoke({"input": "An attention function can be described as mapping a query and a set"})

In [30]:
response['answer']

'An attention function in the context of neural networks can be described as mapping a query (usually represented by a vector) and a set of key vectors to a set of weighted values, where the weights are determined by the dot product of the query and the keys. The output is then normalized using the softmax function.\n\nIn more detail, an attention function typically consists of three main components:\n\n1. Query: This is the input that is being attended to.\n2. Keys: These are vectors that represent the information being considered.\n3. Values: These are weights or scores associated with each key vector.\n\nThe attention function can be represented mathematically as follows:\n\n`Attention Function = softmax( W * Q ) / sqrt(d)`\n\nWhere:\n\n* `W` is a weight matrix\n* `Q` is the query vector\n* `K` is the set of key vectors (usually represented by a matrix)\n* `d` is the number of dimensions in the input space\n\nThe weights `W` are learned during training and can vary depending on the 