## <h2 style="color:blue;text-align:center;"><strong>Document Retriver Search Engine</strong></h2>

Here the data comes form multiple sources (JSON and PDFS)

## Setting Up Environment Variables

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ["openai_api_key"]=os.getenv("openai_api_key")

## Data Ingestion

### Download the Dataset

In [2]:
# if you can't download using the following code
# go to https://drive.google.com/file/d/1aZxZejfteVuofISodUrY2CDoyuPLYDGZ download it
# manually upload it on colab
# !gdown 1aZxZejfteVuofISodUrY2CDoyuPLYDGZ

# Uncomment the above to download and unzip the data

### Load the Json Data

In [3]:
from langchain.document_loaders import JSONLoader

loader=JSONLoader(
    file_path="./rag_docs/wikidata_rag_demo.jsonl",
    jq_schema=".",
    text_content=False,
    json_lines=True)
wiki_docs = loader.load()

In [4]:
wiki_docs[:2]

[Document(metadata={'source': 'C:\\Studies\\projects_bkr\\learn\\Analytics Vidya\\RAG\\Mini Projects\\Project_01\\rag_docs\\wikidata_rag_demo.jsonl', 'seq_num': 1}, page_content='{"id": "84801", "title": "Chinese New Year", "paragraphs": ["Chinese New Year, known in China as the SpringFestival and in Singapore as the LunarNewYear, is a holiday on and around the new moon on the first day of the year in the traditional Chinese calendar. This calendar is based on the changes in the moon and is only sometimes changed to fit the seasons of the year based on how the Earth moves around the sun. Because of this, Chinese New Year is never on January1. It moves around between January21 and February20.", "The Chinese New Year is of the most important holidays for Chinese people all over the world. Its 7th day used to be used instead of birthdays to count people\'s ages in China. The holiday is still used to tell people which \\"animal\\" of the Chinese zodiac they are part of. The holiday is a ti

In [5]:
len(wiki_docs)

1801

In [6]:
# The document should be in this format as the embedding model 
# expects only natural language in pagecontent
"""
Document(
    page_content="Chinese New Year, known in China as the Spring Festival ...",
    metadata={
        "title": "Chinese New Year",
        "id": "84801",
        "source": "Wikipedia"
    }
)
"""

'\nDocument(\n    page_content="Chinese New Year, known in China as the Spring Festival ...",\n    metadata={\n        "title": "Chinese New Year",\n        "id": "84801",\n        "source": "Wikipedia"\n    }\n)\n'

In [7]:
type(wiki_docs[0])

langchain_core.documents.base.Document

In [8]:
# Changing the document format
from langchain.docstore.document import Document
import json

wiki_docs_processed=[]

for doc in wiki_docs:
    doc=json.loads(doc.page_content)
    metadata={
        "title":doc["title"],
        "id":doc["id"],
        "source":"Wikipedia"        
    }
    data=" ".join(doc["paragraphs"])
    
    wiki_docs_processed.append(Document(
        metadata=metadata,
        page_content=data
    ))

In [9]:
wiki_docs_processed[:2]

[Document(metadata={'title': 'Chinese New Year', 'id': '84801', 'source': 'Wikipedia'}, page_content='Chinese New Year, known in China as the SpringFestival and in Singapore as the LunarNewYear, is a holiday on and around the new moon on the first day of the year in the traditional Chinese calendar. This calendar is based on the changes in the moon and is only sometimes changed to fit the seasons of the year based on how the Earth moves around the sun. Because of this, Chinese New Year is never on January1. It moves around between January21 and February20. The Chinese New Year is of the most important holidays for Chinese people all over the world. Its 7th day used to be used instead of birthdays to count people\'s ages in China. The holiday is still used to tell people which "animal" of the Chinese zodiac they are part of. The holiday is a time for gifts to children and for family gatherings with large meals, just like Christmas in Europe and in other Christian areas. Unlike Christmas

### Load the data from PDFS

In [10]:
import glob
pdf_files=glob.glob("./rag_docs/*.pdf")
pdf_files



['./rag_docs\\attention_paper.pdf',
 './rag_docs\\cnn_paper.pdf',
 './rag_docs\\resnet_paper.pdf',
 './rag_docs\\vision_transformer.pdf']

In [11]:
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def get_chunks(pdf_path):
    """
    Loads a PDF file and splits it into smaller text chunks.
    
    Args:
        pdf_path (str): Path to the PDF file.
    
    Returns:
        list: List of LangChain Document chunks.
    """
    print(f"Reading PDF: {pdf_path}")
    
    # Load PDF
    loader = PyMuPDFLoader(pdf_path)
    docs = loader.load()
    print(f"Loaded {len(docs)} pages from {pdf_path}")
    
    # Split into chunks
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=3500,
        chunk_overlap=200,   # small overlap improves context retention
        separators=["\n\n", "\n", " ", ""]
    )
    doc_chunks = splitter.split_documents(docs)
    
    print(f"Created {len(doc_chunks)} chunks from {pdf_path}")
    print()
    return doc_chunks


In [12]:
paper_docs = []
for fp in pdf_files:
    paper_docs.extend(get_chunks(fp))


Reading PDF: ./rag_docs\attention_paper.pdf
Loaded 15 pages from ./rag_docs\attention_paper.pdf
Created 16 chunks from ./rag_docs\attention_paper.pdf

Reading PDF: ./rag_docs\cnn_paper.pdf
Loaded 11 pages from ./rag_docs\cnn_paper.pdf
Created 11 chunks from ./rag_docs\cnn_paper.pdf

Reading PDF: ./rag_docs\resnet_paper.pdf
Loaded 12 pages from ./rag_docs\resnet_paper.pdf
Created 24 chunks from ./rag_docs\resnet_paper.pdf

Reading PDF: ./rag_docs\vision_transformer.pdf
Loaded 22 pages from ./rag_docs\vision_transformer.pdf
Created 28 chunks from ./rag_docs\vision_transformer.pdf



In [13]:
len(paper_docs)

79

In [14]:
paper_docs[0]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-08-03T00:07:29+00:00', 'source': './rag_docs\\attention_paper.pdf', 'file_path': './rag_docs\\attention_paper.pdf', 'total_pages': 15, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2023-08-03T00:07:29+00:00', 'trapped': '', 'modDate': 'D:20230803000729Z', 'creationDate': 'D:20230803000729Z', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗†\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nG

In [15]:
total_docs = wiki_docs_processed + paper_docs
len(total_docs)

1880

## Vector Databases

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector database takes care of storing embedded data and performing vector search for you.

### Chroma Vector DB

[Chroma](https://docs.trychroma.com/getting-started) is a AI-native open-source vector database focused on developer productivity and happiness. Chroma is licensed under Apache 2.0.

In [16]:
from langchain_openai import OpenAIEmbeddings

# details here: https://openai.com/blog/new-embedding-models-and-api-updates
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

In [17]:
from langchain_chroma import Chroma

# create vector DB of docs and embeddings - takes < 30s on Colab
chroma_db = Chroma.from_documents(documents=total_docs,
                                  collection_name='my_db',
                                  embedding=openai_embed_model,
                                  # need to set the distance function to cosine else it uses euclidean by default
                                  # check https://docs.trychroma.com/guides#changing-the-distance-function
                                  collection_metadata={"hnsw:space": "cosine"},
                                  persist_directory="./my_db")

### Load Vector DB from disk

Once have a vector database on disk you can just load and create a connection to it anytime

In [18]:
chroma_db=Chroma(
    collection_name="my_db",
    embedding_function=openai_embed_model,
    persist_directory="./my_db"
    
)

##  Vector Database Retrievers

Here we will explore the following retrieval strategies on our Vector Database:

- Similarity or Ranking based Retrieval
- Multi Query Retrieval
- Contextual Compression Retrieval
- Chained Retrieval Pipeline

### Similarity or Ranking based Retrieval

We use cosine similarity here and retrieve the top 5 similar documents based on the user input query

In [19]:
similarity_retriever = chroma_db.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 3})

In [20]:
from IPython.display import display, Markdown

def display_docs(docs):
    for doc in docs:
        print('Metadata:', doc.metadata)
        print('Content Brief:')
        display(Markdown(doc.page_content[:1000]))
        print()

In [21]:
query = "what is machine learning?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

Metadata: {'title': 'Machine learning', 'id': '564928', 'source': 'Wikipedia'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'title': 'Machine learning', 'id': '564928', 'source': 'Wikipedia'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'source': 'Wikipedia', 'id': '564928', 'title': 'Machine learning'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.




In [22]:
query = "what is nlp?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

Metadata: {'source': 'Wikipedia', 'title': 'Neurolinguistic programming', 'id': '335464'}
Content Brief:


Neurolinguistic programming is a way of communicating, created in the 1970s. It is often shortened to "NLP". The discipline assumes there is a link between neurological processes, language and behavior. According to NLP, it is possible to achieve certain goals in life by changing one's behaviour. Certain neuroscientists psychologists and linguists, believe that NLP is unsupported by current scientific evidence and that it uses incorrect and misleading terms and concepts. NLP was invented by Richard Bandler and John Grinder. According to these people, NLP can help solve problems such as phobias, depression, habit disorder, psychosomatic illnesses, and learning disorders.


Metadata: {'source': 'Wikipedia', 'title': 'Neurolinguistic programming', 'id': '335464'}
Content Brief:


Neurolinguistic programming is a way of communicating, created in the 1970s. It is often shortened to "NLP". The discipline assumes there is a link between neurological processes, language and behavior. According to NLP, it is possible to achieve certain goals in life by changing one's behaviour. Certain neuroscientists psychologists and linguists, believe that NLP is unsupported by current scientific evidence and that it uses incorrect and misleading terms and concepts. NLP was invented by Richard Bandler and John Grinder. According to these people, NLP can help solve problems such as phobias, depression, habit disorder, psychosomatic illnesses, and learning disorders.


Metadata: {'source': 'Wikipedia', 'id': '335464', 'title': 'Neurolinguistic programming'}
Content Brief:


Neurolinguistic programming is a way of communicating, created in the 1970s. It is often shortened to "NLP". The discipline assumes there is a link between neurological processes, language and behavior. According to NLP, it is possible to achieve certain goals in life by changing one's behaviour. Certain neuroscientists psychologists and linguists, believe that NLP is unsupported by current scientific evidence and that it uses incorrect and misleading terms and concepts. NLP was invented by Richard Bandler and John Grinder. According to these people, NLP can help solve problems such as phobias, depression, habit disorder, psychosomatic illnesses, and learning disorders.




### Multi Query Retrieval

Retrieval may produce different results with subtle changes in query wording, or if the embeddings do not capture the semantics of the data well. Prompt engineering / tuning is sometimes done to manually address these problems, but can be tedious.

The [`MultiQueryRetriever`](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.multi_query.MultiQueryRetriever.html) automates the process of prompt tuning by using an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and takes the unique union across all queries to get a larger set of potentially relevant documents.

In [23]:
from langchain_openai import ChatOpenAI

chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

In [24]:
from langchain.retrievers.multi_query import MultiQueryRetriever
# Set logging for the queries
import logging

similarity_retriever = chroma_db.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 3})

mq_retriever = MultiQueryRetriever.from_llm(
    retriever=similarity_retriever, llm=chatgpt
)

logging.basicConfig()
# so we can see what queries are generated by the LLM
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [25]:
query = "what is a cnn?"
top_docs = mq_retriever.invoke(query)
display_docs(top_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['What does CNN stand for and what are its main functions?  ', 'Can you explain the concept and applications of convolutional neural networks?  ', 'What are the key features and uses of CNNs in machine learning?']


Metadata: {'source': 'Wikipedia', 'title': 'CNN', 'id': '3615'}
Content Brief:


The Cable News Network (CNN) is an American cable news television channel. It was founded in 1980 by Ted Turner. The Cable News Network first aired on television on June 1, 1980. The Cable News Network's first newscast was anchored (hosted) by David Walker and his wife Lois Hart. In its first year CNN hired many political analysts, including Rowland Evans and Robert Novak. On January 1, 1982 CNN launched a 24-hour sister newscast channel with no talk shows or commentary shows called CNN2. CNN broadcasts programs from its headquarters at the CNN Center in Atlanta, or from the Time Warner Center in New York City, or from studios in Washington, D.C., and Los Angeles. CNN is owned by Time Warner, and the U.S. news channel is a part of the Turner Broadcasting System. The hosts of its opinion shows are Don Lemon, Chris Cuomo, Fredricka Whitfield, Erin Burnett, Brianna Keiler and Brooke Baldwin. CNN has been criticized by the right-wing Media Research Center for having a left-wing bias. Accor


Metadata: {'source': 'Wikipedia', 'title': 'CNN', 'id': '3615'}
Content Brief:


The Cable News Network (CNN) is an American cable news television channel. It was founded in 1980 by Ted Turner. The Cable News Network first aired on television on June 1, 1980. The Cable News Network's first newscast was anchored (hosted) by David Walker and his wife Lois Hart. In its first year CNN hired many political analysts, including Rowland Evans and Robert Novak. On January 1, 1982 CNN launched a 24-hour sister newscast channel with no talk shows or commentary shows called CNN2. CNN broadcasts programs from its headquarters at the CNN Center in Atlanta, or from the Time Warner Center in New York City, or from studios in Washington, D.C., and Los Angeles. CNN is owned by Time Warner, and the U.S. news channel is a part of the Turner Broadcasting System. The hosts of its opinion shows are Don Lemon, Chris Cuomo, Fredricka Whitfield, Erin Burnett, Brianna Keiler and Brooke Baldwin. CNN has been criticized by the right-wing Media Research Center for having a left-wing bias. Accor


Metadata: {'title': 'CNN', 'source': 'Wikipedia', 'id': '3615'}
Content Brief:


The Cable News Network (CNN) is an American cable news television channel. It was founded in 1980 by Ted Turner. The Cable News Network first aired on television on June 1, 1980. The Cable News Network's first newscast was anchored (hosted) by David Walker and his wife Lois Hart. In its first year CNN hired many political analysts, including Rowland Evans and Robert Novak. On January 1, 1982 CNN launched a 24-hour sister newscast channel with no talk shows or commentary shows called CNN2. CNN broadcasts programs from its headquarters at the CNN Center in Atlanta, or from the Time Warner Center in New York City, or from studios in Washington, D.C., and Los Angeles. CNN is owned by Time Warner, and the U.S. news channel is a part of the Turner Broadcasting System. The hosts of its opinion shows are Don Lemon, Chris Cuomo, Fredricka Whitfield, Erin Burnett, Brianna Keiler and Brooke Baldwin. CNN has been criticized by the right-wing Media Research Center for having a left-wing bias. Accor


Metadata: {'creationDate': 'D:20151203014807Z', 'keywords': '', 'creator': 'LaTeX with hyperref package', 'producer': 'pdfTeX-1.40.12', 'format': 'PDF 1.5', 'page': 0, 'source': './rag_docs\\cnn_paper.pdf', 'moddate': '2015-12-03T01:48:07+00:00', 'title': '', 'trapped': '', 'modDate': 'D:20151203014807Z', 'subject': '', 'creationdate': '2015-12-03T01:48:07+00:00', 'file_path': './rag_docs\\cnn_paper.pdf', 'total_pages': 11, 'author': ''}
Content Brief:


An Introduction to Convolutional Neural Networks
Keiron O’Shea1 and Ryan Nash2
1 Department of Computer Science, Aberystwyth University, Ceredigion, SY23 3DB
keo7@aber.ac.uk
2 School of Computing and Communications, Lancaster University, Lancashire, LA1
4YW
nashrd@live.lancs.ac.uk
Abstract. The ﬁeld of machine learning has taken a dramatic twist in re-
cent times, with the rise of the Artiﬁcial Neural Network (ANN). These
biologically inspired computational models are able to far exceed the per-
formance of previous forms of artiﬁcial intelligence in common machine
learning tasks. One of the most impressive forms of ANN architecture is
that of the Convolutional Neural Network (CNN). CNNs are primarily
used to solve difﬁcult image-driven pattern recognition tasks and with
their precise yet simple architecture, offers a simpliﬁed method of getting
started with ANNs.
This document provides a brief introduction to CNNs, discussing recently
published papers and newly formed techniques in de


Metadata: {'keywords': '', 'moddate': '2015-12-03T01:48:07+00:00', 'creationDate': 'D:20151203014807Z', 'producer': 'pdfTeX-1.40.12', 'total_pages': 11, 'page': 0, 'subject': '', 'source': './rag_docs\\cnn_paper.pdf', 'format': 'PDF 1.5', 'trapped': '', 'author': '', 'file_path': './rag_docs\\cnn_paper.pdf', 'modDate': 'D:20151203014807Z', 'title': '', 'creator': 'LaTeX with hyperref package', 'creationdate': '2015-12-03T01:48:07+00:00'}
Content Brief:


An Introduction to Convolutional Neural Networks
Keiron O’Shea1 and Ryan Nash2
1 Department of Computer Science, Aberystwyth University, Ceredigion, SY23 3DB
keo7@aber.ac.uk
2 School of Computing and Communications, Lancaster University, Lancashire, LA1
4YW
nashrd@live.lancs.ac.uk
Abstract. The ﬁeld of machine learning has taken a dramatic twist in re-
cent times, with the rise of the Artiﬁcial Neural Network (ANN). These
biologically inspired computational models are able to far exceed the per-
formance of previous forms of artiﬁcial intelligence in common machine
learning tasks. One of the most impressive forms of ANN architecture is
that of the Convolutional Neural Network (CNN). CNNs are primarily
used to solve difﬁcult image-driven pattern recognition tasks and with
their precise yet simple architecture, offers a simpliﬁed method of getting
started with ANNs.
This document provides a brief introduction to CNNs, discussing recently
published papers and newly formed techniques in de


Metadata: {'author': '', 'page': 0, 'subject': '', 'trapped': '', 'producer': 'pdfTeX-1.40.12', 'creationdate': '2015-12-03T01:48:07+00:00', 'creationDate': 'D:20151203014807Z', 'total_pages': 11, 'format': 'PDF 1.5', 'keywords': '', 'source': './rag_docs\\cnn_paper.pdf', 'creator': 'LaTeX with hyperref package', 'title': '', 'file_path': './rag_docs\\cnn_paper.pdf', 'moddate': '2015-12-03T01:48:07+00:00', 'modDate': 'D:20151203014807Z'}
Content Brief:


An Introduction to Convolutional Neural Networks
Keiron O’Shea1 and Ryan Nash2
1 Department of Computer Science, Aberystwyth University, Ceredigion, SY23 3DB
keo7@aber.ac.uk
2 School of Computing and Communications, Lancaster University, Lancashire, LA1
4YW
nashrd@live.lancs.ac.uk
Abstract. The ﬁeld of machine learning has taken a dramatic twist in re-
cent times, with the rise of the Artiﬁcial Neural Network (ANN). These
biologically inspired computational models are able to far exceed the per-
formance of previous forms of artiﬁcial intelligence in common machine
learning tasks. One of the most impressive forms of ANN architecture is
that of the Convolutional Neural Network (CNN). CNNs are primarily
used to solve difﬁcult image-driven pattern recognition tasks and with
their precise yet simple architecture, offers a simpliﬁed method of getting
started with ANNs.
This document provides a brief introduction to CNNs, discussing recently
published papers and newly formed techniques in de


Metadata: {'source': './rag_docs\\cnn_paper.pdf', 'title': '', 'creator': 'LaTeX with hyperref package', 'producer': 'pdfTeX-1.40.12', 'keywords': '', 'file_path': './rag_docs\\cnn_paper.pdf', 'total_pages': 11, 'modDate': 'D:20151203014807Z', 'creationDate': 'D:20151203014807Z', 'author': '', 'moddate': '2015-12-03T01:48:07+00:00', 'creationdate': '2015-12-03T01:48:07+00:00', 'page': 2, 'subject': '', 'format': 'PDF 1.5', 'trapped': ''}
Content Brief:


Introduction to Convolutional Neural Networks
3
more suited for image-focused tasks - whilst further reducing the parameters
required to set up the model.
One of the largest limitations of traditional forms of ANN is that they tend to
struggle with the computational complexity required to compute image data.
Common machine learning benchmarking datasets such as the MNIST database
of handwritten digits are suitable for most forms of ANN, due to its relatively
small image dimensionality of just 28 × 28. With this dataset a single neuron in
the ﬁrst hidden layer will contain 784 weights (28×28×1 where 1 bare in mind
that MNIST is normalised to just black and white values), which is manageable
for most forms of ANN.
If you consider a more substantial coloured image input of 64 × 64, the number
of weights on just a single neuron of the ﬁrst layer increases substantially to
12, 288. Also take into account that to deal with this scale of input, the network
will also need to be a lot larger th


Metadata: {'title': '', 'moddate': '2015-12-03T01:48:07+00:00', 'author': '', 'keywords': '', 'page': 2, 'modDate': 'D:20151203014807Z', 'file_path': './rag_docs\\cnn_paper.pdf', 'creationDate': 'D:20151203014807Z', 'producer': 'pdfTeX-1.40.12', 'total_pages': 11, 'creator': 'LaTeX with hyperref package', 'creationdate': '2015-12-03T01:48:07+00:00', 'format': 'PDF 1.5', 'subject': '', 'source': './rag_docs\\cnn_paper.pdf', 'trapped': ''}
Content Brief:


Introduction to Convolutional Neural Networks
3
more suited for image-focused tasks - whilst further reducing the parameters
required to set up the model.
One of the largest limitations of traditional forms of ANN is that they tend to
struggle with the computational complexity required to compute image data.
Common machine learning benchmarking datasets such as the MNIST database
of handwritten digits are suitable for most forms of ANN, due to its relatively
small image dimensionality of just 28 × 28. With this dataset a single neuron in
the ﬁrst hidden layer will contain 784 weights (28×28×1 where 1 bare in mind
that MNIST is normalised to just black and white values), which is manageable
for most forms of ANN.
If you consider a more substantial coloured image input of 64 × 64, the number
of weights on just a single neuron of the ﬁrst layer increases substantially to
12, 288. Also take into account that to deal with this scale of input, the network
will also need to be a lot larger th


Metadata: {'trapped': '', 'creator': 'LaTeX with hyperref package', 'title': '', 'modDate': 'D:20151203014807Z', 'creationdate': '2015-12-03T01:48:07+00:00', 'moddate': '2015-12-03T01:48:07+00:00', 'producer': 'pdfTeX-1.40.12', 'total_pages': 11, 'source': './rag_docs\\cnn_paper.pdf', 'author': '', 'keywords': '', 'format': 'PDF 1.5', 'creationDate': 'D:20151203014807Z', 'subject': '', 'file_path': './rag_docs\\cnn_paper.pdf', 'page': 2}
Content Brief:


Introduction to Convolutional Neural Networks
3
more suited for image-focused tasks - whilst further reducing the parameters
required to set up the model.
One of the largest limitations of traditional forms of ANN is that they tend to
struggle with the computational complexity required to compute image data.
Common machine learning benchmarking datasets such as the MNIST database
of handwritten digits are suitable for most forms of ANN, due to its relatively
small image dimensionality of just 28 × 28. With this dataset a single neuron in
the ﬁrst hidden layer will contain 784 weights (28×28×1 where 1 bare in mind
that MNIST is normalised to just black and white values), which is manageable
for most forms of ANN.
If you consider a more substantial coloured image input of 64 × 64, the number
of weights on just a single neuron of the ﬁrst layer increases substantially to
12, 288. Also take into account that to deal with this scale of input, the network
will also need to be a lot larger th




In [26]:
query = "what is ML?"
top_docs = mq_retriever.invoke(query)
display_docs(top_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['What does machine learning (ML) refer to, and how does it work?  ', 'Can you explain the concept of machine learning and its applications?  ', 'What are the key principles and techniques involved in machine learning?']


Metadata: {'id': '564928', 'title': 'Machine learning', 'source': 'Wikipedia'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'title': 'Machine learning', 'id': '564928', 'source': 'Wikipedia'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'id': '564928', 'title': 'Machine learning', 'source': 'Wikipedia'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.




In [27]:
len(top_docs)

3

### Contextual Compression Retrieval

The information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. The idea is simple: instead of immediately returning retrieved documents as-is, you can compress them using the context of the given query, so that only the relevant information is returned.

This compression can happen in the form of:

- Remove parts of the content of retrieved documents which are not relevant to the query. This is done by extracting only relevant parts of the document to the given query

- Filter out documents which are not relevant to the given query but do not remove content from the document

In [28]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor


# extracts from each document only the content that is relevant to the query
compressor = LLMChainExtractor.from_llm(llm=chatgpt)

# retrieves the documents similar to query and then applies the compressor
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=mq_retriever
)

In [29]:
query = "what is ML?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['What does machine learning (ML) refer to in the field of artificial intelligence?  ', 'Can you explain the concept and applications of machine learning?  ', 'What are the key principles and techniques involved in machine learning?']


Metadata: {'source': 'Wikipedia', 'title': 'Machine learning', 'id': '564928'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'source': 'Wikipedia', 'id': '564928', 'title': 'Machine learning'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'source': 'Wikipedia', 'title': 'Machine learning', 'id': '564928'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.




## Build the RAG Pipeline

In [30]:
from langchain_core.prompts import ChatPromptTemplate

rag_prompt = """You are an assistant who is an expert in question-answering tasks.
                Answer the following question using only the following pieces of retrieved context.
                If the answer is not in the context, do not make up answers, just say that you don't know.
                Keep the answer detailed and well formatted based on the information from the context.

                Question:
                {question}

                Context:
                {context}

                Answer:
            """

rag_prompt_template = ChatPromptTemplate.from_template(rag_prompt)

In [31]:
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

qa_rag_chain = (
    {
        "context": (similarity_retriever
                      |
                    format_docs),
        "question": RunnablePassthrough()
    }
      |
    rag_prompt_template
      |
    chatgpt
)



In [32]:
from IPython.display import display, Markdown

query = "What is machine learning?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

Machine learning is a subfield of computer science that provides computers with the ability to learn without being explicitly programmed, a concept first articulated by Arthur Samuel in 1959. It originated from research in artificial intelligence and focuses on the study and construction of algorithms capable of learning and making predictions based on data.

Key aspects of machine learning include:

- **Learning from Data**: Machine learning algorithms can learn from sample inputs and build models that allow them to make predictions or decisions based on new data.
- **Algorithmic Flexibility**: While these algorithms follow programmed instructions, they also have the capability to adapt and improve their performance based on the data they process.
- **Applications**: Machine learning is particularly useful in scenarios where it is difficult or impossible to design and program explicit algorithms. Common applications include:
  - Spam filtering
  - Detection of network intruders or malicious insiders
  - Optical character recognition (OCR)
  - Search engines
  - Computer vision

In summary, machine learning enables computers to improve their performance on tasks through experience, making it a powerful tool in various domains.

In [33]:
query = "What do you know about Bharath?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

I don't know.

In [34]:
query = "CNN"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

The Cable News Network (CNN) is an American cable news television channel that was founded in 1980 by Ted Turner. It first aired on television on June 1, 1980, with its inaugural newscast anchored by David Walker and his wife Lois Hart. In its first year, CNN hired several political analysts, including Rowland Evans and Robert Novak.

On January 1, 1982, CNN launched a 24-hour sister newscast channel called CNN2, which did not feature talk shows or commentary shows. CNN broadcasts its programs from various locations, including its headquarters at the CNN Center in Atlanta, the Time Warner Center in New York City, and studios in Washington, D.C., and Los Angeles.

CNN is owned by Time Warner and is part of the Turner Broadcasting System. The channel features opinion shows hosted by notable figures such as Don Lemon, Chris Cuomo, Fredricka Whitfield, Erin Burnett, Brianna Keiler, and Brooke Baldwin.

CNN has faced criticism for perceived biases. The right-wing Media Research Center has accused it of having a left-wing bias, although it claims that CNN is less left-leaning than the news divisions of ABC, NBC, and CBS. Additionally, CNN has been criticized by some Arabs for having a pro-American bias.

In [35]:
result

AIMessage(content='The Cable News Network (CNN) is an American cable news television channel that was founded in 1980 by Ted Turner. It first aired on television on June 1, 1980, with its inaugural newscast anchored by David Walker and his wife Lois Hart. In its first year, CNN hired several political analysts, including Rowland Evans and Robert Novak.\n\nOn January 1, 1982, CNN launched a 24-hour sister newscast channel called CNN2, which did not feature talk shows or commentary shows. CNN broadcasts its programs from various locations, including its headquarters at the CNN Center in Atlanta, the Time Warner Center in New York City, and studios in Washington, D.C., and Los Angeles.\n\nCNN is owned by Time Warner and is part of the Turner Broadcasting System. The channel features opinion shows hosted by notable figures such as Don Lemon, Chris Cuomo, Fredricka Whitfield, Erin Burnett, Brianna Keiler, and Brooke Baldwin.\n\nCNN has faced criticism for perceived biases. The right-wing Me

## Rag With Sources

In [36]:
from langchain_core.runnables import RunnablePassthrough,RunnableLambda
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from operator import itemgetter

chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

src_rag_response_chain = (
    {
        "context": (itemgetter('context')
                             |
                    RunnableLambda(format_docs)         
        ),
        "question": itemgetter("question")
    }
      |
    rag_prompt_template
      |
    chatgpt
      |
    StrOutputParser()
)

rag_chain_w_sources=(
    {
        "context":similarity_retriever,
        "question":RunnablePassthrough()
    }
    |
    RunnablePassthrough.assign(response=src_rag_response_chain)
)

In [37]:
query = "What is machine learning"
result=rag_chain_w_sources.invoke(query)
result

{'context': [Document(id='0040f3f4-00d6-4232-bcd7-8aa06d88a759', metadata={'title': 'Machine learning', 'id': '564928', 'source': 'Wikipedia'}, page_content='Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.'),
  Document(id='713ce55d-845e-45d8-ade4-985d1cdac86a', metadata={'source': 'Wikipedia', 'id': '5

In [38]:
result.keys(),type(result["context"])

(dict_keys(['context', 'question', 'response']), list)

In [39]:
query = "What is machine learning"
result=rag_chain_w_sources.invoke(query)


In [40]:
from IPython.display import display, Markdown

def display_output(result):
    # Display question as a heading
    display(Markdown(f"### ❓ Question\n\n{result['question']}"))

    # Display response
    display(Markdown(f"### 💡 Answer\n\n{result['response']}"))

    # Display sources
    display(Markdown("### 📚 Sources\n"))
    for i, source in enumerate(result["context"], 1):
        md_text = f"""
**Source {i}:**

- **Metadata:**  
  `{source.metadata}`  

- **Content (Brief):**  
  > {source.page_content[:400]}{'...' if len(source.page_content) > 400 else ''}
"""
        display(Markdown(md_text))


In [41]:
display_output(result)

### ❓ Question

What is machine learning

### 💡 Answer

Machine learning is a subfield of computer science that provides computers with the ability to learn without being explicitly programmed, a concept first articulated by Arthur Samuel in 1959. It originated from research in artificial intelligence and focuses on the study and construction of algorithms that can learn from and make predictions based on data.

Key aspects of machine learning include:

- **Learning from Data**: Machine learning algorithms build models from sample inputs, allowing them to make predictions or decisions based on new data.
- **Algorithmic Flexibility**: While these algorithms follow programmed instructions, they also possess the capability to adapt and improve their performance over time without human intervention.
- **Applications**: Machine learning is particularly useful in scenarios where designing and programming explicit algorithms is impractical. Common applications include:
  - Spam filtering
  - Detection of network intruders or malicious insiders
  - Optical character recognition (OCR)
  - Search engines
  - Computer vision

Overall, machine learning represents a significant advancement in how computers can process information and make decisions autonomously.

### 📚 Sources



**Source 1:**

- **Metadata:**  
  `{'title': 'Machine learning', 'source': 'Wikipedia', 'id': '564928'}`  

- **Content (Brief):**  
  > Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or dec...



**Source 2:**

- **Metadata:**  
  `{'source': 'Wikipedia', 'title': 'Machine learning', 'id': '564928'}`  

- **Content (Brief):**  
  > Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or dec...



**Source 3:**

- **Metadata:**  
  `{'id': '564928', 'source': 'Wikipedia', 'title': 'Machine learning'}`  

- **Content (Brief):**  
  > Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or dec...
