# Simple RAG System using LlamaIndex
* Notebook by Adam Lang
* Date: 3/6/2024
* In this notebook we will build a simple RAG system using LlamaIndex.

# Basic RAG Pipeline
1. Data ingestion
2. Indexing
3. Retriever
4. Response Synthesizer
5. Querying

### Set up llama-index and OpenAI key

In [None]:
!pip install llama-index



In [None]:
# instantiate openai API key
import os
os.environ['OPENAI_API_KEY']  = '<your key here>'

# Stage 1: Data ingestion

## 1.1 Data Loaders

In [None]:
# download files
!mkdir './data/'
!wget 'https://raw.githubusercontent.com/aravindpai/Speech-Recognition/c9c45731e966592b1805929fc1585c72e1f34f10/dhs.txt' -O './data/dhs.txt'

mkdir: cannot create directory ‘./data/’: File exists
--2024-03-06 20:23:08--  https://raw.githubusercontent.com/aravindpai/Speech-Recognition/c9c45731e966592b1805929fc1585c72e1f34f10/dhs.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20789 (20K) [text/plain]
Saving to: ‘./data/dhs.txt’


2024-03-06 20:23:08 (22.8 MB/s) - ‘./data/dhs.txt’ saved [20789/20789]



In [None]:
# import SimpleDirectoryReader
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("data").load_data()

In [None]:
# print document type
type(documents)

list

In [None]:
# length of documents
len(documents)

1

summary:
* only 1 object which is a document object.

In [None]:
# metadata for document
documents[0]

Document(id_='d545cd4a-b6d6-4328-ae82-a86ebfec5b32', embedding=None, metadata={'file_path': '/content/data/dhs.txt', 'file_name': '/content/data/dhs.txt', 'file_type': 'text/plain', 'file_size': 20789, 'creation_date': '2024-03-06', 'last_modified_date': '2024-03-06'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text="\ufeffDataHack Summit 2023 (DHS) India’s most Futuristic AI Conference organized by Analytics Vidhya.Analytics Vidhya is the World’s leading and India’s largest data science community.Analytics Vidhya is founded by Kunal Jain. Analytics Vidhya aims to build the next generation data science ecosystem across the globe.We have helped millions of people realize their data science dreams.We conduct hackathons, competitions, training & conferences 

## Embedding Model
* Various embedding models are available from llama index as seen here: https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html
* We will use the OpenAI embeddings.

In [None]:
#install openai embedding
!pip install llama-index-embeddings-openai



In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

embed_model = OpenAIEmbedding()


## LLM
* instantiate LLM model

In [None]:
from llama_index.llms.openai import OpenAI
llm = OpenAI()

# Stage 2: Indexing

In [None]:
from llama_index.core import Settings


In [None]:
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512

In [None]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)


# Stage 3: Retrieval

In [None]:
# retriever
retriever = index.as_retriever()

In [None]:
# retrieval nodes
retrieved_nodes = retriever.retrieve("What is the overall theme of DHS?")

In [None]:
#print output at index 0
(retrieved_nodes)[0].text

"To become a sponsor for DataHack Summit 2023, please contact the conference organizers for more information.The format of DHS 2023 includes Live Keynotes, Power Talks, Hack Sessions, Generative AI Sessions, Workshops, Awards Evening, The AI Showcase.In the AI Showcase, check out the latest and the best in Artificial Intelligence from exciting startups, solution providers to bleeding edge hardware and software providers! Awards Evening is to recognize the best in AI, the awards night uplifts and inspires everyone present.This showcases the groundbreaking innovations and business in the AI landscape.Workshops are each day-long hands-on session aimed to make sure you learn Artificial Intelligence by doing it yourself.No more lectures – just code with the help of experts. Hack Session is no better way to understand AI than seeing an expert building it in front of your eyes.Each Hack Session is a 60 to 90 minutes long live interactive session with an expert working in front of you! Generat

In [None]:
#print
(retrieved_nodes)[1].text

"\ufeffDataHack Summit 2023 (DHS) India’s most Futuristic AI Conference organized by Analytics Vidhya.Analytics Vidhya is the World’s leading and India’s largest data science community.Analytics Vidhya is founded by Kunal Jain. Analytics Vidhya aims to build the next generation data science ecosystem across the globe.We have helped millions of people realize their data science dreams.We conduct hackathons, competitions, training & conferences and help companies find the right data science talent.\r\nDHS 2023 totally has 70+ AI Talks, 30+ Hack Sessions and 8+ Workshops.\r\nIt's the 4th edition of DHS.Here is the DHS website: https://www.analyticsvidhya.com/datahack-summit-2023/.The previous DHS happened in 2017, 2018 and 2019 at Bengaluru. The 4 day conference is taking place on 2nd – 5th August, 2023 at NIMHANS Convention Centre, Bengaluru. The 4th day consists of day-long workshops."

# Stage 4: Response Synthesis

In [None]:
from llama_index.core import get_response_synthesizer
from llama_index.core.data_structs import Node
from llama_index.core.response_synthesizers import ResponseMode



In [None]:
# instantiate response_synthesizer
response_synthesizer = get_response_synthesizer(
    response_mode=ResponseMode.COMPACT
)


# Stage 5 - Query Engine

In [None]:
query_engine = index.as_query_engine(response_synthesizer=response_synthesizer)

In [None]:
response = query_engine.query("What is the main theme of DHS?")

In [None]:
#response
response.response

'The main theme of DHS is centered around Artificial Intelligence (AI) and its various applications and advancements.'

Only runs the retriever to fetch the nodes that would have been sent to the LLM, without actually sending them. Then can be inspected by checking response.source_nodes.

In [None]:
response.source_nodes[1]

NodeWithScore(node=TextNode(id_='c3552d1c-2b64-4a86-8414-21b345be97c8', embedding=None, metadata={'file_path': '/content/data/dhs.txt', 'file_name': '/content/data/dhs.txt', 'file_type': 'text/plain', 'file_size': 20789, 'creation_date': '2024-03-06', 'last_modified_date': '2024-03-06'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='d545cd4a-b6d6-4328-ae82-a86ebfec5b32', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': '/content/data/dhs.txt', 'file_name': '/content/data/dhs.txt', 'file_type': 'text/plain', 'file_size': 20789, 'creation_date': '2024-03-06', 'last_modified_date': '2024-03-06'}, hash='3a878a9c05c381383f2b30552d731995ad16394d8978bf5f69757311ae1ed3e0'), <NodeRelationship.NEXT: '3'

# End to End RAG Pipeline in 1 cell

In [None]:
import os
os.environ['OPENAI_API_KEY'] = '<your key here>'

# Stage 1 - Data Ingestion - import SimpleDirectoryReader
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("data").load_data()

# set up embeddings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings
embed_model = OpenAIEmbedding()

# set up llm
from llama_index.llms.openai import OpenAI
llm = OpenAI()

# Stage 2 - Index
from llama_index.core import Settings
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512

# Stage 3 + 4 - Retriever and Response Synthesis
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)

# Stage 5 - query engine
query_engine = index.as_query_engine()


print(query_engine.query("What is the main theme of DHS?").response)



The main theme of DHS is centered around Artificial Intelligence (AI) and fostering the data science ecosystem globally.
