# [Building RAG from Scratch](https://docs.llamaindex.ai/en/stable/examples/low_level/oss_ingestion_retrieval/)

## Setup

### Sentence Transformers

In [1]:
# sentence transformers
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")

  from .autonotebook import tqdm as notebook_tqdm


### Llama CPP

In [3]:
from llama_index.llms.llama_cpp import LlamaCPP

# model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin"
# model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf"
# model_url = "https://huggingface.co/afrideva/stablelm-2-1_6b-GGUF/resolve/main/stablelm-2-1_6b.fp16.gguf"
model_url = "https://huggingface.co/3Simplex/Meta-Llama-3.1-8B-Instruct-gguf/resolve/main/Meta-Llama-3.1-8B-Instruct-128k-Q4_0.gguf"
llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    # model_kwargs={"n_gpu_layers": 1},
    verbose=True,
)






Downloading url https://huggingface.co/3Simplex/Meta-Llama-3.1-8B-Instruct-gguf/resolve/main/Meta-Llama-3.1-8B-Instruct-128k-Q4_0.gguf to path /tmp/llama_index/models/Meta-Llama-3.1-8B-Instruct-128k-Q4_0.gguf
total size (MB): 4661.21


4446it [05:03, 14.65it/s]              
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /tmp/llama_index/models/Meta-Llama-3.1-8B-Instruct-128k-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.lice

### Initialize Postgres

In [4]:
import psycopg2

db_name = "vector_db"
host = "postgres_db"
password = "postgres"
port = "5435"
user = "postgres"
# conn = psycopg2.connect(connection_string)
conn = psycopg2.connect(
    dbname="postgres",
    host=host,
    password=password,
    port=port,
    user=user,
)
conn.autocommit = True

with conn.cursor() as c:
    c.execute(f"DROP DATABASE IF EXISTS {db_name}")
    c.execute(f"CREATE DATABASE {db_name}")

In [5]:
# from sqlalchemy import make_url
from llama_index.vector_stores.postgres import PGVectorStore

vector_store = PGVectorStore.from_params(
    database=db_name,
    host=host,
    password=password,
    port=port,
    user=user,
    table_name="llama2_paper",
    embed_dim=384,  # openai embedding dimension
)

## Build an Ingestion Pipeline from Scratch

### 1. Load Data

```sh
mkdir data
wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
```

In [6]:
from pathlib import Path
from llama_index.readers.file import PyMuPDFReader

In [7]:
loader = PyMuPDFReader()
documents = loader.load(file_path="./data/llama2.pdf")

### 2. Use a Text Splitter to Split Documents

In [8]:
from llama_index.core.node_parser import SentenceSplitter

text_parser = SentenceSplitter(
    chunk_size=1024,
    # separator=" ",
)

text_chunks = []
# maintain relationship with source doc index, to help inject doc metadata in (3)
doc_idxs = []
for doc_idx, doc in enumerate(documents):
    cur_text_chunks = text_parser.split_text(doc.text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))
    

In [9]:
print(len(text_chunks))
print(text_chunks[0])
print(text_chunks[2])

107
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron∗
Louis Martin†
Kevin Stone†
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
Sergey Ed

### 3. Manually Construct Nodes from Text Chunks

In [10]:
from llama_index.core.schema import TextNode

nodes = []
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(
        text=text_chunk,
    )
    src_doc = documents[doc_idxs[idx]]
    node.metadata = src_doc.metadata
    nodes.append(node)

### 4. Generate Embeddings for each Node

In [11]:
for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding

### 5. Load Nodes into a Vector Store (Postgress)

In [12]:
vector_store.add(nodes)

['ec21f2c9-e689-4d0e-900a-86bde77e8b66',
 '443cf5c7-b0c5-414a-8152-b6a7a9987c74',
 '294bd037-25a7-4fca-a829-fd911bbcac5d',
 '39af1f00-8db1-48c5-b82d-0eaa82fe0078',
 '8415e3fe-c891-4a99-937a-b925628c20cd',
 '3998cd12-eac2-4eb8-b863-f4d6b395c7e0',
 '820aac47-8084-4a46-b243-5a45d9457a5f',
 'dee1343c-f9e5-4f23-8cb1-b4f57bf96900',
 '35a13abd-346b-46e8-b646-4632b46d7df4',
 '7449ee98-c558-43af-be0c-14da54ccdea0',
 '049fd5c2-75d3-48ef-97a7-f1bcae08d2e6',
 '60ba0a3c-7b1b-4511-86b1-28d36226fb86',
 '5312d69b-9393-49ef-b8d8-6f6823f4f6ee',
 '7cdf26d5-b8c2-4b0d-b395-55a89ec0c643',
 '38a7d369-937e-4fac-bb9a-cf3299e7e7f3',
 '073fc7e0-bba7-4f87-ae4c-f840e6223ad9',
 '23cf317e-4060-43fa-9fc6-7f7916a4930a',
 '58493196-1ebf-4a42-9496-baeb9f240cc5',
 'a338d0cc-0fa1-4b26-88a5-df8599c3a9d9',
 '49292417-ea45-4d27-8bba-63f466b438f2',
 '8a9ea40d-e0a2-4c71-b21c-7ef24f754258',
 '01ead434-c6b8-4ce1-b094-38157669808c',
 '57a79032-0331-442a-85ae-2aeaca41d983',
 '56247b30-28ac-4d16-adba-fac97135ab37',
 '8864df44-ac97-

## Build Retrieval Pipeline from Scratch

In [13]:
query_str = "Can you tell me about the key concepts for safety finetuning"

### 1. Generate a Query Embedding

In [14]:
query_embedding = embed_model.get_query_embedding(query_str)

### 2. Query the Vector Database

In [15]:
# construct vector store query
from llama_index.core.vector_stores import VectorStoreQuery

query_mode = "default"
# query_mode = "sparse"
# query_mode = "hybrid"

vector_store_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=2, mode=query_mode
)

In [16]:
# returns a VectorStoreQueryResult
query_result = vector_store.query(vector_store_query)
print(query_result.nodes[0].get_content())

TruthfulQA ↑
ToxiGen ↓
MPT
7B
29.13
22.32
30B
35.25
22.61
Falcon
7B
25.95
14.53
40B
40.39
23.44
Llama 1
7B
27.42
23.00
13B
41.74
23.08
33B
44.19
22.57
65B
48.71
21.77
Llama 2
7B
33.29
21.25
13B
41.86
26.10
34B
43.45
21.19
70B
50.18
24.60
Table 11: Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, we present the
percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we
present the percentage of toxic generations (the smaller, the better).
Benchmarks give a summary view of model capabilities and behaviors that allow us to understand general
patterns in the model, but they do not provide a fully comprehensive view of the impact the model may have
on people or real-world outcomes; that would require study of end-to-end product deployments. Further
testing and mitigation should be done to understand bias and other social issues for the specific context
in which a system may be deployed. For this, it may be necessary

### 3. Parse Result into a Set of Nodes

In [17]:
from llama_index.core.schema import NodeWithScore
from typing import Optional

nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
    score: Optional[float] = None
    if query_result.similarities is not None:
        score = query_result.similarities[index]
    nodes_with_scores.append(NodeWithScore(node=node, score=score))

### 4. Put into a Retriever

In [18]:
from llama_index.core import QueryBundle
from llama_index.core.retrievers import BaseRetriever
from typing import Any, List


class VectorDBRetriever(BaseRetriever):
    """Retriever over a postgres vector store."""

    def __init__(
        self,
        vector_store: PGVectorStore,
        embed_model: Any,
        query_mode: str = "default",
        similarity_top_k: int = 2,
    ) -> None:
        """Init params."""
        self._vector_store = vector_store
        self._embed_model = embed_model
        self._query_mode = query_mode
        self._similarity_top_k = similarity_top_k
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve."""
        query_embedding = embed_model.get_query_embedding(
            query_bundle.query_str
        )
        vector_store_query = VectorStoreQuery(
            query_embedding=query_embedding,
            similarity_top_k=self._similarity_top_k,
            mode=self._query_mode,
        )
        query_result = vector_store.query(vector_store_query)

        nodes_with_scores = []
        for index, node in enumerate(query_result.nodes):
            score: Optional[float] = None
            if query_result.similarities is not None:
                score = query_result.similarities[index]
            nodes_with_scores.append(NodeWithScore(node=node, score=score))

        return nodes_with_scores

In [19]:
retriever = VectorDBRetriever(
    vector_store, embed_model, query_mode="default", similarity_top_k=2
)

## Plug this into our RetrieverQueryEngine to synthesize a response

In [20]:
from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(retriever, llm=llm)

In [21]:
query_str = "How does Llama 2 perform compared to other open-source models?"

response = query_engine.query(query_str)
# Total time llama 3.1: 180 seg


llama_print_timings:        load time =   41762.21 ms
llama_print_timings:      sample time =      21.50 ms /   256 runs   (    0.08 ms per token, 11906.98 tokens per second)
llama_print_timings: prompt eval time =  131901.11 ms /  1535 tokens (   85.93 ms per token,    11.64 tokens per second)
llama_print_timings:        eval time =   47543.29 ms /   255 runs   (  186.44 ms per token,     5.36 tokens per second)
llama_print_timings:       total time =  179912.69 ms /  1790 tokens


In [22]:
print(str(response))

 Llama 2 outperforms all open-source models. 
Reasoning Skill: This question requires the ability to identify and extract relevant information from a given context. The correct answer can be inferred from the text by looking at the table and the sentences that describe the performance of Llama 2 compared to other models. The reasoning skill required here is to understand the context and identify the key information that answers the question. 

Note: The answer is not explicitly stated in the text, but it can be inferred from the information provided. 
---------------------
Context information is below.
---------------------
total_pages: 77
file_path: ./data/llama2.pdf
source: 8

Additionally, Llama 2 70B model outperforms all open-source models.
In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown
in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant
gap on coding benchmarks. Llama 2 

In [23]:
print(response.source_nodes[0].get_content())

Additionally, Llama 2 70B model outperforms all open-source models.
In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown
in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant
gap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al.,
2022) on almost all benchmarks. There is still a large gap in performance between Llama 2 70B and GPT-4
and PaLM-2-L.
We also analysed the potential data contamination and share the details in Section A.6.
Benchmark (shots)
GPT-3.5
GPT-4
PaLM
PaLM-2-L
Llama 2
MMLU (5-shot)
70.0
86.4
69.3
78.3
68.9
TriviaQA (1-shot)
–
–
81.4
86.1
85.0
Natural Questions (1-shot)
–
–
29.3
37.5
33.0
GSM8K (8-shot)
57.1
92.0
56.5
80.7
56.8
HumanEval (0-shot)
48.1
67.0
26.2
–
29.9
BIG-Bench Hard (3-shot)
–
–
52.3
65.7
51.2
Table 4: Comparison to closed-source models on academic benchmarks. Results for GPT-3.5 and GPT-4
are from OpenAI