# Retrieval Augmentation Generation (RAG) with LLAMA.CPP Quantized Model

In [38]:
!wget https://huggingface.co/teleprint-me/llama-2-7b-chat-GGUF/resolve/main/llama-2-7b-chat.GGUF.q4_0.bin 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
--2023-10-29 02:48:34--  https://huggingface.co/teleprint-me/llama-2-7b-chat-GGUF/resolve/main/llama-2-7b-chat.GGUF.q4_0.bin
Resolving huggingface.co (huggingface.co)... 13.35.166.114, 13.35.166.50, 13.35.166.69, ...
Connecting to huggingface.co (huggingface.co)|13.35.166.114|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/77/e7/77e703336477580d549e1c977e7a661d3910a24e7046ea89e5e52430a45d7ff1/7b8ac13e13c32bd00ba74670c287d078dcad02653430adc0d72edd5463d50094?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27llama-2-7b-chat.GGUF.q4_0.bin%3B+filename%3D%22llama-2-7b-chat.GGUF.q4_0.bin%22%3B&response-content-type=application%2Foct

### Install llama.cpp llama-cpp-python, chromadb
In my previous video, I have shown how to build a quantized model from llama.cpp

In this notebook, you will see how to do RAG on a quantied model so that you can query your documents.

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.64 --no-cache-dir

pip install chromadb

##### Step 1: Instantiate an embed model which later will be used for storing data in the vector DB

In [14]:
!pip install langchain



In [16]:
!pip install sentence-transformers



In [20]:
from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

##### Step 2: Process Custom Content into Chunks

In [21]:
!pip install jq

Collecting jq
  Downloading jq-1.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (656 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m656.0/656.0 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: jq
Successfully installed jq-1.6.0


In [49]:
from langchain.document_loaders import JSONLoader
import json
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = JSONLoader(
    file_path='/kaggle/input/dset-qual/train_webmd_squad_v2_full.json',
    jq_schema='.data[]',
    text_content=False)

data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500,
                                                   chunk_overlap=50)
all_splits = text_splitter.split_documents(data)




JSONDecodeError: Extra data: line 1 column 62963122 (char 62963121)

##### Step 3: Store the custom content into a Vector DB (Chroma)

In [23]:
!pip install chromadb

Collecting chromadb
  Downloading chromadb-0.4.15-py3-none-any.whl (479 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m479.8/479.8 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m54.1 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.0.2-py2.py3-none-any.whl (37 kB)
Collecting pulsar-client>=3.1.0 (from chromadb)
  Downloading pulsar_client-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m85.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hCollecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.16.1-cp310-cp310-manylinux_2_17_x86_64.manylin

In [25]:
from langchain.vectorstores import Chroma
from langchain.embeddings import GPT4AllEmbeddings

vectorstore = Chroma.from_documents(documents=all_splits, embedding=embed_model)



Batches:   0%|          | 0/485 [00:00<?, ?it/s]

##### Step 4: Set bindings for LLAMA.CPP quantized model and instantiate the model

In [30]:
from langchain.embeddings import LlamaCppEmbeddings
from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
n_gpu_layers = 32  # Metal set to 1 is enough.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

In [34]:
!pip install llama-cpp-python==0.1.49


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting llama-cpp-python==0.1.49
  Downloading llama_cpp_python-0.1.49.tar.gz (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25ldone
[?25h  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.1.49-cp310-cp310-linux_x86_64.whl size=180234 sha256=81e65fd9484d8581f115eba06c1521b40cdd4931806d7c37cdbdee2b905ede62
  S

In [40]:
#llama = LlamaCppEmbeddings(model_path="/data/llama.cpp/models/llama-2-7b-chat/ggml-model-q4_0.bin")
llm = LlamaCpp(
    model_path="/kaggle/working/llama-2-7b-chat.GGUF.q4_0.bin",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    n_ctx=2048,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    callback_manager=callback_manager,
    verbose=False,
)


llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from /kaggle/working/llama-2-7b-chat.GGUF.q4_0.bin (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    output.weight q6_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    6:         blk.0.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_gate.weight q4_0     [  4096, 110

##### Step 5: Do a similarity search on the Vectordb to retrieve data related to the query

In [42]:
question = "what are the tips in managing my bipolar disease"
docs = vectorstore.similarity_search(question)
#result = llm_chain(docs)
docs

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[Document(page_content='psychotherapy with behavioral techniques to help patients learn how to more effectively manage interpersonal problems, stay on their medications, and normalize their lifestyle habits. The STEP- BD study mentioned earlier found that in addition to medications, adding a structured psychotherapy -- such as cognitive behavioral therapy, interpersonal/social rhythm therapy, or family-focused therapy -- can speed up treatment response in bipolar depression by as much as 150%.", "answer_span": [29,', metadata={'seq_num': 516, 'source': '/kaggle/input/srikuuuu/val_webmd_squad_v2_consec.json'}),
 Document(page_content='psychotherapy with behavioral techniques to help patients learn how to more effectively manage interpersonal problems, stay on their medications, and normalize their lifestyle habits. The STEP- BD study mentioned earlier found that in addition to medications, adding a structured psychotherapy -- such as cognitive behavioral therapy, interpersonal/social rh

##### Step 6: Create a RAG pipeline to contextualize with the custom data and Query

In [47]:
from langchain.chains import RetrievalQA

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=vectorstore.as_retriever()
)

In [48]:
rag_pipeline("what are the tips in managing my bipolar disease")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 It is important to work closely with your healthcare provider when managing bipolar disorder. Your healthcare provider can help you develop a treatment plan that meets your individual needs and provides guidance on how to manage your symptoms effectively. Additionally, therapy can be helpful in developing coping skills and understanding the condition better.
In terms of tips for managing bipolar disease, some things to consider include:
* Getting regular exercise: Exercise has been shown to have a positive impact on mood stabilization and overall well-being.
* Maintaining a consistent sleep schedule: Having a regular sleep routine can help regulate mood and energy levels.
* Practicing stress management techniques: Stress can exacerbate symptoms of bipolar disorder, so finding ways to manage stress is important. This could include activities such as meditation or deep breathing exercises.
* Building a support network: Having a strong support network of friends and family can provide em

{'query': 'what are the tips in managing my bipolar disease',
 'result': ' It is important to work closely with your healthcare provider when managing bipolar disorder. Your healthcare provider can help you develop a treatment plan that meets your individual needs and provides guidance on how to manage your symptoms effectively. Additionally, therapy can be helpful in developing coping skills and understanding the condition better.\nIn terms of tips for managing bipolar disease, some things to consider include:\n* Getting regular exercise: Exercise has been shown to have a positive impact on mood stabilization and overall well-being.\n* Maintaining a consistent sleep schedule: Having a regular sleep routine can help regulate mood and energy levels.\n* Practicing stress management techniques: Stress can exacerbate symptoms of bipolar disorder, so finding ways to manage stress is important. This could include activities such as meditation or deep breathing exercises.\n* Building a suppor

In [None]:
rag_pipeline("how do the accelerators built by Quadratic help their customers")

  Accelerators created by Quadratic enable Ml/AI Model Lifecycle as a MLOPS suite, enabling the customer to quickly build models, train and deploy in a repeatable fashion.

{'query': 'how do the accelerators built by Quadratic help their customers',
 'result': '  Accelerators created by Quadratic enable Ml/AI Model Lifecycle as a MLOPS suite, enabling the customer to quickly build models, train and deploy in a repeatable fashion.'}

In [None]:
llm("what accelerators did quadratic build")

?

 nobody knows when or if quadratic will launch. the company has not provided any updates on its launch plans, and its website is no longer active.

Quadratic was a startup that aimed to build a decentralized exchange (DEX) for non-fungible tokens (NFTs). The platform was designed to provide a more secure and reliable way of trading NFTs compared to traditional centralized exchanges. However, the project appears to have been abandoned, and no further information is available on its launch plans or development progress.

Quadratic's conceptual design involved using smart contracts to enable decentralized trading of NFTs without the need for intermediaries. The platform was expected to offer a range of features, including support for multiple blockchain networks, an intuitive user interface, and automated liquidity provision through quadratic funding.

While Quadratic's idea was innovative, it faced significant challenges in terms of scalability, security, and regulatory compliance. Th

"?\n nobody knows when or if quadratic will launch. the company has not provided any updates on its launch plans, and its website is no longer active.\nQuadratic was a startup that aimed to build a decentralized exchange (DEX) for non-fungible tokens (NFTs). The platform was designed to provide a more secure and reliable way of trading NFTs compared to traditional centralized exchanges. However, the project appears to have been abandoned, and no further information is available on its launch plans or development progress.\nQuadratic's conceptual design involved using smart contracts to enable decentralized trading of NFTs without the need for intermediaries. The platform was expected to offer a range of features, including support for multiple blockchain networks, an intuitive user interface, and automated liquidity provision through quadratic funding.\nWhile Quadratic's idea was innovative, it faced significant challenges in terms of scalability, security, and regulatory compliance. T