# Introduction

<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F769452%2Fb18d0513200d426e556b2b7b7c825981%2FRAG.png?generation=1695504022336680&alt=media"></img>

## Objective

Use Llama 2.0, Langchain and ChromaDB to create a Retrieval Augmented Generation (RAG) system. This will allow us to ask questions about our documents (that were not included in the training data), without fine-tunning the Large Language Model (LLM).
When using RAG, if you are given a question, you first do a retrieval step to fetch any relevant documents from a special database, a vector database where these documents were indexed. 

## Definitions

* LLM - Large Language Model  
* Llama 2.0 - LLM from Meta 
* Langchain - a framework designed to simplify the creation of applications using LLMs
* Vector database - a database that organizes data through high-dimmensional vectors  
* ChromaDB - vector database  
* RAG - Retrieval Augmented Generation (see below more details about RAGs)

## Model details

* **Model**: Llama 2  
* **Variation**: 7b-chat-hf  (7b: 7B dimm. hf: HuggingFace build)
* **Version**: V1  
* **Framework**: PyTorch  

LlaMA 2 model is pretrained and fine-tuned with 2 Trillion tokens and 7 to 70 Billion parameters which makes it one of the powerful open source models. It is a highly improvement over LlaMA 1 model.


## What is a Retrieval Augmented Generation (RAG) system?

Large Language Models (LLMs) has proven their ability to understand context and provide accurate answers to various NLP tasks, including summarization, Q&A, when prompted. While being able to provide very good answers to questions about information that they were trained with, they tend to hallucinate when the topic is about information that they do "not know", i.e. was not included in their training data. Retrieval Augmented Generation combines external resources with LLMs. The main two components of a RAG are therefore a retriever and a generator.  
 
The retriever part can be described as a system that is able to encode our data so that can be easily retrieved the relevant parts of it upon queriying it. The encoding is done using text embeddings, i.e. a model trained to create a vector representation of the information. The best option for implementing a retriever is a vector database. As vector database, there are multiple options, both open source or commercial products. Few examples are ChromaDB, Mevius, FAISS, Pinecone, Weaviate. Our option in this Notebook will be a local instance of ChromaDB (persistent).

For the generator part, the obvious option is a LLM. In this Notebook we will use a quantized LLaMA v2 model, from the Kaggle Models collection.  

The orchestration of the retriever and generator will be done using Langchain. A specialized function from Langchain allows us to create the receiver-generator in one line of code.

## More about this  

Do you want to learn more? Look into the `References` section for blog posts and in `More work on the same topic` for Notebooks about the technologies used here.

# Installations, imports, utils

In [1]:

!pip install transformers==4.33.0 accelerate==0.22.0 einops==0.6.1 langchain==0.0.300 xformers==0.0.21 \
bitsandbytes==0.41.1 sentence_transformers==2.2.2 chromadb==0.4.12



[0m

In [2]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma


# Initialize model, tokenizer, query pipeline

Define the model, the device, and the `bitsandbytes` configuration.

In [3]:
model_id = 'meta-llama/Llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

In [4]:
from huggingface_hub import login

# Replace 'your_hugging_face_token' with your actual Hugging Face token
token = "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"

# Log in to Hugging Face
login(token)


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /home/.cache/huggingface/token
Login successful


Prepare the model and the tokenizer.

In [5]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM


In [6]:
time_1 = time()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
time_2 = time()
print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Prepare model, tokenizer: 68.693 sec.


Define the query pipeline.

In [7]:
time_1 = time()
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",)
time_2 = time()
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")

Prepare pipeline: 0.425 sec.


In [8]:
def is_code_prompt(user_input):
    # You can expand this list with more code-related keywords or programming languages
    code_keywords = ["code", "script", "function", "class", "method", "algorithm", "program", "loop", "Python", "JavaScript", "C++", "Java"]
    
    return any(keyword.lower() in user_input.lower() for keyword in code_keywords)

# Define a function to handle code generation prompts
def handle_code_prompt(pipeline, tokenizer, prompt_to_test):
    print(f"Generating code for the prompt: {prompt_to_test}")
    
    # Generate code using the pipeline
    time_1 = time()
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200
    )
    time_2 = time()
    
    print(f"Code generation time: {round(time_2-time_1, 3)} sec.")
    for seq in sequences:
        print(f"Generated Code: {seq['generated_text']}")
    
    # Optionally allow editing or further interaction with the code
    edit_code = input("\nDo you want to edit the code? (yes/no): ")
    if edit_code.lower() == 'yes':
        print("\nYou can now edit the code manually.")

# Function to test the model with a generic prompt
def test_model(tokenizer, pipeline, prompt_to_test):
    print(f"Processing query: {prompt_to_test}")
    
    if is_code_prompt(prompt_to_test):
        # Handle code-specific prompt
        handle_code_prompt(pipeline, tokenizer, prompt_to_test)
    else:
        # Handle non-code queries
        time_1 = time()
        sequences = pipeline(
            prompt_to_test,
            do_sample=True,
            top_k=10,
            num_return_sequences=1,
            eos_token_id=tokenizer.eos_token_id,
            max_length=200,
        )
        time_2 = time()
        
        print(f"Test inference: {round(time_2-time_1, 3)} sec.")
        for seq in sequences:
            print(f"Result: {seq['generated_text']}")

We define a function for testing the pipeline.

## Test the query pipeline

We test the pipeline with a query about the meaning of State of the Union (SOTU).

In [9]:
test_model(tokenizer,
           query_pipeline,
           "Please explain what is Adaptation of communication system based on changing user accessibility needs. Give just a definition. Keep it in 100 words.")

Processing query: Please explain what is Adaptation of communication system based on changing user accessibility needs. Give just a definition. Keep it in 100 words.
Test inference: 50.094 sec.
Result: Please explain what is Adaptation of communication system based on changing user accessibility needs. Give just a definition. Keep it in 100 words.
Adaptation of communication systems based on changing user accessibility needs involves modifying or customizing the system to accommodate individuals with varying levels of accessibility requirements, such as visual, auditory, motor, or cognitive impairments. This may involve using assistive technologies, such as screen readers, speech-to-text software, or Braille displays, to enable individuals with disabilities to access and use the system effectively.


# Retrieval Augmented Generation

## Check the model with a HuggingFace pipeline


We check the model with a HF pipeline, using a query about the meaning of State of the Union (SOTU).

In [10]:
llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt="Please explain Machine Learning for Network Automation. Give just a definition. Keep it in 100 words.")

'\nMachine Learning for Network Automation involves using algorithms and statistical models to analyze network data and automatically learn patterns, behaviors, and anomalies.ibility to predict network behavior, optimize network performance, and detect and respond to security threats.'

## Ingestion of data using Text loder

We will ingest the newest presidential address, from Jan 2023.

In [11]:
loader = TextLoader("train data.txt",
                    encoding="ISO-8859-1")
documents = loader.load()

## Split data in chunks

We split data in chunks using a recursive character text splitter.

In [12]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

## Creating Embeddings and Storing in Vector Store

Create the embeddings using Sentence Transformer and HuggingFace embeddings.

In [13]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.

In [14]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

## Initialize chain

In [15]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

## Test the Retrieval-Augmented Generation 


We define a test function, that will run the query and time it.

In [16]:
def test_rag(qa, query):
    print(f"Query: {query}\n")
    time_1 = time()
    result = qa.run(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print("\nResult: ", result)

Let's check few queries.

In [17]:
query = "Given the frequent call drops experienced at my university, which utilizes an XSeries server and a star topology network, what specific recommendations would you propose to improve call quality?"
test_rag(qa, query)

Query: Given the frequent call drops experienced at my university, which utilizes an XSeries server and a star topology network, what specific recommendations would you propose to improve call quality?



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 243.128 sec.

Result:   Thank you for reaching out! To improve call quality in your university's network, I would recommend a few specific actions based on the information provided:

1. Conduct Network Troubleshooting: Before making any recommendations, it's essential to identify the root cause of the call drops. I would suggest conducting a thorough network troubleshooting exercise to determine if there are any issues with the network infrastructure, such as poor signal strength, interference, or congestion.
2. Optimize Network Configuration: Once the network troubleshooting exercise is complete, you can optimize the network configuration to improve call quality. This may involve adjusting the netw

In [18]:
query = "Develop an algorithm to implemnt the recommendations you have provided"
test_rag(qa, query)

Query: Develop an algorithm to implemnt the recommendations you have provided



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 160.244 sec.

Result:  
To implement the recommendations provided, we can develop an algorithm that leverages the knowledge graph and intelligent recommendation algorithms to identify the root cause of network abnormalities and provide recommended solutions to first-line experts. Here's a high-level outline of the algorithm:

1. Data Collection: Collect tens of thousands of expert experiences, radio network knowledge bases, and abnormal condition data from the network.
2. Knowledge Graph Construction: Build a knowledge graph using the collected data to establish relationships between network elements, abnormal conditions, and recommended solutions.
3. Intelligent Recommendation Algorithm Development: Develop an intelligent recommendation algorithm that can analyze the knowledge graph and provide reasons and recommended s

In [19]:
query = "generate a python code any existing an existing network"
test_rag(qa, query)

Query: generate a python code any existing an existing network



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 324.314 sec.

Result:   I can provide you with some guidance on how to generate a Python code for creating a digital twin of an existing network. However, I must inform you that creating a digital twin of a complex network like a telecommunications network is a challenging task that requires significant expertise in network engineering, software development, and data analytics.

To start, you will need to gather information about the existing network, including its architecture, topology, and performance metrics. This information can be obtained through various means, such as network monitoring tools, network management systems, and performance monitoring systems.

Once you have gathered the necessary information, you can use Python programming language to create a digital twin of the network. There are several libraries and frameworks

## Document sources

Let's check the documents sources, for the last query run.

In [20]:
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")

Query: generate a python code any existing an existing network
Retrieved documents: 4
Source:  train data.txt
Text:  OR
Top down bootstrapping of apps, services, NaaS, infrastructure, based on these
new SRCs.
Open issues E2E automation frameworks for composition of infrastructure, NaaS and services
do not exist.
Use case category Cat 1: describes a scenario related to core autonomous behaviour itself.
Reference
7.13.1 Use case requirements
Critical requirements
â AN-UC013-REQ-001: It is critical that autonomous networks (AN) enable plug and play of
network functions in the underlay and subsequent seamless participation of such network functions
in the AN functions.
NOTE â Examples of AN functions are creation and hosting of controllers. Plug and play may be executed
by manual or autonomous mechanisms.
7.13.2 Use case specific figures
None.
7.14 âGenerative adversarial Sandboxâ: (or hybrid closed loops)
Use case id FG-AN-usecase-014
Use case name âGenerative adversarial Sandbo

# Conclusions


We used Langchain, ChromaDB and Llama 2 as a LLM to build a Retrieval Augmented Generation solution. For testing, we were using the latest State of the Union address from Jan 2023.





# References  

[1] Murtuza Kazmi, Using LLaMA 2.0, FAISS and LangChain for Question-Answering on Your Own Data, https://medium.com/@murtuza753/using-llama-2-0-faiss-and-langchain-for-question-answering-on-your-own-data-682241488476  

[2] Patrick Lewis, Ethan Perez, et. al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, https://browse.arxiv.org/pdf/2005.11401.pdf 

[3] Minhajul Hoque, Retrieval Augmented Generation: Grounding AI Responses in Factual Data, https://medium.com/@minh.hoque/retrieval-augmented-generation-grounding-ai-responses-in-factual-data-b7855c059322  

[4] Fangrui Liu	, Discover the Performance Gain with Retrieval Augmented Generation, https://thenewstack.io/discover-the-performance-gain-with-retrieval-augmented-generation/

[5] Andrew, How to use Retrieval-Augmented Generation (RAG) with Llama 2, https://agi-sphere.com/retrieval-augmented-generation-llama2/   

[6] Yogendra Sisodia, Retrieval Augmented Generation Using Llama2 And Falcon, https://medium.com/@scholarly360/retrieval-augmented-generation-using-llama2-and-falcon-ed26c7b14670   



In [21]:
import os

# Specify the configuration directory and file
config_dir = "/home/.jupyter/"
config_file = os.path.join(config_dir, "jupyter_notebook_config.py")

# Ensure the directory exists (it should, but just in case)
os.makedirs(config_dir, exist_ok=True)

# Create or write to the config file
config_content = """
c.NotebookApp.shutdown_no_activity_timeout = 10800  # Set to 2 hours (7200 seconds)
"""

with open(config_file, 'w') as f:
    f.write(config_content)

print(f"Configuration file created at: {config_file}")


Configuration file created at: /home/.jupyter/jupyter_notebook_config.py


In [None]:
!pip install flask
!pip install pyngrok

from pyngrok import ngrok
from flask import Flask, request, jsonify

# Set your ngrok authtoken
# ngrok.set_auth_token("2n2mw3PwI2JeIWWbW6dOx1Bwl5L_4Co5Hnaejk29ke6jvS3b7")

# Create a Flask app
app = Flask(__name__)

@app.route('/query', methods=['POST'])
def query_model():
    data = request.get_json()
    query = data.get('query', '')
    
    if not query:
        return jsonify({"error": "No query provided"}), 400

    try:
        # Generate response from the model (replace qa.run with actual logic)
        response = qa.run(query)
        # Optionally, test another function or logic here
        # test_rag(qa, query)
    except Exception as e:
        response = str(e)
    
    return jsonify({"id" : "winest_chatbot", "response": response})

# Run the Flask app
if __name__ == "__main__":
    app.run(host='0.0.0.0', port=5000)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[0mhuggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[0m * Serving Flask app '__main__'
 * Debug mode: off


 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://172.17.0.3:5000
[33mPress CTRL+C to quit[0m
172.17.0.3 - - [08/Oct/2024 15:30:37] "[33mGET / HTTP/1.1[0m" 404 -
172.17.0.3 - - [08/Oct/2024 15:30:41] "[31m[1mGET /query HTTP/1.1[0m" 405 -




[1m> Entering new RetrievalQA chain...[0m


172.17.0.3 - - [08/Oct/2024 15:34:40] "POST /query HTTP/1.1" 200 -



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


172.17.0.3 - - [08/Oct/2024 15:41:00] "POST /query HTTP/1.1" 200 -



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


172.17.0.3 - - [08/Oct/2024 15:49:31] "POST /query HTTP/1.1" 200 -



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


172.17.0.3 - - [08/Oct/2024 15:52:27] "[31m[1mPOST /query HTTP/1.1[0m" 400 -
172.17.0.3 - - [08/Oct/2024 15:54:34] "POST /query HTTP/1.1" 200 -



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


172.17.0.3 - - [08/Oct/2024 15:59:51] "POST /query HTTP/1.1" 200 -



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


172.17.0.3 - - [08/Oct/2024 16:22:48] "POST /query HTTP/1.1" 200 -



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


172.17.0.3 - - [08/Oct/2024 16:27:16] "POST /query HTTP/1.1" 200 -



[1m> Finished chain.[0m
