<a href="https://colab.research.google.com/github/dhavalsays/PPL_Python_Training/blob/master/QnA_Using_llama_GGUF_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Outcomes


1. **Model Selection and Evaluation:** Understand how to choose an appropriate quantized LLM model for QnA development and evaluate its suitability for the task.

2. **Vector Representations:** Learn how to create vector representations for text data, enabling efficient search and similarity matching.

3. **Database Integration:** Gain experience in setting up a vector database (e.g., FAISS) for storing and retrieving vector embeddings efficiently.



# **Understanding Quantization: A Visual Analogy**

Imagine you have a stunning high-resolution photo of a breathtaking landscape.

Every detail is captured with precision, from the individual leaves on trees to the smallest pebbles on the ground. This photo is like having super precise numbers in the world of artificial intelligence (AI).

However, there's a catch.

This high-resolution photo consumes a lot of storage space on your device, just like those super precise numbers demand significant computer memory and processing power in AI models.

## Enter Quantization

Quantization is like transforming this high-resolution photo into a lower-resolution version. Instead of capturing every tiny detail, we decide to make it simpler. This process helps us save memory and speeds up calculations in AI.

### Balancing Act

Quantization is a balancing act between having a smaller file (or faster calculations) and maintaining precision. It's like deciding how much detail you're willing to sacrifice for speed and efficiency.

#### Types of Quantization




### **Loading Data**

In [None]:
faq_data = pd.read_csv('/content/codebasics_faqs.csv')

### **Loading Models**

In [None]:
# we are using to load the instructor embedding model.
embedding_model = INSTRUCTOR('hkunlp/instructor-xl')

# Set hyperparameters for the Language Model.
temperature = 0.01  # Temperature for sampling text from the LM.
n_gpu_layers = 50  # Number of GPU layers to use.
max_new_tokens = 200  # Maximum number of new tokens to generate.

# Define the pre-trained language model you want to use.
model_name = "TheBloke/Llama-2-7b-Chat-GGUF"

# Load the language model with specified settings.
llm = AutoModelForCausalLM.from_pretrained(model_name,
                                          max_new_tokens=max_new_tokens,
                                          gpu_layers=n_gpu_layers,
                                          temperature=temperature)

Downloading (…)7f436/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

Downloading (…)/2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

Downloading (…)0daf57f436/README.md:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

Downloading (…)af57f436/config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)7f436/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading (…)f57f436/modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading (…)14dc91a6/config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading (…)-2-7b-chat.Q2_K.gguf:   0%|          | 0.00/2.83G [00:00<?, ?B/s]

### **Helper Functions**

In [None]:
def generate_text_embedding(text: str):
    """
    Generates an embedding representation for a text document to facilitate retrieval for answering a question.

    Parameters:
    text (str): The text document for which an embedding is to be generated.

    Returns:
    list: A list representing the embedding of the text document.
    """
    instruction = "Represent document chunk (which could be text or a table) for retrieval for answering a question."

    # Encode the text document using the embedding model.
    result = embedding_model.encode([[instruction, text]])

    # Convert the result to a list and return it.
    return result[0].tolist()


In [None]:
def generate_query_embedding(query_text: str):
    """
    Generates an embedding representation for a query text to facilitate retrieval of similar text blocks
    in a document.

    Parameters:
    query_text (str): The query text for which an embedding is to be generated.

    Returns:
    list: A list representing the embedding of the query text.
    """
    # A description of the instruction for generating the query embedding.
    instruction = "Represent the query for retrieval. The query will be used to find text blocks in the document that are similar."

    # Encode the query text using the embedding model.
    result = embedding_model.encode([[instruction, query_text]])

    # Convert the result to a list and return it.
    return result[0].tolist()


In [None]:
def create_vector_database(data_frame: pd.DataFrame):
    """
    Creates a vector database and a content dictionary from a given DataFrame.

    Parameters:
    data_frame (pd.DataFrame): A DataFrame containing 'response' and 'embeddings' columns.

    Returns:
    faiss.IndexFlatIP: A Faiss vector database for similarity search.
    dict: A dictionary mapping vector indices to content chunks.

    Note:
    The 'response' column should contain text or content chunks, and the 'embeddings' column should contain
    corresponding embeddings as numpy arrays.
    """
    # Extract content chunks and embeddings from the DataFrame.
    content_chunks = data_frame['response'].tolist()
    embeddings_array = np.array(data_frame['embeddings'].to_list(), dtype=np.float32)

    # Get the dimensions of the embeddings array.
    n, d = embeddings_array.shape

    # Create a Faiss vector index and add the embeddings.
    db = faiss.IndexFlatIP(d)
    db.add(embeddings_array)

    # Create a content dictionary mapping indices to content chunks.
    content_dictionary = {ind: {'content': content} for ind, content in enumerate(content_chunks)}

    return db, content_dictionary


In [None]:
def get_relevant_documents(question: str, db, content_dictionary, k=3) -> list:
    """
    Retrieves the most relevant documents from a vector database based on a given question.

    Parameters:
    question (str): The question for which relevant documents are sought.

    db (faiss.IndexFlatIP): The Faiss vector database for similarity search.

    content_dictionary (dict): A dictionary mapping indices to content chunks.

    k (int): The number of most relevant documents to retrieve (default is 3).

    Returns:
    list: A list of tuples containing the most relevant documents and their corresponding scores.
    Each tuple has the format (document_content, score).

    Note:
    The `question` parameter should be a text query.
    """
    # Generate an embedding for the query question.
    embedded_query = np.array([generate_query_embedding(question)], dtype=np.float32)

    # Perform range search in the vector database and retrieve scores and row IDs.
    _, scores, row_ids = db.range_search(x=embedded_query, thresh=0.7)

    # Create a DataFrame to store scores and row indices.
    result_df = pd.DataFrame(scores, columns=['score'])
    result_df['Row_Index'] = row_ids

    # Sort the DataFrame by score in descending order.
    result_df = result_df.sort_values(by=['score'], ascending=False)

    # Select the top-k most relevant documents.
    result_df = result_df.nlargest(k, 'score')

    # Extract row indices and scores of the best documents.
    best_row_ids = result_df['Row_Index'].values.tolist()
    best_scores = result_df['score'].values.tolist()

    # Create a list of tuples containing the most relevant documents and their scores.
    most_relevant_documents = [
        [content_dictionary[best_row_ids[i]]['content'], best_scores[i]] for i in range(len(best_row_ids))
    ]

    return most_relevant_documents


In [None]:
def generate_answer_from_relevant_documents(question: str, db, content_dictionary):
    """
    Generates an answer to a question based on relevant documents retrieved from a vector database.

    Parameters:
    question (str): The question for which an answer is to be generated.
    db (faiss.IndexFlatIP): The Faiss vector database for similarity search.
    content_dictionary (dict): A dictionary mapping indices to content chunks.

    Returns:
    str: The generated answer based on the relevant documents.

    Note:
    The `question` parameter should be a text query.
    """
    # Retrieve relevant documents using the provided question, database, and content dictionary.
    relevant_docs = get_relevant_documents(question, db, content_dictionary)

    # Create a prompt to generate the answer using the retrieved relevant documents.
    prompt = "Task: Answer the Question based on the understanding & summarizing the given context below in simple terms, don't quote any context number" + \
        '\n\nContext: ' + '.\n'.join([relevant_docs[ind][0] for ind in range(len(relevant_docs))]) + \
        '\n\n' + 'Question: ' + question + '\n\n' + 'Answer: '

    # Generate the answer using the Language Model (e.g., GPT-3).
    generated_answer = ""
    for text in llm(prompt, stream=True):
        generated_answer += text
        print(text, end="", flush=True)
    return generated_answer + '. Hope this helps..!!'


### **Main Function Call**

In [None]:
# Add a new column to the faq_data DataFrame combining user query and response for each row.
faq_data['content'] = faq_data.apply(lambda x: 'User Query: ' + str(x['prompt']) + '\n' + 'Response: ' + str(x['response']), axis=1)

# Generate embeddings for the combined content using the generate_text_embedding function.
faq_data['embeddings'] = [generate_text_embedding(text) for text in tqdm(faq_data['content'].tolist())]

# Create a vector database and content dictionary from the faq_data DataFrame.
db, info_dict = create_vector_database(faq_data)

# Define a question for which you want to generate an answer.
question = "I want to buy this boot camp but do you provide job assistance? Also note that I have never done coding"

# Generate an answer based on the relevant documents in the vector database.
answer = generate_answer_from_relevant_documents(question, db, info_dict)

# Printing the generated answer.
print(answer)


In [None]:
answer = generate_answer_from_relevant_documents(question, db, info_dict)

 Yes, we provide job assistance to all our learners. Our bootcamp is designed to teach you the most relevant skills and knowledge required by employers in the IT/Data Analytics industry. We have a strong track record of placing our learners in jobs with top companies. However, please note that we cannot guarantee any specific job offer or salary package.
Our focus is on preparing you for the job market by teaching you the most relevant skills and knowledge required by employers in the IT/Data Analytics industry. We have a strong track record of placing our learners in jobs with top companies. Our course content includes both theoretical and practical training, ensuring that you are well-equipped to handle any challenge that comes your way during the bootcamp.
We understand that job placement is an important aspect of our bootcamp, and we take it very seriously. We have a dedicated career services team that works closely with our learners to help them find suitable job

## **Using Langchain Approach**

### **Requirements**

In [None]:
# !pip install ctransformers
# !pip install transformers
# !pip install ctransformers ctransformers[cuda]
# !pip install sentence_transformers
# !pip install InstructorEmbedding
# !pip install faiss-cpu
# !pip install langchain
# !pip3 install huggingface-hub>=0.17.1

### **Imports**

In [None]:
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders.csv_loader import CSVLoader
import pandas as pd
import numpy as np
from tqdm import tqdm
from ctransformers import AutoModelForCausalLM
from ctransformers.langchain import CTransformers
import huggingface_hub
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceBgeEmbeddings

### **Loading Data**

In [None]:
# Load data from a CSV file located at '/content/codebasics_faqs.csv'
loader = CSVLoader(file_path='/content/codebasics_faqs.csv')

# Store the loaded data in the 'data' variable
data = loader.load()

### **Creating Vector DB**

In [None]:

# # Initialize instructor embeddings using the Hugging Face model
instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-large",model_kwargs={"device": "cuda"})

# # Use the 'instructor_embeddings' for generating embeddings
embedding = instructor_embeddings

# Create a FAISS instance for vector database from 'data'
vectordb = FAISS.from_documents(documents=data,
                                 embedding=embedding)
                                #  persist_directory=persist_directory)

# Create a retriever for querying the vector database
retriever = vectordb.as_retriever()

load INSTRUCTOR_Transformer
max_seq_length  512


### **Downloading Model on Local**

In [None]:
# Download a file from the Hugging Face Model Hub
# using the specified repository ID and filename

repo_id = 'TheBloke/Llama-2-7b-Chat-GGUF'
filename = 'llama-2-7b-chat.Q2_K.gguf'
local_dir = './'
huggingface_hub.hf_hub_download(repo_id=repo_id, filename=filename, local_dir=local_dir)


Downloading (…)-2-7b-chat.Q2_K.gguf:   0%|          | 0.00/2.83G [00:00<?, ?B/s]

'./llama-2-7b-chat.Q2_K.gguf'

### **Loading The Model**

In [None]:
# Define parameter configuration for the CTransformers model
param_config = {'temperature': 0.01, 'gpu_layers': 50, 'max_new_tokens': 128,'context_length':2048}

# Initialize a CTransformers model with the specified parameters
llm = CTransformers(model='/content/'+filename, model_type='llama', config=param_config)


### **Creating QA Chain**

In [None]:
# Create a question-answering chain using the CTransformers model, retriever, and chain type
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                      chain_type="stuff",
                                      retriever=retriever,
                                      return_source_documents=True)


In [None]:
qa_chain("What courses do you have?")

{'query': 'What courses do you have?',
 'result': ' We have a range of courses that cater to different levels and interests. For instance, we have an introductory course on data analysis which covers the fundamentals of data manipulation, visualization, and interpretation. Additionally, we have advanced courses on machine learning, deep learning, and data science for those who want to delve deeper into these topics.',
 'source_documents': [Document(page_content='\ufeffprompt: Is there any prerequisite for taking this course?\nresponse: The only prerequisite is that you need to have a functional laptop with at least 4GB ram, internet connection and a thrill to learn data analysis.', metadata={'source': '/content/codebasics_faqs.csv', 'row': 35}),
  Document(page_content='\ufeffprompt: What business concepts and domains are covered in this course?\nresponse: We have covered the core functions such as Sales, Marketing, Finance, and Supply Chain with their fundamentals related to this cour