# Overview

- Use [Langchain](https://python.langchain.com/en/latest/index.html) to **<font color='orange'>build a chatbot that can answer questions about</font>** [Harry Potter books](https://www.kaggle.com/datasets/hinepo/harry-potter-books-in-pdf-1-7)
- **<font color='orange'>Flexible and customizable RAG pipeline (Retrieval Augmented Generation)</font>**
- Experiment with various LLMs (Large Language Models)
- Use [FAISS vector store](https://python.langchain.com/docs/integrations/vectorstores/faiss) to store text embeddings created with [Sentence Transformers](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) from 🤗. FAISS runs on GPU and it is much faster than Chroma
- Use [Retrieval chain](https://python.langchain.com/docs/modules/data_connection/retrievers/) to retrieve relevant passages from embedded text
- Summarize retrieved passages
- Leverage Kaggle dual GPU (2 * T4) with [Hugging Face Accelerate](https://huggingface.co/docs/accelerate/index)
- Chat UI with [Gradio](https://www.gradio.app/guides/quickstart)

**<font color='green'>No need to create any API key to use this notebook! Everything is open source.</font>**

**<font color='orange'>Don't forget to upvote the notebook if you learn from it or use it!</font>**

### Models

- [TheBloke/wizardLM-7B-HF](https://huggingface.co/TheBloke/wizardLM-7B-HF)
- [daryl149/llama-2-7b-chat-hf](https://huggingface.co/daryl149/llama-2-7b-chat-hf)
- [daryl149/llama-2-13b-chat-hf](https://huggingface.co/daryl149/llama-2-13b-chat-hf)
- [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)

![image.png](attachment:ceef601b-8cca-48a5-a433-54c0070f1f44.png)

img source: HinePo

In [None]:
! nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-14521443-7c25-a1e9-678a-68995178e97b)


# Installs

In [None]:
%%time

from IPython.display import clear_output

! pip install sentence_transformers==2.2.2

! pip install -qq -U langchain==0.1.10
! pip install -qq -U tiktoken
! pip install -qq -U pypdf
! pip install -qq -U faiss-gpu
! pip install -qq -U InstructorEmbedding

! pip install -qq -U transformers==4.38.2
! pip install -qq -U accelerate
! pip install -qq -U bitsandbytes
! pip install -qq -U torch==2.0.0

clear_output()

CPU times: user 10.5 s, sys: 1.51 s, total: 12 s
Wall time: 14min 2s


In [None]:
! pip install -qq -U faiss-gpu

# Imports

In [None]:
%%time

import warnings
warnings.filterwarnings("ignore")

import os
import glob
import textwrap
import time

import langchain

CPU times: user 14.3 ms, sys: 5.3 ms, total: 19.6 ms
Wall time: 59.1 ms


In [None]:
### loaders
from langchain.document_loaders import PyPDFLoader, DirectoryLoader

### splits
from langchain.text_splitter import RecursiveCharacterTextSplitter

### prompts
from langchain import PromptTemplate, LLMChain

In [None]:
### vector stores
from langchain.vectorstores import FAISS

### models
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceInstructEmbeddings

### retrievers
from langchain.chains import RetrievalQA

In [None]:
import torch
import transformers

In [None]:
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    BitsAndBytesConfig,
    pipeline
)

#clear_output()

# CFG

- CFG class enables easy and organized experimentation

In [36]:
class CFG:
    # LLMs
    model_name = 'mistral-7B' # wizardlm, llama2-7b-chat, llama2-13b-chat, mistral-7B
    temperature = 0
    top_p = 0.95
    repetition_penalty = 1.15

    # splitting
    split_chunk_size = 600
    split_overlap = 0

    # embeddings
    embeddings_model_repo = 'sentence-transformers/all-MiniLM-L6-v2'

    # similar passages
    k = 4

    # paths
    PDFs_path = 'sample_data/'
    Embeddings_path =  'faiss-hp-sentence-transformers'
    Output_folder = './vectordb'

# Define model

In [37]:
def get_model(model = CFG.model_name):

    print('\nDownloading model: ', model, '\n\n')

    if model == 'wizardlm':
        model_repo = 'TheBloke/wizardLM-7B-HF'

        tokenizer = AutoTokenizer.from_pretrained(model_repo)

        bnb_config = BitsAndBytesConfig(
            load_in_4bit = True,
            bnb_8bit_quant_type = "nf4",
            bnb_8bit_compute_dtype = torch.float16,
            bnb_8bit_use_double_quant = True,
        )

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            quantization_config = bnb_config,
            device_map = 'auto',
            low_cpu_mem_usage = True
        )

        max_len = 1024

    elif model == 'llama2-7b-chat':
        model_repo = 'daryl149/llama-2-7b-chat-hf'

        tokenizer = AutoTokenizer.from_pretrained(model_repo, use_fast=True)

        bnb_config = BitsAndBytesConfig(
            load_in_4bit = True,
            bnb_4bit_quant_type = "nf4",
            bnb_4bit_compute_dtype = torch.float16,
            bnb_4bit_use_double_quant = True,
        )

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            quantization_config = bnb_config,
            device_map = 'auto',
            low_cpu_mem_usage = True,
            trust_remote_code = True
        )

        max_len = 2048

    elif model == 'llama2-13b-chat':
        model_repo = 'daryl149/llama-2-13b-chat-hf'

        tokenizer = AutoTokenizer.from_pretrained(model_repo, use_fast=True)

        bnb_config = BitsAndBytesConfig(
            load_in_4bit = True,
            bnb_4bit_quant_type = "nf4",
            bnb_4bit_compute_dtype = torch.float16,
            bnb_4bit_use_double_quant = True,
        )

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            quantization_config = bnb_config,
            device_map = 'auto',
            low_cpu_mem_usage = True,
            trust_remote_code = True
        )

        max_len = 2048 # 8192

    elif model == 'mistral-7B':
        model_repo = 'mistralai/Mistral-7B-v0.1'

        tokenizer = AutoTokenizer.from_pretrained(model_repo)

        bnb_config = BitsAndBytesConfig(
            load_in_4bit = True,
            bnb_4bit_quant_type = "nf4",
            bnb_4bit_compute_dtype = torch.float16,
            bnb_4bit_use_double_quant = True,
        )

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            quantization_config = bnb_config,
            device_map = 'auto',
            low_cpu_mem_usage = True,
        )

        max_len = 1024

    else:
        print("Not implemented model (tokenizer and backbone)")

    return tokenizer, model, max_len

In [None]:
%%time
tokenizer, model, max_len = get_model(model = CFG.model_name)

clear_output()

CPU times: user 34.3 s, sys: 43.7 s, total: 1min 17s
Wall time: 3min 45s


# 🤗 pipeline

- Hugging Face pipeline

In [None]:
### hugging face pipeline
pipe = pipeline(
    task = "text-generation",
    model = model,
    tokenizer = tokenizer,
    pad_token_id = tokenizer.eos_token_id,
#     do_sample = True,
    max_length = max_len,
    temperature = CFG.temperature,
    top_p = CFG.top_p,
    repetition_penalty = CFG.repetition_penalty
)

### langchain pipeline
llm = HuggingFacePipeline(pipeline = pipe)

# 🦜🔗 Langchain

- Multiple document retriever with LangChain

# Loader

- [Directory loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory) for multiple files
- This step is not necessary if you are just loading the vector database
- This step is necessary if you are creating embeddings. In this case you need to:
    - load de PDF files
    - split into chunks
    - create embeddings
    - save the embeddings in a vector store
    - After that you can just load the saved embeddings to do similarity search with the user query, and then use the LLM to answer the question
    
You can comment out this section if you use the embeddings I already created.

In [None]:
%%time

loader = DirectoryLoader(
    CFG.PDFs_path,
    glob="./*.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True,
    use_multithreading=True
)

documents = loader.load()

100%|██████████| 9/9 [00:05<00:00,  1.51it/s]

CPU times: user 5.82 s, sys: 51.5 ms, total: 5.87 s
Wall time: 5.96 s





In [None]:
documents[0][2].page_content

"© 2023 Nanyang Technological University, Singapore. All Rights Reserved.Sequential Data\nTextual Data:▪Sequential data is organised  in a specific order, often with a time -based or chronological sequence.\n▪The order in which the data points occur is essential for understanding the data's meaning.\n▪In sequential data, each data point is likely correlated to both the earlier and later data points in the \nsequence.\n▪The meaning  of a text often  depends  on the order  of words  and the grammatical  rules  \nthat govern  their arrangement .\n▪Understanding the text often requires knowledge of what came before and what follows.\n▪The meaning of a word or phrase can change based on the context provided by the \nsurrounding text.\n3"

In [None]:
print(f'We have {len(documents[0])} pages in total')

We have 27 pages in total


# Splitter

- Splitting the text into chunks so its passages are easily searchable for similarity
- This step is also only necessary if you are creating the embeddings
- [RecursiveCharacterTextSplitter](https://python.langchain.com/en/latest/reference/modules/document_loaders.html?highlight=RecursiveCharacterTextSplitter#langchain.document_loaders.MWDumpLoader)

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = CFG.split_chunk_size,
    chunk_overlap = CFG.split_overlap
)

texts = text_splitter.split_documents(documents[0])

print(f'We have created {len(texts)} chunks from {len(documents[0])} pages')

We have created 31 chunks from 27 pages


# Create Embeddings


- Embedd and store the texts in a Vector database (FAISS)
- [LangChain Vector Stores docs](https://python.langchain.com/docs/modules/data_connection/vectorstores/)
- [FAISS - langchain](https://python.langchain.com/docs/integrations/vectorstores/faiss)
- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks - paper Aug/2019](https://arxiv.org/pdf/1908.10084.pdf)
- [This is a nice 4 minutes video about vector stores](https://www.youtube.com/watch?v=dN0lsF2cvm4)

___

- If you use Chroma vector store it will take ~35 min to create embeddings
- If you use FAISS vector store on GPU it will take just ~3 min

___

We need to create the embeddings only once, and then we can just load the vector store and query the database using similarity search.

Loading the embeddings takes only a few seconds.

I uploaded the embeddings to a Kaggle Dataset so we just load it from [here](https://www.kaggle.com/datasets/hinepo/faiss-hp-sentence-transformers).

In [None]:
%%time

### we create the embeddings only if they do not exist yet
if not os.path.exists(CFG.Embeddings_path + '/index.faiss'):

    ### download embeddings model
    embeddings = HuggingFaceInstructEmbeddings(
        model_name = CFG.embeddings_model_repo,
        model_kwargs = {"device": "cuda"},
    )

    ### create embeddings and DB
    vectordb = FAISS.from_documents(
        documents = texts,
        embedding = embeddings
    )

    ### persist vector database
    vectordb.save_local(f"{CFG.Output_folder}/faiss_index_hp") # save in output folder
#     vectordb.save_local(f"{CFG.Embeddings_path}/faiss_index_hp") # save in input folder

load INSTRUCTOR_Transformer
max_seq_length  512
CPU times: user 191 ms, sys: 42.1 ms, total: 233 ms
Wall time: 224 ms


If creating embeddings, remember that on Kaggle we can not write data to the input folder.

So just write (save) the embeddings to the output folder and then load them from there.

# Load vector database

- After saving the vector database, we just load it from the Kaggle Dataset I mentioned
- Obviously, the embeddings function to load the embeddings must be the same as the one used to create the embeddings

In [None]:
%%time

### download embeddings model
embeddings = HuggingFaceInstructEmbeddings(
    model_name = CFG.embeddings_model_repo,
    model_kwargs = {"device": "cuda"},
)

### load vector DB embeddings
vectordb = FAISS.load_local(
    #CFG.Embeddings_path, # from input folder
    CFG.Output_folder + '/faiss_index_hp', # from output folder
    embeddings,
    allow_dangerous_deserialization=True
)

clear_output()

CPU times: user 83 ms, sys: 31.7 ms, total: 115 ms
Wall time: 115 ms


In [None]:
### test if vector DB was loaded correctly
vectordb.similarity_search('rnn')

[Document(page_content='© 2023 Nanyang Technological University, Singapore. All Rights Reserved.RNNs\n▪Recurrent Neural Networks (RNNs) are a type of neural network designed for tasks involving \nsequences or time series data.\n▪Unlike traditional feedforward neural networks, RNNs have connections that loop back on \nthemselves, allowing them to maintain memory of previous inputs.\n▪RNNs can model dependencies over time, making them great for tasks like language \ntranslation, speech recognition, and predicting future values in a time series.\n6', metadata={'source': 'sample_data/EE6405_W4_NM_For Students.pdf', 'page': 5}),
 Document(page_content='© 2023 Nanyang Technological University, Singapore. All Rights Reserved.Bi-Directional RNNs\n▪Bi-Directional RNNs consist of two separate RNNs: one moving forward through the input \nsequence and the other moving backward. \n▪Bi-Directional RNNs are especially useful in applications where understanding the context from \nboth past and future 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Prompt Template

- Custom prompt

In [None]:
prompt_template = """
Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked.
Use only the following pieces of context to answer the question at the end.

{context}
Question:{question}
Answer:"""


PROMPT = PromptTemplate(
    template = prompt_template,
    input_variables = ["context", "question"]
)

# Retriever chain

- Retriever to retrieve relevant passages
- Chain to answer questions
- [RetrievalQA: Chain for question-answering](https://python.langchain.com/docs/modules/data_connection/retrievers/)

In [None]:
retriever = vectordb.as_retriever(search_kwargs = {"k": CFG.k, "search_type" : "similarity"})

qa_chain = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = "stuff", # map_reduce, map_rerank, stuff, refine
    retriever = retriever,
    chain_type_kwargs = {"prompt": PROMPT},
    return_source_documents = False,
    verbose = False
)

In [None]:
### testing MMR search
question = "What is lstm?"
vectordb.max_marginal_relevance_search(question, k = CFG.k)

[Document(page_content='A streamlined version of LSTM with a simpler architecture, combining several gates for efficient learning \nof dependencies in sequences.–Bi-RNNs:\nAn RNN that processes data in both forward and backward directions to capture context from the entire \nsequence.\n26', metadata={'source': 'sample_data/EE6405_W4_NM_For Students.pdf', 'page': 25}),
 Document(page_content='© 2023 Nanyang Technological University, Singapore. All Rights Reserved.LSTM – Output Gates\n▪The output gate determines which parts of the cell state are used to generate the output at \nthe current time step. \n▪The sigmoid layer decides how the current input and hidden state contribute to the output.\n▪The cell state passes through tanh function for normalisation .\n▪The two outputs are multiplied to produce the final output. \n▪The output gate equation is given by:\n▪The hidden state is given by:𝑜𝑡=𝜎(𝑈𝑜𝑥𝑡+𝑊𝑜ℎ𝑡−1+𝑏(𝑜))\nℎ𝑡=𝑜𝑡tanh (𝑐𝑡)\n21', metadata={'source': 'sample_data/EE6405_W4_NM_For Stude

In [None]:
### testing similarity search
question = "what are Sequential Data?"
vectordb.similarity_search(question, k = CFG.k)

[Document(page_content="© 2023 Nanyang Technological University, Singapore. All Rights Reserved.Sequential Data\nTextual Data:▪Sequential data is organised  in a specific order, often with a time -based or chronological sequence.\n▪The order in which the data points occur is essential for understanding the data's meaning.\n▪In sequential data, each data point is likely correlated to both the earlier and later data points in the \nsequence.\n▪The meaning  of a text often  depends  on the order  of words  and the grammatical  rules  \nthat govern  their arrangement .", metadata={'source': 'sample_data/EE6405_W4_NM_For Students.pdf', 'page': 2}),
 Document(page_content='© 2023 Nanyang Technological University, Singapore. All Rights Reserved.Sequential Data\n▪To process  sequential  data effectively,  we need  models  that can account  for the \ninterconnectedness  of sequential  data.\n▪Models  can handle  sequential  data through  various  techniques  and architectures  designed  to \nca

# Post-process outputs

- Format llm response
- Cite sources (PDFs)
- Change `width` parameter to format the output

In [None]:
def wrap_text_preserve_newlines(text, width=600):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text


def process_llm_response(llm_response):
    ans = wrap_text_preserve_newlines(llm_response['result'].split("Answer:")[1].strip())
    if "Question:" in ans:
        ans = ans.split("Question:")[0].strip()
    # sources_used = ' \n'.join(
    #     [
    #         source.metadata['source'].split('/')[-1][:-4]
    #         + ' - page: '
    #         + str(source.metadata['page'])
    #         for source in llm_response['source_documents']
    #     ]
    # )

    #ans = ans + '\n\nSources: \n' + sources_used
    return ans

In [None]:
def llm_ans(query):
    llm_response = qa_chain.invoke(query)
    ans = process_llm_response(llm_response)
    return ans

# Ask questions

- Question Answering from multiple documents
- Invoke QA Chain
- Talk to your data

In [None]:
query = "waht is TF- IDF?"
print(llm_ans(query))

TF-IDF stands for term frequency-inverse document frequency. It is a statistical measure used in information retrieval and natural language processing to evaluate how important a word is to a document in a collection or corpus. The idea behind TF-IDF is that words that appear frequently in a document are less useful as indicators of the topic of the document than words that appear rarely.


In [None]:
query = "Give me 5 examples of NLP applications and explain what they do"
print(llm_ans(query))


Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked.
Use only the following pieces of context to answer the question at the end.

Harry  Po t tera
nd the Sorcerer’s Stoneby
J.K. Rowlingi
llustrations by Mary GrandPréA
rthur A. Levine BooksA
n Imprint of Scholastic Inc.
00_F
ront Matter.indd   iii3/ 20/09   1:56:12 PMP
rocess Black

v Contents ON
ET
he Boy Who Lived .  /uniF6DCt
woT
he Vanishing Glass .  /uniF6DC/uniF640t
hreeT
he Letters from No One .  /uniF63B/uniF6DCf
ourT
he Keeper of the Keys .  /uniF63C/uniF63Ef
iveD
iagon Alley .  /uniF63E/uniF6DCsi
xT
he Journey from Platform Ni
ne and Three-quarters .  /uniF640/uniF640se
venT
he Sorting Hat .  /uniF6DC/uniF6DC/uniF63Be
ightT
he Potions Master .  /uniF6DC/uniF63B/uniF6DC00_F
ront Matter.indd   v3/ 9/09   3:12:39 PMP
rocess Black

n any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, w
ithout written permission of 

# Gradio Chat UI

- **<font color='orange'>At the moment this part only works on Google Colab. Gradio and Kaggle started having compatibility issues recently.</font>**
- If you plan to use the interface, it is preferable to do so in Google Colab
- I'll leave this section commented out for now
- Chat UI prints below

___

- Create a chat UI with [Gradio](https://www.gradio.app/guides/quickstart)
- [ChatInterface docs](https://www.gradio.app/docs/chatinterface)
- The notebook should be running if you want to use the chat interface

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
! pip install --upgrade gradio -qq
clear_output()

In [None]:
import gradio as gr
print(gr.__version__)

4.25.0


In [None]:
def predict(message, history):
    # output = message # debug mode

    output = str(llm_ans(message)).replace("\n", "<br/>")
    return output

demo = gr.ChatInterface(
    predict,
    title = f' Open-Source LLM ({CFG.model_name}) for EE6405 Question Answering'
)

demo.queue()
demo.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://78b60ebae03c23c229.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




![image.png](attachment:413fe7a3-6534-45b5-b6e3-7fc86e982cf1.png)

![image.png](attachment:976f4bf4-7626-4d4a-b773-3eebd7e9f000.png)

# Conclusions

- Feel free to fork and optimize the code. Lots of things can be improved.

- Things I found had the most impact on models output quality in my experiments:
    - Prompt engineering
    - Bigger models
    - Other models families
    - Splitting: chunk size, overlap
    - Search: Similarity, MMR, k
    - Pipeline parameters (temperature, top_p, penalty)
    - Embeddings function
    - LLM parameters (max len)


- LangChain, Hugging Face and Gradio are awesome libs!

- **<font color='orange'>If you liked this notebook, don't forget to show your support with an Upvote!</font>**

- In case you are interested in LLMs, I also have some other notebooks you might want to check:

    - [Instruction Finetuning](https://www.kaggle.com/code/hinepo/llm-instruction-finetuning-wandb)
    - [Preference Finetuning - LLM Alignment](https://www.kaggle.com/code/hinepo/llm-alignment-preference-finetuning)
    - [Synthetic Data for Finetuning](https://www.kaggle.com/code/hinepo/synthetic-data-creation-for-llms)
    - [Safeguards and Guardrails](https://www.kaggle.com/code/hinepo/llm-safeguards-and-guardrails)
    
___

🦜🔗🤗

![image.png](attachment:68773819-4358-4ded-be3e-f1d275103171.png)