<a href="https://colab.research.google.com/github/dobrevajovana/LangChain_RAG/blob/main/Md_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#<font color='orange'>**Md files Chatbot 🤗📁**</font>

This notebook contains complete code for the development of a chatbot/QA system.

#<font color='orange'> ⏭ Change your Runtime</font>
On the task bar select Runtime-->Change Runtime, then add any GPU version that you have available (it doesn't metter if you are using basic or Pro Colab)

#<font color='orange'>Download .md files</font>
First we will need to load the dataset, that contains all .md files. For easier integration, when others try to run the colab, I download the folder entirely locally. In case the number of files increases, it can be called directly from Google Drive by mounting with our drive.

In [None]:
!gdown [ID-OF-YOUR-FOLDER]

In [None]:
!unzip "/content/folder.zip"

#<font color='orange'>Requirements</font>
Download all packages that we need for running the code

In [None]:
!pip3 install langchain
!pip3 install ipykernel jupyter
!pip3 install auto-gptq==0.2.2
!pip install requests openai transformers faiss-gpu
!pip install unstructured
!pip install sentence_transformers

#<font color='orange'>Building the chatbot

In [None]:
import os
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.faiss import FAISS
from langchain.embeddings import HuggingFaceBgeEmbeddings
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline


##<font color='orange'>Loading & splitting the data</font>
Since we are talking about .md files that are in a folder and they need to be loaded we have two options:
1. st Option that showed great results: It is most optimal to do the same with an already available class from langchain **[DirectoryLoader](https://js.langchain.com/docs/api/document_loaders_fs_directory/classes/DirectoryLoader)**. Here we indicate the type of files we will process (.md), and also the loader class, ie in our case **[UnstructuredMarkdownLoader](https://python.langchain.com/docs/modules/data_connection/document_loaders/markdown)**. I use this particular loader because it is suitable for the file structure. Also with this tehnique we will have as many files as there are documents, but the problem here is the LLM model, which accepts a maximum of 512 tokens. Now we need to split each of these documents into so-called chunks (pages). I use the **[RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)** class, which splits each document context into characters, where 4 characters are one token. I set each chunk with a size of 1000 characters, I do not go to the maximum number of 2048 characters because the splitting is not always precise. At the same time, in order not to lose the context, I set an overlap between every two chunks to be 100 characters.
2. nd Option, not so good: is to use  **[MarkdownHeaderTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/markdown_header_metadata)** class for splitting the files by headers. I go through all the files in the directory and split them by the header, after that I add manually the source file and pages in the metadata. One more problem is the chunk sizes, this is handeled with checking the context. If the document has more then 1000 characters, I split the document into two pages.

In [None]:
# @title Select the type of loading & splitting and run the cell
option = '2nd option' # @param ["1st option", "2nd option"]

In [None]:
splits=[]
if option == '1st option':
  text_loader_kwargs={'mode': 'elements'}
  loader = DirectoryLoader('/content/folder/', glob="**/*.md", loader_cls=UnstructuredMarkdownLoader,loader_kwargs=text_loader_kwargs)
  docs = loader.load()
  # Char-level splits
  chunk_size = 1000
  chunk_overlap = 100
  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size=chunk_size, chunk_overlap=chunk_overlap
  )
  # Split
  splits = text_splitter.split_documents(docs)
else:
  splits=[]
  for filename in os.listdir('/content/folder/'):
      f = os.path.join('/content/folder/', filename)
      markdown_document = open(f,'r',encoding='utf-8').read()
      headers_to_split_on = [
          ("#", "Header 1"),
          ("##", "Header 2"),
          ("###", "Header 3"),
      ]

      markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
      md_header_splits = markdown_splitter.split_text(markdown_document)
      page_number=1
      for split in md_header_splits:
        split.metadata['filename']=filename
        split.metadata['page_number']=page_number
        page_number+=1
        if len(split.page_content)>1000:
          split1=split
          split2=split
          split1.page_content = split1.page_content[:1000]
          split2.page_content = split2.page_content[1000:]
          page_number+=1
          split.metadata['page_number']=page_number
          splits.append(split1)
          splits.append(split2)
        else:
          splits.append(split)




## <font color='orange'>Vectorization of the Documents context</font>
In order for all texts to be processed to the LLM model, they must first be vectorized. For the embedding model I can use an OpenAI, but the problem here is that at a given moment, when we would like to make a shift to an OpenSource LLM model, the OpenAI embedding will have to be completely replaced with another one, and we will automatically have so much more work to do. While OpenSouce embedding models are also adaptable whether for Llama, OpenAI or any other LLM model available on HuggingFace. I chose BGE model because according to the [leaderboard](https://huggingface.co/spaces/mteb/leaderboard) published on HuggingFace it is the best, and at the same time it does not take up too much RAM.

In [None]:
model_name = "BAAI/bge-base-en"
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

model_norm = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cuda'},
    encode_kwargs=encode_kwargs
)


For a faster search of the documents, I will use [FAISS](https://python.langchain.com/docs/integrations/vectorstores/faiss). If the database of documents increases, FAISS can be replaced with a specific database, but for now it is our best option.

In [None]:
db = FAISS.from_documents(splits, model_norm)

##<font color='orange'>LLM Model 🤗</font>
For this task I decided to use a quantized model of Llama-2-13B-chat. Why did I choose this model?

1. If I load the model locally without quantization, there is no way I will have enough disk space for it
2. In case I use HuggingFaceHub, which means directly sending requests to huggingface without downloading the model locally, it takes a very long period of time to get a response, and also after a while you need to subscribe because you run out of free tokens.
3. I did not use a model from OpenAI because also I need subscription, and charging based on the number of tokens, and if this comes to production, the solution is not very scalable from a financial point of view.

With the model TheBloke/Llama-2-13B-chat-GPTQ we get:
1. Good performance
2. Free OpenSource model

In [None]:
# @title If you want to load model from OpenAI or use HuggingFace Hub here is the code you will need
# @markdown if you want to run this code please copy and execute it in a code block.
model = "OpenAI" #@param ["OpenAI","Llama-HuggingFaceHub"]
if model=='OpenAI':
  print("""
  import os
  from langchain.llms import OpenAI
  os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
  llm = OpenAI()
  """)
else:
  print( """
import os
from langchain import HuggingFaceHub
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_KEY
llm = HuggingFaceHub(repo_id="meta-llama/Llama-2-70b-chat-hf", model_kwargs={"temperature":0.0, "max_length":512})
""")


  import os
  from langchain.llms import OpenAI
  os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
  llm = OpenAI()
  


As I mentioned here will be used OpenSource quantized LLama model, so it can be locally downloaded and used without any payments.

In [None]:
model_name_or_path = "TheBloke/Llama-2-13B-chat-GPTQ"
model_basename = "model"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

"""
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

"""
#To download from a specific branch, use the revision parameter, as in this example:

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        revision="gptq-8bit-64g-actorder_True",
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        quantize_config=None,
        load_in_8bit=True)



Constructing custom prompt template:

In [None]:
prompt = "What is missing for your solution to be production-ready?"
template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
USER: {prompt}
ASSISTANT:
'''

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

# print(pipe(template)[0]['generated_text'])

*** Pipeline:


When employing the Language Model (LM) to conduct document searches in response to specific queries, it is essential to initiate the process by specifying the number of documents to be included for consideration. This parameter is denoted as 'k' within the retriever component.

In [None]:
# @title Select how many documents to be taken in the consideration for searching the answer
k = "5" #@param ["1","2","3","4","5"]

In [None]:
llm = HuggingFacePipeline(pipeline=pipe)
conversation_chain = RetrievalQA.from_llm(
        llm=llm,
        retriever=db.as_retriever(search_kwargs={"k": int(k)}),
        return_source_documents=True,
    )

Formatting the output

In [None]:
import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def llm_chatbot(question):
    llm_response=conversation_chain(question)
    print(wrap_text_preserve_newlines(llm_response['result']))
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['filename'])

#<font color='Orange'>Ask the bot about your md files

In [None]:
llm_chatbot("your query")