<a href="https://colab.research.google.com/github/allansuzuki/LLM_RAG_model/blob/main/treino_llm_langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![author](https://img.shields.io/badge/author-allansuzuki-red.svg)](https://www.linkedin.com/in/allanysuzuki/) [![](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-365/) [![MIT license](https://img.shields.io/badge/License-MIT-yellow.svg)](http://perso.crans.org/besson/LICENSE.html) [![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/allansuzuki/LLM_RAG_model/issues)

Following langchain quickstart in
* https://python.langchain.com/docs/use_cases/question_answering/local_retrieval_qa
* https://python.langchain.com/docs/guides/local_llms

About langchain chains (LLMchain):
https://python.langchain.com/docs/modules/chains

and support from Mario Silva project to use LLAMA2 local model importing from huggingface:
* https://github.com/MarioCSilva/RAG_QA_LLM/blob/main/llama2_RAG_QA.ipynb

Have you ever imagine a way to give more context about a subject to a LLM model in a easy way, helping them narrowing to something more specific and even give more details to enhance the answer properly?

That's the main purpose on RAG (Retrieval-Augmented Generation): no need on train the model on new (and maybe particular) data, returning satisfactory answer based on provided documents,texts, ...

In this use case we are going to use langchain framework to build the prompts and use the LLM models to answer a documentation question.

Langchain uses simple prompt parsers to work on multiple LLM, integrated with Meta llama2-chat model through HuggingFace pipeline.



# Install

In [1]:
# necessary to install requirements properly
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!pip install --upgrade --quiet langchain langchain-community chromadb bs4 huggingface_hub
!pip install -U -q transformers accelerate bitsandbytes sentence-transformers


In [3]:
# or get the requirements !pip freeze > requirements.txt
# and install based on requirements.txt !pip install -r requirements.txt

# Set model to load

Here we get the Hugging Face API token and the Hugging Face model_id

In [4]:
# get access tokens here -- need to add in the key icon on the left tab with `NAME` as variable name and `VALUE` with the secret value
# In this case, I added my Hugging Face token in the HF_TOKEN colab secrets
from google.colab import userdata
hf_auth = userdata.get('HF_TOKEN')

In [5]:
import transformers

# get the model id on hugging face
model_id = 'meta-llama/Llama-2-7b-chat-hf'

In [6]:
from torch import cuda, bfloat16

# get device for torch to run on GPU
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# Quantization

Quantization in the context of Large Language Models (LLMs) refers to the process of converting the weights of a model from higher precision data types to lower-precision ones. This makes complex and heavier models to be loaded in less memory RAM.

Where are going to use BitsAndBytes to apply the quantization.

In [7]:
# set quantization configuration to load large model with less GPU memory
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

In [8]:
# # how the quantization config looks like
# display(bnb_config)

# Model configurations

It loads the model weights and configurations from Hugging Face

In [None]:
# begin initializing HF items with hf auth token
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

In [10]:
# # how the model config looks like
# display(model_config)

# Load model

It loads the LLM model through transformers pipeline, using the quantization and model configs. Device_map accelerates the model performance when possible.

In [None]:
# load HF model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)

# # enable evaluation mode to allow model inference -- no need for now
# model.eval()

In [12]:
print(f"Model loaded on {device}")

Model loaded on cuda:0


# Load tokenizer

Tokenizing is the process of converting human words in numbers mapped from a tokenizer, transforming into `tokens`.

For example, the phrase "Hello world" could be (0.93844 0.4323 0.9434 -0.3252) after tokenizing. It allows better understading and usage of words into LLM models

In [None]:
import transformers

# settings: model_id = 'meta-llama/Llama-2-7b-chat-hf'
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

# Stop on tokens

We create the process of stop on tokens to stop the model continuing generating text when step into these words, controlling model's output.

We are using tokenizer to convert the stop words in tokens and torch.LongTensor to store values with high precision.

In [14]:
stop_list = ['\nHuman:', '\n```\n']

# stop_token_ids in list
stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]

# display a sample
stop_token_ids

[[1, 29871, 13, 29950, 7889, 29901], [1, 29871, 13, 28956, 13]]

In [15]:
from torch import LongTensor

# stop_token_ids in tensor
stop_token_ids = [LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

[tensor([    1, 29871,    13, 29950,  7889, 29901], device='cuda:0'),
 tensor([    1, 29871,    13, 28956,    13], device='cuda:0')]

In [16]:
from transformers import StoppingCriteria, StoppingCriteriaList
from torch import LongTensor, FloatTensor, eq

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: LongTensor, scores: FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

# Load documents

Imagine we want to know more about LLM Agents, but not only using generalized knowledge, but focused on a particular post in https://lilianweng.github.io

We use WebBaseLoader and beatiful-soup to scrape data from the website post.

In [17]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

# Only keep post title, headers, and content from the full HTML.
bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load()

In [18]:
# sample on document retrieved
print('docs sample:\nlen:',len(docs[0].page_content),'\n-----',docs[0].page_content[:500])

docs sample:
len: 42824 
----- 

      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In


# Split documents

Now it comes many different strategies to feed the information to the model without overwhelm its memory with unnused information.

The first thing we are goint to do is to use `RecursiveCharacterSplitter` to split the docuemnt in continuos chunks, overlapping a little bit to avoid missing context.

In [19]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True    # not clear what "start_index" means
)
all_splits = text_splitter.split_documents(docs)

In [20]:
# sample on how the document was splitted
print(actual_doc := all_splits[0].page_content,all_splits[0].metadata,'\nlen:',len(actual_doc),sep='\n')

LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:

Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.


Memory
{'source': 'https://lilianweng

# Embeddings

Now we want to capture the meaning of the sentence. For this we rely on `Embeddings` because it assigns a code (a.k.a embedding) to a sequence of words. It creates an embbeding mapping (or space), which makes the words with similiar meanings closer to each other.

We are going to apply this Embedding transform from Hugging Face Hub.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

hf_embedding = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2",
    model_kwargs={"device": "cuda"}
  )

# Vector store

Now that we have the document splitted and the words' embedding, we will store all these sequences and meanings in a vector, that can be easily accessed later for fast queries on relevant documents.

We are using `Chroma` to create this vector store from the documents we gathered.

In [22]:
from langchain_community.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents=all_splits, embedding=hf_embedding)

In [23]:
# that's how the first split looks like in our vector store
first_vector = vectorstore.get()['ids'][0]
vectorstore.get(first_vector)

{'ids': ['a236b272-c47a-11ee-b55b-0242ac1c000c'],
 'embeddings': None,
 'metadatas': [{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
   'start_index': 8}],
 'documents': ['LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agen

In [51]:
# that's how the text looks like for the model
print(hf_embedding.embed_documents(vectorstore.get(first_vector)['documents'][0].split('\n\n'))[0][:5],'...')

[0.014027159661054611, 0.007987729273736477, 0.005489276722073555, -0.016271350905299187, -0.02559383027255535] ...


# Use Cases on document search or QA

## Query relevant documents using Retriever (QA wo LLM)

In this case, we see the results the retriever does, searching in all the information provided. Since the retriever returns the most similar text based on the query and we know these documents contains the answer, it's a reliable answer, but not contextualized.

In [25]:
# settings for retriever
similarity_top_k = 6
mmr_threshold = 0.7
search_kwargs = {
    'score_threshold' : mmr_threshold,
    'k' : similarity_top_k
    }

retriever = vectorstore.as_retriever(search_type='mmr', search_kwargs = search_kwargs)

In [26]:
# example how it returns
retrieved_docs = retriever.invoke("What are the approaches to Task Decomposition?")

In [27]:
print('docs retrieved:', len(retrieved_docs),'\nsample content',retrieved_docs[0].page_content)

docs retrieved: 6 
sample content Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.
Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.


## Answer using LLM text generation without RAG

In [28]:
from langchain_core.prompts import PromptTemplate

#build the template to prompt
template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, don't make the answer and say you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:"""

# build prompt
prompt = PromptTemplate.from_template(template)

In [29]:
# example of message
example_messages = prompt.invoke(
    {"context": "all possible context here", "question": "your question here"}
).to_messages()

print(example_messages[0].content)

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, don't make the answer and say you don't know. Use three sentences maximum and keep the answer concise.
Question: your question here
Context: all possible context here
Answer:


In this case, we see the results of a LLM model try to answer wihtout any provided document as context. We see it gives explanation, samples, but since it does not know what we are interest in, so it does not fulfill our needs.

In [30]:
from langchain.llms import HuggingFacePipeline

temperature = 0.2
max_new_tokens = 512
repetition_penalty = 1.1

transformers_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    stopping_criteria=stopping_criteria,  # without this model rambles during chat
    temperature=temperature,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=max_new_tokens,  # max number of tokens to generate in the output
    repetition_penalty=repetition_penalty  # without this output begins repeating
)

llm = HuggingFacePipeline(pipeline=transformers_pipeline)

In [40]:
output_wo_rag = llm.invoke('What is Task Decomposition?')

In [42]:
#LLM response
output_wo_rag

'\n nobody likes to do tasks that are too big or overwhelming. So, how can we break down these tasks into smaller, more manageable pieces? This is where task decomposition comes in. Task decomposition is the process of breaking down a large task into smaller, more manageable sub-tasks. By doing so, you can make the task feel less overwhelming and more achievable. In this article, we will explore what task decomposition is, why it\'s important, and how to apply it to your work and personal life. What is Task Decomposition? Task decomposition is the process of breaking down a complex task into smaller, more manageable parts. It involves identifying the individual steps or components of the task and then prioritizing and organizing them in a way that makes sense for the project or task at hand. The goal of task decomposition is to create a clear and actionable plan that can help you complete the task more efficiently and effectively. Why is Task Decomposition Important? Task decomposition

# Answer using RAG and LLM text generation

As you can see the last case, the model was able to answer but with a generalized long answer. The ideal would be having this kind of humanized answer + context about the subject. That's where the RAG feature takes part on improving the model.

In [32]:
from langchain.chains import LLMChain
from langchain_core.output_parsers import StrOutputParser

chain = LLMChain(llm=llm, prompt=prompt, output_key='response')

In [33]:
output = chain({
    'question': (question := 'What is Task Decomposition?'),
    'context': retriever.invoke(question)
})

In [34]:
def pretty_print_response(output):
    print(f"Q: {output['question']}",
          f"\nA:{output['response']}\n",
          '-'*100,
          'DOCUMENTS SEARCHED:',
          '\n'.join(
              [f'--{i+1}:\nlink:{cont.metadata["source"]}\n{cont.page_content}\n' for i,cont in enumerate(output['context'])]
              ),
          sep='\n')

Now we have the perfect case which we have a LLM model answering things more humanized and narrowing to a more contextualized answer.

In [35]:
pretty_print_response(output)

Q: What is Task Decomposition?

A: Task decomposition is the process of breaking down a complex task into smaller, more manageable parts. This can involve identifying the individual steps involved in completing the task, as well as determining the dependencies between those steps. By decomposing a task in this way, it becomes easier to understand and execute the task, as well as to identify any potential challenges or obstacles that may arise during completion.

----------------------------------------------------------------------------------------------------
DOCUMENTS SEARCHED:
--1:
link:https://lilianweng.github.io/posts/2023-06-23-agent/
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a c

-----
-----


Now we can see the differences in use cases using LLMs and Retrivers, but also the one we highlighted in this exercise, which combine these two powerful tools in one, to make, for example, a more accurate QA bot or document search engine.