<a target="_blank" href="https://colab.research.google.com/github/UpstageAI/cookbook/blob/main/Solar-Fullstack-LLM-101/08_RAG.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 08. RAG

## Overview  
In this exercise, we will explore Retrieval-Augmented Generation (RAG) using the Solar framework. RAG combines retrieval-based techniques with generative models to improve the relevance and accuracy of generated responses by leveraging external knowledge sources. This notebook will guide you through implementing RAG and demonstrating its benefits in enhancing model outputs.

## Purpose of the Exercise
The purpose of this exercise is to integrate Retrieval-Augmented Generation into the Solar framework. By the end of this tutorial, users will understand how to use RAG to access external information and generate more informed and contextually accurate responses, thereby improving the performance and reliability of the language model.



## RAG: Retrieval Augmented Generation.
- Large language models (LLMs) have a limited context size.
- TLDR
- Not all context is relevant to a given question
- Query → Retrieve (Search) →  Results →  (LLM) →  Answer
- RAG is a method to combine LLM with Retrieval: Retrieval Augmented Generation



In [None]:
! pip3 install -qU  markdownify  langchain-upstage rank_bm25 python-dotenv tokenizers langchain-community

In [2]:
# @title set API key
from pprint import pprint
import os

import warnings

warnings.filterwarnings("ignore")

if "google.colab" in str(get_ipython()):
    # Running in Google Colab. Please set the UPSTAGE_API_KEY in the Colab Secrets
    from google.colab import userdata

    os.environ["UPSTAGE_API_KEY"] = userdata.get("UPSTAGE_API_KEY")
else:
    # Running locally. Please set the UPSTAGE_API_KEY in the .env file
    from dotenv import load_dotenv

    load_dotenv()

assert (
    "UPSTAGE_API_KEY" in os.environ
), "Please set the UPSTAGE_API_KEY environment variable"

In [6]:
from langchain_upstage import UpstageDocumentParseLoader


layzer = UpstageDocumentParseLoader(
    "pdfs/kim-tse-2008.pdf",output_format="html"
)
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()  # or layzer.lazy_load()

In [7]:
from IPython.display import display, HTML

display(HTML(docs[0].page_content[:1000]))

In [8]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_upstage import ChatUpstage


llm = ChatUpstage(model="solar-pro")

prompt_template = PromptTemplate.from_template(
    """
    Please provide most correct answer from the following context.
    If the answer is not present in the context, please write "The information is not present in the context."
    ---
    Question: {question}
    ---
    Context: {Context}
    """
)
chain = prompt_template | llm | StrOutputParser()

In [66]:
# Example of "Large language models (LLMs) have a limited context size.""
chain.invoke({"question": "What is bug classficiation?", "Context": docs})

BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 4096 tokens. However, your messages resulted in 52377 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

In [67]:
from transformers import AutoTokenizer
from langchain.text_splitter import TokenTextSplitter

solar_tokenizer = AutoTokenizer.from_pretrained("upstage/solar-pro-preview-instruct")

token_splitter = TokenTextSplitter.from_huggingface_tokenizer(
    solar_tokenizer, chunk_size=250, chunk_overlap=100
)

splits = token_splitter.split_documents(docs)
print(len(splits))

177


In [93]:
from langchain_community.retrievers import BM25Retriever

retriever = BM25Retriever.from_documents(splits)
retriever.invoke("What is bug classficiation?")[0]

Document(metadata={'total_pages': 16, 'coordinates': [[{'x': 0.0396, 'y': 0.0561}, {'x': 0.5623, 'y': 0.0561}, {'x': 0.5623, 'y': 0.0688}, {'x': 0.0396, 'y': 0.0688}], [{'x': 0.8614, 'y': 0.0571}, {'x': 0.8826, 'y': 0.0571}, {'x': 0.8826, 'y': 0.0675}, {'x': 0.8614, 'y': 0.0675}], [{'x': 0.1885, 'y': 0.0897}, {'x': 0.7413, 'y': 0.0897}, {'x': 0.7413, 'y': 0.1589}, {'x': 0.1885, 'y': 0.1589}], [{'x': 0.1016, 'y': 0.1696}, {'x': 0.824, 'y': 0.1696}, {'x': 0.824, 'y': 0.1877}, {'x': 0.1016, 'y': 0.1877}], [{'x': 0.0787, 'y': 0.2091}, {'x': 0.8515, 'y': 0.2091}, {'x': 0.8515, 'y': 0.356}, {'x': 0.0787, 'y': 0.356}], [{'x': 0.0785, 'y': 0.3693}, {'x': 0.8468, 'y': 0.3693}, {'x': 0.8468, 'y': 0.3972}, {'x': 0.0785, 'y': 0.3972}], [{'x': 0.4545, 'y': 0.3997}, {'x': 0.4711, 'y': 0.3997}, {'x': 0.4711, 'y': 0.417}, {'x': 0.4545, 'y': 0.417}], [{'x': 0.0391, 'y': 0.4394}, {'x': 0.1883, 'y': 0.4394}, {'x': 0.1883, 'y': 0.4553}, {'x': 0.0391, 'y': 0.4553}], [{'x': 0.04, 'y': 0.4603}, {'x': 0.4557,

In [99]:
# Query 1 : using the whold page_content meta data
query = "What is bug classficiation?"
context_docs = retriever.invoke(query)

context_pagecontent=""
for context_doc in context_docs:
  context_pagecontent += context_doc.page_content
chain.invoke({"question": query, "Context": context_pagecontent})

'Bug classification is the process of classifying software changes as either clean or buggy, and it requires about 100 changes to train a project-specific classification model before the predictive accuracy achieves a "usable" level of accuracy. The information is present in the context.'

In [94]:
# Query 2 : using the whold page_content meta data
query = "How do find the bug?"
context_docs = retriever.invoke(query)

context_page_content=""
for context_doc in context_docs:
  context_pagecontent += context_doc.page_content

chain.invoke({"question": query, "Context": context_page_content})

'The information is not present in the context.'

# Excercise
It seems keyword search is not the best for LLM queries. What are some alternatives?

In [76]:
# Query 1 : extracting the relevant text by parsing html
from bs4 import BeautifulSoup

query = "What is bug classficiation?"
context_docs = retriever.invoke(query)

context=""
for context_doc in context_docs:
  soup = BeautifulSoup(context_doc.page_content, 'html.parser')
  text_content = ''.join([element.get_text() for element in soup.find_all(['p'])])

  context += text_content

chain.invoke({"question": "What is bug classficiation?", "Context": context})

'Bug classification is the process of categorizing changes in software as bug-fixing changes or not. This is typically done using a classification technique, such as an SVM classifier, which requires about 100 changes to train a project-specific model before achieving a "usable" level of accuracy. In the context of bug tracking systems, one potential issue is that these systems are often used to record both bug reports and new feature additions, causing new feature changes to be identified as bug-fix changes. This can increase the number of changes flagged as bug fixes, and when using this algorithm, care needs to be taken to understand the meaning of changes identified as bugs.'

In [80]:
# Query 2 : extracting the relevant text by parsing html
from bs4 import BeautifulSoup

query = "How do find the bug?"
context_docs = retriever.invoke(query)

context=""
for context_doc in context_docs:
  soup = BeautifulSoup(context_doc.page_content, 'html.parser')
  text_content = ''.join([element.get_text() for element in soup.find_all(['p'])])

  context += text_content

chain.invoke({"question": "How do find the bug?", "Context": context})

'To find the bug, Pan et al. use metrics computed over software slice data in conjunction with machine learning algorithms to find bug-prone software files or functions. They try to find faults in the whole code. Another approach focuses on file changes. One thread of research attempts to find buggy or clean code patterns in the history of development of a software project, like Williams and Hollingsworth, who use project histories to improve existing bug-finding tools and remove false positives from their approach.'