<a target="_blank" href="https://colab.research.google.com/github/UpstageAI/cookbook/blob/main/Solar-Fullstack-LLM-101/08_RAG.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 09. RAG with Embeddings

## Overview  
In this exercise, we will explore Retrieval-Augmented Generation (RAG) using embeddings within the Solar framework. RAG combines retrieval-based techniques with generative models to improve the relevance and accuracy of generated responses. By incorporating embeddings, we can enhance the retrieval process, leading to more contextually appropriate information being used to augment the model's knowledge. This notebook will guide you through implementing RAG with embeddings and demonstrate its benefits in enhancing model outputs.
 
## Purpose of the Exercise
The purpose of this exercise is to integrate Retrieval-Augmented Generation using embeddings into the Solar framework. By the end of this tutorial, users will understand how to use RAG with embeddings to more effectively access and utilize external information. This approach will enable the generation of more informed and contextually accurate responses, thereby improving the performance and reliability of the language model beyond simple keyword-based retrieval methods.



## RAG with Embeddings: Enhanced Retrieval Augmented Generation
- Large language models (LLMs) have a limited context size
- Embeddings allow for semantic understanding of queries and documents
- Not all context is relevant to a given question
- Query → Embed → Retrieve (Semantic Search) → Results → (LLM) → Answer
- RAG with embeddings combines LLMs with semantic retrieval for more accurate and relevant information augmentation

In [1]:
! pip3 install -qU  markdownify  langchain-upstage python-dotenv tokenizers chromadb

In [2]:
# @title set API key
import os
import getpass
from pprint import pprint
import warnings

warnings.filterwarnings("ignore")

from IPython import get_ipython

if "google.colab" in str(get_ipython()):
    # Running in Google Colab. Please set the UPSTAGE_API_KEY in the Colab Secrets
    from google.colab import userdata
    os.environ["UPSTAGE_API_KEY"] = userdata.get("UPSTAGE_API_KEY")
else:
    # Running locally. Please set the UPSTAGE_API_KEY in the .env file
    from dotenv import load_dotenv

    load_dotenv()

if "UPSTAGE_API_KEY" not in os.environ:
    os.environ["UPSTAGE_API_KEY"] = getpass.getpass("Enter your Upstage API key: ")


In [3]:
from langchain_upstage import UpstageLayoutAnalysisLoader


layzer = UpstageLayoutAnalysisLoader(
    "pdfs/kim-tse-2008.pdf", use_ocr=False, output_type="html"
)
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()  # or layzer.lazy_load()

In [4]:
from IPython.display import display, HTML

display(HTML(docs[0].page_content[:1000]))

In [5]:
from transformers import AutoTokenizer
from langchain.text_splitter import TokenTextSplitter

solar_tokenizer = AutoTokenizer.from_pretrained("upstage/solar-pro-preview-instruct")

token_splitter = TokenTextSplitter.from_huggingface_tokenizer(
    solar_tokenizer, chunk_size=250, chunk_overlap=100
)

splits = token_splitter.split_documents(docs)
print(len(splits))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


171


In [6]:

from langchain_upstage import UpstageEmbeddings
from langchain.vectorstores import Chroma

# Initialize Upstage embeddings
embeddings = UpstageEmbeddings(model="solar-embedding-1-large")

# Create Chroma vector store with Upstage embeddings
db = Chroma(embedding_function=embeddings)

# Add documents to the vector store
db.add_documents(splits)

# Define a retriever
retriever = db.as_retriever()

  db = Chroma(embedding_function=embeddings)


In [7]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_upstage import ChatUpstage


llm = ChatUpstage()

prompt_template = PromptTemplate.from_template(
    """
    Please provide most correct answer from the following context. 
    If the answer is not present in the context, please write "The information is not present in the context."
    ---
    Question: {question}
    ---
    Context: {Context}
    """
)
chain = prompt_template | llm | StrOutputParser()

In [8]:
query = "What is bug classficiation?"
context_docs = retriever.invoke(query)
print(context_docs[0].page_content)

 bug classification in<br>that features (terms) are also extracted from the source code<br>and are then fed into classification or clustering algorithms.<br>These projects have goals other than predicting bugs,<br>including classifying software into broad functional cate-<br>gories [19], clustering related software project documents<br>[20], [28], and associating the source code with other<br>artifacts such as design documents [29].</p><br><p id='47' data-category='paragraph' style='font-size:16px'>Krovetz et al. use terms in the source code (as features)<br>and SVM to classify software projects into broad functional<br>categories such as communications, databases, games, and<br>math [19]. Their insight is that software projects in the same<br>category will share terms in their source code, thereby<br>permitting classification.</p><br><p id='48' data-category='paragraph' style='font-size:16px'>Research that categorizes or associates source code with<br>other documents (traceability rec

In [9]:
chain.invoke({"question": query, "Context": context_docs})

'Bug classification is a technique used in software engineering to predict the existence of bugs in software changes. It uses a machine learning classifier to determine whether a new software change is more similar to prior buggy changes or clean changes. The classifier is trained using features (in the machine learning sense) extracted from the revision history of a software project stored in its software configuration management repository. The trained classifier can classify changes as buggy or clean, with a certain accuracy and recall on average. Change classification has several desirable qualities, including the ability to predict the existence of bugs in software changes.'

# Excercise 
How evalute the performance of RAG with embeddings?