<a target="_blank" href="https://colab.research.google.com/github/UpstageAI/cookbook/blob/main/cookbooks/upstage/Solar-LLM-ZeroToAll/08_Emb_RAG.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## RAG: Retrieval Augmented Generation.
- Large language models (LLMs) have a limited context size.
- TLDR
- Not all context is relevant to a given question
- Query -> Search -> Results -> (LLM) -> Answer

## Keyword VS Semantic Search 
![Vector](https://blog.dataiku.com/hs-fs/hubfs/dftt%202.webp?width=1346&height=632&name=dftt%202.webp)

from https://blog.dataiku.com/semantic-search-an-overlooked-nlp-superpower

![Emb_search](figures/emb_search.png)

from https://sreent.medium.com/llms-embeddings-and-vector-search-d4bd9362df56

In [19]:
! pip3 install -qU  markdownify  langchain-upstage rank_bm25 python-dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [21]:
#@title set API key
from pprint import pprint
import os

import warnings
warnings.filterwarnings('ignore')

from IPython import get_ipython

upstage_api_key_env_name = 'UPSTAGE_API_KEY'
def load_env():
    if 'google.colab' in str(get_ipython()):
        # Running in Google Colab
        from google.colab import userdata
        upstage_api_key = userdata.get(upstage_api_key_env_name)
        return os.environ.setdefault('UPSTAGE_API_KEY', upstage_api_key)
    else:
        # Running in local Jupyter Notebook
        from dotenv import load_dotenv
        load_dotenv()
        return os.environ.get(upstage_api_key_env_name)

UPSTAGE_API_KEY = load_env()

# Most powerful solar embedding
![Solar Embedding](figures/solar_emb.jpeg)



In [22]:
from langchain_upstage import UpstageEmbeddings

embeddings_model = UpstageEmbeddings(model="solar-embedding-1-large")
embeddings = embeddings_model.embed_documents(
    [
        "What is the best season to visit Korea?",
    ]
)

len(embeddings), len(embeddings[0])
print(embeddings[0])

[0.001220703125, -0.0296478271484375, -0.0066680908203125, 0.00951385498046875, 0.017791748046875, -0.002902984619140625, -0.0029201507568359375, 0.00937652587890625, 0.027618408203125, -0.0127716064453125, 0.0158843994140625, -0.01288604736328125, 0.01329803466796875, 0.00307464599609375, 0.00429534912109375, 0.00208282470703125, -0.0011320114135742188, -0.0105743408203125, 0.00218963623046875, -0.00579071044921875, -0.021270751953125, -0.0046234130859375, 0.007144927978515625, 0.01387786865234375, 0.00615692138671875, -0.0020961761474609375, -0.004512786865234375, 0.03021240234375, -0.02606201171875, 0.005687713623046875, 0.0028438568115234375, 0.003803253173828125, -0.005588531494140625, -0.01241302490234375, 0.003543853759765625, 0.01198577880859375, 0.0095367431640625, 0.01251220703125, -0.003719329833984375, -0.0097198486328125, 0.0021915435791015625, -0.02191162109375, 0.00269317626953125, 0.0006132125854492188, 0.0113372802734375, 0.016693115234375, 0.0204620361328125, 0.030212

In [23]:
from langchain_chroma import Chroma
from langchain_upstage import UpstageEmbeddings
from langchain.docstore.document import Document

sample_text_list = [
    "Korea is a beautiful country to visit in the spring.",
    "The best time to visit Korea is in the fall.",
    "Best way to find bug is using unit test.",
    "Python is a great programming language for beginners.",
    "Sung Kim is a great teacher.",
    "맛있는 좋은 과일을 많이 먹어 볼까?"
]

sample_docs = [Document(page_content=text) for text in sample_text_list]

vectorstore = Chroma.from_documents(
    documents=sample_docs,
    embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
)

retriever = vectorstore.as_retriever()



In [24]:
result_docs = retriever.invoke("How to find problems in code?")
print(result_docs[0].page_content[:100])


Best way to find bug is using unit test.


In [25]:
result_docs = retriever.invoke("When to visit Korea?")
print(result_docs[0].page_content[:100])

The best time to visit Korea is in the fall.


In [26]:
result_docs = retriever.invoke("Who is a great prof?")
print(result_docs[0].page_content[:100])

<p id='77' data-category='paragraph' style='font-size:18px'>Yi Zhang received the MS and PhD degrees


In [27]:
result_docs = retriever.invoke("좋은 선생님")
print(result_docs[0].page_content[:100])

Sung Kim is a great teacher.


In [28]:
# RAG 1. load doc (done), 2. chunking, splits, 3. embeding - indexing, 4. retrieve

In [29]:
from langchain_upstage import UpstageLayoutAnalysisLoader


layzer = UpstageLayoutAnalysisLoader("pdfs/kim-tse-2008.pdf", output_type="html")
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()  # or layzer.lazy_load()

In [30]:
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

# 2. Split
text_splitter = RecursiveCharacterTextSplitter.from_language(
    chunk_size=1000, chunk_overlap=100, language=Language.HTML
)
splits = text_splitter.split_documents(docs)
print("Splits:", len(splits))

Splits: 132


In [31]:
from langchain_chroma import Chroma

# 3. Embed & indexing
vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
)

In [32]:
# 4. retrive
retriever = vectorstore.as_retriever()
result_docs = retriever.invoke("What is Bug Classification?")
print(len(result_docs))
print(result_docs[0].page_content[:100])

4
<p id='49' data-category='paragraph' style='font-size:16px'>Similar in spirit to change classificati


In [33]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_upstage import ChatUpstage


llm = ChatUpstage()

prompt_template = PromptTemplate.from_template(
    """
    Please provide most correct answer from the following context. 
    If the answer is not present in the context, please write "The information is not present in the context."
    ---
    Question: {question}
    ---
    Context: {context}
    """
)
chain = prompt_template | llm | StrOutputParser()

In [34]:
chain.invoke({"question": "What is bug classficiation?", "context": result_docs})

'Bug classification is a process where keywords in bug reports or change requests are extracted and used as features to train a machine learning classifier. The goal of the classification is to place a bug report into a specific category or to find the developer best suited to fix a bug.'

In [39]:
# Put themm all together
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

setup_and_retrieval = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
)

rag_chain = setup_and_retrieval | prompt_template | llm | StrOutputParser()

rag_chain.invoke("What is bug classification?")

'The information is not present in the context.'

# Excercise: Hybrid
Sometimes keyword search can be useful. Design a system that does keyword and semantic search, then combine the results. Use them as context for your RAG.