<a target="_blank" href="https://colab.research.google.com/github/UpstageAI/cookbook/blob/main/Solar-Fullstack-LLM-101/07_LA_CAG.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 07. LA CAG (Credibility-Aware Generation)

## Overview  
In this exercise, we will explore Language Analysis (LA) combined with Credibility-Aware Generation (CAG) using the Solar framework. This notebook will demonstrate how to analyze language data for credibility and generate reliable outputs. The techniques covered will enhance the accuracy and trustworthiness of text generated from various language inputs.

## Purpose of the Exercise
The purpose of this exercise is to integrate Language Analysis with Credibility-Aware Generation to produce credible and well-analyzed outputs. By the end of this tutorial, users will be able to analyze text for credibility and apply these insights to generate reliable and accurate responses using the Solar framework.


# No.1 accuracy in multiform table extraction
- Convert documents to maximize RAG performance
- LangChain provides powerful tools for text splitting and vectorization


![Layout Analyzer](https://github.com/UpstageAI/cookbook/blob/main/Solar-Fullstack-LLM-101/figures/la.png?raw=1)

In [None]:
! pip3 install -qU  markdownify  langchain-upstage  requests  python-dotenv langchain-chroma

In [3]:
# @title set API key
from pprint import pprint
import os

import warnings

warnings.filterwarnings("ignore")

if "google.colab" in str(get_ipython()):
    # Running in Google Colab. Please set the UPSTAGE_API_KEY in the Colab Secrets
    from google.colab import userdata

    os.environ["UPSTAGE_API_KEY"] = userdata.get("UPSTAGE_API_KEY")
else:
    # Running locally. Please set the UPSTAGE_API_KEY in the .env file
    from dotenv import load_dotenv

    load_dotenv()

assert (
    "UPSTAGE_API_KEY" in os.environ
), "Please set the UPSTAGE_API_KEY environment variable"

![Layout Analyzer](https://github.com/UpstageAI/cookbook/blob/main/Solar-Fullstack-LLM-101/figures/solar_sample.png?raw=1)

In [6]:
from langchain_upstage import UpstageDocumentParseLoader


layzer = UpstageDocumentParseLoader("pdfs/solar_sample.pdf", output_format="html")
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()

In [76]:
from IPython.display import display, HTML
display(HTML(docs[0].page_content[:5000]))

0,1,2,3,4,5,6,7,8,9
Model,Size,Type,H6 (Avg.),ARC,HellaSwag,MMLU,TruthfulQA,Winogrande,GSM8K
SOLAR 10.7B-Instruct,11B ⇠,Alignment-tuned,74.20,71.08,88.16,66.21,71.43,83.58,64.75
Qwen 72B,72B ⇠,Pretrained,73.60,65.19,85.94,77.37,60.19,82.48,70.43
Mixtral 8x7B-Instruct-v0.1,47B ⇠,Instruction-tuned,72.62,70.22,87.63,71.16,64.58,81.37,60.73
Yi 34B-200K,34B ⇠,Pretrained,70.81,65.36,85.58,76.06,53.64,82.56,61.64
Yi 34B,34B ⇠,Pretrained,69.42,64.59,85.69,76.35,56.23,83.03,50.64
Mixtral 8x7B-v0.1,47B ⇠,Pretrained,68.42,66.04,86.49,71.82,46.78,81.93,57.47
Llama 2 70B,70B ⇠,Pretrained,67.87,67.32,87.33,69.83,44.92,83.74,54.06
Falcon 180B,180B ⇠,Pretrained,67.85,69.45,88.86,70.50,45.47,86.90,45.94
SOLAR 10.7B,11B,Pretrained,66.04,61.95,84.60,65.48,45.04,83.66,55.50


In [101]:
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

text_splitter = RecursiveCharacterTextSplitter.from_language(
    chunk_size=1000, chunk_overlap=100, language=Language.HTML
)
splits = text_splitter.split_documents(docs)
print("Splits:", len(splits))

Splits: 10


In [108]:
from langchain_chroma import Chroma
from langchain_upstage import UpstageEmbeddings

#initiate vectorstore, retriever
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=UpstageEmbeddings(model="solar-embedding-1-large"),
)

retriever = vectorstore.as_retriever()

In [111]:
# add documnets to Chromadb
def simplify_metadata(metadata):
    simplified_metadata = {}
    for key, value in metadata.items():
        if isinstance(value, (str, int, float, bool)):
            simplified_metadata[key] = value
        else:
            simplified_metadata[key] = str(value)  # Convert complex data types to string
    return simplified_metadata

In [115]:
for split in splits:
    split.metadata = simplify_metadata(split.metadata)
vectorstore.add_documents(splits)

['9eefa0e1-ed74-4657-909d-6bdf32089564',
 '1921d0e2-8e18-4c08-8bd3-e1b86e7850e4',
 '6ce7fa74-99bd-4918-b2f5-ac884b6c1f9f',
 '1d844f00-3e74-4487-83a0-75f5af731825',
 '9546d91c-ea16-45f5-8830-3e542f4bdc93',
 'c402bf57-44f2-4d36-b5fc-ac6501d8abf5',
 'd46f0c01-0e16-4918-b67e-5a0a3c27e014',
 'f5153594-3ce3-4392-b5cc-85664f3a0e82',
 '72619ac5-4fea-496c-8977-f1015f9a9a76',
 '53d5a656-3c15-41d9-b575-ba7cc7b2af64']

In [116]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_upstage import ChatUpstage


llm = ChatUpstage(model="solar-pro")

prompt_template = PromptTemplate.from_template(
    """
    Please provide most correct answer from the following context.
    Think step by step and look the html tags and table values carefully to provide the most correct answer.
    If the answer is not present in the context, please write "The information is not present in the context."
    ---
    Question: {question}
    ---
    Context: {Context}
    """
)
chain = prompt_template | llm | StrOutputParser()

In [121]:
search_result = retriever.invoke("Explain Table 2?")

context=""
for result in search_result:
    context += result.page_content

chain.invoke({"question": "Explain Table 2?", "Context": context})


'Table 2 shows the evaluation results of different language models in the Open LLM Leaderboard. The results include scores for six tasks mentioned in Section 4.1 of the source document, as well as the H6 score (average of six tasks). The table also reports the size of the models in units of billions of parameters and the type of the model, which indicates the training stage and can be chosen from {Pretrained, Instruction-tuned, Alignment-tuned}. Models based on SOLAR 10.7B are colored purple. The best scores for H6 and the individual tasks are shown in bold. The information is not present in the context.'

In [126]:
search_result = retriever.invoke("what is MMLU scores of SOLAR 10.7B?")

context=""
for result in search_result:
    context += result.page_content

chain.invoke({"question": "what is MMLU scores of SOLAR 10.7B?", "Context": context})


'The MMLU score of SOLAR 10.7B is 65.48.\n\n---'

In [125]:
search_result = retriever.invoke("What is MMLU scores of Mistral 7B-Instruct-v0.2?")

context=""
for result in search_result:
    context += result.page_content

chain.invoke({"question": "What is MMLU scores of Mistral 7B-Instruct-v0.2?", "Context": context})


'The MMLU score of Mistral 7B-Instruct-v0.2 is 60.78.'

# Excercise
Sometimes, even if we provide a table in Markdown or HTML format, the Large Language Model (LLM) may not extract the information correctly. How can you fix this issue?

Hint: Consider using CoT, a few-shot learning approach or a divide and conquer strategy.
