<a target="_blank" href="https://colab.research.google.com/github/UpstageAI/cookbook/blob/main/Solar-Fullstack-LLM-101/07_LA_CAG.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 07. LA CAG (Credibility-Aware Generation)

## Overview  
In this exercise, we will explore Language Analysis (LA) combined with Credibility-Aware Generation (CAG) using the Solar framework. This notebook will demonstrate how to analyze language data for credibility and generate reliable outputs. The techniques covered will enhance the accuracy and trustworthiness of text generated from various language inputs.
 
## Purpose of the Exercise
The purpose of this exercise is to integrate Language Analysis with Credibility-Aware Generation to produce credible and well-analyzed outputs. By the end of this tutorial, users will be able to analyze text for credibility and apply these insights to generate reliable and accurate responses using the Solar framework.


# No.1 accuracy in multiform table extraction 
- Convert documents to maximize RAG performance 
- LangChain provides powerful tools for text splitting and vectorization


![Layout Analyzer](./figures/la.png)

In [1]:
! pip3 install -qU  markdownify  langchain-upstage==0.1.8rc0  requests  python-dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
# @title set API key
import os
import getpass
from pprint import pprint
import warnings

warnings.filterwarnings("ignore")

from IPython import get_ipython

if "google.colab" in str(get_ipython()):
    # Running in Google Colab. Please set the UPSTAGE_API_KEY in the Colab Secrets
    from google.colab import userdata
    os.environ["UPSTAGE_API_KEY"] = userdata.get("UPSTAGE_API_KEY")
else:
    # Running locally. Please set the UPSTAGE_API_KEY in the .env file
    from dotenv import load_dotenv

    load_dotenv()

if "UPSTAGE_API_KEY" not in os.environ:
    os.environ["UPSTAGE_API_KEY"] = getpass.getpass("Enter your Upstage API key: ")


![Layout Analyzer](./figures/solar_sample.png)

In [3]:
from langchain_upstage import UpstageLayoutAnalysisLoader


layzer = UpstageLayoutAnalysisLoader("pdfs/solar_sample.pdf", output_type="html")
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()

In [4]:
from IPython.display import display, HTML

display(HTML(docs[0].page_content[:5000]))

0,1,2,3,4,5,6,7,8,9
Model,Size,Type,H6 (Avg.),ARC,HellaSwag,MMLU,TruthfulQA,Winogrande,GSM8K
SOLAR 10.7B-Instruct,11B ⇠,Alignment-tuned,74.20,71.08,88.16,66.21,71.43,83.58,64.75
Qwen 72B,72B ⇠,Pretrained,73.60,65.19,85.94,77.37,60.19,82.48,70.43
Mixtral 8x7B-Instruct-v0.1,47B ⇠,Instruction-tuned,72.62,70.22,87.63,71.16,64.58,81.37,60.73
Yi 34B-200K,34B ⇠,Pretrained,70.81,65.36,85.58,76.06,53.64,82.56,61.64
Yi 34B,34B ⇠,Pretrained,69.42,64.59,85.69,76.35,56.23,83.03,50.64
Mixtral 8x7B-v0.1,47B ⇠,Pretrained,68.42,66.04,86.49,71.82,46.78,81.93,57.47
Llama 2 70B,70B ⇠,Pretrained,67.87,67.32,87.33,69.83,44.92,83.74,54.06
Falcon 180B,180B ⇠,Pretrained,67.85,69.45,88.86,70.50,45.47,86.90,45.94
SOLAR 10.7B,11B ⇠,Pretrained,66.04,61.95,84.60,65.48,45.04,83.66,55.50


In [5]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_upstage import ChatUpstage


llm = ChatUpstage()

prompt_template = PromptTemplate.from_template(
    """
    Please provide most correct answer from the following context. 
    Think step by step and look the html tags and table values carefully to provide the most correct answer.
    If the answer is not present in the context, please write "The information is not present in the context."
    ---
    Question: {question}
    ---
    Context: {Context}
    """
)
chain = prompt_template | llm | StrOutputParser()

In [6]:
chain.invoke({"question": "Explain Table 2?", "Context": docs})

'Table 2 presents evaluation results in the Open LLM Leaderboard for SOLAR 10.7B and SOLAR 10.7B-Instruct along with other top-performing models. The table includes scores for six tasks mentioned in Sec. 4.1, along with the H6 score (average of six tasks). The table also reports the size of the models in units of billions of parameters and indicates the training stage of the model (Pretrained, Instruction-tuned, or Alignment-tuned). Models based on SOLAR 10.7B are colored purple, and the best scores for H6 and the individual tasks are shown in bold.'

In [7]:
chain.invoke({"question": "What is MMLU scores of SOLAR 10.7B?", "Context": docs})

'The MMLU score of SOLAR 10.7B is 65.48.'

In [8]:
chain.invoke(
    {"question": "What is MMLU scores of Mistral 7B-Instruct-v0.2?", "Context": docs}
)

'The MMLU scores of Mistral 7B-Instruct-v0.2 is 60.78.'

# Excercise 
Sometimes, even if we provide a table in Markdown or HTML format, the Large Language Model (LLM) may not extract the information correctly. How can you fix this issue?

Hint: Consider using CoT, a few-shot learning approach or a divide and conquer strategy. 
