# 논문 질의응답 시스템

In [81]:
%%capture --no-stderr
%pip install --upgrade --quiet  langchain langchain-community langchainhub langchain-chroma bs4 langchain_core

## 논문 삽입과 질의응답 시스템

문서 올리기

In [3]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "../project/example_data/2408.00714v1.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

# print(len(docs))

RAG로 질의응답

In [89]:
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

from langchain_openai import ChatOpenAI


# llm = ChatOpenAI(model="gpt-4o")
llm = ChatOpenAI(model="gpt-3.5-turbo")


In [90]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

In [92]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)


question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

results = rag_chain.invoke({"input": "Give me some key points of the paper."})

results

{'input': 'Give me some key points of the paper.',
 'context': [Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='}\n]\nChallenges#\nAfter going through key ideas and demos of building LLM-centered agents, I start to see a couple common limitations:'),
  Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='}\n]\nChallenges#\nAfter going through key ideas and demos of building LLM-centered agents, I start to see a couple common limitations:'),
  Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='}\n]\nChallenges#\nAfter going through key ideas and demos of building LLM-centered agents, I start to see a couple common limitations:'),
  Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='}\n]\nChallenges#\nAfter going through key ideas and demos of building LLM-centered agents, I start to see a coup

In [93]:
results['answer']

'The paper discusses key ideas and demonstrations related to building LLM-centered agents. It highlights common limitations encountered in the process of developing such agents. The challenges include issues such as model biases, data limitations, and the need for interpretability and explainability in LLM-based systems.'

테스트 결과: 맥락을 이해하지 못함

In [94]:
print("Welcome to the paperbot. If you want to quit, please enter 'exit'.")
while True:
    # Input
    user_input = input("User: ")

    # 종료 입력하면 대화 종료
    if user_input.lower() == "exit":
        print("Thank you.")
        break

    # 응답 생성 및 출력
    results = rag_chain.invoke({"input": user_input})
    print(f"User: {user_input}")
    print(f"Assitant: {results['answer']}")

Welcome to the paperbot. If you want to quit, please enter 'exit'.
User: What is annotation?
Assitant: Annotation is the process of adding metadata or labels to data, such as images, videos, or text, to provide additional context or information. It helps in categorizing, organizing, and making the data more understandable for machines or humans. Annotations can include things like bounding boxes around objects, text descriptions, or classifications that can be used for training machine learning models.
User: What are common ways of doing it?
Assitant: Building LLM-centered agents commonly involves challenges such as lack of interpretability in generated text, difficulty in fine-tuning large language models, and potential biases in the training data. These limitations can impact the performance and reliability of the agents in real-world applications.
Thank you.


## 대화형 RAG

In [95]:
import bs4
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_chroma import Chroma
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

### Construct retriever ###
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

### Contextualize question ###
contextualize_q_system_prompt = (
    "Given a chat history and the latest user question "
    "which might reference context in the chat history, "
    "formulate a standalone question which can be understood "
    "without the chat history. Do NOT answer the question, "
    "just reformulate it if needed and otherwise return it as is."
)
contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)
history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)


### Answer question ###
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)
qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)
question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)


### Statefully manage chat history ###
store = {}


def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]


conversational_rag_chain = RunnableWithMessageHistory(
    rag_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
    output_messages_key="answer",
)

In [96]:
print("Welcome to the paperbot. If you want to quit, please enter 'exit'.")
while True:
    # Input
    user_input = input("User: ")

    # 종료 입력하면 대화 종료
    if user_input.lower() == "exit":
        print("Thank you.")
        break

    # 응답 생성 및 출력
    results = conversational_rag_chain.invoke(
        {"input": user_input},
        config={"configurable": {"session_id": "abc123"}
    },  # constructs a key "abc123" in `store`.
    )
    print(f"User: {user_input}")
    print(f"Assitant: {results['answer']}")

Welcome to the paperbot. If you want to quit, please enter 'exit'.
User: What is annotation?
Assitant: Annotation is the process of adding metadata or labels to data to provide additional information or context. In the context of data annotation for tasks like video segmentation, annotators mark or label objects in videos to train machine learning models. Annotations help algorithms understand and interpret the data more effectively.
User: What are common ways of doing it?
Assitant: Common ways of data annotation include manual annotation, where humans label data directly, and automatic annotation, where algorithms generate labels. Other methods include semi-automatic annotation, where humans correct or validate automated annotations, and crowdsourced annotation, where tasks are distributed to a large group of annotators. Each method has its own advantages and limitations depending on the type of data and task at hand.
Thank you.
